All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] hugepage migration fixes (v3)
@ 2014-08-29  1:38 ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

This is the ver.3 of hugepage migration fix patchset.

The original problem discussed with Hugh was that follow_huge_pmd(FOLL_GET)
looks to do get_page() without any locking (fixed in patch 2/6). However,
thorough testing showed that we have more fundamental issue on hugetlb_fault(),
where it suffers the race related to migration entry. This will be fixed in
patch 3/6.

And as a cosmetic/readability issue, currently follow_huge_(addr|pud|pmd) are
defined in common code or in arch-dependent code, depending on
CONFIG_ARCH_WANT_GENERAL_HUGETLB. But in reality, most architectures are doing
the same thing, so patch 1/6 cleans it up and leaves arch-dependent implementation
only if necessary, which decreases code by more than 100 lines.

Another point mentioned in the previous cycle is that we repeated fixing
migration entry issues again and again, which is inefficient considering
all such patches are backported to stable trees. So it's nice to completely
fix the similar problems at one time.
I researched the all code calling huge_pte_offset() and checked if !pte_present()
case is properly handled or not, and found that only 2 points missed it,
each of which is fixed in patch 4/6 and 5/6.
There are some non-trivial cases, so I put justifications for them below:
- mincore_hugetlb_page_range() determines present only by
  (ptep && !huge_pte_none()), but it's fine because we can consider migrating
  hugepage or hwpoisoned hugepage as in-memory.
- follow_huge_addr@arch/ia64/mm/hugetlbpage.c don't have to check pte_present,
  because ia64 doesn't support hugepage migration or hwpoison, so never sees
  migration entry.
- huge_pmd_share() is called only when pud_none() returns true, but then
  pmd is never migration/hwpoisoned entry.

Patch 6/6 is a just cleanup of an unused parameter.

This patchset is based on mmotm-2014-08-25-16-52 and shows no regression
in libhugetlbfs test.

I'd like to add Hugh's Suggested-by tags on patch 2 and 3 if he is OK,
because the solutions are mostly based on his idea.

Tree: git@github.com:Naoya-Horiguchi/linux.git
Branch: mmotm-2014-08-25-16-52/fix_follow_huge_pmd.v3

v2: http://thread.gmane.org/gmane.linux.kernel/1761065
---
Summary:

Naoya Horiguchi (6):
      mm/hugetlb: reduce arch dependent code around follow_huge_*
      mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
      mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
      mm/hugetlb: add migration entry check in hugetlb_change_protection
      mm/hugetlb: add migration entry check in __unmap_hugepage_range
      mm/hugetlb: remove unused argument of follow_huge_addr()

 arch/arm/mm/hugetlbpage.c     |   6 --
 arch/arm64/mm/hugetlbpage.c   |   6 --
 arch/ia64/mm/hugetlbpage.c    |  17 +++---
 arch/metag/mm/hugetlbpage.c   |  10 +---
 arch/mips/mm/hugetlbpage.c    |  18 ------
 arch/powerpc/mm/hugetlbpage.c |  28 +++++----
 arch/s390/mm/hugetlbpage.c    |  20 -------
 arch/sh/mm/hugetlbpage.c      |  12 ----
 arch/sparc/mm/hugetlbpage.c   |  12 ----
 arch/tile/mm/hugetlbpage.c    |  28 ---------
 arch/x86/mm/hugetlbpage.c     |  14 +----
 include/linux/hugetlb.h       |  17 +++---
 mm/gup.c                      |  27 ++-------
 mm/hugetlb.c                  | 130 +++++++++++++++++++++++++++++-------------
 14 files changed, 131 insertions(+), 214 deletions(-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 0/6] hugepage migration fixes (v3)
@ 2014-08-29  1:38 ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

This is the ver.3 of hugepage migration fix patchset.

The original problem discussed with Hugh was that follow_huge_pmd(FOLL_GET)
looks to do get_page() without any locking (fixed in patch 2/6). However,
thorough testing showed that we have more fundamental issue on hugetlb_fault(),
where it suffers the race related to migration entry. This will be fixed in
patch 3/6.

And as a cosmetic/readability issue, currently follow_huge_(addr|pud|pmd) are
defined in common code or in arch-dependent code, depending on
CONFIG_ARCH_WANT_GENERAL_HUGETLB. But in reality, most architectures are doing
the same thing, so patch 1/6 cleans it up and leaves arch-dependent implementation
only if necessary, which decreases code by more than 100 lines.

Another point mentioned in the previous cycle is that we repeated fixing
migration entry issues again and again, which is inefficient considering
all such patches are backported to stable trees. So it's nice to completely
fix the similar problems at one time.
I researched the all code calling huge_pte_offset() and checked if !pte_present()
case is properly handled or not, and found that only 2 points missed it,
each of which is fixed in patch 4/6 and 5/6.
There are some non-trivial cases, so I put justifications for them below:
- mincore_hugetlb_page_range() determines present only by
  (ptep && !huge_pte_none()), but it's fine because we can consider migrating
  hugepage or hwpoisoned hugepage as in-memory.
- follow_huge_addr@arch/ia64/mm/hugetlbpage.c don't have to check pte_present,
  because ia64 doesn't support hugepage migration or hwpoison, so never sees
  migration entry.
- huge_pmd_share() is called only when pud_none() returns true, but then
  pmd is never migration/hwpoisoned entry.

Patch 6/6 is a just cleanup of an unused parameter.

This patchset is based on mmotm-2014-08-25-16-52 and shows no regression
in libhugetlbfs test.

I'd like to add Hugh's Suggested-by tags on patch 2 and 3 if he is OK,
because the solutions are mostly based on his idea.

Tree: git@github.com:Naoya-Horiguchi/linux.git
Branch: mmotm-2014-08-25-16-52/fix_follow_huge_pmd.v3

v2: http://thread.gmane.org/gmane.linux.kernel/1761065
---
Summary:

Naoya Horiguchi (6):
      mm/hugetlb: reduce arch dependent code around follow_huge_*
      mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
      mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
      mm/hugetlb: add migration entry check in hugetlb_change_protection
      mm/hugetlb: add migration entry check in __unmap_hugepage_range
      mm/hugetlb: remove unused argument of follow_huge_addr()

 arch/arm/mm/hugetlbpage.c     |   6 --
 arch/arm64/mm/hugetlbpage.c   |   6 --
 arch/ia64/mm/hugetlbpage.c    |  17 +++---
 arch/metag/mm/hugetlbpage.c   |  10 +---
 arch/mips/mm/hugetlbpage.c    |  18 ------
 arch/powerpc/mm/hugetlbpage.c |  28 +++++----
 arch/s390/mm/hugetlbpage.c    |  20 -------
 arch/sh/mm/hugetlbpage.c      |  12 ----
 arch/sparc/mm/hugetlbpage.c   |  12 ----
 arch/tile/mm/hugetlbpage.c    |  28 ---------
 arch/x86/mm/hugetlbpage.c     |  14 +----
 include/linux/hugetlb.h       |  17 +++---
 mm/gup.c                      |  27 ++-------
 mm/hugetlb.c                  | 130 +++++++++++++++++++++++++++++-------------
 14 files changed, 131 insertions(+), 214 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 1/6] mm/hugetlb: reduce arch dependent code around follow_huge_*
  2014-08-29  1:38 ` Naoya Horiguchi
@ 2014-08-29  1:38   ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

Currently we have many duplicates in definitions around follow_huge_addr(),
follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove them.
The basic idea is to put the default implementation for these functions in
mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETLB),
and to implement arch-specific code only when the arch needs it.

For follow_huge_addr(), only powerpc and ia64 have their own implementation,
and in all other architectures this function just returns ERR_PTR(-EINVAL).
So this patch sets returning ERR_PTR(-EINVAL) as default.

As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always
return 0 in your architecture (like in ia64 or sparc,) it's never called
(the callsite is optimized away) no matter how implemented it is.
So in such architectures, we don't need arch-specific implementation.

In some architecture (like mips, s390 and tile,) their current arch-specific
follow_huge_(pmd|pud)() are effectively identical with the common code,
so this patch lets these architecture use the common code.

One exception is metag, where pmd_huge() could return non-zero but it expects
follow_huge_pmd() to always return NULL. This means that we need arch-specific
implementation which returns NULL. This behavior looks strange to me (because
non-zero pmd_huge() implies that the architecture supports PMD-based hugepage,
so follow_huge_pmd() can/should return some relevant value,) but that's beyond
this cleanup patch, so let's keep it.

Justification of non-trivial changes:
- in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
  patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
  is true when follow_huge_pmd() can be called (note that pmd_huge() has
  the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
- in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
  code. This patch forces these archs use PMD_MASK, but it's OK because
  they are identical in both archs.
  In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
  In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
  PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
  PTE_ORDER is always 0, so these are identical.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/arm/mm/hugetlbpage.c     |  6 ------
 arch/arm64/mm/hugetlbpage.c   |  6 ------
 arch/ia64/mm/hugetlbpage.c    |  6 ------
 arch/metag/mm/hugetlbpage.c   |  6 ------
 arch/mips/mm/hugetlbpage.c    | 18 ------------------
 arch/powerpc/mm/hugetlbpage.c |  8 ++++++++
 arch/s390/mm/hugetlbpage.c    | 20 --------------------
 arch/sh/mm/hugetlbpage.c      | 12 ------------
 arch/sparc/mm/hugetlbpage.c   | 12 ------------
 arch/tile/mm/hugetlbpage.c    | 28 ----------------------------
 arch/x86/mm/hugetlbpage.c     | 12 ------------
 mm/hugetlb.c                  | 30 +++++++++++++++---------------
 12 files changed, 23 insertions(+), 141 deletions(-)

diff --git mmotm-2014-08-25-16-52.orig/arch/arm/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/arm/mm/hugetlbpage.c
index 66781bf34077..c72412415093 100644
--- mmotm-2014-08-25-16-52.orig/arch/arm/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/arm/mm/hugetlbpage.c
@@ -36,12 +36,6 @@
  * of type casting from pmd_t * to pte_t *.
  */
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pud_huge(pud_t pud)
 {
 	return 0;
diff --git mmotm-2014-08-25-16-52.orig/arch/arm64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/arm64/mm/hugetlbpage.c
index 023747bf4dd7..2de9d2e59d96 100644
--- mmotm-2014-08-25-16-52.orig/arch/arm64/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/arm64/mm/hugetlbpage.c
@@ -38,12 +38,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 }
 #endif
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !(pmd_val(pmd) & PMD_TABLE_BIT);
diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
index 76069c18ee42..52b7604b5215 100644
--- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
@@ -114,12 +114,6 @@ int pud_huge(pud_t pud)
 	return 0;
 }
 
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
-{
-	return NULL;
-}
-
 void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 			unsigned long addr, unsigned long end,
 			unsigned long floor, unsigned long ceiling)
diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
index 3c52fa6d0f8e..745081427659 100644
--- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
@@ -94,12 +94,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return pmd_page_shift(pmd) > PAGE_SHIFT;
diff --git mmotm-2014-08-25-16-52.orig/arch/mips/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/mips/mm/hugetlbpage.c
index 4ec8ee10d371..06e0f421b41b 100644
--- mmotm-2014-08-25-16-52.orig/arch/mips/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/mips/mm/hugetlbpage.c
@@ -68,12 +68,6 @@ int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
 	return 0;
 }
 
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return (pmd_val(pmd) & _PAGE_HUGE) != 0;
@@ -83,15 +77,3 @@ int pud_huge(pud_t pud)
 {
 	return (pud_val(pud) & _PAGE_HUGE) != 0;
 }
-
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
-	return page;
-}
diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
index 7e70ae968e5f..9517a93a315c 100644
--- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
@@ -706,6 +706,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+		pmd_t *pmd, int write)
+{
+	BUG();
+	return NULL;
+}
+
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
 {
diff --git mmotm-2014-08-25-16-52.orig/arch/s390/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/s390/mm/hugetlbpage.c
index 0ff66a7e29bb..811e7f9a2de0 100644
--- mmotm-2014-08-25-16-52.orig/arch/s390/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/s390/mm/hugetlbpage.c
@@ -201,12 +201,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	if (!MACHINE_HAS_HPAGE)
@@ -219,17 +213,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmdp, int write)
-{
-	struct page *page;
-
-	if (!MACHINE_HAS_HPAGE)
-		return NULL;
-
-	page = pmd_page(*pmdp);
-	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
-	return page;
-}
diff --git mmotm-2014-08-25-16-52.orig/arch/sh/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/sh/mm/hugetlbpage.c
index d7762349ea48..534bc978af8a 100644
--- mmotm-2014-08-25-16-52.orig/arch/sh/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/sh/mm/hugetlbpage.c
@@ -67,12 +67,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return 0;
@@ -82,9 +76,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	return NULL;
-}
diff --git mmotm-2014-08-25-16-52.orig/arch/sparc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/sparc/mm/hugetlbpage.c
index d329537739c6..4242eab12e10 100644
--- mmotm-2014-08-25-16-52.orig/arch/sparc/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/sparc/mm/hugetlbpage.c
@@ -215,12 +215,6 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 	return entry;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return 0;
@@ -230,9 +224,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	return NULL;
-}
diff --git mmotm-2014-08-25-16-52.orig/arch/tile/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/tile/mm/hugetlbpage.c
index e514899e1100..8a00c7b7b862 100644
--- mmotm-2014-08-25-16-52.orig/arch/tile/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/tile/mm/hugetlbpage.c
@@ -150,12 +150,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 	return NULL;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !!(pmd_val(pmd) & _PAGE_HUGE_PAGE);
@@ -166,28 +160,6 @@ int pud_huge(pud_t pud)
 	return !!(pud_val(pud) & _PAGE_HUGE_PAGE);
 }
 
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
-	return page;
-}
-
-struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
-			     pud_t *pud, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pud);
-	if (page)
-		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
-	return page;
-}
-
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
index 8b977ebf9388..03b8a7c11817 100644
--- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
@@ -52,20 +52,8 @@ int pud_huge(pud_t pud)
 	return 0;
 }
 
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
-{
-	return NULL;
-}
 #else
 
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !!(pmd_val(pmd) & _PAGE_PSE);
diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index eeceeeb09019..022767506c7b 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3653,7 +3653,20 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 	return (pte_t *) pmd;
 }
 
-struct page *
+#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
+
+/*
+ * These functions are overwritable if your architecture needs its own
+ * behavior.
+ */
+struct page * __weak
+follow_huge_addr(struct mm_struct *mm, unsigned long address,
+			      int write)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+struct page * __weak
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
 {
@@ -3665,7 +3678,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return page;
 }
 
-struct page *
+struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int write)
 {
@@ -3677,19 +3690,6 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 	return page;
 }
 
-#else /* !CONFIG_ARCH_WANT_GENERAL_HUGETLB */
-
-/* Can be overriden by architectures */
-struct page * __weak
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-	       pud_t *pud, int write)
-{
-	BUG();
-	return NULL;
-}
-
-#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
-
 #ifdef CONFIG_MEMORY_FAILURE
 
 /* Should be called in hugetlb_lock */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 1/6] mm/hugetlb: reduce arch dependent code around follow_huge_*
@ 2014-08-29  1:38   ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

Currently we have many duplicates in definitions around follow_huge_addr(),
follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove them.
The basic idea is to put the default implementation for these functions in
mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETLB),
and to implement arch-specific code only when the arch needs it.

For follow_huge_addr(), only powerpc and ia64 have their own implementation,
and in all other architectures this function just returns ERR_PTR(-EINVAL).
So this patch sets returning ERR_PTR(-EINVAL) as default.

As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always
return 0 in your architecture (like in ia64 or sparc,) it's never called
(the callsite is optimized away) no matter how implemented it is.
So in such architectures, we don't need arch-specific implementation.

In some architecture (like mips, s390 and tile,) their current arch-specific
follow_huge_(pmd|pud)() are effectively identical with the common code,
so this patch lets these architecture use the common code.

One exception is metag, where pmd_huge() could return non-zero but it expects
follow_huge_pmd() to always return NULL. This means that we need arch-specific
implementation which returns NULL. This behavior looks strange to me (because
non-zero pmd_huge() implies that the architecture supports PMD-based hugepage,
so follow_huge_pmd() can/should return some relevant value,) but that's beyond
this cleanup patch, so let's keep it.

Justification of non-trivial changes:
- in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
  patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
  is true when follow_huge_pmd() can be called (note that pmd_huge() has
  the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
- in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
  code. This patch forces these archs use PMD_MASK, but it's OK because
  they are identical in both archs.
  In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
  In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
  PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
  PTE_ORDER is always 0, so these are identical.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/arm/mm/hugetlbpage.c     |  6 ------
 arch/arm64/mm/hugetlbpage.c   |  6 ------
 arch/ia64/mm/hugetlbpage.c    |  6 ------
 arch/metag/mm/hugetlbpage.c   |  6 ------
 arch/mips/mm/hugetlbpage.c    | 18 ------------------
 arch/powerpc/mm/hugetlbpage.c |  8 ++++++++
 arch/s390/mm/hugetlbpage.c    | 20 --------------------
 arch/sh/mm/hugetlbpage.c      | 12 ------------
 arch/sparc/mm/hugetlbpage.c   | 12 ------------
 arch/tile/mm/hugetlbpage.c    | 28 ----------------------------
 arch/x86/mm/hugetlbpage.c     | 12 ------------
 mm/hugetlb.c                  | 30 +++++++++++++++---------------
 12 files changed, 23 insertions(+), 141 deletions(-)

diff --git mmotm-2014-08-25-16-52.orig/arch/arm/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/arm/mm/hugetlbpage.c
index 66781bf34077..c72412415093 100644
--- mmotm-2014-08-25-16-52.orig/arch/arm/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/arm/mm/hugetlbpage.c
@@ -36,12 +36,6 @@
  * of type casting from pmd_t * to pte_t *.
  */
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pud_huge(pud_t pud)
 {
 	return 0;
diff --git mmotm-2014-08-25-16-52.orig/arch/arm64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/arm64/mm/hugetlbpage.c
index 023747bf4dd7..2de9d2e59d96 100644
--- mmotm-2014-08-25-16-52.orig/arch/arm64/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/arm64/mm/hugetlbpage.c
@@ -38,12 +38,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 }
 #endif
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !(pmd_val(pmd) & PMD_TABLE_BIT);
diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
index 76069c18ee42..52b7604b5215 100644
--- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
@@ -114,12 +114,6 @@ int pud_huge(pud_t pud)
 	return 0;
 }
 
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
-{
-	return NULL;
-}
-
 void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 			unsigned long addr, unsigned long end,
 			unsigned long floor, unsigned long ceiling)
diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
index 3c52fa6d0f8e..745081427659 100644
--- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
@@ -94,12 +94,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return pmd_page_shift(pmd) > PAGE_SHIFT;
diff --git mmotm-2014-08-25-16-52.orig/arch/mips/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/mips/mm/hugetlbpage.c
index 4ec8ee10d371..06e0f421b41b 100644
--- mmotm-2014-08-25-16-52.orig/arch/mips/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/mips/mm/hugetlbpage.c
@@ -68,12 +68,6 @@ int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
 	return 0;
 }
 
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return (pmd_val(pmd) & _PAGE_HUGE) != 0;
@@ -83,15 +77,3 @@ int pud_huge(pud_t pud)
 {
 	return (pud_val(pud) & _PAGE_HUGE) != 0;
 }
-
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
-	return page;
-}
diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
index 7e70ae968e5f..9517a93a315c 100644
--- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
@@ -706,6 +706,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+		pmd_t *pmd, int write)
+{
+	BUG();
+	return NULL;
+}
+
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
 {
diff --git mmotm-2014-08-25-16-52.orig/arch/s390/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/s390/mm/hugetlbpage.c
index 0ff66a7e29bb..811e7f9a2de0 100644
--- mmotm-2014-08-25-16-52.orig/arch/s390/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/s390/mm/hugetlbpage.c
@@ -201,12 +201,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	if (!MACHINE_HAS_HPAGE)
@@ -219,17 +213,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmdp, int write)
-{
-	struct page *page;
-
-	if (!MACHINE_HAS_HPAGE)
-		return NULL;
-
-	page = pmd_page(*pmdp);
-	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
-	return page;
-}
diff --git mmotm-2014-08-25-16-52.orig/arch/sh/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/sh/mm/hugetlbpage.c
index d7762349ea48..534bc978af8a 100644
--- mmotm-2014-08-25-16-52.orig/arch/sh/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/sh/mm/hugetlbpage.c
@@ -67,12 +67,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return 0;
@@ -82,9 +76,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	return NULL;
-}
diff --git mmotm-2014-08-25-16-52.orig/arch/sparc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/sparc/mm/hugetlbpage.c
index d329537739c6..4242eab12e10 100644
--- mmotm-2014-08-25-16-52.orig/arch/sparc/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/sparc/mm/hugetlbpage.c
@@ -215,12 +215,6 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 	return entry;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return 0;
@@ -230,9 +224,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	return NULL;
-}
diff --git mmotm-2014-08-25-16-52.orig/arch/tile/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/tile/mm/hugetlbpage.c
index e514899e1100..8a00c7b7b862 100644
--- mmotm-2014-08-25-16-52.orig/arch/tile/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/tile/mm/hugetlbpage.c
@@ -150,12 +150,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 	return NULL;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !!(pmd_val(pmd) & _PAGE_HUGE_PAGE);
@@ -166,28 +160,6 @@ int pud_huge(pud_t pud)
 	return !!(pud_val(pud) & _PAGE_HUGE_PAGE);
 }
 
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
-	return page;
-}
-
-struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
-			     pud_t *pud, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pud);
-	if (page)
-		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
-	return page;
-}
-
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
index 8b977ebf9388..03b8a7c11817 100644
--- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
@@ -52,20 +52,8 @@ int pud_huge(pud_t pud)
 	return 0;
 }
 
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
-{
-	return NULL;
-}
 #else
 
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !!(pmd_val(pmd) & _PAGE_PSE);
diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index eeceeeb09019..022767506c7b 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3653,7 +3653,20 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 	return (pte_t *) pmd;
 }
 
-struct page *
+#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
+
+/*
+ * These functions are overwritable if your architecture needs its own
+ * behavior.
+ */
+struct page * __weak
+follow_huge_addr(struct mm_struct *mm, unsigned long address,
+			      int write)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+struct page * __weak
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
 {
@@ -3665,7 +3678,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return page;
 }
 
-struct page *
+struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int write)
 {
@@ -3677,19 +3690,6 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 	return page;
 }
 
-#else /* !CONFIG_ARCH_WANT_GENERAL_HUGETLB */
-
-/* Can be overriden by architectures */
-struct page * __weak
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-	       pud_t *pud, int write)
-{
-	BUG();
-	return NULL;
-}
-
-#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
-
 #ifdef CONFIG_MEMORY_FAILURE
 
 /* Should be called in hugetlb_lock */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
  2014-08-29  1:38 ` Naoya Horiguchi
@ 2014-08-29  1:38   ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

We have a race condition between move_pages() and freeing hugepages,
where move_pages() calls follow_page(FOLL_GET) for hugepages internally
and tries to get its refcount without preventing concurrent freeing.
This race crashes the kernel, so this patch fixes it by moving FOLL_GET
code for hugepages into follow_huge_pmd() with taking the page table lock.

This patch also adds the similar locking to follow_huge_(addr|pud)
for consistency.

Here is the reproducer:

  $ cat movepages.c
  #include <stdio.h>
  #include <stdlib.h>
  #include <numaif.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000
  #define PS              0x1000

  int main(int argc, char *argv[]) {
          int i;
          int nr_hp = strtol(argv[1], NULL, 0);
          int nr_p  = nr_hp * HPS / PS;
          int ret;
          void **addrs;
          int *status;
          int *nodes;
          pid_t pid;

          pid = strtol(argv[2], NULL, 0);
          addrs  = malloc(sizeof(char *) * nr_p + 1);
          status = malloc(sizeof(char *) * nr_p + 1);
          nodes  = malloc(sizeof(char *) * nr_p + 1);

          while (1) {
                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 1;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");

                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 0;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");
          }
          return 0;
  }

  $ cat hugepage.c
  #include <stdio.h>
  #include <sys/mman.h>
  #include <string.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000

  int main(int argc, char *argv[]) {
          int nr_hp = strtol(argv[1], NULL, 0);
          char *p;

          while (1) {
                  p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                  if (p != (void *)ADDR_INPUT) {
                          perror("mmap");
                          break;
                  }
                  memset(p, 0, nr_hp * HPS);
                  munmap(p, nr_hp * HPS);
          }
  }

  $ sysctl vm.nr_hugepages=40
  $ ./hugepage 10 &
  $ ./movepages 10 $(pgrep -f hugepage)

Note for stable inclusion:
  This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code
  to move_pages()"), so is applicable to -stable kernels which includes it.

ChangeLog v3:
- remove unnecessary if (page) check
- check (pmd|pud)_huge again after holding ptl
- do the same change also on follow_huge_pud()
- take page table lock also in follow_huge_addr()

ChangeLog v2:
- introduce follow_huge_pmd_lock() to do locking in arch-independent code.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [3.12+]
---
 arch/ia64/mm/hugetlbpage.c    |  9 +++++++--
 arch/metag/mm/hugetlbpage.c   |  4 ++--
 arch/powerpc/mm/hugetlbpage.c | 22 +++++++++++-----------
 include/linux/hugetlb.h       | 12 ++++++------
 mm/gup.c                      | 25 ++++---------------------
 mm/hugetlb.c                  | 43 +++++++++++++++++++++++++++++++------------
 6 files changed, 61 insertions(+), 54 deletions(-)

diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
index 52b7604b5215..6170381bf074 100644
--- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
@@ -91,17 +91,22 @@ int prepare_hugepage_range(struct file *file,
 
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pte_t *ptep;
+	spinlock_t *ptl;
 
 	if (REGION_NUMBER(addr) != RGN_HPAGE)
 		return ERR_PTR(-EINVAL);
 
 	ptep = huge_pte_offset(mm, addr);
+	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
 	if (!ptep || pte_none(*ptep))
-		return NULL;
+		goto out;
+
 	page = pte_page(*ptep);
 	page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+out:
+	spin_unlock(ptl);
 	return page;
 }
 int pmd_huge(pmd_t pmd)
diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
index 745081427659..5e96ef096df9 100644
--- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
@@ -104,8 +104,8 @@ int pud_huge(pud_t pud)
 	return 0;
 }
 
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
+struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
+			     pmd_t *pmd, int flags)
 {
 	return NULL;
 }
diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
index 9517a93a315c..1d8854a56309 100644
--- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
@@ -677,38 +677,38 @@ struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
 	pte_t *ptep;
-	struct page *page;
+	struct page *page = ERR_PTR(-EINVAL);
 	unsigned shift;
 	unsigned long mask;
+	spinlock_t *ptl;
 	/*
 	 * Transparent hugepages are handled by generic code. We can skip them
 	 * here.
 	 */
 	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
-
+	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
 	/* Verify it is a huge page else bail. */
 	if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
-		return ERR_PTR(-EINVAL);
+		goto out;
 
 	mask = (1UL << shift) - 1;
-	page = pte_page(*ptep);
-	if (page)
-		page += (address & mask) / PAGE_SIZE;
-
+	page = pte_page(*ptep) + ((address & mask) >> PAGE_SHIFT);
+out:
+	spin_unlock(ptl);
 	return page;
 }
 
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
+follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, int flags)
 {
 	BUG();
 	return NULL;
 }
 
 struct page *
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
+follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
+		pud_t *pud, int flags)
 {
 	BUG();
 	return NULL;
diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
index 6e6d338641fe..b3200fce07aa 100644
--- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
+++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
@@ -98,10 +98,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-				pmd_t *pmd, int write);
-struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
-				pud_t *pud, int write);
+struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
+				pmd_t *pmd, int flags);
+struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
+				pud_t *pud, int flags);
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 static inline void hugetlb_show_meminfo(void)
 {
 }
-#define follow_huge_pmd(mm, addr, pmd, write)	NULL
-#define follow_huge_pud(mm, addr, pud, write)	NULL
+#define follow_huge_pmd(vma, addr, pmd, flags)	NULL
+#define follow_huge_pud(vma, addr, pud, flags)	NULL
 #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define pud_huge(x)	0
diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
index 91d044b1600d..597a5e92e265 100644
--- mmotm-2014-08-25-16-52.orig/mm/gup.c
+++ mmotm-2014-08-25-16-52/mm/gup.c
@@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	pud = pud_offset(pgd, address);
 	if (pud_none(*pud))
 		return no_page_table(vma, flags);
-	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
-		if (flags & FOLL_GET)
-			return NULL;
-		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
-		return page;
-	}
+	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
+		return follow_huge_pud(vma, address, pud, flags);
 	if (unlikely(pud_bad(*pud)))
 		return no_page_table(vma, flags);
 
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		return no_page_table(vma, flags);
-	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
-		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
-		if (flags & FOLL_GET) {
-			/*
-			 * Refcount on tail pages are not well-defined and
-			 * shouldn't be taken. The caller should handle a NULL
-			 * return when trying to follow tail pages.
-			 */
-			if (PageHead(page))
-				get_page(page);
-			else
-				page = NULL;
-		}
-		return page;
-	}
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
+		return follow_huge_pmd(vma, address, pmd, flags);
 	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
 		return no_page_table(vma, flags);
 	if (pmd_trans_huge(*pmd)) {
diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index 022767506c7b..c5345c5edb50 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3667,26 +3667,45 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
 }
 
 struct page * __weak
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
+follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, int flags)
 {
-	struct page *page;
+	struct page *page = NULL;
+	spinlock_t *ptl;
 
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
+	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
+
+	if (!pmd_huge(*pmd))
+		goto out;
+
+	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (flags & FOLL_GET)
+		if (!get_page_unless_zero(page))
+			page = NULL;
+out:
+	spin_unlock(ptl);
 	return page;
 }
 
 struct page * __weak
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-		pud_t *pud, int write)
+follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
+		pud_t *pud, int flags)
 {
-	struct page *page;
+	struct page *page = NULL;
+	spinlock_t *ptl;
 
-	page = pte_page(*(pte_t *)pud);
-	if (page)
-		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
+	if (flags & FOLL_GET)
+		return NULL;
+
+	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pud);
+
+	if (!pud_huge(*pud))
+		goto out;
+
+	page = pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+out:
+	spin_unlock(ptl);
 	return page;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
@ 2014-08-29  1:38   ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

We have a race condition between move_pages() and freeing hugepages,
where move_pages() calls follow_page(FOLL_GET) for hugepages internally
and tries to get its refcount without preventing concurrent freeing.
This race crashes the kernel, so this patch fixes it by moving FOLL_GET
code for hugepages into follow_huge_pmd() with taking the page table lock.

This patch also adds the similar locking to follow_huge_(addr|pud)
for consistency.

Here is the reproducer:

  $ cat movepages.c
  #include <stdio.h>
  #include <stdlib.h>
  #include <numaif.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000
  #define PS              0x1000

  int main(int argc, char *argv[]) {
          int i;
          int nr_hp = strtol(argv[1], NULL, 0);
          int nr_p  = nr_hp * HPS / PS;
          int ret;
          void **addrs;
          int *status;
          int *nodes;
          pid_t pid;

          pid = strtol(argv[2], NULL, 0);
          addrs  = malloc(sizeof(char *) * nr_p + 1);
          status = malloc(sizeof(char *) * nr_p + 1);
          nodes  = malloc(sizeof(char *) * nr_p + 1);

          while (1) {
                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 1;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");

                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 0;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");
          }
          return 0;
  }

  $ cat hugepage.c
  #include <stdio.h>
  #include <sys/mman.h>
  #include <string.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000

  int main(int argc, char *argv[]) {
          int nr_hp = strtol(argv[1], NULL, 0);
          char *p;

          while (1) {
                  p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                  if (p != (void *)ADDR_INPUT) {
                          perror("mmap");
                          break;
                  }
                  memset(p, 0, nr_hp * HPS);
                  munmap(p, nr_hp * HPS);
          }
  }

  $ sysctl vm.nr_hugepages=40
  $ ./hugepage 10 &
  $ ./movepages 10 $(pgrep -f hugepage)

Note for stable inclusion:
  This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code
  to move_pages()"), so is applicable to -stable kernels which includes it.

ChangeLog v3:
- remove unnecessary if (page) check
- check (pmd|pud)_huge again after holding ptl
- do the same change also on follow_huge_pud()
- take page table lock also in follow_huge_addr()

ChangeLog v2:
- introduce follow_huge_pmd_lock() to do locking in arch-independent code.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [3.12+]
---
 arch/ia64/mm/hugetlbpage.c    |  9 +++++++--
 arch/metag/mm/hugetlbpage.c   |  4 ++--
 arch/powerpc/mm/hugetlbpage.c | 22 +++++++++++-----------
 include/linux/hugetlb.h       | 12 ++++++------
 mm/gup.c                      | 25 ++++---------------------
 mm/hugetlb.c                  | 43 +++++++++++++++++++++++++++++++------------
 6 files changed, 61 insertions(+), 54 deletions(-)

diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
index 52b7604b5215..6170381bf074 100644
--- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
@@ -91,17 +91,22 @@ int prepare_hugepage_range(struct file *file,
 
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pte_t *ptep;
+	spinlock_t *ptl;
 
 	if (REGION_NUMBER(addr) != RGN_HPAGE)
 		return ERR_PTR(-EINVAL);
 
 	ptep = huge_pte_offset(mm, addr);
+	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
 	if (!ptep || pte_none(*ptep))
-		return NULL;
+		goto out;
+
 	page = pte_page(*ptep);
 	page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+out:
+	spin_unlock(ptl);
 	return page;
 }
 int pmd_huge(pmd_t pmd)
diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
index 745081427659..5e96ef096df9 100644
--- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
@@ -104,8 +104,8 @@ int pud_huge(pud_t pud)
 	return 0;
 }
 
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
+struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
+			     pmd_t *pmd, int flags)
 {
 	return NULL;
 }
diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
index 9517a93a315c..1d8854a56309 100644
--- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
@@ -677,38 +677,38 @@ struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
 	pte_t *ptep;
-	struct page *page;
+	struct page *page = ERR_PTR(-EINVAL);
 	unsigned shift;
 	unsigned long mask;
+	spinlock_t *ptl;
 	/*
 	 * Transparent hugepages are handled by generic code. We can skip them
 	 * here.
 	 */
 	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
-
+	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
 	/* Verify it is a huge page else bail. */
 	if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
-		return ERR_PTR(-EINVAL);
+		goto out;
 
 	mask = (1UL << shift) - 1;
-	page = pte_page(*ptep);
-	if (page)
-		page += (address & mask) / PAGE_SIZE;
-
+	page = pte_page(*ptep) + ((address & mask) >> PAGE_SHIFT);
+out:
+	spin_unlock(ptl);
 	return page;
 }
 
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
+follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, int flags)
 {
 	BUG();
 	return NULL;
 }
 
 struct page *
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
+follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
+		pud_t *pud, int flags)
 {
 	BUG();
 	return NULL;
diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
index 6e6d338641fe..b3200fce07aa 100644
--- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
+++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
@@ -98,10 +98,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-				pmd_t *pmd, int write);
-struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
-				pud_t *pud, int write);
+struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
+				pmd_t *pmd, int flags);
+struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
+				pud_t *pud, int flags);
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 static inline void hugetlb_show_meminfo(void)
 {
 }
-#define follow_huge_pmd(mm, addr, pmd, write)	NULL
-#define follow_huge_pud(mm, addr, pud, write)	NULL
+#define follow_huge_pmd(vma, addr, pmd, flags)	NULL
+#define follow_huge_pud(vma, addr, pud, flags)	NULL
 #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define pud_huge(x)	0
diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
index 91d044b1600d..597a5e92e265 100644
--- mmotm-2014-08-25-16-52.orig/mm/gup.c
+++ mmotm-2014-08-25-16-52/mm/gup.c
@@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	pud = pud_offset(pgd, address);
 	if (pud_none(*pud))
 		return no_page_table(vma, flags);
-	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
-		if (flags & FOLL_GET)
-			return NULL;
-		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
-		return page;
-	}
+	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
+		return follow_huge_pud(vma, address, pud, flags);
 	if (unlikely(pud_bad(*pud)))
 		return no_page_table(vma, flags);
 
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		return no_page_table(vma, flags);
-	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
-		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
-		if (flags & FOLL_GET) {
-			/*
-			 * Refcount on tail pages are not well-defined and
-			 * shouldn't be taken. The caller should handle a NULL
-			 * return when trying to follow tail pages.
-			 */
-			if (PageHead(page))
-				get_page(page);
-			else
-				page = NULL;
-		}
-		return page;
-	}
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
+		return follow_huge_pmd(vma, address, pmd, flags);
 	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
 		return no_page_table(vma, flags);
 	if (pmd_trans_huge(*pmd)) {
diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index 022767506c7b..c5345c5edb50 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3667,26 +3667,45 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
 }
 
 struct page * __weak
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
+follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, int flags)
 {
-	struct page *page;
+	struct page *page = NULL;
+	spinlock_t *ptl;
 
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
+	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
+
+	if (!pmd_huge(*pmd))
+		goto out;
+
+	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (flags & FOLL_GET)
+		if (!get_page_unless_zero(page))
+			page = NULL;
+out:
+	spin_unlock(ptl);
 	return page;
 }
 
 struct page * __weak
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-		pud_t *pud, int write)
+follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
+		pud_t *pud, int flags)
 {
-	struct page *page;
+	struct page *page = NULL;
+	spinlock_t *ptl;
 
-	page = pte_page(*(pte_t *)pud);
-	if (page)
-		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
+	if (flags & FOLL_GET)
+		return NULL;
+
+	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pud);
+
+	if (!pud_huge(*pud))
+		goto out;
+
+	page = pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+out:
+	spin_unlock(ptl);
 	return page;
 }
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 3/6] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
  2014-08-29  1:38 ` Naoya Horiguchi
@ 2014-08-29  1:38   ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().

This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again with holding ptl.

This patch also reorders taking ptl and doing pte_page(), because pte_page()
should be done in ptl. Due to this reordering, we need use trylock_page()
in page != pagecache_page case to respect locking order.

ChangeLog v3:
- doing pte_page() and taking refcount under page table lock
- check pte_present after taking ptl, which makes it unnecessary to use
  get_page_unless_zero()
- use trylock_page in page != pagecache_page case
- fixed target stable version

Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [3.2+]
---
 mm/hugetlb.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index c5345c5edb50..2aafe073cb06 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3184,6 +3184,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 								vma, address);
 	}
 
+	ptl = huge_pte_lock(h, mm, ptep);
+
+	/* Check for a racing update before calling hugetlb_cow */
+	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+		goto out_ptl;
+
+	if (!pte_present(entry))
+		goto out_ptl;
+
 	/*
 	 * hugetlb_cow() requires page locks of pte_page(entry) and
 	 * pagecache_page, so here we need take the former one
@@ -3192,22 +3201,17 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * so no worry about deadlock.
 	 */
 	page = pte_page(entry);
-	get_page(page);
 	if (page != pagecache_page)
-		lock_page(page);
-
-	ptl = huge_pte_lockptr(h, mm, ptep);
-	spin_lock(ptl);
-	/* Check for a racing update before calling hugetlb_cow */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
-		goto out_ptl;
+		if (!trylock_page(page))
+			goto out_ptl;
 
+	get_page(page);
 
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!huge_pte_write(entry)) {
 			ret = hugetlb_cow(mm, vma, address, ptep, entry,
 					pagecache_page, ptl);
-			goto out_ptl;
+			goto out_put_page;
 		}
 		entry = huge_pte_mkdirty(entry);
 	}
@@ -3215,7 +3219,11 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (huge_ptep_set_access_flags(vma, address, ptep, entry,
 						flags & FAULT_FLAG_WRITE))
 		update_mmu_cache(vma, address, ptep);
-
+out_put_page:
+	put_page(page);
+out_unlock_page:
+	if (page != pagecache_page)
+		unlock_page(page);
 out_ptl:
 	spin_unlock(ptl);
 
@@ -3223,10 +3231,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unlock_page(pagecache_page);
 		put_page(pagecache_page);
 	}
-	if (page != pagecache_page)
-		unlock_page(page);
-	put_page(page);
-
 out_mutex:
 	mutex_unlock(&htlb_fault_mutex_table[hash]);
 	return ret;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 3/6] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
@ 2014-08-29  1:38   ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().

This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again with holding ptl.

This patch also reorders taking ptl and doing pte_page(), because pte_page()
should be done in ptl. Due to this reordering, we need use trylock_page()
in page != pagecache_page case to respect locking order.

ChangeLog v3:
- doing pte_page() and taking refcount under page table lock
- check pte_present after taking ptl, which makes it unnecessary to use
  get_page_unless_zero()
- use trylock_page in page != pagecache_page case
- fixed target stable version

Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [3.2+]
---
 mm/hugetlb.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index c5345c5edb50..2aafe073cb06 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3184,6 +3184,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 								vma, address);
 	}
 
+	ptl = huge_pte_lock(h, mm, ptep);
+
+	/* Check for a racing update before calling hugetlb_cow */
+	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+		goto out_ptl;
+
+	if (!pte_present(entry))
+		goto out_ptl;
+
 	/*
 	 * hugetlb_cow() requires page locks of pte_page(entry) and
 	 * pagecache_page, so here we need take the former one
@@ -3192,22 +3201,17 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * so no worry about deadlock.
 	 */
 	page = pte_page(entry);
-	get_page(page);
 	if (page != pagecache_page)
-		lock_page(page);
-
-	ptl = huge_pte_lockptr(h, mm, ptep);
-	spin_lock(ptl);
-	/* Check for a racing update before calling hugetlb_cow */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
-		goto out_ptl;
+		if (!trylock_page(page))
+			goto out_ptl;
 
+	get_page(page);
 
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!huge_pte_write(entry)) {
 			ret = hugetlb_cow(mm, vma, address, ptep, entry,
 					pagecache_page, ptl);
-			goto out_ptl;
+			goto out_put_page;
 		}
 		entry = huge_pte_mkdirty(entry);
 	}
@@ -3215,7 +3219,11 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (huge_ptep_set_access_flags(vma, address, ptep, entry,
 						flags & FAULT_FLAG_WRITE))
 		update_mmu_cache(vma, address, ptep);
-
+out_put_page:
+	put_page(page);
+out_unlock_page:
+	if (page != pagecache_page)
+		unlock_page(page);
 out_ptl:
 	spin_unlock(ptl);
 
@@ -3223,10 +3231,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unlock_page(pagecache_page);
 		put_page(pagecache_page);
 	}
-	if (page != pagecache_page)
-		unlock_page(page);
-	put_page(page);
-
 out_mutex:
 	mutex_unlock(&htlb_fault_mutex_table[hash]);
 	return ret;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 4/6] mm/hugetlb: add migration entry check in hugetlb_change_protection
  2014-08-29  1:38 ` Naoya Horiguchi
@ 2014-08-29  1:38   ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

There is a race condition between hugepage migration and change_protection(),
where hugetlb_change_protection() doesn't care about migration entries and
wrongly overwrites them. That causes unexpected results like kernel crash.

This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.

ChangeLog v3:
- handle migration entry correctly (instead of just skipping)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> # [2.6.36+]
---
 mm/hugetlb.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index 2aafe073cb06..1ed9df6def54 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3362,7 +3362,26 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			spin_unlock(ptl);
 			continue;
 		}
-		if (!huge_pte_none(huge_ptep_get(ptep))) {
+		pte = huge_ptep_get(ptep);
+		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
+			spin_unlock(ptl);
+			continue;
+		}
+		if (unlikely(is_hugetlb_entry_migration(pte))) {
+			swp_entry_t entry = pte_to_swp_entry(pte);
+
+			if (is_write_migration_entry(entry)) {
+				pte_t newpte;
+
+				make_migration_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				set_pte_at(mm, address, ptep, newpte);
+				pages++;
+			}
+			spin_unlock(ptl);
+			continue;
+		}
+		if (!huge_pte_none(pte)) {
 			pte = huge_ptep_get_and_clear(mm, address, ptep);
 			pte = pte_mkhuge(huge_pte_modify(pte, newprot));
 			pte = arch_make_huge_pte(pte, vma, NULL, 0);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 4/6] mm/hugetlb: add migration entry check in hugetlb_change_protection
@ 2014-08-29  1:38   ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

There is a race condition between hugepage migration and change_protection(),
where hugetlb_change_protection() doesn't care about migration entries and
wrongly overwrites them. That causes unexpected results like kernel crash.

This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.

ChangeLog v3:
- handle migration entry correctly (instead of just skipping)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> # [2.6.36+]
---
 mm/hugetlb.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index 2aafe073cb06..1ed9df6def54 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3362,7 +3362,26 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			spin_unlock(ptl);
 			continue;
 		}
-		if (!huge_pte_none(huge_ptep_get(ptep))) {
+		pte = huge_ptep_get(ptep);
+		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
+			spin_unlock(ptl);
+			continue;
+		}
+		if (unlikely(is_hugetlb_entry_migration(pte))) {
+			swp_entry_t entry = pte_to_swp_entry(pte);
+
+			if (is_write_migration_entry(entry)) {
+				pte_t newpte;
+
+				make_migration_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				set_pte_at(mm, address, ptep, newpte);
+				pages++;
+			}
+			spin_unlock(ptl);
+			continue;
+		}
+		if (!huge_pte_none(pte)) {
 			pte = huge_ptep_get_and_clear(mm, address, ptep);
 			pte = pte_mkhuge(huge_pte_modify(pte, newprot));
 			pte = arch_make_huge_pte(pte, vma, NULL, 0);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 5/6] mm/hugetlb: add migration entry check in __unmap_hugepage_range
  2014-08-29  1:38 ` Naoya Horiguchi
@ 2014-08-29  1:38   ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries. This patch calls pte_to_swp_entry() and
migration_entry_to_page() to get the right page for migration entries.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [2.6.36+]
---
 mm/hugetlb.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index 1ed9df6def54..0a4511115ee0 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -2652,6 +2652,13 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (huge_pte_none(pte))
 			goto unlock;
 
+		if (unlikely(is_hugetlb_entry_migration(pte))) {
+			swp_entry_t entry = pte_to_swp_entry(pte);
+
+			page = migration_entry_to_page(entry);
+			goto clear;
+		}
+
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
@@ -2677,7 +2684,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			 */
 			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
 		}
-
+clear:
 		pte = huge_ptep_get_and_clear(mm, address, ptep);
 		tlb_remove_tlb_entry(tlb, ptep, address);
 		if (huge_pte_dirty(pte))
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 5/6] mm/hugetlb: add migration entry check in __unmap_hugepage_range
@ 2014-08-29  1:38   ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries. This patch calls pte_to_swp_entry() and
migration_entry_to_page() to get the right page for migration entries.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [2.6.36+]
---
 mm/hugetlb.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index 1ed9df6def54..0a4511115ee0 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -2652,6 +2652,13 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (huge_pte_none(pte))
 			goto unlock;
 
+		if (unlikely(is_hugetlb_entry_migration(pte))) {
+			swp_entry_t entry = pte_to_swp_entry(pte);
+
+			page = migration_entry_to_page(entry);
+			goto clear;
+		}
+
 		/*
 		 * HWPoisoned hugepage is already unmapped and dropped reference
 		 */
@@ -2677,7 +2684,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			 */
 			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
 		}
-
+clear:
 		pte = huge_ptep_get_and_clear(mm, address, ptep);
 		tlb_remove_tlb_entry(tlb, ptep, address);
 		if (huge_pte_dirty(pte))
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 6/6] mm/hugetlb: remove unused argument of follow_huge_addr()
  2014-08-29  1:38 ` Naoya Horiguchi
@ 2014-08-29  1:39   ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:39 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

follow_huge_addr()'s parameter write is not used, so let's remove it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/ia64/mm/hugetlbpage.c    | 2 +-
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 arch/x86/mm/hugetlbpage.c     | 2 +-
 include/linux/hugetlb.h       | 5 ++---
 mm/gup.c                      | 2 +-
 mm/hugetlb.c                  | 3 +--
 6 files changed, 7 insertions(+), 9 deletions(-)

diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
index 6170381bf074..524a4e001bda 100644
--- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
@@ -89,7 +89,7 @@ int prepare_hugepage_range(struct file *file,
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
+struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr)
 {
 	struct page *page = NULL;
 	pte_t *ptep;
diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
index 1d8854a56309..5b6fe8b0cde3 100644
--- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
@@ -674,7 +674,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 }
 
 struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
+follow_huge_addr(struct mm_struct *mm, unsigned long address)
 {
 	pte_t *ptep;
 	struct page *page = ERR_PTR(-EINVAL);
diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
index 03b8a7c11817..cab09d87ae65 100644
--- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
@@ -18,7 +18,7 @@
 
 #if 0	/* This is just for testing */
 struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
+follow_huge_addr(struct mm_struct *mm, unsigned long address)
 {
 	unsigned long start = address;
 	int length = 1;
diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
index b3200fce07aa..cdff1bd393bb 100644
--- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
+++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
@@ -96,8 +96,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write);
+struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address);
 struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
 				pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
@@ -124,7 +123,7 @@ static inline unsigned long hugetlb_total_pages(void)
 }
 
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
-#define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
+#define follow_huge_addr(mm, addr)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
 static inline void hugetlb_report_meminfo(struct seq_file *m)
 {
diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
index 597a5e92e265..8f0550f1770d 100644
--- mmotm-2014-08-25-16-52.orig/mm/gup.c
+++ mmotm-2014-08-25-16-52/mm/gup.c
@@ -149,7 +149,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 
 	*page_mask = 0;
 
-	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
+	page = follow_huge_addr(mm, address);
 	if (!IS_ERR(page)) {
 		BUG_ON(flags & FOLL_GET);
 		return page;
diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index 0a4511115ee0..f7dcad3474ec 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3690,8 +3690,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
  * behavior.
  */
 struct page * __weak
-follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
+follow_huge_addr(struct mm_struct *mm, unsigned long address)
 {
 	return ERR_PTR(-EINVAL);
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 6/6] mm/hugetlb: remove unused argument of follow_huge_addr()
@ 2014-08-29  1:39   ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-08-29  1:39 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

follow_huge_addr()'s parameter write is not used, so let's remove it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/ia64/mm/hugetlbpage.c    | 2 +-
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 arch/x86/mm/hugetlbpage.c     | 2 +-
 include/linux/hugetlb.h       | 5 ++---
 mm/gup.c                      | 2 +-
 mm/hugetlb.c                  | 3 +--
 6 files changed, 7 insertions(+), 9 deletions(-)

diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
index 6170381bf074..524a4e001bda 100644
--- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
@@ -89,7 +89,7 @@ int prepare_hugepage_range(struct file *file,
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
+struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr)
 {
 	struct page *page = NULL;
 	pte_t *ptep;
diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
index 1d8854a56309..5b6fe8b0cde3 100644
--- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
@@ -674,7 +674,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 }
 
 struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
+follow_huge_addr(struct mm_struct *mm, unsigned long address)
 {
 	pte_t *ptep;
 	struct page *page = ERR_PTR(-EINVAL);
diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
index 03b8a7c11817..cab09d87ae65 100644
--- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
+++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
@@ -18,7 +18,7 @@
 
 #if 0	/* This is just for testing */
 struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
+follow_huge_addr(struct mm_struct *mm, unsigned long address)
 {
 	unsigned long start = address;
 	int length = 1;
diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
index b3200fce07aa..cdff1bd393bb 100644
--- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
+++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
@@ -96,8 +96,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write);
+struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address);
 struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
 				pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
@@ -124,7 +123,7 @@ static inline unsigned long hugetlb_total_pages(void)
 }
 
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
-#define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
+#define follow_huge_addr(mm, addr)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
 static inline void hugetlb_report_meminfo(struct seq_file *m)
 {
diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
index 597a5e92e265..8f0550f1770d 100644
--- mmotm-2014-08-25-16-52.orig/mm/gup.c
+++ mmotm-2014-08-25-16-52/mm/gup.c
@@ -149,7 +149,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 
 	*page_mask = 0;
 
-	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
+	page = follow_huge_addr(mm, address);
 	if (!IS_ERR(page)) {
 		BUG_ON(flags & FOLL_GET);
 		return page;
diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
index 0a4511115ee0..f7dcad3474ec 100644
--- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
+++ mmotm-2014-08-25-16-52/mm/hugetlb.c
@@ -3690,8 +3690,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
  * behavior.
  */
 struct page * __weak
-follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
+follow_huge_addr(struct mm_struct *mm, unsigned long address)
 {
 	return ERR_PTR(-EINVAL);
 }
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/6] hugepage migration fixes (v3)
  2014-08-29  1:38 ` Naoya Horiguchi
@ 2014-08-31 15:27   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2014-08-31 15:27 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> This is the ver.3 of hugepage migration fix patchset.

I wonder how far we are away from support THP migration with the
standard migrate_pages() syscall?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/6] hugepage migration fixes (v3)
@ 2014-08-31 15:27   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2014-08-31 15:27 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> This is the ver.3 of hugepage migration fix patchset.

I wonder how far we are away from support THP migration with the
standard migrate_pages() syscall?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/6] hugepage migration fixes (v3)
  2014-08-31 15:27   ` Andi Kleen
@ 2014-09-01  4:08     ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-01  4:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Sun, Aug 31, 2014 at 08:27:35AM -0700, Andi Kleen wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > This is the ver.3 of hugepage migration fix patchset.
> 
> I wonder how far we are away from support THP migration with the
> standard migrate_pages() syscall?

I don't think that we are very far from this because we can borrow
some code from migrate_misplaced_transhuge_page(), and the experience
in hugetlb migration will be also helpful.
The difficulties are rather in integrating thp support in the existing
migrate code which is already very complicated, so careful code-reading
and testing is necessary.
This topic was in my agenda for long, but no highlight at this point.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/6] hugepage migration fixes (v3)
@ 2014-09-01  4:08     ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-01  4:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Sun, Aug 31, 2014 at 08:27:35AM -0700, Andi Kleen wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > This is the ver.3 of hugepage migration fix patchset.
> 
> I wonder how far we are away from support THP migration with the
> standard migrate_pages() syscall?

I don't think that we are very far from this because we can borrow
some code from migrate_misplaced_transhuge_page(), and the experience
in hugetlb migration will be also helpful.
The difficulties are rather in integrating thp support in the existing
migrate code which is already very complicated, so careful code-reading
and testing is necessary.
This topic was in my agenda for long, but no highlight at this point.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/6] mm/hugetlb: reduce arch dependent code around follow_huge_*
  2014-08-29  1:38   ` Naoya Horiguchi
@ 2014-09-03 19:40     ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-03 19:40 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: James Hogan, Andrew Morton, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> Currently we have many duplicates in definitions around follow_huge_addr(),
> follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove them.
> The basic idea is to put the default implementation for these functions in
> mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETLB),
> and to implement arch-specific code only when the arch needs it.
> 
> For follow_huge_addr(), only powerpc and ia64 have their own implementation,
> and in all other architectures this function just returns ERR_PTR(-EINVAL).
> So this patch sets returning ERR_PTR(-EINVAL) as default.
> 
> As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always
> return 0 in your architecture (like in ia64 or sparc,) it's never called
> (the callsite is optimized away) no matter how implemented it is.
> So in such architectures, we don't need arch-specific implementation.
> 
> In some architecture (like mips, s390 and tile,) their current arch-specific
> follow_huge_(pmd|pud)() are effectively identical with the common code,
> so this patch lets these architecture use the common code.
> 
> One exception is metag, where pmd_huge() could return non-zero but it expects
> follow_huge_pmd() to always return NULL. This means that we need arch-specific
> implementation which returns NULL. This behavior looks strange to me (because
> non-zero pmd_huge() implies that the architecture supports PMD-based hugepage,
> so follow_huge_pmd() can/should return some relevant value,) but that's beyond
> this cleanup patch, so let's keep it.
> 
> Justification of non-trivial changes:
> - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
>   patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
>   is true when follow_huge_pmd() can be called (note that pmd_huge() has
>   the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
> - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
>   code. This patch forces these archs use PMD_MASK, but it's OK because
>   they are identical in both archs.
>   In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
>   In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
>   PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
>   PTE_ORDER is always 0, so these are identical.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  arch/arm/mm/hugetlbpage.c     |  6 ------
>  arch/arm64/mm/hugetlbpage.c   |  6 ------
>  arch/ia64/mm/hugetlbpage.c    |  6 ------
>  arch/metag/mm/hugetlbpage.c   |  6 ------
>  arch/mips/mm/hugetlbpage.c    | 18 ------------------
>  arch/powerpc/mm/hugetlbpage.c |  8 ++++++++
>  arch/s390/mm/hugetlbpage.c    | 20 --------------------
>  arch/sh/mm/hugetlbpage.c      | 12 ------------
>  arch/sparc/mm/hugetlbpage.c   | 12 ------------
>  arch/tile/mm/hugetlbpage.c    | 28 ----------------------------
>  arch/x86/mm/hugetlbpage.c     | 12 ------------
>  mm/hugetlb.c                  | 30 +++++++++++++++---------------
>  12 files changed, 23 insertions(+), 141 deletions(-)

I like this very much.  And I agree with each of your decisions above,
which you described very well in the commit message.  Not everybody
likes __weak-ness, but hugetlb.c is already using that technique,
and I think you're right to extend it to these functions.

What a delight to be able to see at a grep-glance that only ia64
and powerpc use the follow_huge_addr() method!

I agree that the metag situation is odd: I suppose almost nothing ends
up using follow_huge_pmd() apart from move_pages(), so it barely matters
at present; but odd-ones-out present a risk, and it will prevent your
hugetlb migration from being extended to metag.

Let's Cc James Hogan, who I hope will be able to say that metag can
just use the default implementation of follow_huge_pmd().

It would be good to Cc the other affected architecture maintainers
next time, but I don't expect this will pose any problem for them.

The only problem I have with this patch is its position in the series:
it's a cleanup, and so not marked for stable; but it's 1/6, and so at
least one of the fixes for stable depends on it.

It might be possible to shift it to the end of the series, and fix
only the x86 hugetlb migration for stable (since you have disabled it
on all other architectures for now); but I expect that would get too
messy, and you'll end up preferring to keep this as 1/6 and mark it
for stable too.

> 
> diff --git mmotm-2014-08-25-16-52.orig/arch/arm/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/arm/mm/hugetlbpage.c
> index 66781bf34077..c72412415093 100644
> --- mmotm-2014-08-25-16-52.orig/arch/arm/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/arm/mm/hugetlbpage.c
> @@ -36,12 +36,6 @@
>   * of type casting from pmd_t * to pte_t *.
>   */
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pud_huge(pud_t pud)
>  {
>  	return 0;
> diff --git mmotm-2014-08-25-16-52.orig/arch/arm64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/arm64/mm/hugetlbpage.c
> index 023747bf4dd7..2de9d2e59d96 100644
> --- mmotm-2014-08-25-16-52.orig/arch/arm64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/arm64/mm/hugetlbpage.c
> @@ -38,12 +38,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  }
>  #endif
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !(pmd_val(pmd) & PMD_TABLE_BIT);
> diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> index 76069c18ee42..52b7604b5215 100644
> --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> @@ -114,12 +114,6 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> -
>  void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>  			unsigned long addr, unsigned long end,
>  			unsigned long floor, unsigned long ceiling)
> diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> index 3c52fa6d0f8e..745081427659 100644
> --- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> @@ -94,12 +94,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return pmd_page_shift(pmd) > PAGE_SHIFT;
> diff --git mmotm-2014-08-25-16-52.orig/arch/mips/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/mips/mm/hugetlbpage.c
> index 4ec8ee10d371..06e0f421b41b 100644
> --- mmotm-2014-08-25-16-52.orig/arch/mips/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/mips/mm/hugetlbpage.c
> @@ -68,12 +68,6 @@ int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return (pmd_val(pmd) & _PAGE_HUGE) != 0;
> @@ -83,15 +77,3 @@ int pud_huge(pud_t pud)
>  {
>  	return (pud_val(pud) & _PAGE_HUGE) != 0;
>  }
> -
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> index 7e70ae968e5f..9517a93a315c 100644
> --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> @@ -706,6 +706,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	return NULL;
>  }
>  
> +struct page *
> +follow_huge_pud(struct mm_struct *mm, unsigned long address,
> +		pmd_t *pmd, int write)
> +{
> +	BUG();
> +	return NULL;
> +}
> +
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>  				      unsigned long sz)
>  {
> diff --git mmotm-2014-08-25-16-52.orig/arch/s390/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/s390/mm/hugetlbpage.c
> index 0ff66a7e29bb..811e7f9a2de0 100644
> --- mmotm-2014-08-25-16-52.orig/arch/s390/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/s390/mm/hugetlbpage.c
> @@ -201,12 +201,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	if (!MACHINE_HAS_HPAGE)
> @@ -219,17 +213,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmdp, int write)
> -{
> -	struct page *page;
> -
> -	if (!MACHINE_HAS_HPAGE)
> -		return NULL;
> -
> -	page = pmd_page(*pmdp);
> -	if (page)
> -		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> diff --git mmotm-2014-08-25-16-52.orig/arch/sh/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/sh/mm/hugetlbpage.c
> index d7762349ea48..534bc978af8a 100644
> --- mmotm-2014-08-25-16-52.orig/arch/sh/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/sh/mm/hugetlbpage.c
> @@ -67,12 +67,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return 0;
> @@ -82,9 +76,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> diff --git mmotm-2014-08-25-16-52.orig/arch/sparc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/sparc/mm/hugetlbpage.c
> index d329537739c6..4242eab12e10 100644
> --- mmotm-2014-08-25-16-52.orig/arch/sparc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/sparc/mm/hugetlbpage.c
> @@ -215,12 +215,6 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>  	return entry;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return 0;
> @@ -230,9 +224,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> diff --git mmotm-2014-08-25-16-52.orig/arch/tile/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/tile/mm/hugetlbpage.c
> index e514899e1100..8a00c7b7b862 100644
> --- mmotm-2014-08-25-16-52.orig/arch/tile/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/tile/mm/hugetlbpage.c
> @@ -150,12 +150,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>  	return NULL;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !!(pmd_val(pmd) & _PAGE_HUGE_PAGE);
> @@ -166,28 +160,6 @@ int pud_huge(pud_t pud)
>  	return !!(pud_val(pud) & _PAGE_HUGE_PAGE);
>  }
>  
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> -
> -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -			     pud_t *pud, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pud);
> -	if (page)
> -		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> -
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  {
>  	return 0;
> diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> index 8b977ebf9388..03b8a7c11817 100644
> --- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> @@ -52,20 +52,8 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
>  #else
>  
> -struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !!(pmd_val(pmd) & _PAGE_PSE);
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index eeceeeb09019..022767506c7b 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3653,7 +3653,20 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>  	return (pte_t *) pmd;
>  }
>  
> -struct page *
> +#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> +
> +/*
> + * These functions are overwritable if your architecture needs its own
> + * behavior.
> + */
> +struct page * __weak
> +follow_huge_addr(struct mm_struct *mm, unsigned long address,
> +			      int write)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +struct page * __weak
>  follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  		pmd_t *pmd, int write)
>  {
> @@ -3665,7 +3678,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	return page;
>  }
>  
> -struct page *
> +struct page * __weak
>  follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  		pud_t *pud, int write)
>  {
> @@ -3677,19 +3690,6 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  	return page;
>  }
>  
> -#else /* !CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> -
> -/* Can be overriden by architectures */
> -struct page * __weak
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -	       pud_t *pud, int write)
> -{
> -	BUG();
> -	return NULL;
> -}
> -
> -#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> -
>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  /* Should be called in hugetlb_lock */
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/6] mm/hugetlb: reduce arch dependent code around follow_huge_*
@ 2014-09-03 19:40     ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-03 19:40 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: James Hogan, Andrew Morton, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> Currently we have many duplicates in definitions around follow_huge_addr(),
> follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove them.
> The basic idea is to put the default implementation for these functions in
> mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETLB),
> and to implement arch-specific code only when the arch needs it.
> 
> For follow_huge_addr(), only powerpc and ia64 have their own implementation,
> and in all other architectures this function just returns ERR_PTR(-EINVAL).
> So this patch sets returning ERR_PTR(-EINVAL) as default.
> 
> As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always
> return 0 in your architecture (like in ia64 or sparc,) it's never called
> (the callsite is optimized away) no matter how implemented it is.
> So in such architectures, we don't need arch-specific implementation.
> 
> In some architecture (like mips, s390 and tile,) their current arch-specific
> follow_huge_(pmd|pud)() are effectively identical with the common code,
> so this patch lets these architecture use the common code.
> 
> One exception is metag, where pmd_huge() could return non-zero but it expects
> follow_huge_pmd() to always return NULL. This means that we need arch-specific
> implementation which returns NULL. This behavior looks strange to me (because
> non-zero pmd_huge() implies that the architecture supports PMD-based hugepage,
> so follow_huge_pmd() can/should return some relevant value,) but that's beyond
> this cleanup patch, so let's keep it.
> 
> Justification of non-trivial changes:
> - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
>   patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
>   is true when follow_huge_pmd() can be called (note that pmd_huge() has
>   the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
> - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
>   code. This patch forces these archs use PMD_MASK, but it's OK because
>   they are identical in both archs.
>   In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
>   In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
>   PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
>   PTE_ORDER is always 0, so these are identical.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  arch/arm/mm/hugetlbpage.c     |  6 ------
>  arch/arm64/mm/hugetlbpage.c   |  6 ------
>  arch/ia64/mm/hugetlbpage.c    |  6 ------
>  arch/metag/mm/hugetlbpage.c   |  6 ------
>  arch/mips/mm/hugetlbpage.c    | 18 ------------------
>  arch/powerpc/mm/hugetlbpage.c |  8 ++++++++
>  arch/s390/mm/hugetlbpage.c    | 20 --------------------
>  arch/sh/mm/hugetlbpage.c      | 12 ------------
>  arch/sparc/mm/hugetlbpage.c   | 12 ------------
>  arch/tile/mm/hugetlbpage.c    | 28 ----------------------------
>  arch/x86/mm/hugetlbpage.c     | 12 ------------
>  mm/hugetlb.c                  | 30 +++++++++++++++---------------
>  12 files changed, 23 insertions(+), 141 deletions(-)

I like this very much.  And I agree with each of your decisions above,
which you described very well in the commit message.  Not everybody
likes __weak-ness, but hugetlb.c is already using that technique,
and I think you're right to extend it to these functions.

What a delight to be able to see at a grep-glance that only ia64
and powerpc use the follow_huge_addr() method!

I agree that the metag situation is odd: I suppose almost nothing ends
up using follow_huge_pmd() apart from move_pages(), so it barely matters
at present; but odd-ones-out present a risk, and it will prevent your
hugetlb migration from being extended to metag.

Let's Cc James Hogan, who I hope will be able to say that metag can
just use the default implementation of follow_huge_pmd().

It would be good to Cc the other affected architecture maintainers
next time, but I don't expect this will pose any problem for them.

The only problem I have with this patch is its position in the series:
it's a cleanup, and so not marked for stable; but it's 1/6, and so at
least one of the fixes for stable depends on it.

It might be possible to shift it to the end of the series, and fix
only the x86 hugetlb migration for stable (since you have disabled it
on all other architectures for now); but I expect that would get too
messy, and you'll end up preferring to keep this as 1/6 and mark it
for stable too.

> 
> diff --git mmotm-2014-08-25-16-52.orig/arch/arm/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/arm/mm/hugetlbpage.c
> index 66781bf34077..c72412415093 100644
> --- mmotm-2014-08-25-16-52.orig/arch/arm/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/arm/mm/hugetlbpage.c
> @@ -36,12 +36,6 @@
>   * of type casting from pmd_t * to pte_t *.
>   */
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pud_huge(pud_t pud)
>  {
>  	return 0;
> diff --git mmotm-2014-08-25-16-52.orig/arch/arm64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/arm64/mm/hugetlbpage.c
> index 023747bf4dd7..2de9d2e59d96 100644
> --- mmotm-2014-08-25-16-52.orig/arch/arm64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/arm64/mm/hugetlbpage.c
> @@ -38,12 +38,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  }
>  #endif
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !(pmd_val(pmd) & PMD_TABLE_BIT);
> diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> index 76069c18ee42..52b7604b5215 100644
> --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> @@ -114,12 +114,6 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> -
>  void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>  			unsigned long addr, unsigned long end,
>  			unsigned long floor, unsigned long ceiling)
> diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> index 3c52fa6d0f8e..745081427659 100644
> --- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> @@ -94,12 +94,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return pmd_page_shift(pmd) > PAGE_SHIFT;
> diff --git mmotm-2014-08-25-16-52.orig/arch/mips/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/mips/mm/hugetlbpage.c
> index 4ec8ee10d371..06e0f421b41b 100644
> --- mmotm-2014-08-25-16-52.orig/arch/mips/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/mips/mm/hugetlbpage.c
> @@ -68,12 +68,6 @@ int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return (pmd_val(pmd) & _PAGE_HUGE) != 0;
> @@ -83,15 +77,3 @@ int pud_huge(pud_t pud)
>  {
>  	return (pud_val(pud) & _PAGE_HUGE) != 0;
>  }
> -
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> index 7e70ae968e5f..9517a93a315c 100644
> --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> @@ -706,6 +706,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	return NULL;
>  }
>  
> +struct page *
> +follow_huge_pud(struct mm_struct *mm, unsigned long address,
> +		pmd_t *pmd, int write)
> +{
> +	BUG();
> +	return NULL;
> +}
> +
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>  				      unsigned long sz)
>  {
> diff --git mmotm-2014-08-25-16-52.orig/arch/s390/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/s390/mm/hugetlbpage.c
> index 0ff66a7e29bb..811e7f9a2de0 100644
> --- mmotm-2014-08-25-16-52.orig/arch/s390/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/s390/mm/hugetlbpage.c
> @@ -201,12 +201,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	if (!MACHINE_HAS_HPAGE)
> @@ -219,17 +213,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmdp, int write)
> -{
> -	struct page *page;
> -
> -	if (!MACHINE_HAS_HPAGE)
> -		return NULL;
> -
> -	page = pmd_page(*pmdp);
> -	if (page)
> -		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> diff --git mmotm-2014-08-25-16-52.orig/arch/sh/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/sh/mm/hugetlbpage.c
> index d7762349ea48..534bc978af8a 100644
> --- mmotm-2014-08-25-16-52.orig/arch/sh/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/sh/mm/hugetlbpage.c
> @@ -67,12 +67,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return 0;
> @@ -82,9 +76,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> diff --git mmotm-2014-08-25-16-52.orig/arch/sparc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/sparc/mm/hugetlbpage.c
> index d329537739c6..4242eab12e10 100644
> --- mmotm-2014-08-25-16-52.orig/arch/sparc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/sparc/mm/hugetlbpage.c
> @@ -215,12 +215,6 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>  	return entry;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return 0;
> @@ -230,9 +224,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> diff --git mmotm-2014-08-25-16-52.orig/arch/tile/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/tile/mm/hugetlbpage.c
> index e514899e1100..8a00c7b7b862 100644
> --- mmotm-2014-08-25-16-52.orig/arch/tile/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/tile/mm/hugetlbpage.c
> @@ -150,12 +150,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>  	return NULL;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !!(pmd_val(pmd) & _PAGE_HUGE_PAGE);
> @@ -166,28 +160,6 @@ int pud_huge(pud_t pud)
>  	return !!(pud_val(pud) & _PAGE_HUGE_PAGE);
>  }
>  
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> -
> -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -			     pud_t *pud, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pud);
> -	if (page)
> -		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> -
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  {
>  	return 0;
> diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> index 8b977ebf9388..03b8a7c11817 100644
> --- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> @@ -52,20 +52,8 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
>  #else
>  
> -struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !!(pmd_val(pmd) & _PAGE_PSE);
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index eeceeeb09019..022767506c7b 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3653,7 +3653,20 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>  	return (pte_t *) pmd;
>  }
>  
> -struct page *
> +#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> +
> +/*
> + * These functions are overwritable if your architecture needs its own
> + * behavior.
> + */
> +struct page * __weak
> +follow_huge_addr(struct mm_struct *mm, unsigned long address,
> +			      int write)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +struct page * __weak
>  follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  		pmd_t *pmd, int write)
>  {
> @@ -3665,7 +3678,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	return page;
>  }
>  
> -struct page *
> +struct page * __weak
>  follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  		pud_t *pud, int write)
>  {
> @@ -3677,19 +3690,6 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  	return page;
>  }
>  
> -#else /* !CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> -
> -/* Can be overriden by architectures */
> -struct page * __weak
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -	       pud_t *pud, int write)
> -{
> -	BUG();
> -	return NULL;
> -}
> -
> -#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> -
>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  /* Should be called in hugetlb_lock */
> -- 
> 1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
  2014-08-29  1:38   ` Naoya Horiguchi
@ 2014-09-03 21:17     ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-03 21:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> We have a race condition between move_pages() and freeing hugepages,
> where move_pages() calls follow_page(FOLL_GET) for hugepages internally
> and tries to get its refcount without preventing concurrent freeing.
> This race crashes the kernel, so this patch fixes it by moving FOLL_GET
> code for hugepages into follow_huge_pmd() with taking the page table lock.

You really ought to mention how you are intentionally dropping the
unnecessary check for NULL pte_page() in this patch: we agree on that,
but it does need to be mentioned somewhere in the comment.

> 
> This patch also adds the similar locking to follow_huge_(addr|pud)
> for consistency.
> 
> Here is the reproducer:
> 
>   $ cat movepages.c
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <numaif.h>
> 
>   #define ADDR_INPUT      0x700000000000UL
>   #define HPS             0x200000
>   #define PS              0x1000
> 
>   int main(int argc, char *argv[]) {
>           int i;
>           int nr_hp = strtol(argv[1], NULL, 0);
>           int nr_p  = nr_hp * HPS / PS;
>           int ret;
>           void **addrs;
>           int *status;
>           int *nodes;
>           pid_t pid;
> 
>           pid = strtol(argv[2], NULL, 0);
>           addrs  = malloc(sizeof(char *) * nr_p + 1);
>           status = malloc(sizeof(char *) * nr_p + 1);
>           nodes  = malloc(sizeof(char *) * nr_p + 1);
> 
>           while (1) {
>                   for (i = 0; i < nr_p; i++) {
>                           addrs[i] = (void *)ADDR_INPUT + i * PS;
>                           nodes[i] = 1;
>                           status[i] = 0;
>                   }
>                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
>                                         MPOL_MF_MOVE_ALL);
>                   if (ret == -1)
>                           err("move_pages");
> 
>                   for (i = 0; i < nr_p; i++) {
>                           addrs[i] = (void *)ADDR_INPUT + i * PS;
>                           nodes[i] = 0;
>                           status[i] = 0;
>                   }
>                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
>                                         MPOL_MF_MOVE_ALL);
>                   if (ret == -1)
>                           err("move_pages");
>           }
>           return 0;
>   }
> 
>   $ cat hugepage.c
>   #include <stdio.h>
>   #include <sys/mman.h>
>   #include <string.h>
> 
>   #define ADDR_INPUT      0x700000000000UL
>   #define HPS             0x200000
> 
>   int main(int argc, char *argv[]) {
>           int nr_hp = strtol(argv[1], NULL, 0);
>           char *p;
> 
>           while (1) {
>                   p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
>                            MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
>                   if (p != (void *)ADDR_INPUT) {
>                           perror("mmap");
>                           break;
>                   }
>                   memset(p, 0, nr_hp * HPS);
>                   munmap(p, nr_hp * HPS);
>           }
>   }
> 
>   $ sysctl vm.nr_hugepages=40
>   $ ./hugepage 10 &
>   $ ./movepages 10 $(pgrep -f hugepage)
> 
> Note for stable inclusion:
>   This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code
>   to move_pages()"), so is applicable to -stable kernels which includes it.

Just say
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")

> 
> ChangeLog v3:
> - remove unnecessary if (page) check
> - check (pmd|pud)_huge again after holding ptl
> - do the same change also on follow_huge_pud()
> - take page table lock also in follow_huge_addr()
> 
> ChangeLog v2:
> - introduce follow_huge_pmd_lock() to do locking in arch-independent code.

ChangeLog vN info belongs below the ---

> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>  # [3.12+]

No ack to this one yet, I'm afraid.

> ---
>  arch/ia64/mm/hugetlbpage.c    |  9 +++++++--
>  arch/metag/mm/hugetlbpage.c   |  4 ++--
>  arch/powerpc/mm/hugetlbpage.c | 22 +++++++++++-----------
>  include/linux/hugetlb.h       | 12 ++++++------
>  mm/gup.c                      | 25 ++++---------------------
>  mm/hugetlb.c                  | 43 +++++++++++++++++++++++++++++++------------
>  6 files changed, 61 insertions(+), 54 deletions(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> index 52b7604b5215..6170381bf074 100644
> --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> @@ -91,17 +91,22 @@ int prepare_hugepage_range(struct file *file,
>  
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	pte_t *ptep;
> +	spinlock_t *ptl;
>  
>  	if (REGION_NUMBER(addr) != RGN_HPAGE)
>  		return ERR_PTR(-EINVAL);
>  
>  	ptep = huge_pte_offset(mm, addr);
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);

It was a mistake to lump this follow_huge_addr() change in with the
rest: please defer it to your 6/6 (or send 5 and leave 6th to later).

Unless I'm missing something, all you succeed in doing here is break
the build on ia64 and powerpc, by introducing undeclared "vma" variable.

There is no point whatever in taking and dropping this lock: the
point was to do the get_page while holding the relevant page table lock,
but you're not doing any get_page, and you still have an "int write"
argument instead of "int flags" to pass down the FOLL_GET flag,
and you still have the BUG_ON(flags & FOLL_GET) in follow_page_mask().

So, please throw these follow_huge_addr() parts out this patch.

>  	if (!ptep || pte_none(*ptep))
> -		return NULL;
> +		goto out;
> +
>  	page = pte_page(*ptep);
>  	page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  int pmd_huge(pmd_t pmd)
> diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> index 745081427659..5e96ef096df9 100644
> --- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> @@ -104,8 +104,8 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +			     pmd_t *pmd, int flags)

Change from "write" to "flags" is good, but I question below whether
we actually need to change from mm to vma in follow_huge_pmd() and
follow_huge_pud().

>  {
>  	return NULL;
>  }
> diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> index 9517a93a315c..1d8854a56309 100644
> --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> @@ -677,38 +677,38 @@ struct page *
>  follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
>  {
>  	pte_t *ptep;
> -	struct page *page;
> +	struct page *page = ERR_PTR(-EINVAL);
>  	unsigned shift;
>  	unsigned long mask;
> +	spinlock_t *ptl;
>  	/*
>  	 * Transparent hugepages are handled by generic code. We can skip them
>  	 * here.
>  	 */
>  	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> -
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);

As above, you're breaking the build with a lock that serves no purpose
in the current patch.

>  	/* Verify it is a huge page else bail. */
>  	if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
> -		return ERR_PTR(-EINVAL);
> +		goto out;
>  
>  	mask = (1UL << shift) - 1;
> -	page = pte_page(*ptep);
> -	if (page)
> -		page += (address & mask) / PAGE_SIZE;
> -
> +	page = pte_page(*ptep) + ((address & mask) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
>  struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, int flags)
>  {
>  	BUG();
>  	return NULL;
>  }
>  
>  struct page *
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +		pud_t *pud, int flags)
>  {
>  	BUG();
>  	return NULL;
> diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> index 6e6d338641fe..b3200fce07aa 100644
> --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> @@ -98,10 +98,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  			      int write);
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -				pmd_t *pmd, int write);
> -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -				pud_t *pud, int write);
> +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +				pmd_t *pmd, int flags);
> +struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +				pud_t *pud, int flags);
>  int pmd_huge(pmd_t pmd);
>  int pud_huge(pud_t pmd);
>  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> @@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
>  static inline void hugetlb_show_meminfo(void)
>  {
>  }
> -#define follow_huge_pmd(mm, addr, pmd, write)	NULL
> -#define follow_huge_pud(mm, addr, pud, write)	NULL
> +#define follow_huge_pmd(vma, addr, pmd, flags)	NULL
> +#define follow_huge_pud(vma, addr, pud, flags)	NULL
>  #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
>  #define pmd_huge(x)	0
>  #define pud_huge(x)	0
> diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> index 91d044b1600d..597a5e92e265 100644
> --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> +++ mmotm-2014-08-25-16-52/mm/gup.c
> @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	pud = pud_offset(pgd, address);
>  	if (pud_none(*pud))
>  		return no_page_table(vma, flags);
> -	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
> -		if (flags & FOLL_GET)
> -			return NULL;
> -		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> -		return page;
> -	}
> +	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
> +		return follow_huge_pud(vma, address, pud, flags);

Yes, this part is good, except I think mm rather than vma.

>  	if (unlikely(pud_bad(*pud)))
>  		return no_page_table(vma, flags);
>  
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		return no_page_table(vma, flags);
> -	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> -		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> -		if (flags & FOLL_GET) {
> -			/*
> -			 * Refcount on tail pages are not well-defined and
> -			 * shouldn't be taken. The caller should handle a NULL
> -			 * return when trying to follow tail pages.
> -			 */
> -			if (PageHead(page))
> -				get_page(page);
> -			else
> -				page = NULL;
> -		}
> -		return page;
> -	}
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> +		return follow_huge_pmd(vma, address, pmd, flags);

And this part is good, except I think mm rather than vma.

>  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
>  		return no_page_table(vma, flags);
>  	if (pmd_trans_huge(*pmd)) {
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 022767506c7b..c5345c5edb50 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3667,26 +3667,45 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  }
>  
>  struct page * __weak
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, int flags)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
>  
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);

So, this is why you have had to change from "mm" to "vma" throughout.
And we might end up deciding that that is the right thing to do.

But here we are deep in page table code, dealing with a huge pmd entry:
I protest that it's very lame to be asking vma->vm_file to tell us what
lock the page table code needs at this level.  Isn't it pmd_lockptr()?

Now, I'm easily confused, and there may be reasons why it's more subtle
than that, and you really are forced to use huge_pte_lockptr(); but I'd
much rather not if we can avoid doing so, just as a matter of principle.

One subtlety to take care over: it's a long time since I've had to
worry about pmd folding and pud folding (what happens when you only
have 2 or 3 levels of page table instead of the full 4): macros get
defined to each other, and levels get optimized out (perhaps
differently on different architectures).

So although at first sight the lock to take in follow_huge_pud()
would seem to be mm->page_table_lock, I am not at this point certain
that that's necessarily so - sometimes pud_huge might be pmd_huge,
and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
to be the pud level.  Maybe: needs checking through the architectures
and their configs, not obvious to me.

I realize that I am asking for you (or I) to do more work, when using
huge_pte_lock(hstate_vma(vma),,) would work it out "automatically";
but I do feel quite strongly that that's the right approach here
(and I'm not just trying to avoid a few edits of "mm" to "vma").

Cc'ing Kirill, who may have a strong view to the contrary,
or a good insight on where the problems if any might be.

Also Cc'ing Kirill because I'm not convinced that huge_pte_lockptr()
necessarily does the right thing on follow_huge_addr() architectures,
ia64 and powerpc.  Do they, for example, allocate the memory for their
hugetlb entries in such a way that we can indeed use pmd_lockptr() to
point to a useable spinlock, in the case when huge_page_size(h) just
happens to equal PMD_SIZE?

I don't know if this was thought through thoroughly
(now that's a satisfying phrase hugh thinks hugh never wrote before!)
when huge_pte_lockptr() was invented or not.  I think it would be safer
if huge_pte_lockptr() just gave mm->page_table_lock on follow_huge_addr()
architectures.

> +
> +	if (!pmd_huge(*pmd))
> +		goto out;
> +
> +	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +
> +	if (flags & FOLL_GET)
> +		if (!get_page_unless_zero(page))
> +			page = NULL;

get_page() should be quite good enough, shouldn't it?  We are holding
the necessary lock, and have tested pmd_huge(*pmd), so it would be a
bug if page_count(page) were zero here.

> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
>  struct page * __weak
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -		pud_t *pud, int write)
> +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +		pud_t *pud, int flags)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
>  
> -	page = pte_page(*(pte_t *)pud);
> -	if (page)
> -		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +	if (flags & FOLL_GET)
> +		return NULL;
> +
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pud);

Well, you do have vma declared here, but otherwise it's like what you
had in follow_huge_addr(): there is no point in taking and dropping
the lock if you're not getting the page while the lock is held.

So, which way to go on follow_huge_pud()?  I certainly think that we
should implement FOLL_GET on it, as we should for follow_huge_addr(),
simply for completeness, and so we don't need to come back here.

But whether we should do so in a patch which is Cc'ed to stable is not
so clear.  And leaving follow_huge_pmd() and follow_huge_addr() out
of this patch may avoid those awkward where-is-the-lock questions
for now.  Convert follow_huge_pmd() in a separate patch?

> +
> +	if (!pud_huge(*pud))
> +		goto out;
> +
> +	page = pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
@ 2014-09-03 21:17     ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-03 21:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> We have a race condition between move_pages() and freeing hugepages,
> where move_pages() calls follow_page(FOLL_GET) for hugepages internally
> and tries to get its refcount without preventing concurrent freeing.
> This race crashes the kernel, so this patch fixes it by moving FOLL_GET
> code for hugepages into follow_huge_pmd() with taking the page table lock.

You really ought to mention how you are intentionally dropping the
unnecessary check for NULL pte_page() in this patch: we agree on that,
but it does need to be mentioned somewhere in the comment.

> 
> This patch also adds the similar locking to follow_huge_(addr|pud)
> for consistency.
> 
> Here is the reproducer:
> 
>   $ cat movepages.c
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <numaif.h>
> 
>   #define ADDR_INPUT      0x700000000000UL
>   #define HPS             0x200000
>   #define PS              0x1000
> 
>   int main(int argc, char *argv[]) {
>           int i;
>           int nr_hp = strtol(argv[1], NULL, 0);
>           int nr_p  = nr_hp * HPS / PS;
>           int ret;
>           void **addrs;
>           int *status;
>           int *nodes;
>           pid_t pid;
> 
>           pid = strtol(argv[2], NULL, 0);
>           addrs  = malloc(sizeof(char *) * nr_p + 1);
>           status = malloc(sizeof(char *) * nr_p + 1);
>           nodes  = malloc(sizeof(char *) * nr_p + 1);
> 
>           while (1) {
>                   for (i = 0; i < nr_p; i++) {
>                           addrs[i] = (void *)ADDR_INPUT + i * PS;
>                           nodes[i] = 1;
>                           status[i] = 0;
>                   }
>                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
>                                         MPOL_MF_MOVE_ALL);
>                   if (ret == -1)
>                           err("move_pages");
> 
>                   for (i = 0; i < nr_p; i++) {
>                           addrs[i] = (void *)ADDR_INPUT + i * PS;
>                           nodes[i] = 0;
>                           status[i] = 0;
>                   }
>                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
>                                         MPOL_MF_MOVE_ALL);
>                   if (ret == -1)
>                           err("move_pages");
>           }
>           return 0;
>   }
> 
>   $ cat hugepage.c
>   #include <stdio.h>
>   #include <sys/mman.h>
>   #include <string.h>
> 
>   #define ADDR_INPUT      0x700000000000UL
>   #define HPS             0x200000
> 
>   int main(int argc, char *argv[]) {
>           int nr_hp = strtol(argv[1], NULL, 0);
>           char *p;
> 
>           while (1) {
>                   p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
>                            MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
>                   if (p != (void *)ADDR_INPUT) {
>                           perror("mmap");
>                           break;
>                   }
>                   memset(p, 0, nr_hp * HPS);
>                   munmap(p, nr_hp * HPS);
>           }
>   }
> 
>   $ sysctl vm.nr_hugepages=40
>   $ ./hugepage 10 &
>   $ ./movepages 10 $(pgrep -f hugepage)
> 
> Note for stable inclusion:
>   This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code
>   to move_pages()"), so is applicable to -stable kernels which includes it.

Just say
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")

> 
> ChangeLog v3:
> - remove unnecessary if (page) check
> - check (pmd|pud)_huge again after holding ptl
> - do the same change also on follow_huge_pud()
> - take page table lock also in follow_huge_addr()
> 
> ChangeLog v2:
> - introduce follow_huge_pmd_lock() to do locking in arch-independent code.

ChangeLog vN info belongs below the ---

> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>  # [3.12+]

No ack to this one yet, I'm afraid.

> ---
>  arch/ia64/mm/hugetlbpage.c    |  9 +++++++--
>  arch/metag/mm/hugetlbpage.c   |  4 ++--
>  arch/powerpc/mm/hugetlbpage.c | 22 +++++++++++-----------
>  include/linux/hugetlb.h       | 12 ++++++------
>  mm/gup.c                      | 25 ++++---------------------
>  mm/hugetlb.c                  | 43 +++++++++++++++++++++++++++++++------------
>  6 files changed, 61 insertions(+), 54 deletions(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> index 52b7604b5215..6170381bf074 100644
> --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> @@ -91,17 +91,22 @@ int prepare_hugepage_range(struct file *file,
>  
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	pte_t *ptep;
> +	spinlock_t *ptl;
>  
>  	if (REGION_NUMBER(addr) != RGN_HPAGE)
>  		return ERR_PTR(-EINVAL);
>  
>  	ptep = huge_pte_offset(mm, addr);
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);

It was a mistake to lump this follow_huge_addr() change in with the
rest: please defer it to your 6/6 (or send 5 and leave 6th to later).

Unless I'm missing something, all you succeed in doing here is break
the build on ia64 and powerpc, by introducing undeclared "vma" variable.

There is no point whatever in taking and dropping this lock: the
point was to do the get_page while holding the relevant page table lock,
but you're not doing any get_page, and you still have an "int write"
argument instead of "int flags" to pass down the FOLL_GET flag,
and you still have the BUG_ON(flags & FOLL_GET) in follow_page_mask().

So, please throw these follow_huge_addr() parts out this patch.

>  	if (!ptep || pte_none(*ptep))
> -		return NULL;
> +		goto out;
> +
>  	page = pte_page(*ptep);
>  	page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  int pmd_huge(pmd_t pmd)
> diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> index 745081427659..5e96ef096df9 100644
> --- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> @@ -104,8 +104,8 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +			     pmd_t *pmd, int flags)

Change from "write" to "flags" is good, but I question below whether
we actually need to change from mm to vma in follow_huge_pmd() and
follow_huge_pud().

>  {
>  	return NULL;
>  }
> diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> index 9517a93a315c..1d8854a56309 100644
> --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> @@ -677,38 +677,38 @@ struct page *
>  follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
>  {
>  	pte_t *ptep;
> -	struct page *page;
> +	struct page *page = ERR_PTR(-EINVAL);
>  	unsigned shift;
>  	unsigned long mask;
> +	spinlock_t *ptl;
>  	/*
>  	 * Transparent hugepages are handled by generic code. We can skip them
>  	 * here.
>  	 */
>  	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> -
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);

As above, you're breaking the build with a lock that serves no purpose
in the current patch.

>  	/* Verify it is a huge page else bail. */
>  	if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
> -		return ERR_PTR(-EINVAL);
> +		goto out;
>  
>  	mask = (1UL << shift) - 1;
> -	page = pte_page(*ptep);
> -	if (page)
> -		page += (address & mask) / PAGE_SIZE;
> -
> +	page = pte_page(*ptep) + ((address & mask) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
>  struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, int flags)
>  {
>  	BUG();
>  	return NULL;
>  }
>  
>  struct page *
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +		pud_t *pud, int flags)
>  {
>  	BUG();
>  	return NULL;
> diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> index 6e6d338641fe..b3200fce07aa 100644
> --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> @@ -98,10 +98,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  			      int write);
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -				pmd_t *pmd, int write);
> -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -				pud_t *pud, int write);
> +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +				pmd_t *pmd, int flags);
> +struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +				pud_t *pud, int flags);
>  int pmd_huge(pmd_t pmd);
>  int pud_huge(pud_t pmd);
>  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> @@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
>  static inline void hugetlb_show_meminfo(void)
>  {
>  }
> -#define follow_huge_pmd(mm, addr, pmd, write)	NULL
> -#define follow_huge_pud(mm, addr, pud, write)	NULL
> +#define follow_huge_pmd(vma, addr, pmd, flags)	NULL
> +#define follow_huge_pud(vma, addr, pud, flags)	NULL
>  #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
>  #define pmd_huge(x)	0
>  #define pud_huge(x)	0
> diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> index 91d044b1600d..597a5e92e265 100644
> --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> +++ mmotm-2014-08-25-16-52/mm/gup.c
> @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	pud = pud_offset(pgd, address);
>  	if (pud_none(*pud))
>  		return no_page_table(vma, flags);
> -	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
> -		if (flags & FOLL_GET)
> -			return NULL;
> -		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> -		return page;
> -	}
> +	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
> +		return follow_huge_pud(vma, address, pud, flags);

Yes, this part is good, except I think mm rather than vma.

>  	if (unlikely(pud_bad(*pud)))
>  		return no_page_table(vma, flags);
>  
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		return no_page_table(vma, flags);
> -	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> -		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> -		if (flags & FOLL_GET) {
> -			/*
> -			 * Refcount on tail pages are not well-defined and
> -			 * shouldn't be taken. The caller should handle a NULL
> -			 * return when trying to follow tail pages.
> -			 */
> -			if (PageHead(page))
> -				get_page(page);
> -			else
> -				page = NULL;
> -		}
> -		return page;
> -	}
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> +		return follow_huge_pmd(vma, address, pmd, flags);

And this part is good, except I think mm rather than vma.

>  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
>  		return no_page_table(vma, flags);
>  	if (pmd_trans_huge(*pmd)) {
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 022767506c7b..c5345c5edb50 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3667,26 +3667,45 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  }
>  
>  struct page * __weak
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, int flags)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
>  
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);

So, this is why you have had to change from "mm" to "vma" throughout.
And we might end up deciding that that is the right thing to do.

But here we are deep in page table code, dealing with a huge pmd entry:
I protest that it's very lame to be asking vma->vm_file to tell us what
lock the page table code needs at this level.  Isn't it pmd_lockptr()?

Now, I'm easily confused, and there may be reasons why it's more subtle
than that, and you really are forced to use huge_pte_lockptr(); but I'd
much rather not if we can avoid doing so, just as a matter of principle.

One subtlety to take care over: it's a long time since I've had to
worry about pmd folding and pud folding (what happens when you only
have 2 or 3 levels of page table instead of the full 4): macros get
defined to each other, and levels get optimized out (perhaps
differently on different architectures).

So although at first sight the lock to take in follow_huge_pud()
would seem to be mm->page_table_lock, I am not at this point certain
that that's necessarily so - sometimes pud_huge might be pmd_huge,
and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
to be the pud level.  Maybe: needs checking through the architectures
and their configs, not obvious to me.

I realize that I am asking for you (or I) to do more work, when using
huge_pte_lock(hstate_vma(vma),,) would work it out "automatically";
but I do feel quite strongly that that's the right approach here
(and I'm not just trying to avoid a few edits of "mm" to "vma").

Cc'ing Kirill, who may have a strong view to the contrary,
or a good insight on where the problems if any might be.

Also Cc'ing Kirill because I'm not convinced that huge_pte_lockptr()
necessarily does the right thing on follow_huge_addr() architectures,
ia64 and powerpc.  Do they, for example, allocate the memory for their
hugetlb entries in such a way that we can indeed use pmd_lockptr() to
point to a useable spinlock, in the case when huge_page_size(h) just
happens to equal PMD_SIZE?

I don't know if this was thought through thoroughly
(now that's a satisfying phrase hugh thinks hugh never wrote before!)
when huge_pte_lockptr() was invented or not.  I think it would be safer
if huge_pte_lockptr() just gave mm->page_table_lock on follow_huge_addr()
architectures.

> +
> +	if (!pmd_huge(*pmd))
> +		goto out;
> +
> +	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +
> +	if (flags & FOLL_GET)
> +		if (!get_page_unless_zero(page))
> +			page = NULL;

get_page() should be quite good enough, shouldn't it?  We are holding
the necessary lock, and have tested pmd_huge(*pmd), so it would be a
bug if page_count(page) were zero here.

> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
>  struct page * __weak
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -		pud_t *pud, int write)
> +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +		pud_t *pud, int flags)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
>  
> -	page = pte_page(*(pte_t *)pud);
> -	if (page)
> -		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +	if (flags & FOLL_GET)
> +		return NULL;
> +
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pud);

Well, you do have vma declared here, but otherwise it's like what you
had in follow_huge_addr(): there is no point in taking and dropping
the lock if you're not getting the page while the lock is held.

So, which way to go on follow_huge_pud()?  I certainly think that we
should implement FOLL_GET on it, as we should for follow_huge_addr(),
simply for completeness, and so we don't need to come back here.

But whether we should do so in a patch which is Cc'ed to stable is not
so clear.  And leaving follow_huge_pmd() and follow_huge_addr() out
of this patch may avoid those awkward where-is-the-lock questions
for now.  Convert follow_huge_pmd() in a separate patch?

> +
> +	if (!pud_huge(*pud))
> +		goto out;
> +
> +	page = pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
> -- 
> 1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 6/6] mm/hugetlb: remove unused argument of follow_huge_addr()
  2014-08-29  1:39   ` Naoya Horiguchi
@ 2014-09-03 21:26     ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-03 21:26 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> follow_huge_addr()'s parameter write is not used, so let's remove it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

I think this patch is a waste of time: that it should be replaced
by a patch which replaces the "write" argument by a "flags" argument,
so that follow_huge_addr() can do get_page() for FOLL_GET while holding
appropriate lock, instead of the BUG_ON(flags & FOLL_GET) we currently
have.

Once that is implemented, you could try getting hugetlb migration
tested on ia64 and powerpc; but yes, keep hugetlb migration disabled
on all but x86 until it has been tested elsewhere.

> ---
>  arch/ia64/mm/hugetlbpage.c    | 2 +-
>  arch/powerpc/mm/hugetlbpage.c | 2 +-
>  arch/x86/mm/hugetlbpage.c     | 2 +-
>  include/linux/hugetlb.h       | 5 ++---
>  mm/gup.c                      | 2 +-
>  mm/hugetlb.c                  | 3 +--
>  6 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> index 6170381bf074..524a4e001bda 100644
> --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> @@ -89,7 +89,7 @@ int prepare_hugepage_range(struct file *file,
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
> +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr)
>  {
>  	struct page *page = NULL;
>  	pte_t *ptep;
> diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> index 1d8854a56309..5b6fe8b0cde3 100644
> --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> @@ -674,7 +674,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>  }
>  
>  struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> +follow_huge_addr(struct mm_struct *mm, unsigned long address)
>  {
>  	pte_t *ptep;
>  	struct page *page = ERR_PTR(-EINVAL);
> diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> index 03b8a7c11817..cab09d87ae65 100644
> --- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> @@ -18,7 +18,7 @@
>  
>  #if 0	/* This is just for testing */
>  struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> +follow_huge_addr(struct mm_struct *mm, unsigned long address)
>  {
>  	unsigned long start = address;
>  	int length = 1;
> diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> index b3200fce07aa..cdff1bd393bb 100644
> --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> @@ -96,8 +96,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>  			unsigned long addr, unsigned long sz);
>  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write);
> +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address);
>  struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
>  				pmd_t *pmd, int flags);
>  struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> @@ -124,7 +123,7 @@ static inline unsigned long hugetlb_total_pages(void)
>  }
>  
>  #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
> -#define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
> +#define follow_huge_addr(mm, addr)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
>  static inline void hugetlb_report_meminfo(struct seq_file *m)
>  {
> diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> index 597a5e92e265..8f0550f1770d 100644
> --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> +++ mmotm-2014-08-25-16-52/mm/gup.c
> @@ -149,7 +149,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  
>  	*page_mask = 0;
>  
> -	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
> +	page = follow_huge_addr(mm, address);
>  	if (!IS_ERR(page)) {
>  		BUG_ON(flags & FOLL_GET);
>  		return page;
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 0a4511115ee0..f7dcad3474ec 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3690,8 +3690,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>   * behavior.
>   */
>  struct page * __weak
> -follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> +follow_huge_addr(struct mm_struct *mm, unsigned long address)
>  {
>  	return ERR_PTR(-EINVAL);
>  }
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 6/6] mm/hugetlb: remove unused argument of follow_huge_addr()
@ 2014-09-03 21:26     ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-03 21:26 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> follow_huge_addr()'s parameter write is not used, so let's remove it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

I think this patch is a waste of time: that it should be replaced
by a patch which replaces the "write" argument by a "flags" argument,
so that follow_huge_addr() can do get_page() for FOLL_GET while holding
appropriate lock, instead of the BUG_ON(flags & FOLL_GET) we currently
have.

Once that is implemented, you could try getting hugetlb migration
tested on ia64 and powerpc; but yes, keep hugetlb migration disabled
on all but x86 until it has been tested elsewhere.

> ---
>  arch/ia64/mm/hugetlbpage.c    | 2 +-
>  arch/powerpc/mm/hugetlbpage.c | 2 +-
>  arch/x86/mm/hugetlbpage.c     | 2 +-
>  include/linux/hugetlb.h       | 5 ++---
>  mm/gup.c                      | 2 +-
>  mm/hugetlb.c                  | 3 +--
>  6 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> index 6170381bf074..524a4e001bda 100644
> --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> @@ -89,7 +89,7 @@ int prepare_hugepage_range(struct file *file,
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
> +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr)
>  {
>  	struct page *page = NULL;
>  	pte_t *ptep;
> diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> index 1d8854a56309..5b6fe8b0cde3 100644
> --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> @@ -674,7 +674,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>  }
>  
>  struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> +follow_huge_addr(struct mm_struct *mm, unsigned long address)
>  {
>  	pte_t *ptep;
>  	struct page *page = ERR_PTR(-EINVAL);
> diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> index 03b8a7c11817..cab09d87ae65 100644
> --- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> @@ -18,7 +18,7 @@
>  
>  #if 0	/* This is just for testing */
>  struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> +follow_huge_addr(struct mm_struct *mm, unsigned long address)
>  {
>  	unsigned long start = address;
>  	int length = 1;
> diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> index b3200fce07aa..cdff1bd393bb 100644
> --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> @@ -96,8 +96,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>  			unsigned long addr, unsigned long sz);
>  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write);
> +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address);
>  struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
>  				pmd_t *pmd, int flags);
>  struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> @@ -124,7 +123,7 @@ static inline unsigned long hugetlb_total_pages(void)
>  }
>  
>  #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
> -#define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
> +#define follow_huge_addr(mm, addr)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
>  static inline void hugetlb_report_meminfo(struct seq_file *m)
>  {
> diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> index 597a5e92e265..8f0550f1770d 100644
> --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> +++ mmotm-2014-08-25-16-52/mm/gup.c
> @@ -149,7 +149,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  
>  	*page_mask = 0;
>  
> -	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
> +	page = follow_huge_addr(mm, address);
>  	if (!IS_ERR(page)) {
>  		BUG_ON(flags & FOLL_GET);
>  		return page;
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 0a4511115ee0..f7dcad3474ec 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3690,8 +3690,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>   * behavior.
>   */
>  struct page * __weak
> -follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> +follow_huge_addr(struct mm_struct *mm, unsigned long address)
>  {
>  	return ERR_PTR(-EINVAL);
>  }
> -- 
> 1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
  2014-08-29  1:38   ` Naoya Horiguchi
@ 2014-09-04  0:20     ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-04  0:20 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> When running the test which causes the race as shown in the previous patch,
> we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
> 
> This race happens when pte turns into migration entry just after the first
> check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
> To fix this, we need to check pte_present() again with holding ptl.
> 
> This patch also reorders taking ptl and doing pte_page(), because pte_page()
> should be done in ptl. Due to this reordering, we need use trylock_page()
> in page != pagecache_page case to respect locking order.
> 
> ChangeLog v3:
> - doing pte_page() and taking refcount under page table lock
> - check pte_present after taking ptl, which makes it unnecessary to use
>   get_page_unless_zero()
> - use trylock_page in page != pagecache_page case
> - fixed target stable version

ChangeLog vN below the --- (or am I contradicting some other advice?)

> 
> Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>  # [3.2+]

One bug, one warning, a couple of suboptimals...

> ---
>  mm/hugetlb.c | 32 ++++++++++++++++++--------------
>  1 file changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index c5345c5edb50..2aafe073cb06 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3184,6 +3184,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  								vma, address);
>  	}
>  
> +	ptl = huge_pte_lock(h, mm, ptep);
> +
> +	/* Check for a racing update before calling hugetlb_cow */
> +	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> +		goto out_ptl;
> +
> +	if (!pte_present(entry))
> +		goto out_ptl;

A comment on that test would be helpful.  Is a migration entry
the only !pte_present() case you would expect to find there?

It would be better to test "entry" for this (or for being a migration
entry) higher up, just after getting "entry": less to unwind on error.

And better to call migration_entry_wait_huge(), after dropping locks,
before returning 0, so that we don't keep the cpu busy faulting while
the migration entry remains there.  Maybe not important, but better.

Probably best done with a goto unwinding code at end of function.

(Whereas we don't worry about "wait"s in the !pte_same case,
because !pte_same indicates that change is already occurring:
it's prolonged pte_same cases that we want to get away from.)

> +
>  	/*
>  	 * hugetlb_cow() requires page locks of pte_page(entry) and
>  	 * pagecache_page, so here we need take the former one
> @@ -3192,22 +3201,17 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * so no worry about deadlock.
>  	 */
>  	page = pte_page(entry);
> -	get_page(page);
>  	if (page != pagecache_page)
> -		lock_page(page);
> -
> -	ptl = huge_pte_lockptr(h, mm, ptep);
> -	spin_lock(ptl);
> -	/* Check for a racing update before calling hugetlb_cow */
> -	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> -		goto out_ptl;
> +		if (!trylock_page(page))
> +			goto out_ptl;

And, again to avoid keeping the cpu busy refaulting, it would be better
to wait_on_page_locked(), after dropping locks, before returning 0;
probably best done with another goto end of function.

>  
> +	get_page(page);
>  
>  	if (flags & FAULT_FLAG_WRITE) {
>  		if (!huge_pte_write(entry)) {
>  			ret = hugetlb_cow(mm, vma, address, ptep, entry,
>  					pagecache_page, ptl);
> -			goto out_ptl;
> +			goto out_put_page;
>  		}
>  		entry = huge_pte_mkdirty(entry);
>  	}
> @@ -3215,7 +3219,11 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (huge_ptep_set_access_flags(vma, address, ptep, entry,
>  						flags & FAULT_FLAG_WRITE))
>  		update_mmu_cache(vma, address, ptep);
> -
> +out_put_page:
> +	put_page(page);

If I'm reading this correctly, there's now a small but nasty chance that
this put_page will be the one which frees the page, and the unlock_page
below will then be unlocking a freed page.  Our "Bad page" checks should
detect that case, so it won't be as serious as unlocking someone else's
page; but you still should avoid that possibility.

> +out_unlock_page:

mm/hugetlb.c:3231:1: warning: label `out_unlock_page' defined but not used [-Wunused-label]

> +	if (page != pagecache_page)
> +		unlock_page(page);
>  out_ptl:
>  	spin_unlock(ptl);
>  
> @@ -3223,10 +3231,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unlock_page(pagecache_page);
>  		put_page(pagecache_page);
>  	}
> -	if (page != pagecache_page)
> -		unlock_page(page);
> -	put_page(page);
> -
>  out_mutex:
>  	mutex_unlock(&htlb_fault_mutex_table[hash]);
>  	return ret;
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
@ 2014-09-04  0:20     ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-04  0:20 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> When running the test which causes the race as shown in the previous patch,
> we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
> 
> This race happens when pte turns into migration entry just after the first
> check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
> To fix this, we need to check pte_present() again with holding ptl.
> 
> This patch also reorders taking ptl and doing pte_page(), because pte_page()
> should be done in ptl. Due to this reordering, we need use trylock_page()
> in page != pagecache_page case to respect locking order.
> 
> ChangeLog v3:
> - doing pte_page() and taking refcount under page table lock
> - check pte_present after taking ptl, which makes it unnecessary to use
>   get_page_unless_zero()
> - use trylock_page in page != pagecache_page case
> - fixed target stable version

ChangeLog vN below the --- (or am I contradicting some other advice?)

> 
> Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>  # [3.2+]

One bug, one warning, a couple of suboptimals...

> ---
>  mm/hugetlb.c | 32 ++++++++++++++++++--------------
>  1 file changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index c5345c5edb50..2aafe073cb06 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3184,6 +3184,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  								vma, address);
>  	}
>  
> +	ptl = huge_pte_lock(h, mm, ptep);
> +
> +	/* Check for a racing update before calling hugetlb_cow */
> +	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> +		goto out_ptl;
> +
> +	if (!pte_present(entry))
> +		goto out_ptl;

A comment on that test would be helpful.  Is a migration entry
the only !pte_present() case you would expect to find there?

It would be better to test "entry" for this (or for being a migration
entry) higher up, just after getting "entry": less to unwind on error.

And better to call migration_entry_wait_huge(), after dropping locks,
before returning 0, so that we don't keep the cpu busy faulting while
the migration entry remains there.  Maybe not important, but better.

Probably best done with a goto unwinding code at end of function.

(Whereas we don't worry about "wait"s in the !pte_same case,
because !pte_same indicates that change is already occurring:
it's prolonged pte_same cases that we want to get away from.)

> +
>  	/*
>  	 * hugetlb_cow() requires page locks of pte_page(entry) and
>  	 * pagecache_page, so here we need take the former one
> @@ -3192,22 +3201,17 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * so no worry about deadlock.
>  	 */
>  	page = pte_page(entry);
> -	get_page(page);
>  	if (page != pagecache_page)
> -		lock_page(page);
> -
> -	ptl = huge_pte_lockptr(h, mm, ptep);
> -	spin_lock(ptl);
> -	/* Check for a racing update before calling hugetlb_cow */
> -	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> -		goto out_ptl;
> +		if (!trylock_page(page))
> +			goto out_ptl;

And, again to avoid keeping the cpu busy refaulting, it would be better
to wait_on_page_locked(), after dropping locks, before returning 0;
probably best done with another goto end of function.

>  
> +	get_page(page);
>  
>  	if (flags & FAULT_FLAG_WRITE) {
>  		if (!huge_pte_write(entry)) {
>  			ret = hugetlb_cow(mm, vma, address, ptep, entry,
>  					pagecache_page, ptl);
> -			goto out_ptl;
> +			goto out_put_page;
>  		}
>  		entry = huge_pte_mkdirty(entry);
>  	}
> @@ -3215,7 +3219,11 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (huge_ptep_set_access_flags(vma, address, ptep, entry,
>  						flags & FAULT_FLAG_WRITE))
>  		update_mmu_cache(vma, address, ptep);
> -
> +out_put_page:
> +	put_page(page);

If I'm reading this correctly, there's now a small but nasty chance that
this put_page will be the one which frees the page, and the unlock_page
below will then be unlocking a freed page.  Our "Bad page" checks should
detect that case, so it won't be as serious as unlocking someone else's
page; but you still should avoid that possibility.

> +out_unlock_page:

mm/hugetlb.c:3231:1: warning: label `out_unlock_page' defined but not used [-Wunused-label]

> +	if (page != pagecache_page)
> +		unlock_page(page);
>  out_ptl:
>  	spin_unlock(ptl);
>  
> @@ -3223,10 +3231,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unlock_page(pagecache_page);
>  		put_page(pagecache_page);
>  	}
> -	if (page != pagecache_page)
> -		unlock_page(page);
> -	put_page(page);
> -
>  out_mutex:
>  	mutex_unlock(&htlb_fault_mutex_table[hash]);
>  	return ret;
> -- 
> 1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] mm/hugetlb: add migration entry check in hugetlb_change_protection
  2014-08-29  1:38   ` Naoya Horiguchi
@ 2014-09-04  1:06     ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-04  1:06 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> There is a race condition between hugepage migration and change_protection(),
> where hugetlb_change_protection() doesn't care about migration entries and
> wrongly overwrites them. That causes unexpected results like kernel crash.
> 
> This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
> function to do proper actions.
> 
> ChangeLog v3:
> - handle migration entry correctly (instead of just skipping)
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org> # [2.6.36+]

2.6.36+?  For the hwpoisoned part of it, I suppose.
Then you'd better mentioned the hwpoisoned case in the comment above.

> ---
>  mm/hugetlb.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 2aafe073cb06..1ed9df6def54 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3362,7 +3362,26 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  			spin_unlock(ptl);
>  			continue;
>  		}
> -		if (!huge_pte_none(huge_ptep_get(ptep))) {
> +		pte = huge_ptep_get(ptep);
> +		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> +			spin_unlock(ptl);
> +			continue;
> +		}
> +		if (unlikely(is_hugetlb_entry_migration(pte))) {
> +			swp_entry_t entry = pte_to_swp_entry(pte);
> +
> +			if (is_write_migration_entry(entry)) {
> +				pte_t newpte;
> +
> +				make_migration_entry_read(&entry);
> +				newpte = swp_entry_to_pte(entry);
> +				set_pte_at(mm, address, ptep, newpte);

set_huge_pte_at.

(As usual, I can't bear to see these is_hugetlb_entry_hwpoisoned and
is_hugetlb_entry_migration examples go past without bleating about
wanting to streamline them a little; but agreed last time to leave
that to some later cleanup once all the stable backports are stable.)

> +				pages++;
> +			}
> +			spin_unlock(ptl);
> +			continue;
> +		}
> +		if (!huge_pte_none(pte)) {
>  			pte = huge_ptep_get_and_clear(mm, address, ptep);
>  			pte = pte_mkhuge(huge_pte_modify(pte, newprot));
>  			pte = arch_make_huge_pte(pte, vma, NULL, 0);
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] mm/hugetlb: add migration entry check in hugetlb_change_protection
@ 2014-09-04  1:06     ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-04  1:06 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> There is a race condition between hugepage migration and change_protection(),
> where hugetlb_change_protection() doesn't care about migration entries and
> wrongly overwrites them. That causes unexpected results like kernel crash.
> 
> This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
> function to do proper actions.
> 
> ChangeLog v3:
> - handle migration entry correctly (instead of just skipping)
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org> # [2.6.36+]

2.6.36+?  For the hwpoisoned part of it, I suppose.
Then you'd better mentioned the hwpoisoned case in the comment above.

> ---
>  mm/hugetlb.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 2aafe073cb06..1ed9df6def54 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3362,7 +3362,26 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  			spin_unlock(ptl);
>  			continue;
>  		}
> -		if (!huge_pte_none(huge_ptep_get(ptep))) {
> +		pte = huge_ptep_get(ptep);
> +		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> +			spin_unlock(ptl);
> +			continue;
> +		}
> +		if (unlikely(is_hugetlb_entry_migration(pte))) {
> +			swp_entry_t entry = pte_to_swp_entry(pte);
> +
> +			if (is_write_migration_entry(entry)) {
> +				pte_t newpte;
> +
> +				make_migration_entry_read(&entry);
> +				newpte = swp_entry_to_pte(entry);
> +				set_pte_at(mm, address, ptep, newpte);

set_huge_pte_at.

(As usual, I can't bear to see these is_hugetlb_entry_hwpoisoned and
is_hugetlb_entry_migration examples go past without bleating about
wanting to streamline them a little; but agreed last time to leave
that to some later cleanup once all the stable backports are stable.)

> +				pages++;
> +			}
> +			spin_unlock(ptl);
> +			continue;
> +		}
> +		if (!huge_pte_none(pte)) {
>  			pte = huge_ptep_get_and_clear(mm, address, ptep);
>  			pte = pte_mkhuge(huge_pte_modify(pte, newprot));
>  			pte = arch_make_huge_pte(pte, vma, NULL, 0);
> -- 
> 1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/6] mm/hugetlb: add migration entry check in __unmap_hugepage_range
  2014-08-29  1:38   ` Naoya Horiguchi
@ 2014-09-04  1:47     ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-04  1:47 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> If __unmap_hugepage_range() tries to unmap the address range over which
> hugepage migration is on the way, we get the wrong page because pte_page()
> doesn't work for migration entries. This patch calls pte_to_swp_entry() and
> migration_entry_to_page() to get the right page for migration entries.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>  # [2.6.36+]

2.6.36+?  But this one doesn't affect hwpoisoned.
I admit I've lost track of how far back hugetlb migration goes:
oh, to 2.6.37+, that fits with what you marked on some commits earlier.
But then 2/6 says 3.12+.  Help!  Please remind me of the sequence of events.

> ---
>  mm/hugetlb.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 1ed9df6def54..0a4511115ee0 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -2652,6 +2652,13 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		if (huge_pte_none(pte))
>  			goto unlock;
>  
> +		if (unlikely(is_hugetlb_entry_migration(pte))) {
> +			swp_entry_t entry = pte_to_swp_entry(pte);
> +
> +			page = migration_entry_to_page(entry);
> +			goto clear;
> +		}
> +

This surprises me: are you sure?  Obviously you know hugetlb migration
much better than I do: is it done in a significantly different way from
order:0 page migration?  In the order:0 case, there is no reference to
the page corresponding to the migration entry placed in a page table,
just the remaining reference held by the task doing the migration.  But
here you are jumping to the code which unmaps and frees a present page.

I can see that a fix is necessary, but I would have expected it to
consist of merely changing the "HWPoisoned" comment below to include
migration entries, and changing its test from
		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
to
		if (unlikely(!pte_present(pte))) {

>  		/*
>  		 * HWPoisoned hugepage is already unmapped and dropped reference
>  		 */
> @@ -2677,7 +2684,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			 */
>  			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
>  		}
> -
> +clear:
>  		pte = huge_ptep_get_and_clear(mm, address, ptep);
>  		tlb_remove_tlb_entry(tlb, ptep, address);
>  		if (huge_pte_dirty(pte))
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/6] mm/hugetlb: add migration entry check in __unmap_hugepage_range
@ 2014-09-04  1:47     ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-04  1:47 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> If __unmap_hugepage_range() tries to unmap the address range over which
> hugepage migration is on the way, we get the wrong page because pte_page()
> doesn't work for migration entries. This patch calls pte_to_swp_entry() and
> migration_entry_to_page() to get the right page for migration entries.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>  # [2.6.36+]

2.6.36+?  But this one doesn't affect hwpoisoned.
I admit I've lost track of how far back hugetlb migration goes:
oh, to 2.6.37+, that fits with what you marked on some commits earlier.
But then 2/6 says 3.12+.  Help!  Please remind me of the sequence of events.

> ---
>  mm/hugetlb.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 1ed9df6def54..0a4511115ee0 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -2652,6 +2652,13 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		if (huge_pte_none(pte))
>  			goto unlock;
>  
> +		if (unlikely(is_hugetlb_entry_migration(pte))) {
> +			swp_entry_t entry = pte_to_swp_entry(pte);
> +
> +			page = migration_entry_to_page(entry);
> +			goto clear;
> +		}
> +

This surprises me: are you sure?  Obviously you know hugetlb migration
much better than I do: is it done in a significantly different way from
order:0 page migration?  In the order:0 case, there is no reference to
the page corresponding to the migration entry placed in a page table,
just the remaining reference held by the task doing the migration.  But
here you are jumping to the code which unmaps and frees a present page.

I can see that a fix is necessary, but I would have expected it to
consist of merely changing the "HWPoisoned" comment below to include
migration entries, and changing its test from
		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
to
		if (unlikely(!pte_present(pte))) {

>  		/*
>  		 * HWPoisoned hugepage is already unmapped and dropped reference
>  		 */
> @@ -2677,7 +2684,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			 */
>  			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
>  		}
> -
> +clear:
>  		pte = huge_ptep_get_and_clear(mm, address, ptep);
>  		tlb_remove_tlb_entry(tlb, ptep, address);
>  		if (huge_pte_dirty(pte))
> -- 
> 1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
  2014-09-03 21:17     ` Hugh Dickins
@ 2014-09-05  5:27       ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

Hi Hugh,

Thank you very much for you close looking and valuable comments.
And I can't help feeling shame on many mistakes/misunderstandings
and lack of thoughts throughout the patchset.
I promise that all these will be fixed in the next version.

On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > We have a race condition between move_pages() and freeing hugepages,
> > where move_pages() calls follow_page(FOLL_GET) for hugepages internally
> > and tries to get its refcount without preventing concurrent freeing.
> > This race crashes the kernel, so this patch fixes it by moving FOLL_GET
> > code for hugepages into follow_huge_pmd() with taking the page table lock.
> 
> You really ought to mention how you are intentionally dropping the
> unnecessary check for NULL pte_page() in this patch: we agree on that,
> but it does need to be mentioned somewhere in the comment.

OK, I'll add it.

> > 
> > This patch also adds the similar locking to follow_huge_(addr|pud)
> > for consistency.
> > 
> > Here is the reproducer:
> > 
> >   $ cat movepages.c
> >   #include <stdio.h>
> >   #include <stdlib.h>
> >   #include <numaif.h>
> > 
> >   #define ADDR_INPUT      0x700000000000UL
> >   #define HPS             0x200000
> >   #define PS              0x1000
> > 
> >   int main(int argc, char *argv[]) {
> >           int i;
> >           int nr_hp = strtol(argv[1], NULL, 0);
> >           int nr_p  = nr_hp * HPS / PS;
> >           int ret;
> >           void **addrs;
> >           int *status;
> >           int *nodes;
> >           pid_t pid;
> > 
> >           pid = strtol(argv[2], NULL, 0);
> >           addrs  = malloc(sizeof(char *) * nr_p + 1);
> >           status = malloc(sizeof(char *) * nr_p + 1);
> >           nodes  = malloc(sizeof(char *) * nr_p + 1);
> > 
> >           while (1) {
> >                   for (i = 0; i < nr_p; i++) {
> >                           addrs[i] = (void *)ADDR_INPUT + i * PS;
> >                           nodes[i] = 1;
> >                           status[i] = 0;
> >                   }
> >                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
> >                                         MPOL_MF_MOVE_ALL);
> >                   if (ret == -1)
> >                           err("move_pages");
> > 
> >                   for (i = 0; i < nr_p; i++) {
> >                           addrs[i] = (void *)ADDR_INPUT + i * PS;
> >                           nodes[i] = 0;
> >                           status[i] = 0;
> >                   }
> >                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
> >                                         MPOL_MF_MOVE_ALL);
> >                   if (ret == -1)
> >                           err("move_pages");
> >           }
> >           return 0;
> >   }
> > 
> >   $ cat hugepage.c
> >   #include <stdio.h>
> >   #include <sys/mman.h>
> >   #include <string.h>
> > 
> >   #define ADDR_INPUT      0x700000000000UL
> >   #define HPS             0x200000
> > 
> >   int main(int argc, char *argv[]) {
> >           int nr_hp = strtol(argv[1], NULL, 0);
> >           char *p;
> > 
> >           while (1) {
> >                   p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
> >                            MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
> >                   if (p != (void *)ADDR_INPUT) {
> >                           perror("mmap");
> >                           break;
> >                   }
> >                   memset(p, 0, nr_hp * HPS);
> >                   munmap(p, nr_hp * HPS);
> >           }
> >   }
> > 
> >   $ sysctl vm.nr_hugepages=40
> >   $ ./hugepage 10 &
> >   $ ./movepages 10 $(pgrep -f hugepage)
> > 
> > Note for stable inclusion:
> >   This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code
> >   to move_pages()"), so is applicable to -stable kernels which includes it.
> 
> Just say
> Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")

I just found that Documentation/SubmittingPatches started to state about
Fixes: tag. I'll use it from now.

> > 
> > ChangeLog v3:
> > - remove unnecessary if (page) check
> > - check (pmd|pud)_huge again after holding ptl
> > - do the same change also on follow_huge_pud()
> > - take page table lock also in follow_huge_addr()
> > 
> > ChangeLog v2:
> > - introduce follow_huge_pmd_lock() to do locking in arch-independent code.
> 
> ChangeLog vN info belongs below the ---

OK.
I didn't know this but it's written in SubmittingPatches, so I'll keep it
in mind.

> > 
> > Reported-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org>  # [3.12+]
> 
> No ack to this one yet, I'm afraid.

OK, I defer Reported-by until all the problems in this patch are solved.
I added this Reported-by because Andrew asked how In found this problem,
and advised me to show the reporter.
And I didn't intend by this Reported-by that you acked the patch.
In this case, should I have used some unofficial tag like
"Not-yet-Reported-by:" to avoid being rude?

> > ---
> >  arch/ia64/mm/hugetlbpage.c    |  9 +++++++--
> >  arch/metag/mm/hugetlbpage.c   |  4 ++--
> >  arch/powerpc/mm/hugetlbpage.c | 22 +++++++++++-----------
> >  include/linux/hugetlb.h       | 12 ++++++------
> >  mm/gup.c                      | 25 ++++---------------------
> >  mm/hugetlb.c                  | 43 +++++++++++++++++++++++++++++++------------
> >  6 files changed, 61 insertions(+), 54 deletions(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > index 52b7604b5215..6170381bf074 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > @@ -91,17 +91,22 @@ int prepare_hugepage_range(struct file *file,
> >  
> >  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  	pte_t *ptep;
> > +	spinlock_t *ptl;
> >  
> >  	if (REGION_NUMBER(addr) != RGN_HPAGE)
> >  		return ERR_PTR(-EINVAL);
> >  
> >  	ptep = huge_pte_offset(mm, addr);
> > +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
> 
> It was a mistake to lump this follow_huge_addr() change in with the
> rest: please defer it to your 6/6 (or send 5 and leave 6th to later).
> 
> Unless I'm missing something, all you succeed in doing here is break
> the build on ia64 and powerpc, by introducing undeclared "vma" variable.
> 
> There is no point whatever in taking and dropping this lock: the
> point was to do the get_page while holding the relevant page table lock,
> but you're not doing any get_page, and you still have an "int write"
> argument instead of "int flags" to pass down the FOLL_GET flag,
> and you still have the BUG_ON(flags & FOLL_GET) in follow_page_mask().
> 
> So, please throw these follow_huge_addr() parts out this patch.

Sorry, I'll drop them all.

> >  	if (!ptep || pte_none(*ptep))
> > -		return NULL;
> > +		goto out;
> > +
> >  	page = pte_page(*ptep);
> >  	page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
> > +out:
> > +	spin_unlock(ptl);
> >  	return page;
> >  }
> >  int pmd_huge(pmd_t pmd)
> > diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> > index 745081427659..5e96ef096df9 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> > @@ -104,8 +104,8 @@ int pud_huge(pud_t pud)
> >  	return 0;
> >  }
> >  
> > -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -			     pmd_t *pmd, int write)
> > +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +			     pmd_t *pmd, int flags)
> 
> Change from "write" to "flags" is good, but I question below whether
> we actually need to change from mm to vma in follow_huge_pmd() and
> follow_huge_pud().

Without changing mm with vma, we need call find_vma() to get the relevant
vma to get ptl, which looks expensive than getting mm from vma.
The caller already has vma, so I thought that passing vma is better.

... but as you wrote below, there's a better way to get ptl.
With your suggestion, there's no need to change mm.

> >  {
> >  	return NULL;
> >  }
> > diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > index 9517a93a315c..1d8854a56309 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > @@ -677,38 +677,38 @@ struct page *
> >  follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> >  {
> >  	pte_t *ptep;
> > -	struct page *page;
> > +	struct page *page = ERR_PTR(-EINVAL);
> >  	unsigned shift;
> >  	unsigned long mask;
> > +	spinlock_t *ptl;
> >  	/*
> >  	 * Transparent hugepages are handled by generic code. We can skip them
> >  	 * here.
> >  	 */
> >  	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> > -
> > +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
> 
> As above, you're breaking the build with a lock that serves no purpose
> in the current patch.

I just drop it, sorry for the silly code.

> >  	/* Verify it is a huge page else bail. */
> >  	if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
> > -		return ERR_PTR(-EINVAL);
> > +		goto out;
> >  
> >  	mask = (1UL << shift) - 1;
> > -	page = pte_page(*ptep);
> > -	if (page)
> > -		page += (address & mask) / PAGE_SIZE;
> > -
> > +	page = pte_page(*ptep) + ((address & mask) >> PAGE_SHIFT);
> > +out:
> > +	spin_unlock(ptl);
> >  	return page;
> >  }
> >  
> >  struct page *
> > -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -		pmd_t *pmd, int write)
> > +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, int flags)
> >  {
> >  	BUG();
> >  	return NULL;
> >  }
> >  
> >  struct page *
> > -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -		pmd_t *pmd, int write)
> > +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > +		pud_t *pud, int flags)
> >  {
> >  	BUG();
> >  	return NULL;
> > diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > index 6e6d338641fe..b3200fce07aa 100644
> > --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> > +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > @@ -98,10 +98,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
> >  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> >  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> >  			      int write);
> > -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -				pmd_t *pmd, int write);
> > -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -				pud_t *pud, int write);
> > +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +				pmd_t *pmd, int flags);
> > +struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > +				pud_t *pud, int flags);
> >  int pmd_huge(pmd_t pmd);
> >  int pud_huge(pud_t pmd);
> >  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > @@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
> >  static inline void hugetlb_show_meminfo(void)
> >  {
> >  }
> > -#define follow_huge_pmd(mm, addr, pmd, write)	NULL
> > -#define follow_huge_pud(mm, addr, pud, write)	NULL
> > +#define follow_huge_pmd(vma, addr, pmd, flags)	NULL
> > +#define follow_huge_pud(vma, addr, pud, flags)	NULL
> >  #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
> >  #define pmd_huge(x)	0
> >  #define pud_huge(x)	0
> > diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> > index 91d044b1600d..597a5e92e265 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> > +++ mmotm-2014-08-25-16-52/mm/gup.c
> > @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  	pud = pud_offset(pgd, address);
> >  	if (pud_none(*pud))
> >  		return no_page_table(vma, flags);
> > -	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
> > -		if (flags & FOLL_GET)
> > -			return NULL;
> > -		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> > -		return page;
> > -	}
> > +	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
> > +		return follow_huge_pud(vma, address, pud, flags);
> 
> Yes, this part is good, except I think mm rather than vma.

I'll fix it.

> >  	if (unlikely(pud_bad(*pud)))
> >  		return no_page_table(vma, flags);
> >  
> >  	pmd = pmd_offset(pud, address);
> >  	if (pmd_none(*pmd))
> >  		return no_page_table(vma, flags);
> > -	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> > -		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> > -		if (flags & FOLL_GET) {
> > -			/*
> > -			 * Refcount on tail pages are not well-defined and
> > -			 * shouldn't be taken. The caller should handle a NULL
> > -			 * return when trying to follow tail pages.
> > -			 */
> > -			if (PageHead(page))
> > -				get_page(page);
> > -			else
> > -				page = NULL;
> > -		}
> > -		return page;
> > -	}
> > +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> > +		return follow_huge_pmd(vma, address, pmd, flags);
> 
> And this part is good, except I think mm rather than vma.

I'll fix it, too.

> >  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
> >  		return no_page_table(vma, flags);
> >  	if (pmd_trans_huge(*pmd)) {
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 022767506c7b..c5345c5edb50 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3667,26 +3667,45 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
> >  }
> >  
> >  struct page * __weak
> > -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -		pmd_t *pmd, int write)
> > +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, int flags)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> > +	spinlock_t *ptl;
> >  
> > -	page = pte_page(*(pte_t *)pmd);
> > -	if (page)
> > -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
> 
> So, this is why you have had to change from "mm" to "vma" throughout.
> And we might end up deciding that that is the right thing to do.
> 
> But here we are deep in page table code, dealing with a huge pmd entry:
> I protest that it's very lame to be asking vma->vm_file to tell us what
> lock the page table code needs at this level.  Isn't it pmd_lockptr()?

Right, inside huge_pte_lock() we call pmd_lockptr() when huge_page_size(h)
== PMD_SIZE to get the ptl. And this code can assume that it's true, so
calling pmd_lockptr() directly is better/faster.

> Now, I'm easily confused, and there may be reasons why it's more subtle
> than that, and you really are forced to use huge_pte_lockptr(); but I'd
> much rather not if we can avoid doing so, just as a matter of principle.

Using huge_pte_lockptr() is useful when we can't assume the hugepage's
info like hugepage size or whether it's pmd/pud-based or not.

> One subtlety to take care over: it's a long time since I've had to
> worry about pmd folding and pud folding (what happens when you only
> have 2 or 3 levels of page table instead of the full 4): macros get
> defined to each other, and levels get optimized out (perhaps
> differently on different architectures).
> 
> So although at first sight the lock to take in follow_huge_pud()
> would seem to be mm->page_table_lock, I am not at this point certain
> that that's necessarily so - sometimes pud_huge might be pmd_huge,
> and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> to be the pud level.  Maybe: needs checking through the architectures
> and their configs, not obvious to me.

I think that every architecture uses mm->page_table_lock for pud-level
locking at least for now, but that could be changed in the future,
for example when 1GB hugepages or pud-based hugepages become common and
someone are interested in splitting lock for pud level.
So it would be helpful to introduce pud_lockptr() which just returns
mm->page_table_lock now, so that developers never forget to update it
when considering splitting pud lock.

> 
> I realize that I am asking for you (or I) to do more work, when using
> huge_pte_lock(hstate_vma(vma),,) would work it out "automatically";
> but I do feel quite strongly that that's the right approach here
> (and I'm not just trying to avoid a few edits of "mm" to "vma").

Yes, I agree.

> Cc'ing Kirill, who may have a strong view to the contrary,
> or a good insight on where the problems if any might be.
> 
> Also Cc'ing Kirill because I'm not convinced that huge_pte_lockptr()
> necessarily does the right thing on follow_huge_addr() architectures,
> ia64 and powerpc.  Do they, for example, allocate the memory for their
> hugetlb entries in such a way that we can indeed use pmd_lockptr() to
> point to a useable spinlock, in the case when huge_page_size(h) just
> happens to equal PMD_SIZE?
> 
> I don't know if this was thought through thoroughly
> (now that's a satisfying phrase hugh thinks hugh never wrote before!)
> when huge_pte_lockptr() was invented or not.  I think it would be safer
> if huge_pte_lockptr() just gave mm->page_table_lock on follow_huge_addr()
> architectures.

Yes, this seems a real problem and is worth discussing with maintainers
of these architectures. Maybe we can do this as a separate work.

> 
> > +
> > +	if (!pmd_huge(*pmd))
> > +		goto out;
> > +
> > +	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +
> > +	if (flags & FOLL_GET)
> > +		if (!get_page_unless_zero(page))
> > +			page = NULL;
> 
> get_page() should be quite good enough, shouldn't it?  We are holding
> the necessary lock, and have tested pmd_huge(*pmd), so it would be a
> bug if page_count(page) were zero here.

Yes, get_page() is enough, I'll fix it.

> > +out:
> > +	spin_unlock(ptl);
> >  	return page;
> >  }
> >  
> >  struct page * __weak
> > -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -		pud_t *pud, int write)
> > +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > +		pud_t *pud, int flags)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> > +	spinlock_t *ptl;
> >  
> > -	page = pte_page(*(pte_t *)pud);
> > -	if (page)
> > -		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> > +	if (flags & FOLL_GET)
> > +		return NULL;
> > +
> > +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pud);
> 
> Well, you do have vma declared here, but otherwise it's like what you
> had in follow_huge_addr(): there is no point in taking and dropping
> the lock if you're not getting the page while the lock is held.
> 
> So, which way to go on follow_huge_pud()?  I certainly think that we
> should implement FOLL_GET on it, as we should for follow_huge_addr(),
> simply for completeness, and so we don't need to come back here.

Right, this will become important when thinking of 1GB hugepage migration,

> But whether we should do so in a patch which is Cc'ed to stable is not
> so clear.  And leaving follow_huge_pmd() and follow_huge_addr() out
> of this patch may avoid those awkward where-is-the-lock questions
> for now.  Convert follow_huge_pmd() in a separate patch?

... but 1GB hugepage migration is not available now, so no reason to
send follow_huge_pud to stable. I agree to separate that part.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
@ 2014-09-05  5:27       ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

Hi Hugh,

Thank you very much for you close looking and valuable comments.
And I can't help feeling shame on many mistakes/misunderstandings
and lack of thoughts throughout the patchset.
I promise that all these will be fixed in the next version.

On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > We have a race condition between move_pages() and freeing hugepages,
> > where move_pages() calls follow_page(FOLL_GET) for hugepages internally
> > and tries to get its refcount without preventing concurrent freeing.
> > This race crashes the kernel, so this patch fixes it by moving FOLL_GET
> > code for hugepages into follow_huge_pmd() with taking the page table lock.
> 
> You really ought to mention how you are intentionally dropping the
> unnecessary check for NULL pte_page() in this patch: we agree on that,
> but it does need to be mentioned somewhere in the comment.

OK, I'll add it.

> > 
> > This patch also adds the similar locking to follow_huge_(addr|pud)
> > for consistency.
> > 
> > Here is the reproducer:
> > 
> >   $ cat movepages.c
> >   #include <stdio.h>
> >   #include <stdlib.h>
> >   #include <numaif.h>
> > 
> >   #define ADDR_INPUT      0x700000000000UL
> >   #define HPS             0x200000
> >   #define PS              0x1000
> > 
> >   int main(int argc, char *argv[]) {
> >           int i;
> >           int nr_hp = strtol(argv[1], NULL, 0);
> >           int nr_p  = nr_hp * HPS / PS;
> >           int ret;
> >           void **addrs;
> >           int *status;
> >           int *nodes;
> >           pid_t pid;
> > 
> >           pid = strtol(argv[2], NULL, 0);
> >           addrs  = malloc(sizeof(char *) * nr_p + 1);
> >           status = malloc(sizeof(char *) * nr_p + 1);
> >           nodes  = malloc(sizeof(char *) * nr_p + 1);
> > 
> >           while (1) {
> >                   for (i = 0; i < nr_p; i++) {
> >                           addrs[i] = (void *)ADDR_INPUT + i * PS;
> >                           nodes[i] = 1;
> >                           status[i] = 0;
> >                   }
> >                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
> >                                         MPOL_MF_MOVE_ALL);
> >                   if (ret == -1)
> >                           err("move_pages");
> > 
> >                   for (i = 0; i < nr_p; i++) {
> >                           addrs[i] = (void *)ADDR_INPUT + i * PS;
> >                           nodes[i] = 0;
> >                           status[i] = 0;
> >                   }
> >                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
> >                                         MPOL_MF_MOVE_ALL);
> >                   if (ret == -1)
> >                           err("move_pages");
> >           }
> >           return 0;
> >   }
> > 
> >   $ cat hugepage.c
> >   #include <stdio.h>
> >   #include <sys/mman.h>
> >   #include <string.h>
> > 
> >   #define ADDR_INPUT      0x700000000000UL
> >   #define HPS             0x200000
> > 
> >   int main(int argc, char *argv[]) {
> >           int nr_hp = strtol(argv[1], NULL, 0);
> >           char *p;
> > 
> >           while (1) {
> >                   p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
> >                            MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
> >                   if (p != (void *)ADDR_INPUT) {
> >                           perror("mmap");
> >                           break;
> >                   }
> >                   memset(p, 0, nr_hp * HPS);
> >                   munmap(p, nr_hp * HPS);
> >           }
> >   }
> > 
> >   $ sysctl vm.nr_hugepages=40
> >   $ ./hugepage 10 &
> >   $ ./movepages 10 $(pgrep -f hugepage)
> > 
> > Note for stable inclusion:
> >   This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code
> >   to move_pages()"), so is applicable to -stable kernels which includes it.
> 
> Just say
> Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")

I just found that Documentation/SubmittingPatches started to state about
Fixes: tag. I'll use it from now.

> > 
> > ChangeLog v3:
> > - remove unnecessary if (page) check
> > - check (pmd|pud)_huge again after holding ptl
> > - do the same change also on follow_huge_pud()
> > - take page table lock also in follow_huge_addr()
> > 
> > ChangeLog v2:
> > - introduce follow_huge_pmd_lock() to do locking in arch-independent code.
> 
> ChangeLog vN info belongs below the ---

OK.
I didn't know this but it's written in SubmittingPatches, so I'll keep it
in mind.

> > 
> > Reported-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org>  # [3.12+]
> 
> No ack to this one yet, I'm afraid.

OK, I defer Reported-by until all the problems in this patch are solved.
I added this Reported-by because Andrew asked how In found this problem,
and advised me to show the reporter.
And I didn't intend by this Reported-by that you acked the patch.
In this case, should I have used some unofficial tag like
"Not-yet-Reported-by:" to avoid being rude?

> > ---
> >  arch/ia64/mm/hugetlbpage.c    |  9 +++++++--
> >  arch/metag/mm/hugetlbpage.c   |  4 ++--
> >  arch/powerpc/mm/hugetlbpage.c | 22 +++++++++++-----------
> >  include/linux/hugetlb.h       | 12 ++++++------
> >  mm/gup.c                      | 25 ++++---------------------
> >  mm/hugetlb.c                  | 43 +++++++++++++++++++++++++++++++------------
> >  6 files changed, 61 insertions(+), 54 deletions(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > index 52b7604b5215..6170381bf074 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > @@ -91,17 +91,22 @@ int prepare_hugepage_range(struct file *file,
> >  
> >  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  	pte_t *ptep;
> > +	spinlock_t *ptl;
> >  
> >  	if (REGION_NUMBER(addr) != RGN_HPAGE)
> >  		return ERR_PTR(-EINVAL);
> >  
> >  	ptep = huge_pte_offset(mm, addr);
> > +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
> 
> It was a mistake to lump this follow_huge_addr() change in with the
> rest: please defer it to your 6/6 (or send 5 and leave 6th to later).
> 
> Unless I'm missing something, all you succeed in doing here is break
> the build on ia64 and powerpc, by introducing undeclared "vma" variable.
> 
> There is no point whatever in taking and dropping this lock: the
> point was to do the get_page while holding the relevant page table lock,
> but you're not doing any get_page, and you still have an "int write"
> argument instead of "int flags" to pass down the FOLL_GET flag,
> and you still have the BUG_ON(flags & FOLL_GET) in follow_page_mask().
> 
> So, please throw these follow_huge_addr() parts out this patch.

Sorry, I'll drop them all.

> >  	if (!ptep || pte_none(*ptep))
> > -		return NULL;
> > +		goto out;
> > +
> >  	page = pte_page(*ptep);
> >  	page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
> > +out:
> > +	spin_unlock(ptl);
> >  	return page;
> >  }
> >  int pmd_huge(pmd_t pmd)
> > diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> > index 745081427659..5e96ef096df9 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> > @@ -104,8 +104,8 @@ int pud_huge(pud_t pud)
> >  	return 0;
> >  }
> >  
> > -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -			     pmd_t *pmd, int write)
> > +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +			     pmd_t *pmd, int flags)
> 
> Change from "write" to "flags" is good, but I question below whether
> we actually need to change from mm to vma in follow_huge_pmd() and
> follow_huge_pud().

Without changing mm with vma, we need call find_vma() to get the relevant
vma to get ptl, which looks expensive than getting mm from vma.
The caller already has vma, so I thought that passing vma is better.

... but as you wrote below, there's a better way to get ptl.
With your suggestion, there's no need to change mm.

> >  {
> >  	return NULL;
> >  }
> > diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > index 9517a93a315c..1d8854a56309 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > @@ -677,38 +677,38 @@ struct page *
> >  follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> >  {
> >  	pte_t *ptep;
> > -	struct page *page;
> > +	struct page *page = ERR_PTR(-EINVAL);
> >  	unsigned shift;
> >  	unsigned long mask;
> > +	spinlock_t *ptl;
> >  	/*
> >  	 * Transparent hugepages are handled by generic code. We can skip them
> >  	 * here.
> >  	 */
> >  	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> > -
> > +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
> 
> As above, you're breaking the build with a lock that serves no purpose
> in the current patch.

I just drop it, sorry for the silly code.

> >  	/* Verify it is a huge page else bail. */
> >  	if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
> > -		return ERR_PTR(-EINVAL);
> > +		goto out;
> >  
> >  	mask = (1UL << shift) - 1;
> > -	page = pte_page(*ptep);
> > -	if (page)
> > -		page += (address & mask) / PAGE_SIZE;
> > -
> > +	page = pte_page(*ptep) + ((address & mask) >> PAGE_SHIFT);
> > +out:
> > +	spin_unlock(ptl);
> >  	return page;
> >  }
> >  
> >  struct page *
> > -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -		pmd_t *pmd, int write)
> > +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, int flags)
> >  {
> >  	BUG();
> >  	return NULL;
> >  }
> >  
> >  struct page *
> > -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -		pmd_t *pmd, int write)
> > +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > +		pud_t *pud, int flags)
> >  {
> >  	BUG();
> >  	return NULL;
> > diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > index 6e6d338641fe..b3200fce07aa 100644
> > --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> > +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > @@ -98,10 +98,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
> >  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> >  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> >  			      int write);
> > -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -				pmd_t *pmd, int write);
> > -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -				pud_t *pud, int write);
> > +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +				pmd_t *pmd, int flags);
> > +struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > +				pud_t *pud, int flags);
> >  int pmd_huge(pmd_t pmd);
> >  int pud_huge(pud_t pmd);
> >  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > @@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
> >  static inline void hugetlb_show_meminfo(void)
> >  {
> >  }
> > -#define follow_huge_pmd(mm, addr, pmd, write)	NULL
> > -#define follow_huge_pud(mm, addr, pud, write)	NULL
> > +#define follow_huge_pmd(vma, addr, pmd, flags)	NULL
> > +#define follow_huge_pud(vma, addr, pud, flags)	NULL
> >  #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
> >  #define pmd_huge(x)	0
> >  #define pud_huge(x)	0
> > diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> > index 91d044b1600d..597a5e92e265 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> > +++ mmotm-2014-08-25-16-52/mm/gup.c
> > @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  	pud = pud_offset(pgd, address);
> >  	if (pud_none(*pud))
> >  		return no_page_table(vma, flags);
> > -	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
> > -		if (flags & FOLL_GET)
> > -			return NULL;
> > -		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> > -		return page;
> > -	}
> > +	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
> > +		return follow_huge_pud(vma, address, pud, flags);
> 
> Yes, this part is good, except I think mm rather than vma.

I'll fix it.

> >  	if (unlikely(pud_bad(*pud)))
> >  		return no_page_table(vma, flags);
> >  
> >  	pmd = pmd_offset(pud, address);
> >  	if (pmd_none(*pmd))
> >  		return no_page_table(vma, flags);
> > -	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> > -		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> > -		if (flags & FOLL_GET) {
> > -			/*
> > -			 * Refcount on tail pages are not well-defined and
> > -			 * shouldn't be taken. The caller should handle a NULL
> > -			 * return when trying to follow tail pages.
> > -			 */
> > -			if (PageHead(page))
> > -				get_page(page);
> > -			else
> > -				page = NULL;
> > -		}
> > -		return page;
> > -	}
> > +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> > +		return follow_huge_pmd(vma, address, pmd, flags);
> 
> And this part is good, except I think mm rather than vma.

I'll fix it, too.

> >  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
> >  		return no_page_table(vma, flags);
> >  	if (pmd_trans_huge(*pmd)) {
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 022767506c7b..c5345c5edb50 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3667,26 +3667,45 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
> >  }
> >  
> >  struct page * __weak
> > -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -		pmd_t *pmd, int write)
> > +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, int flags)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> > +	spinlock_t *ptl;
> >  
> > -	page = pte_page(*(pte_t *)pmd);
> > -	if (page)
> > -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
> 
> So, this is why you have had to change from "mm" to "vma" throughout.
> And we might end up deciding that that is the right thing to do.
> 
> But here we are deep in page table code, dealing with a huge pmd entry:
> I protest that it's very lame to be asking vma->vm_file to tell us what
> lock the page table code needs at this level.  Isn't it pmd_lockptr()?

Right, inside huge_pte_lock() we call pmd_lockptr() when huge_page_size(h)
== PMD_SIZE to get the ptl. And this code can assume that it's true, so
calling pmd_lockptr() directly is better/faster.

> Now, I'm easily confused, and there may be reasons why it's more subtle
> than that, and you really are forced to use huge_pte_lockptr(); but I'd
> much rather not if we can avoid doing so, just as a matter of principle.

Using huge_pte_lockptr() is useful when we can't assume the hugepage's
info like hugepage size or whether it's pmd/pud-based or not.

> One subtlety to take care over: it's a long time since I've had to
> worry about pmd folding and pud folding (what happens when you only
> have 2 or 3 levels of page table instead of the full 4): macros get
> defined to each other, and levels get optimized out (perhaps
> differently on different architectures).
> 
> So although at first sight the lock to take in follow_huge_pud()
> would seem to be mm->page_table_lock, I am not at this point certain
> that that's necessarily so - sometimes pud_huge might be pmd_huge,
> and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> to be the pud level.  Maybe: needs checking through the architectures
> and their configs, not obvious to me.

I think that every architecture uses mm->page_table_lock for pud-level
locking at least for now, but that could be changed in the future,
for example when 1GB hugepages or pud-based hugepages become common and
someone are interested in splitting lock for pud level.
So it would be helpful to introduce pud_lockptr() which just returns
mm->page_table_lock now, so that developers never forget to update it
when considering splitting pud lock.

> 
> I realize that I am asking for you (or I) to do more work, when using
> huge_pte_lock(hstate_vma(vma),,) would work it out "automatically";
> but I do feel quite strongly that that's the right approach here
> (and I'm not just trying to avoid a few edits of "mm" to "vma").

Yes, I agree.

> Cc'ing Kirill, who may have a strong view to the contrary,
> or a good insight on where the problems if any might be.
> 
> Also Cc'ing Kirill because I'm not convinced that huge_pte_lockptr()
> necessarily does the right thing on follow_huge_addr() architectures,
> ia64 and powerpc.  Do they, for example, allocate the memory for their
> hugetlb entries in such a way that we can indeed use pmd_lockptr() to
> point to a useable spinlock, in the case when huge_page_size(h) just
> happens to equal PMD_SIZE?
> 
> I don't know if this was thought through thoroughly
> (now that's a satisfying phrase hugh thinks hugh never wrote before!)
> when huge_pte_lockptr() was invented or not.  I think it would be safer
> if huge_pte_lockptr() just gave mm->page_table_lock on follow_huge_addr()
> architectures.

Yes, this seems a real problem and is worth discussing with maintainers
of these architectures. Maybe we can do this as a separate work.

> 
> > +
> > +	if (!pmd_huge(*pmd))
> > +		goto out;
> > +
> > +	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +
> > +	if (flags & FOLL_GET)
> > +		if (!get_page_unless_zero(page))
> > +			page = NULL;
> 
> get_page() should be quite good enough, shouldn't it?  We are holding
> the necessary lock, and have tested pmd_huge(*pmd), so it would be a
> bug if page_count(page) were zero here.

Yes, get_page() is enough, I'll fix it.

> > +out:
> > +	spin_unlock(ptl);
> >  	return page;
> >  }
> >  
> >  struct page * __weak
> > -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -		pud_t *pud, int write)
> > +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > +		pud_t *pud, int flags)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> > +	spinlock_t *ptl;
> >  
> > -	page = pte_page(*(pte_t *)pud);
> > -	if (page)
> > -		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> > +	if (flags & FOLL_GET)
> > +		return NULL;
> > +
> > +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pud);
> 
> Well, you do have vma declared here, but otherwise it's like what you
> had in follow_huge_addr(): there is no point in taking and dropping
> the lock if you're not getting the page while the lock is held.
> 
> So, which way to go on follow_huge_pud()?  I certainly think that we
> should implement FOLL_GET on it, as we should for follow_huge_addr(),
> simply for completeness, and so we don't need to come back here.

Right, this will become important when thinking of 1GB hugepage migration,

> But whether we should do so in a patch which is Cc'ed to stable is not
> so clear.  And leaving follow_huge_pmd() and follow_huge_addr() out
> of this patch may avoid those awkward where-is-the-lock questions
> for now.  Convert follow_huge_pmd() in a separate patch?

... but 1GB hugepage migration is not available now, so no reason to
send follow_huge_pud to stable. I agree to separate that part.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
  2014-09-04  0:20     ` Hugh Dickins
@ 2014-09-05  5:28       ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Sep 03, 2014 at 05:20:59PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > When running the test which causes the race as shown in the previous patch,
> > we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
> > 
> > This race happens when pte turns into migration entry just after the first
> > check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
> > To fix this, we need to check pte_present() again with holding ptl.
> > 
> > This patch also reorders taking ptl and doing pte_page(), because pte_page()
> > should be done in ptl. Due to this reordering, we need use trylock_page()
> > in page != pagecache_page case to respect locking order.
> > 
> > ChangeLog v3:
> > - doing pte_page() and taking refcount under page table lock
> > - check pte_present after taking ptl, which makes it unnecessary to use
> >   get_page_unless_zero()
> > - use trylock_page in page != pagecache_page case
> > - fixed target stable version
> 
> ChangeLog vN below the --- (or am I contradicting some other advice?)

no, this is a practical advice.

> > 
> > Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org>  # [3.2+]
> 
> One bug, one warning, a couple of suboptimals...
> 
> > ---
> >  mm/hugetlb.c | 32 ++++++++++++++++++--------------
> >  1 file changed, 18 insertions(+), 14 deletions(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index c5345c5edb50..2aafe073cb06 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3184,6 +3184,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  								vma, address);
> >  	}
> >  
> > +	ptl = huge_pte_lock(h, mm, ptep);
> > +
> > +	/* Check for a racing update before calling hugetlb_cow */
> > +	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> > +		goto out_ptl;
> > +
> > +	if (!pte_present(entry))
> > +		goto out_ptl;
> 
> A comment on that test would be helpful.  Is a migration entry
> the only !pte_present() case you would expect to find there?

No, we can have the same race with hwpoisoned entry, although it's
very rare.

> It would be better to test "entry" for this (or for being a migration
> entry) higher up, just after getting "entry": less to unwind on error.

Right, thanks.

> And better to call migration_entry_wait_huge(), after dropping locks,
> before returning 0, so that we don't keep the cpu busy faulting while
> the migration entry remains there.  Maybe not important, but better.

OK.

> Probably best done with a goto unwinding code at end of function.
> 
> (Whereas we don't worry about "wait"s in the !pte_same case,
> because !pte_same indicates that change is already occurring:
> it's prolonged pte_same cases that we want to get away from.)
> 
> > +
> >  	/*
> >  	 * hugetlb_cow() requires page locks of pte_page(entry) and
> >  	 * pagecache_page, so here we need take the former one
> > @@ -3192,22 +3201,17 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	 * so no worry about deadlock.
> >  	 */
> >  	page = pte_page(entry);
> > -	get_page(page);
> >  	if (page != pagecache_page)
> > -		lock_page(page);
> > -
> > -	ptl = huge_pte_lockptr(h, mm, ptep);
> > -	spin_lock(ptl);
> > -	/* Check for a racing update before calling hugetlb_cow */
> > -	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> > -		goto out_ptl;
> > +		if (!trylock_page(page))
> > +			goto out_ptl;
> 
> And, again to avoid keeping the cpu busy refaulting, it would be better
> to wait_on_page_locked(), after dropping locks, before returning 0;
> probably best done with another goto end of function.

OK.

> >  
> > +	get_page(page);
> >  
> >  	if (flags & FAULT_FLAG_WRITE) {
> >  		if (!huge_pte_write(entry)) {
> >  			ret = hugetlb_cow(mm, vma, address, ptep, entry,
> >  					pagecache_page, ptl);
> > -			goto out_ptl;
> > +			goto out_put_page;
> >  		}
> >  		entry = huge_pte_mkdirty(entry);
> >  	}
> > @@ -3215,7 +3219,11 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (huge_ptep_set_access_flags(vma, address, ptep, entry,
> >  						flags & FAULT_FLAG_WRITE))
> >  		update_mmu_cache(vma, address, ptep);
> > -
> > +out_put_page:
> > +	put_page(page);
> 
> If I'm reading this correctly, there's now a small but nasty chance that
> this put_page will be the one which frees the page, and the unlock_page
> below will then be unlocking a freed page.  Our "Bad page" checks should
> detect that case, so it won't be as serious as unlocking someone else's
> page; but you still should avoid that possibility.

I shouldn't change the order of put_page and unlock_page.

> 
> > +out_unlock_page:
> 
> mm/hugetlb.c:3231:1: warning: label `out_unlock_page' defined but not used [-Wunused-label]

Sorry, I fix it.

> > +	if (page != pagecache_page)
> > +		unlock_page(page);
> >  out_ptl:
> >  	spin_unlock(ptl);
> >  
> > @@ -3223,10 +3231,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		unlock_page(pagecache_page);
> >  		put_page(pagecache_page);
> >  	}
> > -	if (page != pagecache_page)
> > -		unlock_page(page);
> > -	put_page(page);
> > -
> >  out_mutex:
> >  	mutex_unlock(&htlb_fault_mutex_table[hash]);
> >  	return ret;
> > -- 
> > 1.9.3
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
@ 2014-09-05  5:28       ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Sep 03, 2014 at 05:20:59PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > When running the test which causes the race as shown in the previous patch,
> > we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
> > 
> > This race happens when pte turns into migration entry just after the first
> > check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
> > To fix this, we need to check pte_present() again with holding ptl.
> > 
> > This patch also reorders taking ptl and doing pte_page(), because pte_page()
> > should be done in ptl. Due to this reordering, we need use trylock_page()
> > in page != pagecache_page case to respect locking order.
> > 
> > ChangeLog v3:
> > - doing pte_page() and taking refcount under page table lock
> > - check pte_present after taking ptl, which makes it unnecessary to use
> >   get_page_unless_zero()
> > - use trylock_page in page != pagecache_page case
> > - fixed target stable version
> 
> ChangeLog vN below the --- (or am I contradicting some other advice?)

no, this is a practical advice.

> > 
> > Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org>  # [3.2+]
> 
> One bug, one warning, a couple of suboptimals...
> 
> > ---
> >  mm/hugetlb.c | 32 ++++++++++++++++++--------------
> >  1 file changed, 18 insertions(+), 14 deletions(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index c5345c5edb50..2aafe073cb06 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3184,6 +3184,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  								vma, address);
> >  	}
> >  
> > +	ptl = huge_pte_lock(h, mm, ptep);
> > +
> > +	/* Check for a racing update before calling hugetlb_cow */
> > +	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> > +		goto out_ptl;
> > +
> > +	if (!pte_present(entry))
> > +		goto out_ptl;
> 
> A comment on that test would be helpful.  Is a migration entry
> the only !pte_present() case you would expect to find there?

No, we can have the same race with hwpoisoned entry, although it's
very rare.

> It would be better to test "entry" for this (or for being a migration
> entry) higher up, just after getting "entry": less to unwind on error.

Right, thanks.

> And better to call migration_entry_wait_huge(), after dropping locks,
> before returning 0, so that we don't keep the cpu busy faulting while
> the migration entry remains there.  Maybe not important, but better.

OK.

> Probably best done with a goto unwinding code at end of function.
> 
> (Whereas we don't worry about "wait"s in the !pte_same case,
> because !pte_same indicates that change is already occurring:
> it's prolonged pte_same cases that we want to get away from.)
> 
> > +
> >  	/*
> >  	 * hugetlb_cow() requires page locks of pte_page(entry) and
> >  	 * pagecache_page, so here we need take the former one
> > @@ -3192,22 +3201,17 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	 * so no worry about deadlock.
> >  	 */
> >  	page = pte_page(entry);
> > -	get_page(page);
> >  	if (page != pagecache_page)
> > -		lock_page(page);
> > -
> > -	ptl = huge_pte_lockptr(h, mm, ptep);
> > -	spin_lock(ptl);
> > -	/* Check for a racing update before calling hugetlb_cow */
> > -	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> > -		goto out_ptl;
> > +		if (!trylock_page(page))
> > +			goto out_ptl;
> 
> And, again to avoid keeping the cpu busy refaulting, it would be better
> to wait_on_page_locked(), after dropping locks, before returning 0;
> probably best done with another goto end of function.

OK.

> >  
> > +	get_page(page);
> >  
> >  	if (flags & FAULT_FLAG_WRITE) {
> >  		if (!huge_pte_write(entry)) {
> >  			ret = hugetlb_cow(mm, vma, address, ptep, entry,
> >  					pagecache_page, ptl);
> > -			goto out_ptl;
> > +			goto out_put_page;
> >  		}
> >  		entry = huge_pte_mkdirty(entry);
> >  	}
> > @@ -3215,7 +3219,11 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (huge_ptep_set_access_flags(vma, address, ptep, entry,
> >  						flags & FAULT_FLAG_WRITE))
> >  		update_mmu_cache(vma, address, ptep);
> > -
> > +out_put_page:
> > +	put_page(page);
> 
> If I'm reading this correctly, there's now a small but nasty chance that
> this put_page will be the one which frees the page, and the unlock_page
> below will then be unlocking a freed page.  Our "Bad page" checks should
> detect that case, so it won't be as serious as unlocking someone else's
> page; but you still should avoid that possibility.

I shouldn't change the order of put_page and unlock_page.

> 
> > +out_unlock_page:
> 
> mm/hugetlb.c:3231:1: warning: label `out_unlock_page' defined but not used [-Wunused-label]

Sorry, I fix it.

> > +	if (page != pagecache_page)
> > +		unlock_page(page);
> >  out_ptl:
> >  	spin_unlock(ptl);
> >  
> > @@ -3223,10 +3231,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		unlock_page(pagecache_page);
> >  		put_page(pagecache_page);
> >  	}
> > -	if (page != pagecache_page)
> > -		unlock_page(page);
> > -	put_page(page);
> > -
> >  out_mutex:
> >  	mutex_unlock(&htlb_fault_mutex_table[hash]);
> >  	return ret;
> > -- 
> > 1.9.3
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] mm/hugetlb: add migration entry check in hugetlb_change_protection
  2014-09-04  1:06     ` Hugh Dickins
@ 2014-09-05  5:28       ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Sep 03, 2014 at 06:06:34PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > There is a race condition between hugepage migration and change_protection(),
> > where hugetlb_change_protection() doesn't care about migration entries and
> > wrongly overwrites them. That causes unexpected results like kernel crash.
> > 
> > This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
> > function to do proper actions.
> > 
> > ChangeLog v3:
> > - handle migration entry correctly (instead of just skipping)
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org> # [2.6.36+]
> 
> 2.6.36+?  For the hwpoisoned part of it, I suppose.
> Then you'd better mentioned the hwpoisoned case in the comment above.

OK, I'll update the description and the subject.

> > ---
> >  mm/hugetlb.c | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 2aafe073cb06..1ed9df6def54 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3362,7 +3362,26 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  			spin_unlock(ptl);
> >  			continue;
> >  		}
> > -		if (!huge_pte_none(huge_ptep_get(ptep))) {
> > +		pte = huge_ptep_get(ptep);
> > +		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> > +			spin_unlock(ptl);
> > +			continue;
> > +		}
> > +		if (unlikely(is_hugetlb_entry_migration(pte))) {
> > +			swp_entry_t entry = pte_to_swp_entry(pte);
> > +
> > +			if (is_write_migration_entry(entry)) {
> > +				pte_t newpte;
> > +
> > +				make_migration_entry_read(&entry);
> > +				newpte = swp_entry_to_pte(entry);
> > +				set_pte_at(mm, address, ptep, newpte);
> 
> set_huge_pte_at.

Fixed, thanks.

> 
> (As usual, I can't bear to see these is_hugetlb_entry_hwpoisoned and
> is_hugetlb_entry_migration examples go past without bleating about
> wanting to streamline them a little; but agreed last time to leave
> that to some later cleanup once all the stable backports are stable.)

Yes, these two check routines need cleanup.
I'll do it in separate work later.

> > +				pages++;
> > +			}
> > +			spin_unlock(ptl);
> > +			continue;
> > +		}
> > +		if (!huge_pte_none(pte)) {
> >  			pte = huge_ptep_get_and_clear(mm, address, ptep);
> >  			pte = pte_mkhuge(huge_pte_modify(pte, newprot));
> >  			pte = arch_make_huge_pte(pte, vma, NULL, 0);
> > -- 
> > 1.9.3
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] mm/hugetlb: add migration entry check in hugetlb_change_protection
@ 2014-09-05  5:28       ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Sep 03, 2014 at 06:06:34PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > There is a race condition between hugepage migration and change_protection(),
> > where hugetlb_change_protection() doesn't care about migration entries and
> > wrongly overwrites them. That causes unexpected results like kernel crash.
> > 
> > This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
> > function to do proper actions.
> > 
> > ChangeLog v3:
> > - handle migration entry correctly (instead of just skipping)
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org> # [2.6.36+]
> 
> 2.6.36+?  For the hwpoisoned part of it, I suppose.
> Then you'd better mentioned the hwpoisoned case in the comment above.

OK, I'll update the description and the subject.

> > ---
> >  mm/hugetlb.c | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 2aafe073cb06..1ed9df6def54 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3362,7 +3362,26 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  			spin_unlock(ptl);
> >  			continue;
> >  		}
> > -		if (!huge_pte_none(huge_ptep_get(ptep))) {
> > +		pte = huge_ptep_get(ptep);
> > +		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> > +			spin_unlock(ptl);
> > +			continue;
> > +		}
> > +		if (unlikely(is_hugetlb_entry_migration(pte))) {
> > +			swp_entry_t entry = pte_to_swp_entry(pte);
> > +
> > +			if (is_write_migration_entry(entry)) {
> > +				pte_t newpte;
> > +
> > +				make_migration_entry_read(&entry);
> > +				newpte = swp_entry_to_pte(entry);
> > +				set_pte_at(mm, address, ptep, newpte);
> 
> set_huge_pte_at.

Fixed, thanks.

> 
> (As usual, I can't bear to see these is_hugetlb_entry_hwpoisoned and
> is_hugetlb_entry_migration examples go past without bleating about
> wanting to streamline them a little; but agreed last time to leave
> that to some later cleanup once all the stable backports are stable.)

Yes, these two check routines need cleanup.
I'll do it in separate work later.

> > +				pages++;
> > +			}
> > +			spin_unlock(ptl);
> > +			continue;
> > +		}
> > +		if (!huge_pte_none(pte)) {
> >  			pte = huge_ptep_get_and_clear(mm, address, ptep);
> >  			pte = pte_mkhuge(huge_pte_modify(pte, newprot));
> >  			pte = arch_make_huge_pte(pte, vma, NULL, 0);
> > -- 
> > 1.9.3
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/6] mm/hugetlb: add migration entry check in __unmap_hugepage_range
  2014-09-04  1:47     ` Hugh Dickins
@ 2014-09-05  5:28       ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Sep 03, 2014 at 06:47:38PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > If __unmap_hugepage_range() tries to unmap the address range over which
> > hugepage migration is on the way, we get the wrong page because pte_page()
> > doesn't work for migration entries. This patch calls pte_to_swp_entry() and
> > migration_entry_to_page() to get the right page for migration entries.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org>  # [2.6.36+]
> 
> 2.6.36+?  But this one doesn't affect hwpoisoned.
> I admit I've lost track of how far back hugetlb migration goes:
> oh, to 2.6.37+, that fits with what you marked on some commits earlier.
> But then 2/6 says 3.12+.  Help!  Please remind me of the sequence of events.

The bug of this patch exists after any kind of hugetlb migration appears,
so I tagged as [2.6.36+] (Fixes: 290408d4a2 "hugetlb: hugepage migration core".)
As for patch 2/6, the related bug was introduced due to follow_huge_pmd()
with FOLL_GET, which can happen after commit e632a938d914 "mm: migrate:
add hugepage migration code to move_pages()", so I tagged as [3.12+].

> 
> > ---
> >  mm/hugetlb.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 1ed9df6def54..0a4511115ee0 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -2652,6 +2652,13 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		if (huge_pte_none(pte))
> >  			goto unlock;
> >  
> > +		if (unlikely(is_hugetlb_entry_migration(pte))) {
> > +			swp_entry_t entry = pte_to_swp_entry(pte);
> > +
> > +			page = migration_entry_to_page(entry);
> > +			goto clear;
> > +		}
> > +
> 
> This surprises me: are you sure?  Obviously you know hugetlb migration
> much better than I do: is it done in a significantly different way from
> order:0 page migration?  In the order:0 case, there is no reference to
> the page corresponding to the migration entry placed in a page table,
> just the remaining reference held by the task doing the migration.  But
> here you are jumping to the code which unmaps and frees a present page.

Sorry, I misread the code again, you're right.

> I can see that a fix is necessary, but I would have expected it to
> consist of merely changing the "HWPoisoned" comment below to include
> migration entries, and changing its test from
> 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> to
> 		if (unlikely(!pte_present(pte))) {

Yes, this looks the best way.

> 
> >  		/*
> >  		 * HWPoisoned hugepage is already unmapped and dropped reference
> >  		 */
> > @@ -2677,7 +2684,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  			 */
> >  			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
> >  		}
> > -
> > +clear:
> >  		pte = huge_ptep_get_and_clear(mm, address, ptep);
> >  		tlb_remove_tlb_entry(tlb, ptep, address);
> >  		if (huge_pte_dirty(pte))
> > -- 
> > 1.9.3
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/6] mm/hugetlb: add migration entry check in __unmap_hugepage_range
@ 2014-09-05  5:28       ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Sep 03, 2014 at 06:47:38PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > If __unmap_hugepage_range() tries to unmap the address range over which
> > hugepage migration is on the way, we get the wrong page because pte_page()
> > doesn't work for migration entries. This patch calls pte_to_swp_entry() and
> > migration_entry_to_page() to get the right page for migration entries.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: <stable@vger.kernel.org>  # [2.6.36+]
> 
> 2.6.36+?  But this one doesn't affect hwpoisoned.
> I admit I've lost track of how far back hugetlb migration goes:
> oh, to 2.6.37+, that fits with what you marked on some commits earlier.
> But then 2/6 says 3.12+.  Help!  Please remind me of the sequence of events.

The bug of this patch exists after any kind of hugetlb migration appears,
so I tagged as [2.6.36+] (Fixes: 290408d4a2 "hugetlb: hugepage migration core".)
As for patch 2/6, the related bug was introduced due to follow_huge_pmd()
with FOLL_GET, which can happen after commit e632a938d914 "mm: migrate:
add hugepage migration code to move_pages()", so I tagged as [3.12+].

> 
> > ---
> >  mm/hugetlb.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 1ed9df6def54..0a4511115ee0 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -2652,6 +2652,13 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		if (huge_pte_none(pte))
> >  			goto unlock;
> >  
> > +		if (unlikely(is_hugetlb_entry_migration(pte))) {
> > +			swp_entry_t entry = pte_to_swp_entry(pte);
> > +
> > +			page = migration_entry_to_page(entry);
> > +			goto clear;
> > +		}
> > +
> 
> This surprises me: are you sure?  Obviously you know hugetlb migration
> much better than I do: is it done in a significantly different way from
> order:0 page migration?  In the order:0 case, there is no reference to
> the page corresponding to the migration entry placed in a page table,
> just the remaining reference held by the task doing the migration.  But
> here you are jumping to the code which unmaps and frees a present page.

Sorry, I misread the code again, you're right.

> I can see that a fix is necessary, but I would have expected it to
> consist of merely changing the "HWPoisoned" comment below to include
> migration entries, and changing its test from
> 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
> to
> 		if (unlikely(!pte_present(pte))) {

Yes, this looks the best way.

> 
> >  		/*
> >  		 * HWPoisoned hugepage is already unmapped and dropped reference
> >  		 */
> > @@ -2677,7 +2684,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  			 */
> >  			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
> >  		}
> > -
> > +clear:
> >  		pte = huge_ptep_get_and_clear(mm, address, ptep);
> >  		tlb_remove_tlb_entry(tlb, ptep, address);
> >  		if (huge_pte_dirty(pte))
> > -- 
> > 1.9.3
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 6/6] mm/hugetlb: remove unused argument of follow_huge_addr()
  2014-09-03 21:26     ` Hugh Dickins
@ 2014-09-05  5:29       ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Sep 03, 2014 at 02:26:37PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > follow_huge_addr()'s parameter write is not used, so let's remove it.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> I think this patch is a waste of time: that it should be replaced
> by a patch which replaces the "write" argument by a "flags" argument,

OK, I just drop this patch.

> so that follow_huge_addr() can do get_page() for FOLL_GET while holding
> appropriate lock, instead of the BUG_ON(flags & FOLL_GET) we currently
> have.
> 
> Once that is implemented, you could try getting hugetlb migration
> tested on ia64 and powerpc; but yes, keep hugetlb migration disabled
> on all but x86 until it has been tested elsewhere.
> 
> > ---
> >  arch/ia64/mm/hugetlbpage.c    | 2 +-
> >  arch/powerpc/mm/hugetlbpage.c | 2 +-
> >  arch/x86/mm/hugetlbpage.c     | 2 +-
> >  include/linux/hugetlb.h       | 5 ++---
> >  mm/gup.c                      | 2 +-
> >  mm/hugetlb.c                  | 3 +--
> >  6 files changed, 7 insertions(+), 9 deletions(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > index 6170381bf074..524a4e001bda 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > @@ -89,7 +89,7 @@ int prepare_hugepage_range(struct file *file,
> >  	return 0;
> >  }
> >  
> > -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
> > +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr)
> >  {
> >  	struct page *page = NULL;
> >  	pte_t *ptep;
> > diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > index 1d8854a56309..5b6fe8b0cde3 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > @@ -674,7 +674,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
> >  }
> >  
> >  struct page *
> > -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> > +follow_huge_addr(struct mm_struct *mm, unsigned long address)
> >  {
> >  	pte_t *ptep;
> >  	struct page *page = ERR_PTR(-EINVAL);
> > diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> > index 03b8a7c11817..cab09d87ae65 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> > @@ -18,7 +18,7 @@
> >  
> >  #if 0	/* This is just for testing */
> >  struct page *
> > -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> > +follow_huge_addr(struct mm_struct *mm, unsigned long address)
> >  {
> >  	unsigned long start = address;
> >  	int length = 1;
> > diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > index b3200fce07aa..cdff1bd393bb 100644
> > --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> > +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > @@ -96,8 +96,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> >  			unsigned long addr, unsigned long sz);
> >  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
> >  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> > -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> > -			      int write);
> > +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address);
> >  struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> >  				pmd_t *pmd, int flags);
> >  struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > @@ -124,7 +123,7 @@ static inline unsigned long hugetlb_total_pages(void)
> >  }
> >  
> >  #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
> > -#define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
> > +#define follow_huge_addr(mm, addr)	ERR_PTR(-EINVAL)
> >  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
> >  static inline void hugetlb_report_meminfo(struct seq_file *m)
> >  {
> > diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> > index 597a5e92e265..8f0550f1770d 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> > +++ mmotm-2014-08-25-16-52/mm/gup.c
> > @@ -149,7 +149,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  
> >  	*page_mask = 0;
> >  
> > -	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
> > +	page = follow_huge_addr(mm, address);
> >  	if (!IS_ERR(page)) {
> >  		BUG_ON(flags & FOLL_GET);
> >  		return page;
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 0a4511115ee0..f7dcad3474ec 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3690,8 +3690,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
> >   * behavior.
> >   */
> >  struct page * __weak
> > -follow_huge_addr(struct mm_struct *mm, unsigned long address,
> > -			      int write)
> > +follow_huge_addr(struct mm_struct *mm, unsigned long address)
> >  {
> >  	return ERR_PTR(-EINVAL);
> >  }
> > -- 
> > 1.9.3
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 6/6] mm/hugetlb: remove unused argument of follow_huge_addr()
@ 2014-09-05  5:29       ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-05  5:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Sep 03, 2014 at 02:26:37PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > follow_huge_addr()'s parameter write is not used, so let's remove it.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> I think this patch is a waste of time: that it should be replaced
> by a patch which replaces the "write" argument by a "flags" argument,

OK, I just drop this patch.

> so that follow_huge_addr() can do get_page() for FOLL_GET while holding
> appropriate lock, instead of the BUG_ON(flags & FOLL_GET) we currently
> have.
> 
> Once that is implemented, you could try getting hugetlb migration
> tested on ia64 and powerpc; but yes, keep hugetlb migration disabled
> on all but x86 until it has been tested elsewhere.
> 
> > ---
> >  arch/ia64/mm/hugetlbpage.c    | 2 +-
> >  arch/powerpc/mm/hugetlbpage.c | 2 +-
> >  arch/x86/mm/hugetlbpage.c     | 2 +-
> >  include/linux/hugetlb.h       | 5 ++---
> >  mm/gup.c                      | 2 +-
> >  mm/hugetlb.c                  | 3 +--
> >  6 files changed, 7 insertions(+), 9 deletions(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > index 6170381bf074..524a4e001bda 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > @@ -89,7 +89,7 @@ int prepare_hugepage_range(struct file *file,
> >  	return 0;
> >  }
> >  
> > -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
> > +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr)
> >  {
> >  	struct page *page = NULL;
> >  	pte_t *ptep;
> > diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > index 1d8854a56309..5b6fe8b0cde3 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > @@ -674,7 +674,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
> >  }
> >  
> >  struct page *
> > -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> > +follow_huge_addr(struct mm_struct *mm, unsigned long address)
> >  {
> >  	pte_t *ptep;
> >  	struct page *page = ERR_PTR(-EINVAL);
> > diff --git mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> > index 03b8a7c11817..cab09d87ae65 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/x86/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/x86/mm/hugetlbpage.c
> > @@ -18,7 +18,7 @@
> >  
> >  #if 0	/* This is just for testing */
> >  struct page *
> > -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> > +follow_huge_addr(struct mm_struct *mm, unsigned long address)
> >  {
> >  	unsigned long start = address;
> >  	int length = 1;
> > diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > index b3200fce07aa..cdff1bd393bb 100644
> > --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> > +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > @@ -96,8 +96,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> >  			unsigned long addr, unsigned long sz);
> >  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
> >  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> > -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> > -			      int write);
> > +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address);
> >  struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> >  				pmd_t *pmd, int flags);
> >  struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > @@ -124,7 +123,7 @@ static inline unsigned long hugetlb_total_pages(void)
> >  }
> >  
> >  #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
> > -#define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
> > +#define follow_huge_addr(mm, addr)	ERR_PTR(-EINVAL)
> >  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
> >  static inline void hugetlb_report_meminfo(struct seq_file *m)
> >  {
> > diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> > index 597a5e92e265..8f0550f1770d 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> > +++ mmotm-2014-08-25-16-52/mm/gup.c
> > @@ -149,7 +149,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  
> >  	*page_mask = 0;
> >  
> > -	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
> > +	page = follow_huge_addr(mm, address);
> >  	if (!IS_ERR(page)) {
> >  		BUG_ON(flags & FOLL_GET);
> >  		return page;
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 0a4511115ee0..f7dcad3474ec 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3690,8 +3690,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
> >   * behavior.
> >   */
> >  struct page * __weak
> > -follow_huge_addr(struct mm_struct *mm, unsigned long address,
> > -			      int write)
> > +follow_huge_addr(struct mm_struct *mm, unsigned long address)
> >  {
> >  	return ERR_PTR(-EINVAL);
> >  }
> > -- 
> > 1.9.3
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
  2014-09-05  5:27       ` Naoya Horiguchi
@ 2014-09-08  7:13         ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-08  7:13 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes,
	linux-mm, linux-kernel, Naoya Horiguchi

On Fri, 5 Sep 2014, Naoya Horiguchi wrote:
> On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> > On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> > > 
> > > Reported-by: Hugh Dickins <hughd@google.com>
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Cc: <stable@vger.kernel.org>  # [3.12+]
> > 
> > No ack to this one yet, I'm afraid.
> 
> OK, I defer Reported-by until all the problems in this patch are solved.
> I added this Reported-by because Andrew asked how In found this problem,
> and advised me to show the reporter.
> And I didn't intend by this Reported-by that you acked the patch.
> In this case, should I have used some unofficial tag like
> "Not-yet-Reported-by:" to avoid being rude?

Sorry, misunderstanding, I chose that position to write "No ack to this
one yet" because that is where I would insert my "Acked-by" to the patch
when ready.  I just meant that I cannot yet give you my "Acked-by".

You were not being rude to me at all, quite the reverse.

I have no objection to your writing "Reported-by: Hugh...": you are
being polite to acknowledge me, and I was not objecting to that.

Although usually, we save "Reported-by"s for users who have
reported a problem they saw in practice, rather than for fellow
developers who have looked at the code and seen a potential bug -
so I won't mind at all if you end up taking it out.

> 
> > One subtlety to take care over: it's a long time since I've had to
> > worry about pmd folding and pud folding (what happens when you only
> > have 2 or 3 levels of page table instead of the full 4): macros get
> > defined to each other, and levels get optimized out (perhaps
> > differently on different architectures).
> > 
> > So although at first sight the lock to take in follow_huge_pud()
> > would seem to be mm->page_table_lock, I am not at this point certain
> > that that's necessarily so - sometimes pud_huge might be pmd_huge,
> > and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> > to be the pud level.  Maybe: needs checking through the architectures
> > and their configs, not obvious to me.
> 
> I think that every architecture uses mm->page_table_lock for pud-level
> locking at least for now, but that could be changed in the future,
> for example when 1GB hugepages or pud-based hugepages become common and
> someone are interested in splitting lock for pud level.

I'm not convinced by your answer, that you understand the (perhaps
imaginary!) issue I'm referring to.  Try grep for __PAGETABLE_P.D_FOLDED.

Our infrastructure allows for 4 levels of pagetable, pgd pud pmd pte,
but many architectures/configurations support only 2 or 3 levels.
What pud functions and pmd functions work out to be in those
configs is confusing, and varies from architecture to architecture.

In particular, pud and pmd may be different expressions of the same
thing (with 1 pmd per pud, instead of say 512).  In that case PUD_SIZE
will equal PMD_SIZE: and then at the pud level huge_pte_lockptr()
will be using split locking instead of mm->page_table_lock.

Many of the hugetlb architectures have a pud_huge() which just returns
0, and we need not worry about those, nor the follow_huge_addr() powerpc.
But arm64, mips, tile, x86 look more interesting.

Frankly, I find myself too dumb to be sure of the right answer for all:
and think that when we put the proper locking into follow_huge_pud(),
we shall have to include a PUD_SIZE == PMD_SIZE test, to let the
compiler decide for us which is the appropriate locking to match
huge_pte_lockptr().

Unless Kirill can illuminate: I may be afraid of complications
where actually there are none.

> So it would be helpful to introduce pud_lockptr() which just returns
> mm->page_table_lock now, so that developers never forget to update it
> when considering splitting pud lock.
> 
> > 
> > I realize that I am asking for you (or I) to do more work, when using
> > huge_pte_lock(hstate_vma(vma),,) would work it out "automatically";
> > but I do feel quite strongly that that's the right approach here
> > (and I'm not just trying to avoid a few edits of "mm" to "vma").
> 
> Yes, I agree.
> 
> > Cc'ing Kirill, who may have a strong view to the contrary,
> > or a good insight on where the problems if any might be.
> > 
> > Also Cc'ing Kirill because I'm not convinced that huge_pte_lockptr()
> > necessarily does the right thing on follow_huge_addr() architectures,
> > ia64 and powerpc.  Do they, for example, allocate the memory for their
> > hugetlb entries in such a way that we can indeed use pmd_lockptr() to
> > point to a useable spinlock, in the case when huge_page_size(h) just
> > happens to equal PMD_SIZE?
> > 
> > I don't know if this was thought through thoroughly
> > (now that's a satisfying phrase hugh thinks hugh never wrote before!)
> > when huge_pte_lockptr() was invented or not.  I think it would be safer
> > if huge_pte_lockptr() just gave mm->page_table_lock on follow_huge_addr()
> > architectures.
> 
> Yes, this seems a real problem and is worth discussing with maintainers
> of these architectures. Maybe we can do this as a separate work.

Perhaps, but I'm hoping Kirill can say, whether it's something he
considered and felt safe with, or something he overlooked at the
time and would prefer to change now.

I suspect that either the follow_huge_addr() architectures should be
constrained to use mm->page_table_lock; or, when we do introduce the
proper locking into find_huge_addr() (you appear to be backing away
from making any change there for now: yes, it's not needed urgently),
that one will have to take vma instead of mm, so that it can be sure
to match huge_pte_lockptr().

Hugh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
@ 2014-09-08  7:13         ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-08  7:13 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes,
	linux-mm, linux-kernel, Naoya Horiguchi

On Fri, 5 Sep 2014, Naoya Horiguchi wrote:
> On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> > On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> > > 
> > > Reported-by: Hugh Dickins <hughd@google.com>
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Cc: <stable@vger.kernel.org>  # [3.12+]
> > 
> > No ack to this one yet, I'm afraid.
> 
> OK, I defer Reported-by until all the problems in this patch are solved.
> I added this Reported-by because Andrew asked how In found this problem,
> and advised me to show the reporter.
> And I didn't intend by this Reported-by that you acked the patch.
> In this case, should I have used some unofficial tag like
> "Not-yet-Reported-by:" to avoid being rude?

Sorry, misunderstanding, I chose that position to write "No ack to this
one yet" because that is where I would insert my "Acked-by" to the patch
when ready.  I just meant that I cannot yet give you my "Acked-by".

You were not being rude to me at all, quite the reverse.

I have no objection to your writing "Reported-by: Hugh...": you are
being polite to acknowledge me, and I was not objecting to that.

Although usually, we save "Reported-by"s for users who have
reported a problem they saw in practice, rather than for fellow
developers who have looked at the code and seen a potential bug -
so I won't mind at all if you end up taking it out.

> 
> > One subtlety to take care over: it's a long time since I've had to
> > worry about pmd folding and pud folding (what happens when you only
> > have 2 or 3 levels of page table instead of the full 4): macros get
> > defined to each other, and levels get optimized out (perhaps
> > differently on different architectures).
> > 
> > So although at first sight the lock to take in follow_huge_pud()
> > would seem to be mm->page_table_lock, I am not at this point certain
> > that that's necessarily so - sometimes pud_huge might be pmd_huge,
> > and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> > to be the pud level.  Maybe: needs checking through the architectures
> > and their configs, not obvious to me.
> 
> I think that every architecture uses mm->page_table_lock for pud-level
> locking at least for now, but that could be changed in the future,
> for example when 1GB hugepages or pud-based hugepages become common and
> someone are interested in splitting lock for pud level.

I'm not convinced by your answer, that you understand the (perhaps
imaginary!) issue I'm referring to.  Try grep for __PAGETABLE_P.D_FOLDED.

Our infrastructure allows for 4 levels of pagetable, pgd pud pmd pte,
but many architectures/configurations support only 2 or 3 levels.
What pud functions and pmd functions work out to be in those
configs is confusing, and varies from architecture to architecture.

In particular, pud and pmd may be different expressions of the same
thing (with 1 pmd per pud, instead of say 512).  In that case PUD_SIZE
will equal PMD_SIZE: and then at the pud level huge_pte_lockptr()
will be using split locking instead of mm->page_table_lock.

Many of the hugetlb architectures have a pud_huge() which just returns
0, and we need not worry about those, nor the follow_huge_addr() powerpc.
But arm64, mips, tile, x86 look more interesting.

Frankly, I find myself too dumb to be sure of the right answer for all:
and think that when we put the proper locking into follow_huge_pud(),
we shall have to include a PUD_SIZE == PMD_SIZE test, to let the
compiler decide for us which is the appropriate locking to match
huge_pte_lockptr().

Unless Kirill can illuminate: I may be afraid of complications
where actually there are none.

> So it would be helpful to introduce pud_lockptr() which just returns
> mm->page_table_lock now, so that developers never forget to update it
> when considering splitting pud lock.
> 
> > 
> > I realize that I am asking for you (or I) to do more work, when using
> > huge_pte_lock(hstate_vma(vma),,) would work it out "automatically";
> > but I do feel quite strongly that that's the right approach here
> > (and I'm not just trying to avoid a few edits of "mm" to "vma").
> 
> Yes, I agree.
> 
> > Cc'ing Kirill, who may have a strong view to the contrary,
> > or a good insight on where the problems if any might be.
> > 
> > Also Cc'ing Kirill because I'm not convinced that huge_pte_lockptr()
> > necessarily does the right thing on follow_huge_addr() architectures,
> > ia64 and powerpc.  Do they, for example, allocate the memory for their
> > hugetlb entries in such a way that we can indeed use pmd_lockptr() to
> > point to a useable spinlock, in the case when huge_page_size(h) just
> > happens to equal PMD_SIZE?
> > 
> > I don't know if this was thought through thoroughly
> > (now that's a satisfying phrase hugh thinks hugh never wrote before!)
> > when huge_pte_lockptr() was invented or not.  I think it would be safer
> > if huge_pte_lockptr() just gave mm->page_table_lock on follow_huge_addr()
> > architectures.
> 
> Yes, this seems a real problem and is worth discussing with maintainers
> of these architectures. Maybe we can do this as a separate work.

Perhaps, but I'm hoping Kirill can say, whether it's something he
considered and felt safe with, or something he overlooked at the
time and would prefer to change now.

I suspect that either the follow_huge_addr() architectures should be
constrained to use mm->page_table_lock; or, when we do introduce the
proper locking into find_huge_addr() (you appear to be backing away
from making any change there for now: yes, it's not needed urgently),
that one will have to take vma instead of mm, so that it can be sure
to match huge_pte_lockptr().

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
  2014-09-08  7:13         ` Hugh Dickins
@ 2014-09-08 21:37           ` Naoya Horiguchi
  -1 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-08 21:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Mon, Sep 08, 2014 at 12:13:16AM -0700, Hugh Dickins wrote:
> On Fri, 5 Sep 2014, Naoya Horiguchi wrote:
> > On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> > > On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> > > > 
> > > > Reported-by: Hugh Dickins <hughd@google.com>
> > > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > Cc: <stable@vger.kernel.org>  # [3.12+]
> > > 
> > > No ack to this one yet, I'm afraid.
> > 
> > OK, I defer Reported-by until all the problems in this patch are solved.
> > I added this Reported-by because Andrew asked how In found this problem,
> > and advised me to show the reporter.
> > And I didn't intend by this Reported-by that you acked the patch.
> > In this case, should I have used some unofficial tag like
> > "Not-yet-Reported-by:" to avoid being rude?
> 
> Sorry, misunderstanding, I chose that position to write "No ack to this
> one yet" because that is where I would insert my "Acked-by" to the patch
> when ready.  I just meant that I cannot yet give you my "Acked-by".

I see my understanding, thanks.

> You were not being rude to me at all, quite the reverse.
> 
> I have no objection to your writing "Reported-by: Hugh...": you are
> being polite to acknowledge me, and I was not objecting to that.

Great.

> Although usually, we save "Reported-by"s for users who have
> reported a problem they saw in practice, rather than for fellow
> developers who have looked at the code and seen a potential bug -
> so I won't mind at all if you end up taking it out.
> 
> > 
> > > One subtlety to take care over: it's a long time since I've had to
> > > worry about pmd folding and pud folding (what happens when you only
> > > have 2 or 3 levels of page table instead of the full 4): macros get
> > > defined to each other, and levels get optimized out (perhaps
> > > differently on different architectures).
> > > 
> > > So although at first sight the lock to take in follow_huge_pud()
> > > would seem to be mm->page_table_lock, I am not at this point certain
> > > that that's necessarily so - sometimes pud_huge might be pmd_huge,
> > > and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> > > to be the pud level.  Maybe: needs checking through the architectures
> > > and their configs, not obvious to me.
> > 
> > I think that every architecture uses mm->page_table_lock for pud-level
> > locking at least for now, but that could be changed in the future,
> > for example when 1GB hugepages or pud-based hugepages become common and
> > someone are interested in splitting lock for pud level.
> 
> I'm not convinced by your answer, that you understand the (perhaps
> imaginary!) issue I'm referring to.  Try grep for __PAGETABLE_P.D_FOLDED.
> 
> Our infrastructure allows for 4 levels of pagetable, pgd pud pmd pte,
> but many architectures/configurations support only 2 or 3 levels.
> What pud functions and pmd functions work out to be in those
> configs is confusing, and varies from architecture to architecture.
> 
> In particular, pud and pmd may be different expressions of the same
> thing (with 1 pmd per pud, instead of say 512).  In that case PUD_SIZE
> will equal PMD_SIZE: and then at the pud level huge_pte_lockptr()
> will be using split locking instead of mm->page_table_lock.

Is it a possible problem? It seems to me that in such system no one
can create pud-based hugepages and care about pud level locking.

> Many of the hugetlb architectures have a pud_huge() which just returns
> 0, and we need not worry about those, nor the follow_huge_addr() powerpc.
> But arm64, mips, tile, x86 look more interesting.
> 
> Frankly, I find myself too dumb to be sure of the right answer for all:
> and think that when we put the proper locking into follow_huge_pud(),
> we shall have to include a PUD_SIZE == PMD_SIZE test, to let the
> compiler decide for us which is the appropriate locking to match
> huge_pte_lockptr().

Yes, both should be done at the same time.

> 
> Unless Kirill can illuminate: I may be afraid of complications
> where actually there are none.

Yes. What we need now is to fix follow_huge_pmd(), and combining
non-urgent things with it is not easy for me.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
@ 2014-09-08 21:37           ` Naoya Horiguchi
  0 siblings, 0 replies; 48+ messages in thread
From: Naoya Horiguchi @ 2014-09-08 21:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi

On Mon, Sep 08, 2014 at 12:13:16AM -0700, Hugh Dickins wrote:
> On Fri, 5 Sep 2014, Naoya Horiguchi wrote:
> > On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> > > On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> > > > 
> > > > Reported-by: Hugh Dickins <hughd@google.com>
> > > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > Cc: <stable@vger.kernel.org>  # [3.12+]
> > > 
> > > No ack to this one yet, I'm afraid.
> > 
> > OK, I defer Reported-by until all the problems in this patch are solved.
> > I added this Reported-by because Andrew asked how In found this problem,
> > and advised me to show the reporter.
> > And I didn't intend by this Reported-by that you acked the patch.
> > In this case, should I have used some unofficial tag like
> > "Not-yet-Reported-by:" to avoid being rude?
> 
> Sorry, misunderstanding, I chose that position to write "No ack to this
> one yet" because that is where I would insert my "Acked-by" to the patch
> when ready.  I just meant that I cannot yet give you my "Acked-by".

I see my understanding, thanks.

> You were not being rude to me at all, quite the reverse.
> 
> I have no objection to your writing "Reported-by: Hugh...": you are
> being polite to acknowledge me, and I was not objecting to that.

Great.

> Although usually, we save "Reported-by"s for users who have
> reported a problem they saw in practice, rather than for fellow
> developers who have looked at the code and seen a potential bug -
> so I won't mind at all if you end up taking it out.
> 
> > 
> > > One subtlety to take care over: it's a long time since I've had to
> > > worry about pmd folding and pud folding (what happens when you only
> > > have 2 or 3 levels of page table instead of the full 4): macros get
> > > defined to each other, and levels get optimized out (perhaps
> > > differently on different architectures).
> > > 
> > > So although at first sight the lock to take in follow_huge_pud()
> > > would seem to be mm->page_table_lock, I am not at this point certain
> > > that that's necessarily so - sometimes pud_huge might be pmd_huge,
> > > and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> > > to be the pud level.  Maybe: needs checking through the architectures
> > > and their configs, not obvious to me.
> > 
> > I think that every architecture uses mm->page_table_lock for pud-level
> > locking at least for now, but that could be changed in the future,
> > for example when 1GB hugepages or pud-based hugepages become common and
> > someone are interested in splitting lock for pud level.
> 
> I'm not convinced by your answer, that you understand the (perhaps
> imaginary!) issue I'm referring to.  Try grep for __PAGETABLE_P.D_FOLDED.
> 
> Our infrastructure allows for 4 levels of pagetable, pgd pud pmd pte,
> but many architectures/configurations support only 2 or 3 levels.
> What pud functions and pmd functions work out to be in those
> configs is confusing, and varies from architecture to architecture.
> 
> In particular, pud and pmd may be different expressions of the same
> thing (with 1 pmd per pud, instead of say 512).  In that case PUD_SIZE
> will equal PMD_SIZE: and then at the pud level huge_pte_lockptr()
> will be using split locking instead of mm->page_table_lock.

Is it a possible problem? It seems to me that in such system no one
can create pud-based hugepages and care about pud level locking.

> Many of the hugetlb architectures have a pud_huge() which just returns
> 0, and we need not worry about those, nor the follow_huge_addr() powerpc.
> But arm64, mips, tile, x86 look more interesting.
> 
> Frankly, I find myself too dumb to be sure of the right answer for all:
> and think that when we put the proper locking into follow_huge_pud(),
> we shall have to include a PUD_SIZE == PMD_SIZE test, to let the
> compiler decide for us which is the appropriate locking to match
> huge_pte_lockptr().

Yes, both should be done at the same time.

> 
> Unless Kirill can illuminate: I may be afraid of complications
> where actually there are none.

Yes. What we need now is to fix follow_huge_pmd(), and combining
non-urgent things with it is not easy for me.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
  2014-09-08 21:37           ` Naoya Horiguchi
@ 2014-09-09 19:03             ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-09 19:03 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes,
	linux-mm, linux-kernel, Naoya Horiguchi

On Mon, 8 Sep 2014, Naoya Horiguchi wrote:
> On Mon, Sep 08, 2014 at 12:13:16AM -0700, Hugh Dickins wrote:
> > On Fri, 5 Sep 2014, Naoya Horiguchi wrote:
> > > On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> > > 
> > > > One subtlety to take care over: it's a long time since I've had to
> > > > worry about pmd folding and pud folding (what happens when you only
> > > > have 2 or 3 levels of page table instead of the full 4): macros get
> > > > defined to each other, and levels get optimized out (perhaps
> > > > differently on different architectures).
> > > > 
> > > > So although at first sight the lock to take in follow_huge_pud()
> > > > would seem to be mm->page_table_lock, I am not at this point certain
> > > > that that's necessarily so - sometimes pud_huge might be pmd_huge,
> > > > and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> > > > to be the pud level.  Maybe: needs checking through the architectures
> > > > and their configs, not obvious to me.
> > > 
> > > I think that every architecture uses mm->page_table_lock for pud-level
> > > locking at least for now, but that could be changed in the future,
> > > for example when 1GB hugepages or pud-based hugepages become common and
> > > someone are interested in splitting lock for pud level.
> > 
> > I'm not convinced by your answer, that you understand the (perhaps
> > imaginary!) issue I'm referring to.  Try grep for __PAGETABLE_P.D_FOLDED.
> > 
> > Our infrastructure allows for 4 levels of pagetable, pgd pud pmd pte,
> > but many architectures/configurations support only 2 or 3 levels.
> > What pud functions and pmd functions work out to be in those
> > configs is confusing, and varies from architecture to architecture.
> > 
> > In particular, pud and pmd may be different expressions of the same
> > thing (with 1 pmd per pud, instead of say 512).  In that case PUD_SIZE
> > will equal PMD_SIZE: and then at the pud level huge_pte_lockptr()
> > will be using split locking instead of mm->page_table_lock.
> 
> Is it a possible problem? It seems to me that in such system no one
> can create pud-based hugepages and care about pud level locking.

Maybe it is not a possible problem, I already said I'm not certain.
(Maybe I just need to try a couple of x86_32 builds with printks,
to find that it is a real problem; but I haven't tried, and x86_32
would not disprove it for the other architectures.)

But again, your answer does not convince me that you begin to understand
the issue: please read again what I wrote.  I am not talking about
pud-based hugepages, I'm talking about pmd-based hugepages when the
pud level is identical to the pmd level.

Hopefully, you're seeing the issue from a different viewpoint than I am,
and from your good viewpoint the answer is obvious, whereas from my
muddled viewpoint it is not; but you're not making that clear to me.

What is certain is that we do not need to worry about this in a
patch fixing follow_huge_pmd() alone: it only becomes an issue in a
patch extending properly locked FOLL_GET support to follow_huge_pud(),
which I think you've decided to set aside for now.

Hugh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
@ 2014-09-09 19:03             ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2014-09-09 19:03 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes,
	linux-mm, linux-kernel, Naoya Horiguchi

On Mon, 8 Sep 2014, Naoya Horiguchi wrote:
> On Mon, Sep 08, 2014 at 12:13:16AM -0700, Hugh Dickins wrote:
> > On Fri, 5 Sep 2014, Naoya Horiguchi wrote:
> > > On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> > > 
> > > > One subtlety to take care over: it's a long time since I've had to
> > > > worry about pmd folding and pud folding (what happens when you only
> > > > have 2 or 3 levels of page table instead of the full 4): macros get
> > > > defined to each other, and levels get optimized out (perhaps
> > > > differently on different architectures).
> > > > 
> > > > So although at first sight the lock to take in follow_huge_pud()
> > > > would seem to be mm->page_table_lock, I am not at this point certain
> > > > that that's necessarily so - sometimes pud_huge might be pmd_huge,
> > > > and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> > > > to be the pud level.  Maybe: needs checking through the architectures
> > > > and their configs, not obvious to me.
> > > 
> > > I think that every architecture uses mm->page_table_lock for pud-level
> > > locking at least for now, but that could be changed in the future,
> > > for example when 1GB hugepages or pud-based hugepages become common and
> > > someone are interested in splitting lock for pud level.
> > 
> > I'm not convinced by your answer, that you understand the (perhaps
> > imaginary!) issue I'm referring to.  Try grep for __PAGETABLE_P.D_FOLDED.
> > 
> > Our infrastructure allows for 4 levels of pagetable, pgd pud pmd pte,
> > but many architectures/configurations support only 2 or 3 levels.
> > What pud functions and pmd functions work out to be in those
> > configs is confusing, and varies from architecture to architecture.
> > 
> > In particular, pud and pmd may be different expressions of the same
> > thing (with 1 pmd per pud, instead of say 512).  In that case PUD_SIZE
> > will equal PMD_SIZE: and then at the pud level huge_pte_lockptr()
> > will be using split locking instead of mm->page_table_lock.
> 
> Is it a possible problem? It seems to me that in such system no one
> can create pud-based hugepages and care about pud level locking.

Maybe it is not a possible problem, I already said I'm not certain.
(Maybe I just need to try a couple of x86_32 builds with printks,
to find that it is a real problem; but I haven't tried, and x86_32
would not disprove it for the other architectures.)

But again, your answer does not convince me that you begin to understand
the issue: please read again what I wrote.  I am not talking about
pud-based hugepages, I'm talking about pmd-based hugepages when the
pud level is identical to the pmd level.

Hopefully, you're seeing the issue from a different viewpoint than I am,
and from your good viewpoint the answer is obvious, whereas from my
muddled viewpoint it is not; but you're not making that clear to me.

What is certain is that we do not need to worry about this in a
patch fixing follow_huge_pmd() alone: it only becomes an issue in a
patch extending properly locked FOLL_GET support to follow_huge_pud(),
which I think you've decided to set aside for now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
  2014-09-08  7:13         ` Hugh Dickins
@ 2014-09-30 16:54           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-09-30 16:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Naoya Horiguchi, Kirill A. Shutemov, Andrew Morton,
	David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Mon, Sep 08, 2014 at 12:13:16AM -0700, Hugh Dickins wrote:
> > > One subtlety to take care over: it's a long time since I've had to
> > > worry about pmd folding and pud folding (what happens when you only
> > > have 2 or 3 levels of page table instead of the full 4): macros get
> > > defined to each other, and levels get optimized out (perhaps
> > > differently on different architectures).
> > > 
> > > So although at first sight the lock to take in follow_huge_pud()
> > > would seem to be mm->page_table_lock, I am not at this point certain
> > > that that's necessarily so - sometimes pud_huge might be pmd_huge,
> > > and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> > > to be the pud level.  Maybe: needs checking through the architectures
> > > and their configs, not obvious to me.
> > 
> > I think that every architecture uses mm->page_table_lock for pud-level
> > locking at least for now, but that could be changed in the future,
> > for example when 1GB hugepages or pud-based hugepages become common and
> > someone are interested in splitting lock for pud level.
> 
> I'm not convinced by your answer, that you understand the (perhaps
> imaginary!) issue I'm referring to.  Try grep for __PAGETABLE_P.D_FOLDED.
> 
> Our infrastructure allows for 4 levels of pagetable, pgd pud pmd pte,
> but many architectures/configurations support only 2 or 3 levels.
> What pud functions and pmd functions work out to be in those
> configs is confusing, and varies from architecture to architecture.
> 
> In particular, pud and pmd may be different expressions of the same
> thing (with 1 pmd per pud, instead of say 512).  In that case PUD_SIZE
> will equal PMD_SIZE: and then at the pud level huge_pte_lockptr()
> will be using split locking instead of mm->page_table_lock.

<sorry for delay -- just back from vacation>

Look like we can't have PMD folded unless PUD is folded too:

include/asm-generic/pgtable-nopmd.h:#include <asm-generic/pgtable-nopud.h>

It means we have three cases:

- Both PMD and PUD are not folded. PUD_SIZE == PMD_SIZE can be true only
  if PUD table consits from one entry which is emm.. strange.
- PUD folded, PMD is not. In this case PUD_SIZE is equal to PGDIR_SIZE
  which is always (I believe) greater than PMD_SIZE.
- Both are folded: PMD_SIZE == PUD_SIZE == PGDIR_SIZE, but we solve it
  with ARCH_ENABLE_SPLIT_PMD_PTLOCK. It only enabled on configuration with
  where PMD is not folded. Without ARCH_ENABLE_SPLIT_PMD_PTLOCK,
  pmd_lockptr() points to mm->page_table_lock.

Does it make sense?

> Many of the hugetlb architectures have a pud_huge() which just returns
> 0, and we need not worry about those, nor the follow_huge_addr() powerpc.
> But arm64, mips, tile, x86 look more interesting.
> 
> Frankly, I find myself too dumb to be sure of the right answer for all:
> and think that when we put the proper locking into follow_huge_pud(),
> we shall have to include a PUD_SIZE == PMD_SIZE test, to let the
> compiler decide for us which is the appropriate locking to match
> huge_pte_lockptr().
> 
> Unless Kirill can illuminate: I may be afraid of complications
> where actually there are none.

I'm more worry about false-negative result of huge_page_size(h) ==
PMD_SIZE check. I can imagine that some architectures (power and ia64, i
guess) allows several page sizes on the same page table level, but only
one of them is PMD_SIZE.

It seems not a problem currently since we enable split PMD lock only on
x86 and s390.

Possible solution is to annotate each hstate with page table level it
corresponds to.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()
@ 2014-09-30 16:54           ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-09-30 16:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Naoya Horiguchi, Kirill A. Shutemov, Andrew Morton,
	David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

On Mon, Sep 08, 2014 at 12:13:16AM -0700, Hugh Dickins wrote:
> > > One subtlety to take care over: it's a long time since I've had to
> > > worry about pmd folding and pud folding (what happens when you only
> > > have 2 or 3 levels of page table instead of the full 4): macros get
> > > defined to each other, and levels get optimized out (perhaps
> > > differently on different architectures).
> > > 
> > > So although at first sight the lock to take in follow_huge_pud()
> > > would seem to be mm->page_table_lock, I am not at this point certain
> > > that that's necessarily so - sometimes pud_huge might be pmd_huge,
> > > and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> > > to be the pud level.  Maybe: needs checking through the architectures
> > > and their configs, not obvious to me.
> > 
> > I think that every architecture uses mm->page_table_lock for pud-level
> > locking at least for now, but that could be changed in the future,
> > for example when 1GB hugepages or pud-based hugepages become common and
> > someone are interested in splitting lock for pud level.
> 
> I'm not convinced by your answer, that you understand the (perhaps
> imaginary!) issue I'm referring to.  Try grep for __PAGETABLE_P.D_FOLDED.
> 
> Our infrastructure allows for 4 levels of pagetable, pgd pud pmd pte,
> but many architectures/configurations support only 2 or 3 levels.
> What pud functions and pmd functions work out to be in those
> configs is confusing, and varies from architecture to architecture.
> 
> In particular, pud and pmd may be different expressions of the same
> thing (with 1 pmd per pud, instead of say 512).  In that case PUD_SIZE
> will equal PMD_SIZE: and then at the pud level huge_pte_lockptr()
> will be using split locking instead of mm->page_table_lock.

<sorry for delay -- just back from vacation>

Look like we can't have PMD folded unless PUD is folded too:

include/asm-generic/pgtable-nopmd.h:#include <asm-generic/pgtable-nopud.h>

It means we have three cases:

- Both PMD and PUD are not folded. PUD_SIZE == PMD_SIZE can be true only
  if PUD table consits from one entry which is emm.. strange.
- PUD folded, PMD is not. In this case PUD_SIZE is equal to PGDIR_SIZE
  which is always (I believe) greater than PMD_SIZE.
- Both are folded: PMD_SIZE == PUD_SIZE == PGDIR_SIZE, but we solve it
  with ARCH_ENABLE_SPLIT_PMD_PTLOCK. It only enabled on configuration with
  where PMD is not folded. Without ARCH_ENABLE_SPLIT_PMD_PTLOCK,
  pmd_lockptr() points to mm->page_table_lock.

Does it make sense?

> Many of the hugetlb architectures have a pud_huge() which just returns
> 0, and we need not worry about those, nor the follow_huge_addr() powerpc.
> But arm64, mips, tile, x86 look more interesting.
> 
> Frankly, I find myself too dumb to be sure of the right answer for all:
> and think that when we put the proper locking into follow_huge_pud(),
> we shall have to include a PUD_SIZE == PMD_SIZE test, to let the
> compiler decide for us which is the appropriate locking to match
> huge_pte_lockptr().
> 
> Unless Kirill can illuminate: I may be afraid of complications
> where actually there are none.

I'm more worry about false-negative result of huge_page_size(h) ==
PMD_SIZE check. I can imagine that some architectures (power and ia64, i
guess) allows several page sizes on the same page table level, but only
one of them is PMD_SIZE.

It seems not a problem currently since we enable split PMD lock only on
x86 and s390.

Possible solution is to annotate each hstate with page table level it
corresponds to.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2014-09-30 16:55 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-29  1:38 [PATCH 0/6] hugepage migration fixes (v3) Naoya Horiguchi
2014-08-29  1:38 ` Naoya Horiguchi
2014-08-29  1:38 ` [PATCH v3 1/6] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
2014-08-29  1:38   ` Naoya Horiguchi
2014-09-03 19:40   ` Hugh Dickins
2014-09-03 19:40     ` Hugh Dickins
2014-08-29  1:38 ` [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)() Naoya Horiguchi
2014-08-29  1:38   ` Naoya Horiguchi
2014-09-03 21:17   ` Hugh Dickins
2014-09-03 21:17     ` Hugh Dickins
2014-09-05  5:27     ` Naoya Horiguchi
2014-09-05  5:27       ` Naoya Horiguchi
2014-09-08  7:13       ` Hugh Dickins
2014-09-08  7:13         ` Hugh Dickins
2014-09-08 21:37         ` Naoya Horiguchi
2014-09-08 21:37           ` Naoya Horiguchi
2014-09-09 19:03           ` Hugh Dickins
2014-09-09 19:03             ` Hugh Dickins
2014-09-30 16:54         ` Kirill A. Shutemov
2014-09-30 16:54           ` Kirill A. Shutemov
2014-08-29  1:38 ` [PATCH v3 3/6] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() Naoya Horiguchi
2014-08-29  1:38   ` Naoya Horiguchi
2014-09-04  0:20   ` Hugh Dickins
2014-09-04  0:20     ` Hugh Dickins
2014-09-05  5:28     ` Naoya Horiguchi
2014-09-05  5:28       ` Naoya Horiguchi
2014-08-29  1:38 ` [PATCH v3 4/6] mm/hugetlb: add migration entry check in hugetlb_change_protection Naoya Horiguchi
2014-08-29  1:38   ` Naoya Horiguchi
2014-09-04  1:06   ` Hugh Dickins
2014-09-04  1:06     ` Hugh Dickins
2014-09-05  5:28     ` Naoya Horiguchi
2014-09-05  5:28       ` Naoya Horiguchi
2014-08-29  1:38 ` [PATCH v3 5/6] mm/hugetlb: add migration entry check in __unmap_hugepage_range Naoya Horiguchi
2014-08-29  1:38   ` Naoya Horiguchi
2014-09-04  1:47   ` Hugh Dickins
2014-09-04  1:47     ` Hugh Dickins
2014-09-05  5:28     ` Naoya Horiguchi
2014-09-05  5:28       ` Naoya Horiguchi
2014-08-29  1:39 ` [PATCH v3 6/6] mm/hugetlb: remove unused argument of follow_huge_addr() Naoya Horiguchi
2014-08-29  1:39   ` Naoya Horiguchi
2014-09-03 21:26   ` Hugh Dickins
2014-09-03 21:26     ` Hugh Dickins
2014-09-05  5:29     ` Naoya Horiguchi
2014-09-05  5:29       ` Naoya Horiguchi
2014-08-31 15:27 ` [PATCH 0/6] hugepage migration fixes (v3) Andi Kleen
2014-08-31 15:27   ` Andi Kleen
2014-09-01  4:08   ` Naoya Horiguchi
2014-09-01  4:08     ` Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.