All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] NUMA balancing related fixlets
@ 2014-10-02 13:29 Mel Gorman
  2014-10-02 13:29 ` [PATCH 1/4] mm: remove misleading ARCH_USES_NUMA_PROT_NONE (from Andrew's tree) Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Mel Gorman @ 2014-10-02 13:29 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Mel Gorman, Linux Kernel

The following is in reaction to the thread "pipe/page fault oddness".
The first patch cleans up a lot of mess related to ARCH_USES_NUMA_PROT_NONE.
It's already in Andrew's tree but included here for context.

The second patch removes a race between migration and mprotect, particularly
from compaction context.

The third patch should have no real effect but it's an oversight that
should be corrected in case MPOL_MF_LAZY is exposed to userspace.

The fourth patch addresses a problem whereby a split_huge_page can leave
a pte_numa PTE in the wrong location and is one potential source of the
bug Dave observed.

Note that none of this prevents a pte_protnone approach being applied on
top or replacing patch 4 with it.  They should not be in conflict but if
patch 4 in particular addresses Dave's problem then it may be considerably
lower risk for 3.17.

 arch/powerpc/include/asm/pgtable.h    | 57 ++++++++---------------------------
 arch/powerpc/include/asm/pte-common.h |  5 +++
 arch/x86/Kconfig                      |  1 -
 arch/x86/include/asm/pgtable_types.h  | 14 +++++++++
 include/asm-generic/pgtable.h         | 27 ++++++-----------
 init/Kconfig                          | 11 -------
 mm/huge_memory.c                      |  2 --
 mm/mempolicy.c                        |  4 ++-
 mm/migrate.c                          |  4 ++-
 9 files changed, 46 insertions(+), 79 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] mm: remove misleading ARCH_USES_NUMA_PROT_NONE (from Andrew's tree)
  2014-10-02 13:29 [PATCH 0/4] NUMA balancing related fixlets Mel Gorman
@ 2014-10-02 13:29 ` Mel Gorman
  2014-10-02 13:29 ` [PATCH 2/4] mm: migrate: Close race between migration completion and mprotect Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2014-10-02 13:29 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Mel Gorman, Linux Kernel

(This is already in Andrews tree but included for context and because it adds
 a VM_BUG_ON that might trigger faster in the trinity-related testing and
 also updates a stale comment in mainline)

ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
_PAGE_NUMA using _PROT_NONE.  This saved using an additional PTE bit and
relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
fault scanner.  This was found to be conceptually confusing with a lot of
implicit assumptions and it was asked that an alternative be found.

Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE
bits and shrunk the maximum possible swap size but it did not go far
enough.  There are no architectures that reuse _PROT_NONE as _PROT_NUMA
but the relics still exist.

This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
duplication in powerpc vs the generic implementation by defining the types
the core NUMA helpers expected to exist from x86 with their ppc64
equivalent.  This necessitated that a PTE bit mask be created that
identified the bits that distinguish present from NUMA pte entries but it
is expected this will only differ between arches based on _PAGE_PROTNONE.
The naming for the generic helpers was taken from x86 originally but ppc64
has types that are equivalent for the purposes of the helper so they are
mapped instead of duplicating code.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/powerpc/include/asm/pgtable.h    | 57 ++++++++---------------------------
 arch/powerpc/include/asm/pte-common.h |  5 +++
 arch/x86/Kconfig                      |  1 -
 arch/x86/include/asm/pgtable_types.h  | 14 +++++++++
 include/asm-generic/pgtable.h         | 27 ++++++-----------
 init/Kconfig                          | 11 -------
 6 files changed, 40 insertions(+), 75 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index d98c1ec..f60d4ea 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte)		{ return (pte_val(pte) & ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)	{ return __pgprot(pte_val(pte) & PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
-
 static inline int pte_present(pte_t pte)
 {
-	return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA);
+	return pte_val(pte) & _PAGE_NUMA_MASK;
 }
 
 #define pte_present_nonuma pte_present_nonuma
@@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte)
 	return pte_val(pte) & (_PAGE_PRESENT);
 }
 
-#define pte_numa pte_numa
-static inline int pte_numa(pte_t pte)
-{
-	return (pte_val(pte) &
-		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-
-#define pte_mknonnuma pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-	pte_val(pte) &= ~_PAGE_NUMA;
-	pte_val(pte) |=  _PAGE_PRESENT | _PAGE_ACCESSED;
-	return pte;
-}
-
-#define pte_mknuma pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
-	/*
-	 * We should not set _PAGE_NUMA on non present ptes. Also clear the
-	 * present bit so that hash_page will return 1 and we collect this
-	 * as numa fault.
-	 */
-	if (pte_present(pte)) {
-		pte_val(pte) |= _PAGE_NUMA;
-		pte_val(pte) &= ~_PAGE_PRESENT;
-	} else
-		VM_BUG_ON(1);
-	return pte;
-}
-
 #define ptep_set_numa ptep_set_numa
 static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
 				 pte_t *ptep)
@@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
 	return;
 }
 
-#define pmd_numa pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
-	return pte_numa(pmd_pte(pmd));
-}
-
 #define pmdp_set_numa pmdp_set_numa
 static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
 				 pmd_t *pmdp)
@@ -109,16 +71,21 @@ static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
 	return;
 }
 
-#define pmd_mknonnuma pmd_mknonnuma
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+/*
+ * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
+ * which was inherited from x86. For the purposes of powerpc pte_basic_t and
+ * pmd_t are equivalent
+ */
+#define pteval_t pte_basic_t
+#define pmdval_t pmd_t
+static inline pteval_t ptenuma_flags(pte_t pte)
 {
-	return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
+	return pte_val(pte) & _PAGE_NUMA_MASK;
 }
 
-#define pmd_mknuma pmd_mknuma
-static inline pmd_t pmd_mknuma(pmd_t pmd)
+static inline pmdval_t pmdnuma_flags(pmd_t pmd)
 {
-	return pte_pmd(pte_mknuma(pmd_pte(pmd)));
+	return pmd_val(pmd) & _PAGE_NUMA_MASK;
 }
 
 # else
diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h
index 8d1569c..e040c35 100644
--- a/arch/powerpc/include/asm/pte-common.h
+++ b/arch/powerpc/include/asm/pte-common.h
@@ -98,6 +98,11 @@ extern unsigned long bad_call_to_PMD_PAGE_SIZE(void);
 			 _PAGE_USER | _PAGE_ACCESSED | \
 			 _PAGE_RW | _PAGE_HWWRITE | _PAGE_DIRTY | _PAGE_EXEC)
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Mask of bits that distinguish present and numa ptes */
+#define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PRESENT)
+#endif
+
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
  * cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3632743..a4c581f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,7 +30,6 @@ config X86
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
 	select ARCH_SUPPORTS_INT128 if X86_64
-	select ARCH_WANTS_PROT_NUMA_PROT_NONE
 	select HAVE_IDE
 	select HAVE_OPROFILE
 	select HAVE_PCSPKR_PLATFORM
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f216963..0f9724c 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -325,6 +325,20 @@ static inline pteval_t pte_flags(pte_t pte)
 	return native_pte_val(pte) & PTE_FLAGS_MASK;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Set of bits that distinguishes present, prot_none and numa ptes */
+#define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)
+static inline pteval_t ptenuma_flags(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_NUMA_MASK;
+}
+
+static inline pmdval_t pmdnuma_flags(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_NUMA_MASK;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #define pgprot_val(x)	((x).pgprot)
 #define __pgprot(x)	((pgprot_t) { (x) } )
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 53b2acc..281870f 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -660,11 +660,12 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
 /*
- * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
- * same bit too). It's set only when _PAGE_PRESET is not set and it's
- * never set if _PAGE_PRESENT is set.
+ * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
+ * is protected for PROT_NONE and a NUMA hinting fault entry. If the
+ * architecture defines __PAGE_PROTNONE then it should take that into account
+ * but those that do not can rely on the fact that the NUMA hinting scanner
+ * skips inaccessible VMAs.
  *
  * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
  * fault triggers on those regions if pte/pmd_numa returns true
@@ -673,16 +674,14 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #ifndef pte_numa
 static inline int pte_numa(pte_t pte)
 {
-	return (pte_flags(pte) &
-		(_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)) == _PAGE_NUMA;
+	return ptenuma_flags(pte) == _PAGE_NUMA;
 }
 #endif
 
 #ifndef pmd_numa
 static inline int pmd_numa(pmd_t pmd)
 {
-	return (pmd_flags(pmd) &
-		(_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)) == _PAGE_NUMA;
+	return pmdnuma_flags(pmd) == _PAGE_NUMA;
 }
 #endif
 
@@ -722,6 +721,8 @@ static inline pte_t pte_mknuma(pte_t pte)
 {
 	pteval_t val = pte_val(pte);
 
+	VM_BUG_ON(!(val & _PAGE_PRESENT));
+
 	val &= ~_PAGE_PRESENT;
 	val |= _PAGE_NUMA;
 
@@ -765,16 +766,6 @@ static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 #else
-extern int pte_numa(pte_t pte);
-extern int pmd_numa(pmd_t pmd);
-extern pte_t pte_mknonnuma(pte_t pte);
-extern pmd_t pmd_mknonnuma(pmd_t pmd);
-extern pte_t pte_mknuma(pte_t pte);
-extern pmd_t pmd_mknuma(pmd_t pmd);
-extern void ptep_set_numa(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
-extern void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp);
-#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
-#else
 static inline int pmd_numa(pmd_t pmd)
 {
 	return 0;
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..f5c6bb5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -887,17 +887,6 @@ config ARCH_SUPPORTS_INT128
 config ARCH_WANT_NUMA_VARIABLE_LOCALITY
 	bool
 
-#
-# For architectures that are willing to define _PAGE_NUMA as _PAGE_PROTNONE
-config ARCH_WANTS_PROT_NUMA_PROT_NONE
-	bool
-
-config ARCH_USES_NUMA_PROT_NONE
-	bool
-	default y
-	depends on ARCH_WANTS_PROT_NUMA_PROT_NONE
-	depends on NUMA_BALANCING
-
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] mm: migrate: Close race between migration completion and mprotect
  2014-10-02 13:29 [PATCH 0/4] NUMA balancing related fixlets Mel Gorman
  2014-10-02 13:29 ` [PATCH 1/4] mm: remove misleading ARCH_USES_NUMA_PROT_NONE (from Andrew's tree) Mel Gorman
@ 2014-10-02 13:29 ` Mel Gorman
  2014-10-02 15:48   ` Rik van Riel
  2014-10-02 16:41   ` Linus Torvalds
  2014-10-02 13:29 ` [PATCH 3/4] mm: mempolicy: Skip inaccessible VMAs when setting MPOL_MF_LAZY Mel Gorman
  2014-10-02 13:29 ` [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
  3 siblings, 2 replies; 18+ messages in thread
From: Mel Gorman @ 2014-10-02 13:29 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Mel Gorman, Linux Kernel

A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions when migration completes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f78ec9b..f59b5de 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -146,7 +146,9 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (pte_swp_soft_dirty(*ptep))
 		pte = pte_mksoft_dirty(pte);
-	if (is_write_migration_entry(entry))
+
+	/* Recheck VMA as permissions can change during migration started  */
+	if (is_write_migration_entry(entry) && (vma->vm_flags & (VM_WRITE)))
 		pte = pte_mkwrite(pte);
 #ifdef CONFIG_HUGETLB_PAGE
 	if (PageHuge(new)) {
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] mm: mempolicy: Skip inaccessible VMAs when setting MPOL_MF_LAZY
  2014-10-02 13:29 [PATCH 0/4] NUMA balancing related fixlets Mel Gorman
  2014-10-02 13:29 ` [PATCH 1/4] mm: remove misleading ARCH_USES_NUMA_PROT_NONE (from Andrew's tree) Mel Gorman
  2014-10-02 13:29 ` [PATCH 2/4] mm: migrate: Close race between migration completion and mprotect Mel Gorman
@ 2014-10-02 13:29 ` Mel Gorman
  2014-10-02 15:51   ` Rik van Riel
  2014-10-02 13:29 ` [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
  3 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2014-10-02 13:29 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Mel Gorman, Linux Kernel

PROT_NUMA VMAs are skipped to avoid problems distinguishing between
present, prot_none and special entries. MPOL_MF_LAZY is not visible from
userspace since commit a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP and
MPOL_MF_LAZY from userspace for now") but it should still skip VMAs the
same way task_numa_work does.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mempolicy.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d..a5877ce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -683,7 +683,9 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		}
 
 		if (flags & MPOL_MF_LAZY) {
-			change_prot_numa(vma, start, endvma);
+			/* Similar to task_numa_work, skip inaccessible VMAs */
+			if (vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))
+				change_prot_numa(vma, start, endvma);
 			goto next;
 		}
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 13:29 [PATCH 0/4] NUMA balancing related fixlets Mel Gorman
                   ` (2 preceding siblings ...)
  2014-10-02 13:29 ` [PATCH 3/4] mm: mempolicy: Skip inaccessible VMAs when setting MPOL_MF_LAZY Mel Gorman
@ 2014-10-02 13:29 ` Mel Gorman
  2014-10-02 15:18   ` Kirill A. Shutemov
                     ` (2 more replies)
  3 siblings, 3 replies; 18+ messages in thread
From: Mel Gorman @ 2014-10-02 13:29 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Mel Gorman, Linux Kernel

This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.

 VM_RW|VM_PROTNONE
|-----------------|
      ^
      split here

In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.

Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..17a74d6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1801,8 +1801,6 @@ static int __split_huge_page_map(struct page *page,
 				entry = pte_wrprotect(entry);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
-			if (pmd_numa(*pmd))
-				entry = pte_mknuma(entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 13:29 ` [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
@ 2014-10-02 15:18   ` Kirill A. Shutemov
  2014-10-02 16:19   ` Rik van Riel
  2014-10-02 16:36   ` Linus Torvalds
  2 siblings, 0 replies; 18+ messages in thread
From: Kirill A. Shutemov @ 2014-10-02 15:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Jones, Linus Torvalds, Hugh Dickins, Al Viro, Rik van Riel,
	Ingo Molnar, Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On Thu, Oct 02, 2014 at 02:29:18PM +0100, Mel Gorman wrote:
> This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
> NUMA type from the pmd to the pte"). If a huge page is being split due
> a protection change and the tail will be in a PROT_NONE vma then NUMA
> hinting PTEs are temporarily created in the protected VMA.
> 
>  VM_RW|VM_PROTNONE
> |-----------------|
>       ^
>       split here
> 
> In the specific case above, it should get fixed up by change_pte_range()
> but there is a window of opportunity for weirdness to happen. Similarly,
> if a huge page is shrunk and split during a protection update but before
> pmd_numa is cleared then a pte_numa can be left behind.
> 
> Instead of adding complexity trying to deal with the case, this patch
> will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
> will not be triggered which is marginal in comparison to the complexity
> in dealing with the corner cases during THP split.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I think it deserves a comment to avoid re-introducing NUMA type
transferring.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] mm: migrate: Close race between migration completion and mprotect
  2014-10-02 13:29 ` [PATCH 2/4] mm: migrate: Close race between migration completion and mprotect Mel Gorman
@ 2014-10-02 15:48   ` Rik van Riel
  2014-10-02 16:41   ` Linus Torvalds
  1 sibling, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-10-02 15:48 UTC (permalink / raw)
  To: Mel Gorman, Dave Jones
  Cc: Linus Torvalds, Hugh Dickins, Al Viro, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On 10/02/2014 09:29 AM, Mel Gorman wrote:
> A migration entry is marked as write if pte_write was true at the time the
> entry was created. The VMA protections are not double checked when migration
> entries are being removed as mprotect marks write-migration-entries as
> read. It means that potentially we take a spurious fault to mark PTEs write
> again but it's straight-forward. However, there is a race between write
> migrations being marked read and migrations finishing. This potentially
> allows a PTE to be write that should have been read. Close this race by
> double checking the VMA permissions when migration completes.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] mm: mempolicy: Skip inaccessible VMAs when setting MPOL_MF_LAZY
  2014-10-02 13:29 ` [PATCH 3/4] mm: mempolicy: Skip inaccessible VMAs when setting MPOL_MF_LAZY Mel Gorman
@ 2014-10-02 15:51   ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-10-02 15:51 UTC (permalink / raw)
  To: Mel Gorman, Dave Jones
  Cc: Linus Torvalds, Hugh Dickins, Al Viro, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On 10/02/2014 09:29 AM, Mel Gorman wrote:
> PROT_NUMA VMAs are skipped to avoid problems distinguishing between
> present, prot_none and special entries. MPOL_MF_LAZY is not visible from
> userspace since commit a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP and
> MPOL_MF_LAZY from userspace for now") but it should still skip VMAs the
> same way task_numa_work does.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 13:29 ` [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
  2014-10-02 15:18   ` Kirill A. Shutemov
@ 2014-10-02 16:19   ` Rik van Riel
  2014-10-02 16:36   ` Linus Torvalds
  2 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-10-02 16:19 UTC (permalink / raw)
  To: Mel Gorman, Dave Jones
  Cc: Linus Torvalds, Hugh Dickins, Al Viro, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On 10/02/2014 09:29 AM, Mel Gorman wrote:
> This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
> NUMA type from the pmd to the pte"). If a huge page is being split due
> a protection change and the tail will be in a PROT_NONE vma then NUMA
> hinting PTEs are temporarily created in the protected VMA.
> 
>  VM_RW|VM_PROTNONE
> |-----------------|
>       ^
>       split here
> 
> In the specific case above, it should get fixed up by change_pte_range()
> but there is a window of opportunity for weirdness to happen. Similarly,
> if a huge page is shrunk and split during a protection update but before
> pmd_numa is cleared then a pte_numa can be left behind.
> 
> Instead of adding complexity trying to deal with the case, this patch
> will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
> will not be triggered which is marginal in comparison to the complexity
> in dealing with the corner cases during THP split.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 13:29 ` [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
  2014-10-02 15:18   ` Kirill A. Shutemov
  2014-10-02 16:19   ` Rik van Riel
@ 2014-10-02 16:36   ` Linus Torvalds
  2014-10-02 18:58     ` Sasha Levin
  2 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2014-10-02 16:36 UTC (permalink / raw)
  To: Mel Gorman, Sasha Levin
  Cc: Dave Jones, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On Thu, Oct 2, 2014 at 6:29 AM, Mel Gorman <mgorman@suse.de> wrote:
> This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
> NUMA type from the pmd to the pte"). If a huge page is being split due
> a protection change and the tail will be in a PROT_NONE vma then NUMA
> hinting PTEs are temporarily created in the protected VMA.

So this is the particular bug I was worried about when tracing through the code.

Should I just apply this as-is? And mark it for stable, since this has
been around since 3.8 or so. It would seem to be a very safe change to
do, regardless of whether this is actually the issue that Dave and
maybe Sasha are seeing.

Sasha, I notice that you weren't on the cc for Mel's patches (probably
because you got added later to the other thread), but they were all
cc'd to lkml so you should see them there. Or I can forward them
separately.

            Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] mm: migrate: Close race between migration completion and mprotect
  2014-10-02 13:29 ` [PATCH 2/4] mm: migrate: Close race between migration completion and mprotect Mel Gorman
  2014-10-02 15:48   ` Rik van Riel
@ 2014-10-02 16:41   ` Linus Torvalds
  1 sibling, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2014-10-02 16:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Jones, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On Thu, Oct 2, 2014 at 6:29 AM, Mel Gorman <mgorman@suse.de> wrote:
> +
> +       /* Recheck VMA as permissions can change during migration started  */
> +       if (is_write_migration_entry(entry) && (vma->vm_flags & (VM_WRITE)))
>                 pte = pte_mkwrite(pte);

Side note: we have a "maybe_mkwrite()" helper for the VM_WRITE testing
case. So I'd almost do this as

        if (is_write_migration_entry(entry))
                pte = maybe_mkwrite(pte, vma);

instead.

              Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 16:36   ` Linus Torvalds
@ 2014-10-02 18:58     ` Sasha Levin
  2014-10-02 19:03       ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Sasha Levin @ 2014-10-02 18:58 UTC (permalink / raw)
  To: Linus Torvalds, Mel Gorman
  Cc: Dave Jones, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On 10/02/2014 12:36 PM, Linus Torvalds wrote:
> On Thu, Oct 2, 2014 at 6:29 AM, Mel Gorman <mgorman@suse.de> wrote:
>> > This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
>> > NUMA type from the pmd to the pte"). If a huge page is being split due
>> > a protection change and the tail will be in a PROT_NONE vma then NUMA
>> > hinting PTEs are temporarily created in the protected VMA.
> So this is the particular bug I was worried about when tracing through the code.
> 
> Should I just apply this as-is? And mark it for stable, since this has
> been around since 3.8 or so. It would seem to be a very safe change to
> do, regardless of whether this is actually the issue that Dave and
> maybe Sasha are seeing.
> 
> Sasha, I notice that you weren't on the cc for Mel's patches (probably
> because you got added later to the other thread), but they were all
> cc'd to lkml so you should see them there. Or I can forward them
> separately.

I grabbed them and will keep them in my tree for now instead of your
NUMA-chainsaw-massacre patch.

You've also mentioned that while I can tell you if nothing dies, I can't
really tell you if everything is working well. Is there a reasonable way
to easily say if NUMA is working properly? Even something that would just
tell me "your NUMA balancing seems to be sane" would be good.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 18:58     ` Sasha Levin
@ 2014-10-02 19:03       ` Linus Torvalds
  2014-10-02 19:07         ` Kirill A. Shutemov
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2014-10-02 19:03 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Mel Gorman, Dave Jones, Hugh Dickins, Al Viro, Rik van Riel,
	Ingo Molnar, Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On Thu, Oct 2, 2014 at 11:58 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> You've also mentioned that while I can tell you if nothing dies, I can't
> really tell you if everything is working well. Is there a reasonable way
> to easily say if NUMA is working properly? Even something that would just
> tell me "your NUMA balancing seems to be sane" would be good.

So not having a NUMA machine (and not really wanting one), I can't
really test things like the migration even *working*.

But the easy/obvious test would be to run the autonuma benchmarks, and
verify that the performance is where it's expected to be and that the
migration actually works.

I think the "main" source site is

  https://gitorious.org/autonuma-benchmark

but haven't verified.

             Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 19:03       ` Linus Torvalds
@ 2014-10-02 19:07         ` Kirill A. Shutemov
  2014-10-02 19:12           ` Sasha Levin
  2014-10-02 19:26           ` Linus Torvalds
  0 siblings, 2 replies; 18+ messages in thread
From: Kirill A. Shutemov @ 2014-10-02 19:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Mel Gorman, Dave Jones, Hugh Dickins, Al Viro,
	Rik van Riel, Ingo Molnar, Peter Zijlstra, Aneesh Kumar,
	Michel Lespinasse, Kirill A Shutemov, Linux Kernel

On Thu, Oct 02, 2014 at 12:03:58PM -0700, Linus Torvalds wrote:
> On Thu, Oct 2, 2014 at 11:58 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
> >
> > You've also mentioned that while I can tell you if nothing dies, I can't
> > really tell you if everything is working well. Is there a reasonable way
> > to easily say if NUMA is working properly? Even something that would just
> > tell me "your NUMA balancing seems to be sane" would be good.
> 
> So not having a NUMA machine (and not really wanting one), I can't
> really test things like the migration even *working*.

I believe Sasha uses fakenuma in his KVM for that.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 19:07         ` Kirill A. Shutemov
@ 2014-10-02 19:12           ` Sasha Levin
  2014-10-02 19:26           ` Linus Torvalds
  1 sibling, 0 replies; 18+ messages in thread
From: Sasha Levin @ 2014-10-02 19:12 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds
  Cc: Mel Gorman, Dave Jones, Hugh Dickins, Al Viro, Rik van Riel,
	Ingo Molnar, Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On 10/02/2014 03:07 PM, Kirill A. Shutemov wrote:
> On Thu, Oct 02, 2014 at 12:03:58PM -0700, Linus Torvalds wrote:
>> > On Thu, Oct 2, 2014 at 11:58 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>> > >
>>> > > You've also mentioned that while I can tell you if nothing dies, I can't
>>> > > really tell you if everything is working well. Is there a reasonable way
>>> > > to easily say if NUMA is working properly? Even something that would just
>>> > > tell me "your NUMA balancing seems to be sane" would be good.
>> > 
>> > So not having a NUMA machine (and not really wanting one), I can't
>> > really test things like the migration even *working*.
> I believe Sasha uses fakenuma in his KVM for that.

That's true. You can impose arbitrary configurations on your kernel even without your
hardware telling you to...


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 19:07         ` Kirill A. Shutemov
  2014-10-02 19:12           ` Sasha Levin
@ 2014-10-02 19:26           ` Linus Torvalds
  2014-10-02 19:28             ` Rik van Riel
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2014-10-02 19:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sasha Levin, Mel Gorman, Dave Jones, Hugh Dickins, Al Viro,
	Rik van Riel, Ingo Molnar, Peter Zijlstra, Aneesh Kumar,
	Michel Lespinasse, Kirill A Shutemov, Linux Kernel

On Thu, Oct 2, 2014 at 12:07 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> I believe Sasha uses fakenuma in his KVM for that.

Ok, so the benchmarks won't do anything then.

I mean, I guess they might show some of the migration overhead, but
they won't show the actual end result in any meaningful manner, since
memory isn't actually NUMA.

                Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 19:26           ` Linus Torvalds
@ 2014-10-02 19:28             ` Rik van Riel
  2014-10-02 19:32               ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2014-10-02 19:28 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Sasha Levin, Mel Gorman, Dave Jones, Hugh Dickins, Al Viro,
	Ingo Molnar, Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Linux Kernel

On 10/02/2014 03:26 PM, Linus Torvalds wrote:
> On Thu, Oct 2, 2014 at 12:07 PM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
>>
>> I believe Sasha uses fakenuma in his KVM for that.
> 
> Ok, so the benchmarks won't do anything then.
> 
> I mean, I guess they might show some of the migration overhead, but
> they won't show the actual end result in any meaningful manner, since
> memory isn't actually NUMA.

Both autonuma and "perf bench numa mem" mostly tell us how
quickly the kernel manages to locate tasks and their memory
on the nodes where they belong, without doing much in the
way of NUMA performance measuring.

They are more useful as sanity tests than anything else.

"Does the kernel still properly place each process on its
own node, and how quickly does it do that?"


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 19:28             ` Rik van Riel
@ 2014-10-02 19:32               ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2014-10-02 19:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kirill A. Shutemov, Sasha Levin, Mel Gorman, Dave Jones,
	Hugh Dickins, Al Viro, Ingo Molnar, Peter Zijlstra, Aneesh Kumar,
	Michel Lespinasse, Kirill A Shutemov, Linux Kernel

On Thu, Oct 2, 2014 at 12:28 PM, Rik van Riel <riel@redhat.com> wrote:
>
> They are more useful as sanity tests than anything else.

.. and that's exactly what I'd like to hear about the protnone patch.
Did the patch screw something up so that it doesn't count the faults
properly?

              Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-10-02 19:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-02 13:29 [PATCH 0/4] NUMA balancing related fixlets Mel Gorman
2014-10-02 13:29 ` [PATCH 1/4] mm: remove misleading ARCH_USES_NUMA_PROT_NONE (from Andrew's tree) Mel Gorman
2014-10-02 13:29 ` [PATCH 2/4] mm: migrate: Close race between migration completion and mprotect Mel Gorman
2014-10-02 15:48   ` Rik van Riel
2014-10-02 16:41   ` Linus Torvalds
2014-10-02 13:29 ` [PATCH 3/4] mm: mempolicy: Skip inaccessible VMAs when setting MPOL_MF_LAZY Mel Gorman
2014-10-02 15:51   ` Rik van Riel
2014-10-02 13:29 ` [PATCH 4/4] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
2014-10-02 15:18   ` Kirill A. Shutemov
2014-10-02 16:19   ` Rik van Riel
2014-10-02 16:36   ` Linus Torvalds
2014-10-02 18:58     ` Sasha Levin
2014-10-02 19:03       ` Linus Torvalds
2014-10-02 19:07         ` Kirill A. Shutemov
2014-10-02 19:12           ` Sasha Levin
2014-10-02 19:26           ` Linus Torvalds
2014-10-02 19:28             ` Rik van Riel
2014-10-02 19:32               ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.