linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
@ 2014-11-14 13:32 Mel Gorman
  2014-11-14 13:33 ` [PATCH 1/7] mm: Add p[te|md] protnone helpers for use by NUMA balancing Mel Gorman
                   ` (9 more replies)
  0 siblings, 10 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-14 13:32 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Mel Gorman

This is follow up from the "pipe/page fault oddness" thread.

Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking it
would mark PTEs as not present and use another bit to distinguish between
NUMA hinting faults and other types of faults. It was universally loved
by everybody and caused no problems whatsoever. That last sentence might
be a lie.

This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them relatively
minor contributions. Note that the signed-offs here need addressing. I
couldn't use "From" or Signed-off-by from the original authors as the
patches had to be broken up and they were never signed off. I expect the
two people involved will just stick their signed-off-by on it.

This has received *no* testing at all on ppc64. I'm depending on Aneesh
for that. I did test on a 4-node x86-64 machine with just the basics --
specjbb2005 on single and multi JVM workloads both configured for short
runs and autonumabench. In most cases I'm leaving out detail as it's not
that interesting.

specjbb single JVM: There was negligible performance difference that was
	well within the noise. Overall performance and system activity was
	roughly comparable. Memory node balance was roughly similar. System
	CPU usage is very slightly elevated

specjbb multi JVM: Negligible performance difference, system CPU usage is
	slightly elevated but roughly similar system activity

autonumabench: This was all over the place and about all that can be
	reasonably concluded is that it's different but not necessarily
	better or worse. I'll go into more detail on this one

autonumabench
                                          3.18.0-rc4            3.18.0-rc4
                                             vanilla         protnone-v1r7
Time User-NUMA01                  32806.01 (  0.00%)    33049.91 ( -0.74%)
Time User-NUMA01_THEADLOCAL       23910.28 (  0.00%)    22874.91 (  4.33%)
Time User-NUMA02                   3176.85 (  0.00%)     3116.52 (  1.90%)
Time User-NUMA02_SMT               1600.06 (  0.00%)     1645.56 ( -2.84%)
Time System-NUMA01                  719.07 (  0.00%)     1065.31 (-48.15%)
Time System-NUMA01_THEADLOCAL       916.26 (  0.00%)      365.23 ( 60.14%)
Time System-NUMA02                   20.92 (  0.00%)       17.42 ( 16.73%)
Time System-NUMA02_SMT                8.76 (  0.00%)        5.20 ( 40.64%)
Time Elapsed-NUMA01                 728.27 (  0.00%)      759.89 ( -4.34%)
Time Elapsed-NUMA01_THEADLOCAL      589.15 (  0.00%)      560.95 (  4.79%)
Time Elapsed-NUMA02                  81.20 (  0.00%)       84.78 ( -4.41%)
Time Elapsed-NUMA02_SMT              80.49 (  0.00%)       85.29 ( -5.96%)
Time CPU-NUMA01                    4603.00 (  0.00%)     4489.00 (  2.48%)
Time CPU-NUMA01_THEADLOCAL         4213.00 (  0.00%)     4142.00 (  1.69%)
Time CPU-NUMA02                    3937.00 (  0.00%)     3696.00 (  6.12%)
Time CPU-NUMA02_SMT                1998.00 (  0.00%)     1935.00 (  3.15%)

System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage
is great.  Overall time to complete the benchmark is comparable

          3.18.0-rc4  3.18.0-rc4
             vanillaprotnone-v1r7
User        61493.38    60687.08
System       1665.17     1453.32
Elapsed      1480.79     1492.53

The NUMA stats are as follows

NUMA alloc hit                 4739774     4618019
NUMA alloc miss                      0           0
NUMA interleave hit                  0           0
NUMA alloc local               4664980     4589938
NUMA base PTE updates        556489407   589530598
NUMA huge PMD updates          1086000     1150114
NUMA page range updates     1112521407  1178388966
NUMA hint faults               1538964     1427999
NUMA hint local faults          835871      831469
NUMA hint local percent             54          58
NUMA pages migrated            7329212    18992993
AutoNUMA cost                   11729%      11627%

The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be simply
down to bad luck but any modifications that affect that workload would be
related to scan rates and migrations, not the protection mechanism. For
all other workloads, migration activity was comparable.

Based on these results I concluded that performance-wise this series
is similar but from a maintenance perspective it's probably better. I
suspect that system CPU usage may be slightly higher overall but nowhere
near enough to justify a lot of complexity.

 arch/powerpc/include/asm/pgtable.h    |  53 ++----------
 arch/powerpc/include/asm/pte-common.h |   5 --
 arch/powerpc/include/asm/pte-hash64.h |   6 --
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   2 +-
 arch/powerpc/mm/fault.c               |   5 --
 arch/powerpc/mm/gup.c                 |   4 +-
 arch/x86/include/asm/pgtable.h        |  46 +++++-----
 arch/x86/include/asm/pgtable_64.h     |   5 --
 arch/x86/include/asm/pgtable_types.h  |  41 +--------
 arch/x86/mm/gup.c                     |   4 +-
 include/asm-generic/pgtable.h         | 152 ++--------------------------------
 include/linux/swapops.h               |   2 +-
 include/uapi/linux/mempolicy.h        |   2 +-
 mm/gup.c                              |   8 +-
 mm/huge_memory.c                      |  53 ++++++------
 mm/memory.c                           |  25 ++++--
 mm/mempolicy.c                        |   2 +-
 mm/migrate.c                          |   2 +-
 mm/mprotect.c                         |  44 +++++-----
 mm/pgtable-generic.c                  |   2 -
 20 files changed, 108 insertions(+), 355 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/7] mm: Add p[te|md] protnone helpers for use by NUMA balancing
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
@ 2014-11-14 13:33 ` Mel Gorman
  2014-11-14 13:33 ` [PATCH 2/7] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa Mel Gorman
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-14 13:33 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Mel Gorman

This is a preparatory patch that introduces protnone helpers for automatic
NUMA balancing.

Needs-signed-off: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Needs-signed-off: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/powerpc/include/asm/pgtable.h | 11 +++++++++++
 arch/x86/include/asm/pgtable.h     | 16 ++++++++++++++++
 include/asm-generic/pgtable.h      | 19 +++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 316f9a5..452c3b4 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -39,6 +39,17 @@ static inline int pte_none(pte_t pte)		{ return (pte_val(pte) & ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)	{ return __pgprot(pte_val(pte) & PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
+static inline int pte_protnone_numa(pte_t pte)
+{
+	return (pte_val(pte) &
+		(_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+	return pte_protnone_numa(pmd_pte(pmd));
+}
+
 static inline int pte_present(pte_t pte)
 {
 	return pte_val(pte) & _PAGE_NUMA_MASK;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index aa97a07..613cd00 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -497,6 +497,22 @@ static inline int pmd_present(pmd_t pmd)
 				 _PAGE_NUMA);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h
+ */
+static inline int pte_protnone_numa(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_PROTNONE;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_PROTNONE;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 752e30d..7e74122 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -677,6 +677,25 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_NUMA_BALANCING
+/*
+ * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
+ * the only case the kernel cares is for NUMA balancing. By default,
+ * implement the helper as "always no". Note that this does not check VMA
+ * protection bits meaning that it is up to the caller to distinguish between
+ * PROT_NONE protections and NUMA hinting fault protections.
+ */
+static inline int pte_protnone_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/7] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
  2014-11-14 13:33 ` [PATCH 1/7] mm: Add p[te|md] protnone helpers for use by NUMA balancing Mel Gorman
@ 2014-11-14 13:33 ` Mel Gorman
  2014-11-14 13:33 ` [PATCH 3/7] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations Mel Gorman
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-14 13:33 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Mel Gorman

Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other
page table modifiers are also altered. This patch layout is to make
review easier.

Needs-signed-off: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Needs-signed-off: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 +-
 arch/powerpc/mm/fault.c             |  5 -----
 arch/powerpc/mm/gup.c               |  4 ++--
 arch/x86/mm/gup.c                   |  4 ++--
 include/uapi/linux/mempolicy.h      |  2 +-
 mm/gup.c                            |  8 ++++----
 mm/huge_memory.c                    | 16 +++++++--------
 mm/memory.c                         |  4 ++--
 mm/mprotect.c                       | 39 ++++++++++---------------------------
 mm/pgtable-generic.c                |  2 +-
 10 files changed, 31 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 084ad54..bbd8499 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -235,7 +235,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		pte_size = psize;
 		pte = lookup_linux_pte_and_update(pgdir, hva, writing,
 						  &pte_size);
-		if (pte_present(pte) && !pte_numa(pte)) {
+		if (pte_present(pte) && !pte_protnone_numa(pte)) {
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
 				ptel = hpte_make_readonly(ptel);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 08d659a..5007497 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -405,8 +405,6 @@ good_area:
 		 * processors use the same I/D cache coherency mechanism
 		 * as embedded.
 		 */
-		if (error_code & DSISR_PROTFAULT)
-			goto bad_area;
 #endif /* CONFIG_PPC_STD_MMU */
 
 		/*
@@ -430,9 +428,6 @@ good_area:
 		flags |= FAULT_FLAG_WRITE;
 	/* a read */
 	} else {
-		/* protection fault */
-		if (error_code & 0x08000000)
-			goto bad_area;
 		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
 			goto bad_area;
 	}
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..d870d93 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -39,7 +39,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		/*
 		 * Similar to the PMD case, NUMA hinting must take slow path
 		 */
-		if (pte_numa(pte))
+		if (pte_protnone_numa(pte))
 			return 0;
 
 		if ((pte_val(pte) & mask) != result)
@@ -85,7 +85,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			 * slowpath for accounting purposes and so that they
 			 * can be serialised against THP migration.
 			 */
-			if (pmd_numa(pmd))
+			if (pmd_protnone_numa(pmd))
 				return 0;
 
 			if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next,
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..47ce479 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		struct page *page;
 
 		/* Similar to the PMD case, NUMA hinting must take slow path */
-		if (pte_numa(pte)) {
+		if (pte_protnone_numa(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			 * slowpath for accounting purposes and so that they
 			 * can be serialised against THP migration.
 			 */
-			if (pmd_numa(pmd))
+			if (pmd_protnone_numa(pmd))
 				return 0;
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
 				return 0;
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 0d11c3d..e52379b 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -67,7 +67,7 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
-#define MPOL_F_MORON	(1 << 4) /* Migrate On pte_numa Reference On Node */
+#define MPOL_F_MORON	(1 << 4) /* Migrate On pte_protnone_numa Reference On Node */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/gup.c b/mm/gup.c
index cd62c8c..aec34cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -64,7 +64,7 @@ retry:
 		migration_entry_wait(mm, pmd, address);
 		goto retry;
 	}
-	if ((flags & FOLL_NUMA) && pte_numa(pte))
+	if ((flags & FOLL_NUMA) && pte_protnone_numa(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte)) {
 		pte_unmap_unlock(ptep, ptl);
@@ -193,7 +193,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		}
 		return page;
 	}
-	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+	if ((flags & FOLL_NUMA) && pmd_protnone_numa(*pmd))
 		return no_page_table(vma, flags);
 	if (pmd_trans_huge(*pmd)) {
 		if (flags & FOLL_SPLIT) {
@@ -743,7 +743,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		 * path
 		 */
 		if (!pte_present(pte) || pte_special(pte) ||
-			pte_numa(pte) || (write && !pte_write(pte)))
+			pte_protnone_numa(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -895,7 +895,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			 * slowpath for accounting purposes and so that they
 			 * can be serialised against THP migration.
 			 */
-			if (pmd_numa(pmd))
+			if (pmd_protnone_numa(pmd))
 				return 0;
 
 			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de98415..f6e5a8b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1223,7 +1223,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		return ERR_PTR(-EFAULT);
 
 	/* Full NUMA hinting faults to serialise migration in fault paths */
-	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+	if ((flags & FOLL_NUMA) && pmd_protnone_numa(*pmd))
 		goto out;
 
 	page = pmd_page(*pmd);
@@ -1353,7 +1353,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
-	 * and pmd_numa cleared.
+	 * and access rights restored.
 	 */
 	spin_unlock(ptl);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1368,7 +1368,7 @@ clear_pmdnuma:
 	BUG_ON(!PageLocked(page));
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
-	VM_BUG_ON(pmd_numa(*pmdp));
+	VM_BUG_ON(pmd_protnone_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
 	unlock_page(page);
 out_unlock:
@@ -1513,7 +1513,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		ret = 1;
 		if (!prot_numa) {
 			entry = pmdp_get_and_clear(mm, addr, pmd);
-			if (pmd_numa(entry))
+			if (pmd_protnone_numa(entry))
 				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
 			ret = HPAGE_PMD_NR;
@@ -1529,7 +1529,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			 * local vs remote hits on the zero page.
 			 */
 			if (!is_huge_zero_page(page) &&
-			    !pmd_numa(*pmd)) {
+			    !pmd_protnone_numa(*pmd)) {
 				pmdp_set_numa(mm, addr, pmd);
 				ret = HPAGE_PMD_NR;
 			}
@@ -1796,9 +1796,9 @@ static int __split_huge_page_map(struct page *page,
 			pte_t *pte, entry;
 			BUG_ON(PageCompound(page+i));
 			/*
-			 * Note that pmd_numa is not transferred deliberately
-			 * to avoid any possibility that pte_numa leaks to
-			 * a PROT_NONE VMA by accident.
+			 * Note that NUMA hinting access restrictions are not
+			 * transferred to avoid any possibility of altering
+			 * permissions across VMAs.
 			 */
 			entry = mk_pte(page + i, vma->vm_page_prot);
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
diff --git a/mm/memory.c b/mm/memory.c
index 3e50383..96ceb0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3220,7 +3220,7 @@ static int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
-	if (pte_numa(entry))
+	if (pte_protnone_numa(entry))
 		return do_numa_page(mm, vma, address, entry, pte, pmd);
 
 	ptl = pte_lockptr(mm, pmd);
@@ -3298,7 +3298,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (pmd_trans_splitting(orig_pmd))
 				return 0;
 
-			if (pmd_numa(orig_pmd))
+			if (pmd_protnone_numa(orig_pmd))
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ace9345..e93ddac 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,36 +75,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
-			bool updated = false;
-
-			if (!prot_numa) {
-				ptent = ptep_modify_prot_start(mm, addr, pte);
-				if (pte_numa(ptent))
-					ptent = pte_mknonnuma(ptent);
-				ptent = pte_modify(ptent, newprot);
-				/*
-				 * Avoid taking write faults for pages we
-				 * know to be dirty.
-				 */
-				if (dirty_accountable && pte_dirty(ptent) &&
-				    (pte_soft_dirty(ptent) ||
-				     !(vma->vm_flags & VM_SOFTDIRTY)))
-					ptent = pte_mkwrite(ptent);
-				ptep_modify_prot_commit(mm, addr, pte, ptent);
-				updated = true;
-			} else {
-				struct page *page;
-
-				page = vm_normal_page(vma, addr, oldpte);
-				if (page && !PageKsm(page)) {
-					if (!pte_numa(oldpte)) {
-						ptep_set_numa(mm, addr, pte);
-						updated = true;
-					}
-				}
+			ptent = ptep_modify_prot_start(mm, addr, pte);
+			ptent = pte_modify(ptent, newprot);
+
+			/* Avoid taking write faults for known dirty pages */
+			if (dirty_accountable && pte_dirty(ptent) &&
+					(pte_soft_dirty(ptent) ||
+					 !(vma->vm_flags & VM_SOFTDIRTY))) {
+				ptent = pte_mkwrite(ptent);
 			}
-			if (updated)
-				pages++;
+			ptep_modify_prot_commit(mm, addr, pte, ptent);
+			pages++;
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index dfb79e0..a2d8587 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -193,7 +193,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
 	pmd_t entry = *pmdp;
-	if (pmd_numa(entry))
+	if (pmd_protnone_numa(entry))
 		entry = pmd_mknonnuma(entry);
 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/7] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
  2014-11-14 13:33 ` [PATCH 1/7] mm: Add p[te|md] protnone helpers for use by NUMA balancing Mel Gorman
  2014-11-14 13:33 ` [PATCH 2/7] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa Mel Gorman
@ 2014-11-14 13:33 ` Mel Gorman
  2014-11-14 13:33 ` [PATCH 4/7] mm: Remove remaining references to NUMA hinting bits and helpers Mel Gorman
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-14 13:33 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Mel Gorman

With PROT_NONE, the traditional page table manipulation functions are
sufficient.

Needs-signed-off: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Needs-signed-off: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |  3 +--
 mm/huge_memory.c        | 33 +++++++--------------------------
 mm/memory.c             | 17 +++++++++++------
 mm/mempolicy.c          |  2 +-
 mm/migrate.c            |  2 +-
 mm/mprotect.c           |  2 +-
 mm/pgtable-generic.c    |  2 --
 7 files changed, 22 insertions(+), 39 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad9051b..554bbe3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 			 unsigned long new_addr, unsigned long old_end,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, pgprot_t newprot,
-			int prot_numa);
+			unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f6e5a8b..8295c9a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1366,9 +1366,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	goto out;
 clear_pmdnuma:
 	BUG_ON(!PageLocked(page));
-	pmd = pmd_mknonnuma(pmd);
+	pmd = pmd_modify(pmd, vma->vm_page_prot);
 	set_pmd_at(mm, haddr, pmdp, pmd);
-	VM_BUG_ON(pmd_protnone_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
 	unlock_page(page);
 out_unlock:
@@ -1502,7 +1501,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot, int prot_numa)
+		unsigned long addr, pgprot_t newprot)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
@@ -1511,29 +1510,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		pmd_t entry;
 		ret = 1;
-		if (!prot_numa) {
-			entry = pmdp_get_and_clear(mm, addr, pmd);
-			if (pmd_protnone_numa(entry))
-				entry = pmd_mknonnuma(entry);
-			entry = pmd_modify(entry, newprot);
-			ret = HPAGE_PMD_NR;
-			set_pmd_at(mm, addr, pmd, entry);
-			BUG_ON(pmd_write(entry));
-		} else {
-			struct page *page = pmd_page(*pmd);
-
-			/*
-			 * Do not trap faults against the zero page. The
-			 * read-only data is likely to be read-cached on the
-			 * local CPU cache and it is less useful to know about
-			 * local vs remote hits on the zero page.
-			 */
-			if (!is_huge_zero_page(page) &&
-			    !pmd_protnone_numa(*pmd)) {
-				pmdp_set_numa(mm, addr, pmd);
-				ret = HPAGE_PMD_NR;
-			}
-		}
+		entry = pmdp_get_and_clear(mm, addr, pmd);
+		entry = pmd_modify(entry, newprot);
+		ret = HPAGE_PMD_NR;
+		set_pmd_at(mm, addr, pmd, entry);
+		BUG_ON(pmd_write(entry));
 		spin_unlock(ptl);
 	}
 
diff --git a/mm/memory.c b/mm/memory.c
index 96ceb0a..62604b1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3105,7 +3105,8 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 }
 
 static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd,
+		   unsigned int fault_flags)
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
@@ -3120,9 +3121,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	* validation through pte_unmap_same(). It's of NUMA type but
 	* the pfn may be screwed if the read is non atomic.
 	*
-	* ptep_modify_prot_start is not called as this is clearing
-	* the _PAGE_NUMA bit and it is not really expected that there
-	* would be concurrent hardware modifications to the PTE.
+	* We can safely just do a "set_pte_at()", because the old
+	* page table entry is not accessible, so there would be no
+	* concurrent hardware modifications to the PTE.
 	*/
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -3131,7 +3132,11 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	pte = pte_mknonnuma(pte);
+	/* Make it present again */
+	pte = pte_modify(pte, vma->vm_page_prot);
+	pte = pte_mkyoung(pte);
+	if (fault_flags & FAULT_FLAG_WRITE)
+		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
@@ -3221,7 +3226,7 @@ static int handle_pte_fault(struct mm_struct *mm,
 	}
 
 	if (pte_protnone_numa(entry))
-		return do_numa_page(mm, vma, address, entry, pte, pmd);
+		return do_numa_page(mm, vma, address, entry, pte, pmd, flags);
 
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e58725a..9d61dce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -633,7 +633,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
-	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 0143995..26fa71f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1896,7 +1896,7 @@ out_fail:
 out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
-		entry = pmd_mknonnuma(entry);
+		entry = pmd_modify(entry, vma->vm_page_prot);
 		set_pmd_at(mm, mmun_start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e93ddac..dc65c0f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -141,7 +141,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-						newprot, prot_numa);
+						newprot);
 
 				if (nr_ptes) {
 					if (nr_ptes == HPAGE_PMD_NR) {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a2d8587..c25f94b 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -193,8 +193,6 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
 	pmd_t entry = *pmdp;
-	if (pmd_protnone_numa(entry))
-		entry = pmd_mknonnuma(entry);
 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 }
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/7] mm: Remove remaining references to NUMA hinting bits and helpers
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
                   ` (2 preceding siblings ...)
  2014-11-14 13:33 ` [PATCH 3/7] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations Mel Gorman
@ 2014-11-14 13:33 ` Mel Gorman
  2014-11-14 13:33 ` [PATCH 5/7] mm: numa: Do not trap faults on the huge zero page Mel Gorman
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-14 13:33 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Mel Gorman

This patch removes the NUMA PTE bits and associated helpers. As a side-effect
it increases the maximum possible swap space on x86-64.

One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory. The base case is safe but transhuge has problems in the
past due to an different migration mechanism and a dependance on page lock
to serialise migrations and warrants a closer look.

task_work hinting update			parallel fault
------------------------			--------------
change_pmd_range
  change_huge_pmd
    __pmd_trans_huge_lock
      pmdp_get_and_clear
						__handle_mm_fault
						pmd_none
						  do_huge_pmd_anonymous_page
						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
      pmd_modify
      set_pmd_at

task_work hinting update			parallel migration
------------------------			------------------
change_pmd_range
  change_huge_pmd
    __pmd_trans_huge_lock
      pmdp_get_and_clear
						__handle_mm_fault
						  do_huge_pmd_numa_page
						    migrate_misplaced_transhuge_page
						    pmd_lock waits for updates to complete, recheck pmd_same
      pmd_modify
      set_pmd_at

Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged. The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok. I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed. If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/powerpc/include/asm/pgtable.h    |  54 +-----------
 arch/powerpc/include/asm/pte-common.h |   5 --
 arch/powerpc/include/asm/pte-hash64.h |   6 --
 arch/x86/include/asm/pgtable.h        |  22 +----
 arch/x86/include/asm/pgtable_64.h     |   5 --
 arch/x86/include/asm/pgtable_types.h  |  41 +--------
 include/asm-generic/pgtable.h         | 155 ----------------------------------
 include/linux/swapops.h               |   2 +-
 8 files changed, 7 insertions(+), 283 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 452c3b4..2e074e7 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -49,64 +49,12 @@ static inline int pmd_protnone_numa(pmd_t pmd)
 {
 	return pte_protnone_numa(pmd_pte(pmd));
 }
-
-static inline int pte_present(pte_t pte)
-{
-	return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-#define pte_present_nonuma pte_present_nonuma
-static inline int pte_present_nonuma(pte_t pte)
-{
-	return pte_val(pte) & (_PAGE_PRESENT);
-}
-
-#define ptep_set_numa ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pte_t *ptep)
-{
-	if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
-		VM_BUG_ON(1);
-
-	pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
-	return;
-}
-
-#define pmdp_set_numa pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pmd_t *pmdp)
-{
-	if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
-		VM_BUG_ON(1);
-
-	pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
-	return;
-}
-
-/*
- * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
- * which was inherited from x86. For the purposes of powerpc pte_basic_t and
- * pmd_t are equivalent
- */
-#define pteval_t pte_basic_t
-#define pmdval_t pmd_t
-static inline pteval_t ptenuma_flags(pte_t pte)
-{
-	return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-static inline pmdval_t pmdnuma_flags(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_NUMA_MASK;
-}
-
-# else
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline int pte_present(pte_t pte)
 {
 	return pte_val(pte) & _PAGE_PRESENT;
 }
-#endif /* CONFIG_NUMA_BALANCING */
 
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h
index e040c35..8d1569c 100644
--- a/arch/powerpc/include/asm/pte-common.h
+++ b/arch/powerpc/include/asm/pte-common.h
@@ -98,11 +98,6 @@ extern unsigned long bad_call_to_PMD_PAGE_SIZE(void);
 			 _PAGE_USER | _PAGE_ACCESSED | \
 			 _PAGE_RW | _PAGE_HWWRITE | _PAGE_DIRTY | _PAGE_EXEC)
 
-#ifdef CONFIG_NUMA_BALANCING
-/* Mask of bits that distinguish present and numa ptes */
-#define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PRESENT)
-#endif
-
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
  * cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/pte-hash64.h b/arch/powerpc/include/asm/pte-hash64.h
index 2505d8e..55aea0c 100644
--- a/arch/powerpc/include/asm/pte-hash64.h
+++ b/arch/powerpc/include/asm/pte-hash64.h
@@ -27,12 +27,6 @@
 #define _PAGE_RW		0x0200 /* software: user write access allowed */
 #define _PAGE_BUSY		0x0800 /* software: PTE & hash are busy */
 
-/*
- * Used for tracking numa faults
- */
-#define _PAGE_NUMA	0x00000010 /* Gather numa placement stats */
-
-
 /* No separate kernel read-only */
 #define _PAGE_KERNEL_RW		(_PAGE_RW | _PAGE_DIRTY) /* user access blocked by key */
 #define _PAGE_KERNEL_RO		 _PAGE_KERNEL_RW
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 613cd00..f8799e0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -299,7 +299,7 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
 
 static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
@@ -457,13 +457,6 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
-			       _PAGE_NUMA);
-}
-
-#define pte_present_nonuma pte_present_nonuma
-static inline int pte_present_nonuma(pte_t a)
-{
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
@@ -473,7 +466,7 @@ static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 	if (pte_flags(a) & _PAGE_PRESENT)
 		return true;
 
-	if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &&
+	if ((pte_flags(a) & _PAGE_PROTNONE) &&
 			mm_tlb_flush_pending(mm))
 		return true;
 
@@ -493,8 +486,7 @@ static inline int pmd_present(pmd_t pmd)
 	 * the _PAGE_PSE flag will remain set at all times while the
 	 * _PAGE_PRESENT bit is clear).
 	 */
-	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
-				 _PAGE_NUMA);
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
 }
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -569,11 +561,6 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-#ifdef CONFIG_NUMA_BALANCING
-	/* pmd_numa check */
-	if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
-		return 0;
-#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
@@ -892,19 +879,16 @@ static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
-	VM_BUG_ON(pte_present_nonuma(pte));
 	return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
 
 static inline int pte_swp_soft_dirty(pte_t pte)
 {
-	VM_BUG_ON(pte_present_nonuma(pte));
 	return pte_flags(pte) & _PAGE_SWP_SOFT_DIRTY;
 }
 
 static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
-	VM_BUG_ON(pte_present_nonuma(pte));
 	return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
 #endif
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 4572b2f..06ffca8 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -146,12 +146,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 
 /* Encode and de-code a swap entry */
 #define SWP_TYPE_BITS (_PAGE_BIT_FILE - _PAGE_BIT_PRESENT - 1)
-#ifdef CONFIG_NUMA_BALANCING
-/* Automatic NUMA balancing needs to be distinguishable from swap entries */
-#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 2)
-#else
 #define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1)
-#endif
 
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
 
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0778964..d299cdd 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -27,14 +27,6 @@
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
-/*
- * Swap offsets on configurations that allow automatic NUMA balancing use the
- * bits after _PAGE_BIT_GLOBAL. To uniquely distinguish NUMA hinting PTEs from
- * swap entries, we use the first bit after _PAGE_BIT_GLOBAL and shrink the
- * maximum possible swap space from 16TB to 8TB.
- */
-#define _PAGE_BIT_NUMA		(_PAGE_BIT_GLOBAL+1)
-
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -78,21 +70,6 @@
 #endif
 
 /*
- * _PAGE_NUMA distinguishes between a numa hinting minor fault and a page
- * that is not present. The hinting fault gathers numa placement statistics
- * (see pte_numa()). The bit is always zero when the PTE is not present.
- *
- * The bit picked must be always zero when the pmd is present and not
- * present, so that we don't lose information when we set it while
- * atomically clearing the present bit.
- */
-#ifdef CONFIG_NUMA_BALANCING
-#define _PAGE_NUMA	(_AT(pteval_t, 1) << _PAGE_BIT_NUMA)
-#else
-#define _PAGE_NUMA	(_AT(pteval_t, 0))
-#endif
-
-/*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
  * with swap entry format. On x86 bits 6 and 7 are *not* involved
@@ -125,8 +102,8 @@
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_NUMA)
-#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_NUMA)
+			 _PAGE_SOFT_DIRTY)
+#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 #define _PAGE_CACHE_MASK	(_PAGE_PCD | _PAGE_PWT)
 #define _PAGE_CACHE_WB		(0)
@@ -324,20 +301,6 @@ static inline pteval_t pte_flags(pte_t pte)
 	return native_pte_val(pte) & PTE_FLAGS_MASK;
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-/* Set of bits that distinguishes present, prot_none and numa ptes */
-#define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)
-static inline pteval_t ptenuma_flags(pte_t pte)
-{
-	return pte_flags(pte) & _PAGE_NUMA_MASK;
-}
-
-static inline pmdval_t pmdnuma_flags(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_NUMA_MASK;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 #define pgprot_val(x)	((x).pgprot)
 #define __pgprot(x)	((pgprot_t) { (x) } )
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 7e74122..323e914 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -233,10 +233,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 # define pte_accessible(mm, pte)	((void)(pte), 1)
 #endif
 
-#ifndef pte_present_nonuma
-#define pte_present_nonuma(pte) pte_present(pte)
-#endif
-
 #ifndef flush_tlb_fix_spurious_fault
 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
 #endif
@@ -696,157 +692,6 @@ static inline int pmd_protnone_numa(pmd_t pmd)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-#ifdef CONFIG_NUMA_BALANCING
-/*
- * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
- * is protected for PROT_NONE and a NUMA hinting fault entry. If the
- * architecture defines __PAGE_PROTNONE then it should take that into account
- * but those that do not can rely on the fact that the NUMA hinting scanner
- * skips inaccessible VMAs.
- *
- * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
- * fault triggers on those regions if pte/pmd_numa returns true
- * (because _PAGE_PRESENT is not set).
- */
-#ifndef pte_numa
-static inline int pte_numa(pte_t pte)
-{
-	return ptenuma_flags(pte) == _PAGE_NUMA;
-}
-#endif
-
-#ifndef pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
-	return pmdnuma_flags(pmd) == _PAGE_NUMA;
-}
-#endif
-
-/*
- * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
- * because they're called by the NUMA hinting minor page fault. If we
- * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
- * would be forced to set it later while filling the TLB after we
- * return to userland. That would trigger a second write to memory
- * that we optimize away by setting _PAGE_ACCESSED here.
- */
-#ifndef pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-	pteval_t val = pte_val(pte);
-
-	val &= ~_PAGE_NUMA;
-	val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
-	return __pte(val);
-}
-#endif
-
-#ifndef pmd_mknonnuma
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
-	pmdval_t val = pmd_val(pmd);
-
-	val &= ~_PAGE_NUMA;
-	val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
-
-	return __pmd(val);
-}
-#endif
-
-#ifndef pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
-	pteval_t val = pte_val(pte);
-
-	VM_BUG_ON(!(val & _PAGE_PRESENT));
-
-	val &= ~_PAGE_PRESENT;
-	val |= _PAGE_NUMA;
-
-	return __pte(val);
-}
-#endif
-
-#ifndef ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pte_t *ptep)
-{
-	pte_t ptent = *ptep;
-
-	ptent = pte_mknuma(ptent);
-	set_pte_at(mm, addr, ptep, ptent);
-	return;
-}
-#endif
-
-#ifndef pmd_mknuma
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
-	pmdval_t val = pmd_val(pmd);
-
-	val &= ~_PAGE_PRESENT;
-	val |= _PAGE_NUMA;
-
-	return __pmd(val);
-}
-#endif
-
-#ifndef pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pmd_t *pmdp)
-{
-	pmd_t pmd = *pmdp;
-
-	pmd = pmd_mknuma(pmd);
-	set_pmd_at(mm, addr, pmdp, pmd);
-	return;
-}
-#endif
-#else
-static inline int pmd_numa(pmd_t pmd)
-{
-	return 0;
-}
-
-static inline int pte_numa(pte_t pte)
-{
-	return 0;
-}
-
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-	return pte;
-}
-
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
-	return pmd;
-}
-
-static inline pte_t pte_mknuma(pte_t pte)
-{
-	return pte;
-}
-
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pte_t *ptep)
-{
-	return;
-}
-
-
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
-	return pmd;
-}
-
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
-				 pmd_t *pmdp)
-{
-	return ;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6adfb7b..2b1fa56 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -54,7 +54,7 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
 /* check whether a pte points to a swap entry */
 static inline int is_swap_pte(pte_t pte)
 {
-	return !pte_none(pte) && !pte_present_nonuma(pte) && !pte_file(pte);
+	return !pte_none(pte) && !pte_file(pte);
 }
 #endif
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/7] mm: numa: Do not trap faults on the huge zero page
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
                   ` (3 preceding siblings ...)
  2014-11-14 13:33 ` [PATCH 4/7] mm: Remove remaining references to NUMA hinting bits and helpers Mel Gorman
@ 2014-11-14 13:33 ` Mel Gorman
  2014-11-14 13:33 ` [PATCH 6/7] x86: mm: Restore original pte_special check Mel Gorman
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-14 13:33 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Mel Gorman

Faults on the huge zero page are pointless and there is a BUG_ON
to catch them during fault time. This patch reintroduces a check
that avoids marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |  3 ++-
 mm/huge_memory.c        | 13 ++++++++++++-
 mm/memory.c             |  1 -
 mm/mprotect.c           | 15 ++++++++++++++-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 554bbe3..ad9051b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 			 unsigned long new_addr, unsigned long old_end,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, pgprot_t newprot);
+			unsigned long addr, pgprot_t newprot,
+			int prot_numa);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8295c9a..4f02010 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1501,7 +1501,7 @@ out:
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot)
+		unsigned long addr, pgprot_t newprot, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
@@ -1509,6 +1509,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		pmd_t entry;
+
+		/*
+		 * Avoid trapping faults against the zero page. The read-only
+		 * data is likely to be read-cached on the local CPU and
+		 * local/remote hits to the zero page are not interesting.
+		 */
+		if (prot_numa && is_huge_zero_pmd(*pmd)) {
+			spin_unlock(ptl);
+			return 0;
+		}
+
 		ret = 1;
 		entry = pmdp_get_and_clear(mm, addr, pmd);
 		entry = pmd_modify(entry, newprot);
diff --git a/mm/memory.c b/mm/memory.c
index 62604b1..360c3e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3145,7 +3145,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
-	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
 	/*
 	 * Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/mprotect.c b/mm/mprotect.c
index dc65c0f..33dfafb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
+
+			/*
+			 * Avoid trapping faults against the zero or KSM
+			 * pages. See similar comment in change_huge_pmd.
+			 */
+			if (prot_numa) {
+				struct page *page;
+
+				page = vm_normal_page(vma, addr, oldpte);
+				if (!page || PageKsm(page))
+					continue;
+			}
+
 			ptent = ptep_modify_prot_start(mm, addr, pte);
 			ptent = pte_modify(ptent, newprot);
 
@@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-						newprot);
+						newprot, prot_numa);
 
 				if (nr_ptes) {
 					if (nr_ptes == HPAGE_PMD_NR) {
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/7] x86: mm: Restore original pte_special check
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
                   ` (4 preceding siblings ...)
  2014-11-14 13:33 ` [PATCH 5/7] mm: numa: Do not trap faults on the huge zero page Mel Gorman
@ 2014-11-14 13:33 ` Mel Gorman
  2014-11-14 13:33 ` [PATCH 7/7] mm: numa: Add paranoid check around pte_protnone_numa Mel Gorman
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-14 13:33 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Mel Gorman

Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted
the pte_special check to take into account that a special pte had SPECIAL
and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no
longer modifying _PAGE_PRESENT it should be safe to restore the original
pte_special behaviour.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/pgtable.h | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f8799e0..5241332 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -131,13 +131,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-	/*
-	 * See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
-	 * On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
-	 * __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
-	 */
-	return (pte_flags(pte) & _PAGE_SPECIAL) &&
-		(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
+	return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 7/7] mm: numa: Add paranoid check around pte_protnone_numa
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
                   ` (5 preceding siblings ...)
  2014-11-14 13:33 ` [PATCH 6/7] x86: mm: Restore original pte_special check Mel Gorman
@ 2014-11-14 13:33 ` Mel Gorman
  2014-11-15  1:41 ` [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Linus Torvalds
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-14 13:33 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Mel Gorman

pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
to result in strangeness so add a check for it. BUG_ON looks like overkill
but if this is hit then it's a serious bug that could result in corruption
so do not even try recovering. It would have been more comprehensive to
check VMA flags in pte_protnone_numa but it would have made the API ugly
just for a debugging check.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 3 +++
 mm/memory.c      | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4f02010..ae3f3e0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1274,6 +1274,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	bool migrated = false;
 	int flags = 0;
 
+	/* A PROT_NONE fault should not end up here */
+	BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
 	ptl = pmd_lock(mm, pmdp);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index 360c3e3..5d45026 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3116,6 +3116,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	bool migrated = false;
 	int flags = 0;
 
+	/* A PROT_NONE fault should not end up here */
+	BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
 	/*
 	* The "pte" at this point cannot be used safely without
 	* validation through pte_unmap_same(). It's of NUMA type but
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
                   ` (6 preceding siblings ...)
  2014-11-14 13:33 ` [PATCH 7/7] mm: numa: Add paranoid check around pte_protnone_numa Mel Gorman
@ 2014-11-15  1:41 ` Linus Torvalds
  2014-11-15  3:29 ` Sasha Levin
  2014-11-17  8:26 ` Aneesh Kumar K.V
  9 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2014-11-15  1:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones,
	Rik van Riel, Ingo Molnar, Kirill Shutemov, Sasha Levin

On Fri, Nov 14, 2014 at 5:32 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> This series is very heavily based on patches from Linus and Aneesh to
> replace the existing PTE/PMD NUMA helper functions with normal change
> protections. I did alter and add parts of it but I consider them relatively
> minor contributions. Note that the signed-offs here need addressing. I
> couldn't use "From" or Signed-off-by from the original authors as the
> patches had to be broken up and they were never signed off. I expect the
> two people involved will just stick their signed-off-by on it.

Feel free to just take authorship of my parts, and make my
"Needs-sign-off's" be just "Acked-by:"

Or alternatively keep them as "Signed-off-by:", even when it looks a
bit odd if it doesn't have a "From:" me, when the actual patch won't
then actually go through me - I'm assuming this will come in through
the -mm tree.

As to the ppc parts, obviously it would be good to have Aneesh re-test
the series..

                    Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
                   ` (7 preceding siblings ...)
  2014-11-15  1:41 ` [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Linus Torvalds
@ 2014-11-15  3:29 ` Sasha Levin
  2014-11-18 15:42   ` Mel Gorman
  2014-11-17  8:26 ` Aneesh Kumar K.V
  9 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2014-11-15  3:29 UTC (permalink / raw)
  To: Mel Gorman, Linux Kernel
  Cc: Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Linus Torvalds

On 11/14/2014 08:32 AM, Mel Gorman wrote:> This is follow up from the "pipe/page fault oddness" thread.

Hi Mel,

Applying this patch series I've started seeing the following straight away:

[  367.547848] page:ffffea0003fb7db0 count:1007 mapcount:1005 mapping:ffff8800691f2f58 index:0x37
[  367.551481] flags: 0x5001aa8030202d(locked|referenced|uptodate|lru|writeback|unevictable|mlocked)
[  367.555382] page dumped because: VM_BUG_ON_PAGE(!v9inode->writeback_fid)
[  367.558262] page->mem_cgroup:ffff88006d8a1bd8
[  367.560403] ------------[ cut here ]------------
[  367.562343] kernel BUG at fs/9p/vfs_addr.c:190!
[  367.564239] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[  367.566991] Dumping ftrace buffer:
[  367.568481]    (ftrace buffer empty)
[  367.569914] Modules linked in:
[  367.570254] CPU: 3 PID: 8234 Comm: kworker/u52:1 Not tainted 3.18.0-rc4-next-20141114-sasha-00054-ga9ff95e-dirty #1459
[  367.570254] Workqueue: writeback bdi_writeback_workfn (flush-9p-1)
[  367.570254] task: ffff8801e21d8000 ti: ffff8801e1f34000 task.ti: ffff8801e1f34000
[  367.570254] RIP: v9fs_vfs_writepage_locked (fs/9p/vfs_addr.c:190 (discriminator 1))
[  367.570254] RSP: 0018:ffff8801e1f376c8  EFLAGS: 00010286
[  367.570254] RAX: 0000000000000021 RBX: ffffea0003fb7db0 RCX: 0000000000000000
[  367.570254] RDX: 0000000000000021 RSI: ffffffff9208b2e6 RDI: ffff8801e21d8d0c
[  367.570254] RBP: ffff8801e1f37728 R08: 0000000000000001 R09: 0000000000000000
[  367.570254] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8800691f2d48
[  367.570254] R13: 0000000000001000 R14: ffff8800691f2c30 R15: ffff8800691f2c98
[  367.570254] FS:  0000000000000000(0000) GS:ffff8801e5c00000(0000) knlGS:0000000000000000
[  367.570254] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  367.570254] CR2: 0000000000000000 CR3: 00000000ca00c000 CR4: 00000000000006a0
[  367.570254] DR0: ffffffff81000000 DR1: 0000000000000000 DR2: 0000000000000000
[  367.570254] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[  367.570254] Stack:
[  367.570254]  ffffda003c43b1a1 ffffda003c43b1a1 0000000000000000 0000000000000002
[  367.570254]  0000000000000002 0000000000037000 ffff8801e1f37758 ffffea0003fb7db0
[  367.570254]  0000000000000000 ffff8801e1f37a60 ffff8801e1f37a60 ffffea0003fb7db0
[  367.570254] Call Trace:
[  367.570254] v9fs_vfs_writepage (fs/9p/vfs_addr.c:212)
[  367.570254] __writepage (include/linux/pagemap.h:32 mm/page-writeback.c:2006)
[  367.570254] write_cache_pages (mm/page-writeback.c:1943)
[  367.570254] ? bdi_set_max_ratio (mm/page-writeback.c:2003)
[  367.570254] ? sched_clock_local (kernel/sched/clock.c:202)
[  367.570254] generic_writepages (mm/page-writeback.c:2030)
[  367.570254] do_writepages (mm/page-writeback.c:2047)
[  367.570254] __writeback_single_inode (fs/fs-writeback.c:461 (discriminator 3))
[  367.570254] writeback_sb_inodes (fs/fs-writeback.c:706)
[  367.570254] __writeback_inodes_wb (fs/fs-writeback.c:749)
[  367.570254] wb_writeback (fs/fs-writeback.c:880)
[  367.570254] ? __lock_is_held (kernel/locking/lockdep.c:3518)
[  367.570254] bdi_writeback_workfn (fs/fs-writeback.c:1015 fs/fs-writeback.c:1060)
[  367.570254] process_one_work (kernel/workqueue.c:2023 include/linux/jump_label.h:114 include/trace/events/workqueue.h:111 kernel/workqueue.c:2028)
[  367.570254] ? process_one_work (kernel/workqueue.c:2020)
[  367.570254] ? get_lock_stats (kernel/locking/lockdep.c:249)
[  367.570254] worker_thread (include/linux/list.h:189 kernel/workqueue.c:2156)
[  367.570254] ? __schedule (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:91 include/linux/sched.h:2939 kernel/sched/core.c:2848)
[  367.570254] ? rescuer_thread (kernel/workqueue.c:2100)
[  367.570254] kthread (kernel/kthread.c:207)
[  367.570254] ? flush_kthread_work (kernel/kthread.c:176)
[  367.570254] ret_from_fork (arch/x86/kernel/entry_64.S:348)
[  367.570254] ? flush_kthread_work (kernel/kthread.c:176)
[ 367.570254] Code: 48 83 c4 38 44 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 2e 0f 1f 84 00 00 00 00 00 48 c7 c6 f8 18 37 93 48 89 df e8 e1 8b 93 fe <0f> 0b 48 89 de 48 c7 c7 30 bd 9f 95 48 89 4d b8 e8 10 5f 02 0f

All code
========
   0:	48 83 c4 38          	add    $0x38,%rsp
   4:	44 89 f0             	mov    %r14d,%eax
   7:	5b                   	pop    %rbx
   8:	41 5c                	pop    %r12
   a:	41 5d                	pop    %r13
   c:	41 5e                	pop    %r14
   e:	41 5f                	pop    %r15
  10:	5d                   	pop    %rbp
  11:	c3                   	retq
  12:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  19:	00 00 00
  1c:	48 c7 c6 f8 18 37 93 	mov    $0xffffffff933718f8,%rsi
  23:	48 89 df             	mov    %rbx,%rdi
  26:	e8 e1 8b 93 fe       	callq  0xfffffffffe938c0c
  2b:*	0f 0b                	ud2    		<-- trapping instruction
  2d:	48 89 de             	mov    %rbx,%rsi
  30:	48 c7 c7 30 bd 9f 95 	mov    $0xffffffff959fbd30,%rdi
  37:	48 89 4d b8          	mov    %rcx,-0x48(%rbp)
  3b:	e8 10 5f 02 0f       	callq  0xf025f50
	...

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	48 89 de             	mov    %rbx,%rsi
   5:	48 c7 c7 30 bd 9f 95 	mov    $0xffffffff959fbd30,%rdi
   c:	48 89 4d b8          	mov    %rcx,-0x48(%rbp)
  10:	e8 10 5f 02 0f       	callq  0xf025f25
	...
[  367.570254] RIP v9fs_vfs_writepage_locked (fs/9p/vfs_addr.c:190 (discriminator 1))
[  367.570254]  RSP <ffff8801e1f376c8>

(Note that I replaced the BUG_ON with a VM_BUG_ON_PAGE to get some extra information.)


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
                   ` (8 preceding siblings ...)
  2014-11-15  3:29 ` Sasha Levin
@ 2014-11-17  8:26 ` Aneesh Kumar K.V
  2014-11-18 16:01   ` Mel Gorman
  9 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2014-11-17  8:26 UTC (permalink / raw)
  To: Mel Gorman, Linux Kernel
  Cc: Linux-MM, Hugh Dickins, Dave Jones, Rik van Riel, Ingo Molnar,
	Kirill Shutemov, Sasha Levin, Linus Torvalds, Mel Gorman,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev

Mel Gorman <mgorman@suse.de> writes:

> This is follow up from the "pipe/page fault oddness" thread.
>
> Automatic NUMA balancing depends on being able to protect PTEs to trap a
> fault and gather reference locality information. Very broadly speaking it
> would mark PTEs as not present and use another bit to distinguish between
> NUMA hinting faults and other types of faults. It was universally loved
> by everybody and caused no problems whatsoever. That last sentence might
> be a lie.
>
> This series is very heavily based on patches from Linus and Aneesh to
> replace the existing PTE/PMD NUMA helper functions with normal change
> protections. I did alter and add parts of it but I consider them relatively
> minor contributions. Note that the signed-offs here need addressing. I
> couldn't use "From" or Signed-off-by from the original authors as the
> patches had to be broken up and they were never signed off. I expect the
> two people involved will just stick their signed-off-by on it.


How about the additional change listed below for ppc64 ? One part of the
patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
because we find the _PAGE_PRESENT bit set in case of numa fault. I
ended up relaxing the check there.

Second part of the change is to add a WARN_ON to make sure we are
not depending on DSISR_PROTFAULT for anything else. We ideally should not
get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
whether the access is allowed by pte before inserting a pte into hash
page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
in hash page table. But it is good to run with VM_WARN_ON ?

I also added a similar change to handle CAPI. 

This will also need an ack from Ben and Paul . (added them to Cc:) 

With the below patch you can add

Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

for the respective patches.

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f082c78..2e208afb7f4c 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
 		if (!(vma->vm_flags & VM_WRITE))
 			goto out_unlock;
 	} else {
-		if (dsisr & DSISR_PROTFAULT)
-			goto out_unlock;
 		if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
 			goto out_unlock;
+		/*
+		 * protfault should only happen due to us
+		 * mapping a region readonly temporarily. PROT_NONE
+		 * is also covered by the VMA check above.
+		 */
+		VM_WARN_ON(dsisr & DSISR_PROTFAULT);
 	}
 
 	ret = 0;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 50074972d555..6df9483e316f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
 	if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-		/* Protection fault on exec go straight to failure on
-		 * Hash based MMUs as they either don't support per-page
-		 * execute permission, or if they do, it's handled already
-		 * at the hash level. This test would probably have to
-		 * be removed if we change the way this works to make hash
-		 * processors use the same I/D cache coherency mechanism
-		 * as embedded.
-		 */
-#endif /* CONFIG_PPC_STD_MMU */
-
 		/*
 		 * Allow execution from readable areas if the MMU does not
 		 * provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
 		    (cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 		     !(vma->vm_flags & (VM_READ | VM_WRITE))))
 			goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+		/*
+		 * protfault should only happen due to us
+		 * mapping a region readonly temporarily. PROT_NONE
+		 * is also covered by the VMA check above.
+		 */
+		VM_WARN_ON(error_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
 	/* a write */
 	} else if (is_write) {
 		if (!(vma->vm_flags & VM_WRITE))
@@ -430,6 +427,7 @@ good_area:
 	} else {
 		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
 			goto bad_area;
+		VM_WARN_ON(error_code & DSISR_PROTFAULT);
 	}
 
 	/*
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c90e602677c9..75b08098fcf5 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -172,9 +172,13 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma,
 void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		pte_t pte)
 {
-#ifdef CONFIG_DEBUG_VM
-	WARN_ON(pte_val(*ptep) & _PAGE_PRESENT);
-#endif
+	/*
+	 * When handling numa faults, we already have the pte marked
+	 * _PAGE_PRESENT, but we can be sure that it is not in hpte.
+	 * Hence we can use set_pte_at for them.
+	 */
+	VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) ==
+		   (_PAGE_PRESENT | _PAGE_USER));
 	/* Note: mm->context.id might not yet have been assigned as
 	 * this context might not have been activated yet when this
 	 * is called.
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c8d709ab489d..c721c5efb4df 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -710,7 +710,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT);
+	WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) ==
+		(_PAGE_PRESENT | _PAGE_USER));
 	assert_spin_locked(&mm->page_table_lock);
 	WARN_ON(!pmd_trans_huge(pmd));
 #endif


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-15  3:29 ` Sasha Levin
@ 2014-11-18 15:42   ` Mel Gorman
  2014-11-18 16:33     ` Sasha Levin
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2014-11-18 15:42 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Linux Kernel, Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones,
	Rik van Riel, Ingo Molnar, Kirill Shutemov, Linus Torvalds

On Fri, Nov 14, 2014 at 10:29:41PM -0500, Sasha Levin wrote:
> On 11/14/2014 08:32 AM, Mel Gorman wrote:> This is follow up from the "pipe/page fault oddness" thread.
> 
> Hi Mel,
> 
> Applying this patch series I've started seeing the following straight away:
> 
> [  367.547848] page:ffffea0003fb7db0 count:1007 mapcount:1005 mapping:ffff8800691f2f58 index:0x37
> [  367.551481] flags: 0x5001aa8030202d(locked|referenced|uptodate|lru|writeback|unevictable|mlocked)
> [  367.555382] page dumped because: VM_BUG_ON_PAGE(!v9inode->writeback_fid)
> [  367.558262] page->mem_cgroup:ffff88006d8a1bd8
> [  367.560403] ------------[ cut here ]------------
> [  367.562343] kernel BUG at fs/9p/vfs_addr.c:190!
> [  367.564239] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [  367.566991] Dumping ftrace buffer:
> [  367.568481]    (ftrace buffer empty)
> [  367.569914] Modules linked in:
> [  367.570254] CPU: 3 PID: 8234 Comm: kworker/u52:1 Not tainted 3.18.0-rc4-next-20141114-sasha-00054-ga9ff95e-dirty #1459

Thanks Sasha. I don't see a next-20141114 so I looked at next-20141113 and
assuming they are similar. It does not appear that writeback_fid is a struct
page so it's not clear what VM_BUG_ON_PAGE means in this context. Certainly
the fields look screwy but I think it's just accessing garbage.

I tried reproducing this but my KVM setup appears to be broken after an
update and not even able to boot 3.17 properly let alone with the patches. I
still have a few questions though.

1. I'm assuming this is a KVM setup but can you confirm?
2. Are you using numa=fake=N?
3. If you are using fake NUMA, what happens if you boot without it as
   that should make the patches a no-op?
4. Similarly, does the kernel boot properly without without patches?
5. Are any other patches applied because the line numbers are not lining
   up exactly?
6. As my own KVM setup appears broken, can you tell me if the host
   kernel has changed recently? If so, does using an older host kernel
   make a difference?

At the moment I'm scratching my head trying to figure out how the
patches could break 9p like this as I don't believe KVM is doing any
tricks with the same bits that could result in loss.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-17  8:26 ` Aneesh Kumar K.V
@ 2014-11-18 16:01   ` Mel Gorman
  2014-11-18 16:33     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2014-11-18 16:01 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Linux Kernel, Linux-MM, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev

On Mon, Nov 17, 2014 at 01:56:19PM +0530, Aneesh Kumar K.V wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > This is follow up from the "pipe/page fault oddness" thread.
> >
> > Automatic NUMA balancing depends on being able to protect PTEs to trap a
> > fault and gather reference locality information. Very broadly speaking it
> > would mark PTEs as not present and use another bit to distinguish between
> > NUMA hinting faults and other types of faults. It was universally loved
> > by everybody and caused no problems whatsoever. That last sentence might
> > be a lie.
> >
> > This series is very heavily based on patches from Linus and Aneesh to
> > replace the existing PTE/PMD NUMA helper functions with normal change
> > protections. I did alter and add parts of it but I consider them relatively
> > minor contributions. Note that the signed-offs here need addressing. I
> > couldn't use "From" or Signed-off-by from the original authors as the
> > patches had to be broken up and they were never signed off. I expect the
> > two people involved will just stick their signed-off-by on it.
> 
> 
> How about the additional change listed below for ppc64 ? One part of the
> patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
> because we find the _PAGE_PRESENT bit set in case of numa fault. I
> ended up relaxing the check there.
> 

I folded the set_pte_at and set_pmd_at changes into the patch "mm: Convert
p[te|md]_numa users to p[te|md]_protnone_numa" with one change -- both
set_pte_at and set_pmd_at checks are under CONFIG_DEBUG_VM for consistency.

> Second part of the change is to add a WARN_ON to make sure we are
> not depending on DSISR_PROTFAULT for anything else. We ideally should not
> get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
> whether the access is allowed by pte before inserting a pte into hash
> page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
> in hash page table. But it is good to run with VM_WARN_ON ?
> 

Due to the nature of the check and when they are hit, I converted it to
a WARN_ON_ONCE. Due to the exceptional circumstance the overhead should
be non-existant and shouldn't need to be hidden below VM_WARN_ON. I also
noted that with the patch the kernel  potentially no longer recovers
from this exceptional cirsumstance and instead falls through. To avoid
this, I preserved the "goto out_unlock".

Is this still ok?

---8<---
ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Requires-signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/powerpc/mm/copro_fault.c |  7 ++++++-
 arch/powerpc/mm/fault.c       | 20 +++++++++-----------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..46152aa 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
 		if (!(vma->vm_flags & VM_WRITE))
 			goto out_unlock;
 	} else {
-		if (dsisr & DSISR_PROTFAULT)
+		/*
+		 * protfault should only happen due to us
+		 * mapping a region readonly temporarily. PROT_NONE
+		 * is also covered by the VMA check above.
+		 */
+		if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT))
 			goto out_unlock;
 		if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
 			goto out_unlock;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5007497..9d6e0b3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
 	if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-		/* Protection fault on exec go straight to failure on
-		 * Hash based MMUs as they either don't support per-page
-		 * execute permission, or if they do, it's handled already
-		 * at the hash level. This test would probably have to
-		 * be removed if we change the way this works to make hash
-		 * processors use the same I/D cache coherency mechanism
-		 * as embedded.
-		 */
-#endif /* CONFIG_PPC_STD_MMU */
-
 		/*
 		 * Allow execution from readable areas if the MMU does not
 		 * provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
 		    (cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 		     !(vma->vm_flags & (VM_READ | VM_WRITE))))
 			goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+		/*
+		 * protfault should only happen due to us
+		 * mapping a region readonly temporarily. PROT_NONE
+		 * is also covered by the VMA check above.
+		 */
+		WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
 	/* a write */
 	} else if (is_write) {
 		if (!(vma->vm_flags & VM_WRITE))
@@ -430,6 +427,7 @@ good_area:
 	} else {
 		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
 			goto bad_area;
+		WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-18 16:01   ` Mel Gorman
@ 2014-11-18 16:33     ` Aneesh Kumar K.V
  2014-11-18 17:08       ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2014-11-18 16:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev

Mel Gorman <mgorman@suse.de> writes:

> On Mon, Nov 17, 2014 at 01:56:19PM +0530, Aneesh Kumar K.V wrote:
>> Mel Gorman <mgorman@suse.de> writes:
>> 
>> > This is follow up from the "pipe/page fault oddness" thread.
>> >
>> > Automatic NUMA balancing depends on being able to protect PTEs to trap a
>> > fault and gather reference locality information. Very broadly speaking it
>> > would mark PTEs as not present and use another bit to distinguish between
>> > NUMA hinting faults and other types of faults. It was universally loved
>> > by everybody and caused no problems whatsoever. That last sentence might
>> > be a lie.
>> >
>> > This series is very heavily based on patches from Linus and Aneesh to
>> > replace the existing PTE/PMD NUMA helper functions with normal change
>> > protections. I did alter and add parts of it but I consider them relatively
>> > minor contributions. Note that the signed-offs here need addressing. I
>> > couldn't use "From" or Signed-off-by from the original authors as the
>> > patches had to be broken up and they were never signed off. I expect the
>> > two people involved will just stick their signed-off-by on it.
>> 
>> 
>> How about the additional change listed below for ppc64 ? One part of the
>> patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
>> because we find the _PAGE_PRESENT bit set in case of numa fault. I
>> ended up relaxing the check there.
>> 
>
> I folded the set_pte_at and set_pmd_at changes into the patch "mm: Convert
> p[te|md]_numa users to p[te|md]_protnone_numa" with one change -- both
> set_pte_at and set_pmd_at checks are under CONFIG_DEBUG_VM for consistency.
>
>> Second part of the change is to add a WARN_ON to make sure we are
>> not depending on DSISR_PROTFAULT for anything else. We ideally should not
>> get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
>> whether the access is allowed by pte before inserting a pte into hash
>> page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
>> in hash page table. But it is good to run with VM_WARN_ON ?
>> 
>
> Due to the nature of the check and when they are hit, I converted it to
> a WARN_ON_ONCE. Due to the exceptional circumstance the overhead should
> be non-existant and shouldn't need to be hidden below VM_WARN_ON. I also
> noted that with the patch the kernel  potentially no longer recovers
> from this exceptional cirsumstance and instead falls through. To avoid
> this, I preserved the "goto out_unlock".
>
> Is this still ok?
>
> ---8<---
> ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT
>
> ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
> if they are triggered. This patch adds warnings just in case they
> are being accidentally depended upon.
>
> Requires-signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  arch/powerpc/mm/copro_fault.c |  7 ++++++-
>  arch/powerpc/mm/fault.c       | 20 +++++++++-----------
>  2 files changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
> index 5a236f0..46152aa 100644
> --- a/arch/powerpc/mm/copro_fault.c
> +++ b/arch/powerpc/mm/copro_fault.c
> @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
>  		if (!(vma->vm_flags & VM_WRITE))
>  			goto out_unlock;
>  	} else {
> -		if (dsisr & DSISR_PROTFAULT)
> +		/*
> +		 * protfault should only happen due to us
> +		 * mapping a region readonly temporarily. PROT_NONE
> +		 * is also covered by the VMA check above.
> +		 */
> +		if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT))
>  			goto out_unlock;
>  		if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
>  			goto out_unlock;


we should do that DSISR_PROTFAILT check after vma->vm_flags. It is not
that we will not hit DSISR_PROTFAULT, what we want to ensure here is that
we get a prot fault only for cases convered by that vma check. So
everything should be taking the if (!(vma->vm_flags & (VM_READ |
VM_EXEC))) branch if it is a protfault. If not we would like to know
about that. And hence the idea of not using WARN_ON_ONCE. I was also not
sure whether we want to enable that always. The reason for keeping that
within CONFIG_DEBUG_VM is to make sure that nobody ends up depending on
PROTFAULT outside the vma check convered. So expectations is that
developers working on feature will run with DEBUG_VM enable and finds
this warning. We don't expect to hit this otherwise.

> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 5007497..9d6e0b3 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -396,17 +396,6 @@ good_area:
>  #endif /* CONFIG_8xx */
>
>  	if (is_exec) {
> -#ifdef CONFIG_PPC_STD_MMU
> -		/* Protection fault on exec go straight to failure on
> -		 * Hash based MMUs as they either don't support per-page
> -		 * execute permission, or if they do, it's handled already
> -		 * at the hash level. This test would probably have to
> -		 * be removed if we change the way this works to make hash
> -		 * processors use the same I/D cache coherency mechanism
> -		 * as embedded.
> -		 */
> -#endif /* CONFIG_PPC_STD_MMU */
> -
>  		/*
>  		 * Allow execution from readable areas if the MMU does not
>  		 * provide separate controls over reading and executing.
> @@ -421,6 +410,14 @@ good_area:
>  		    (cpu_has_feature(CPU_FTR_NOEXECUTE) ||
>  		     !(vma->vm_flags & (VM_READ | VM_WRITE))))
>  			goto bad_area;
> +#ifdef CONFIG_PPC_STD_MMU
> +		/*
> +		 * protfault should only happen due to us
> +		 * mapping a region readonly temporarily. PROT_NONE
> +		 * is also covered by the VMA check above.
> +		 */
> +		WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
> +#endif /* CONFIG_PPC_STD_MMU */
>  	/* a write */
>  	} else if (is_write) {
>  		if (!(vma->vm_flags & VM_WRITE))
> @@ -430,6 +427,7 @@ good_area:
>  	} else {
>  		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
>  			goto bad_area;
> +		WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
>  	}
>
>  	/*


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-18 15:42   ` Mel Gorman
@ 2014-11-18 16:33     ` Sasha Levin
  2014-11-18 16:56       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2014-11-18 16:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel, Linux-MM, Aneesh Kumar, Hugh Dickins, Dave Jones,
	Rik van Riel, Ingo Molnar, Kirill Shutemov, Linus Torvalds

On 11/18/2014 10:42 AM, Mel Gorman wrote:
> 1. I'm assuming this is a KVM setup but can you confirm?

Yes.

> 2. Are you using numa=fake=N?

Yes. numa=fake=24, which is probably way more nodes on any physical machine
than the new code was tested on?

> 3. If you are using fake NUMA, what happens if you boot without it as
>    that should make the patches a no-op?

Nope, still seeing it without fake numa.

> 4. Similarly, does the kernel boot properly without without patches?

Yes, the kernel works fine without the patches both with and without fake
numa.

> 5. Are any other patches applied because the line numbers are not lining
>    up exactly?

I have quite a few more patches on top of next, but they're debug patches
that add VM_BUG_ONs in quite a few places.

One thing that was odd is that your patches had merge conflicts when applied
on -next in mm/huge-memory.c, so maybe that's where line number differences
are coming from.

> 6. As my own KVM setup appears broken, can you tell me if the host
>    kernel has changed recently? If so, does using an older host kernel
>    make a difference?

Nope, I've been using the same host kernel (Ubuntu's 3.16.0-24-generic #32)
for a while now.

> At the moment I'm scratching my head trying to figure out how the
> patches could break 9p like this as I don't believe KVM is doing any
> tricks with the same bits that could result in loss.

This issue reproduces rather easily, I'd be happy to try out debug patches
rather than having you guess at what might have gone wrong.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-18 16:33     ` Sasha Levin
@ 2014-11-18 16:56       ` Aneesh Kumar K.V
  2014-11-18 17:14         ` Mel Gorman
  2014-11-18 17:18         ` Sasha Levin
  0 siblings, 2 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2014-11-18 16:56 UTC (permalink / raw)
  To: Sasha Levin, Mel Gorman
  Cc: Linux Kernel, Linux-MM, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Linus Torvalds

Sasha Levin <sasha.levin@oracle.com> writes:

> On 11/18/2014 10:42 AM, Mel Gorman wrote:
>> 1. I'm assuming this is a KVM setup but can you confirm?
>
> Yes.
>
>> 2. Are you using numa=fake=N?
>
> Yes. numa=fake=24, which is probably way more nodes on any physical machine
> than the new code was tested on?
>
>> 3. If you are using fake NUMA, what happens if you boot without it as
>>    that should make the patches a no-op?
>
> Nope, still seeing it without fake numa.
>
>> 4. Similarly, does the kernel boot properly without without patches?
>
> Yes, the kernel works fine without the patches both with and without fake
> numa.


Hmm that is interesting. I am not sure how writeback_fid can be
related. We use writeback fid to enable client side caching with 9p
(cache=loose). We use this fid to write back dirty pages later. Can you
share the qemu command line used, 9p mount options and the test details ? 


-aneesh


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-18 16:33     ` Aneesh Kumar K.V
@ 2014-11-18 17:08       ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-18 17:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Linux Kernel, Linux-MM, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Sasha Levin, Linus Torvalds,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev

On Tue, Nov 18, 2014 at 10:03:30PM +0530, Aneesh Kumar K.V wrote:
> > diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
> > index 5a236f0..46152aa 100644
> > --- a/arch/powerpc/mm/copro_fault.c
> > +++ b/arch/powerpc/mm/copro_fault.c
> > @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
> >  		if (!(vma->vm_flags & VM_WRITE))
> >  			goto out_unlock;
> >  	} else {
> > -		if (dsisr & DSISR_PROTFAULT)
> > +		/*
> > +		 * protfault should only happen due to us
> > +		 * mapping a region readonly temporarily. PROT_NONE
> > +		 * is also covered by the VMA check above.
> > +		 */
> > +		if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT))
> >  			goto out_unlock;
> >  		if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
> >  			goto out_unlock;
> 
> 
> we should do that DSISR_PROTFAILT check after vma->vm_flags. It is not
> that we will not hit DSISR_PROTFAULT, what we want to ensure here is that
> we get a prot fault only for cases convered by that vma check. So
> everything should be taking the if (!(vma->vm_flags & (VM_READ |
> VM_EXEC))) branch if it is a protfault. If not we would like to know
> about that. And hence the idea of not using WARN_ON_ONCE. I was also not
> sure whether we want to enable that always. The reason for keeping that
> within CONFIG_DEBUG_VM is to make sure that nobody ends up depending on
> PROTFAULT outside the vma check convered. So expectations is that
> developers working on feature will run with DEBUG_VM enable and finds
> this warning. We don't expect to hit this otherwise.
> 

/me slaps self. It's clear now and updated accordingly. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-18 16:56       ` Aneesh Kumar K.V
@ 2014-11-18 17:14         ` Mel Gorman
  2014-11-18 17:18         ` Sasha Levin
  1 sibling, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-18 17:14 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Sasha Levin, Linux Kernel, Linux-MM, Hugh Dickins, Dave Jones,
	Rik van Riel, Ingo Molnar, Kirill Shutemov, Linus Torvalds

On Tue, Nov 18, 2014 at 10:26:41PM +0530, Aneesh Kumar K.V wrote:
> Sasha Levin <sasha.levin@oracle.com> writes:
> 
> > On 11/18/2014 10:42 AM, Mel Gorman wrote:
> >> 1. I'm assuming this is a KVM setup but can you confirm?
> >
> > Yes.
> >
> >> 2. Are you using numa=fake=N?
> >
> > Yes. numa=fake=24, which is probably way more nodes on any physical machine
> > than the new code was tested on?
> >
> >> 3. If you are using fake NUMA, what happens if you boot without it as
> >>    that should make the patches a no-op?
> >
> > Nope, still seeing it without fake numa.
> >
> >> 4. Similarly, does the kernel boot properly without without patches?
> >
> > Yes, the kernel works fine without the patches both with and without fake
> > numa.
> 
> 
> Hmm that is interesting. I am not sure how writeback_fid can be
> related. We use writeback fid to enable client side caching with 9p
> (cache=loose). We use this fid to write back dirty pages later. Can you
> share the qemu command line used, 9p mount options and the test details ? 
> 

It would help if the test details included the kernel config. I got KVM
working again on an server with an older installation and while it
doesn't use 9p, I'm not seeing any other oddities either yet while
running trinity.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-18 16:56       ` Aneesh Kumar K.V
  2014-11-18 17:14         ` Mel Gorman
@ 2014-11-18 17:18         ` Sasha Levin
  2014-11-19 13:14           ` Mel Gorman
  1 sibling, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2014-11-18 17:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Mel Gorman
  Cc: Linux Kernel, Linux-MM, Hugh Dickins, Dave Jones, Rik van Riel,
	Ingo Molnar, Kirill Shutemov, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 782 bytes --]

On 11/18/2014 11:56 AM, Aneesh Kumar K.V wrote:
>>> 4. Similarly, does the kernel boot properly without without patches?
>> >
>> > Yes, the kernel works fine without the patches both with and without fake
>> > numa.
> 
> Hmm that is interesting. I am not sure how writeback_fid can be
> related. We use writeback fid to enable client side caching with 9p
> (cache=loose). We use this fid to write back dirty pages later. Can you
> share the qemu command line used, 9p mount options and the test details ? 

I'm using kvmtool rather than qemu. rootfs is created via kernel parameters:

root=/dev/root rw rootflags=rw,trans=virtio,version=9p2000.L rootfstype=9p

The test is just running trinity, there's no 9p or mm specific test going on.

I've attached my .config.


Thanks,
Sasha

[-- Attachment #2: config-sasha.gz --]
[-- Type: application/gzip, Size: 39288 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections
  2014-11-18 17:18         ` Sasha Levin
@ 2014-11-19 13:14           ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2014-11-19 13:14 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Aneesh Kumar K.V, Linux Kernel, Linux-MM, Hugh Dickins,
	Dave Jones, Rik van Riel, Ingo Molnar, Kirill Shutemov,
	Linus Torvalds

On Tue, Nov 18, 2014 at 12:18:43PM -0500, Sasha Levin wrote:
> On 11/18/2014 11:56 AM, Aneesh Kumar K.V wrote:
> >>> 4. Similarly, does the kernel boot properly without without patches?
> >> >
> >> > Yes, the kernel works fine without the patches both with and without fake
> >> > numa.
> > 
> > Hmm that is interesting. I am not sure how writeback_fid can be
> > related. We use writeback fid to enable client side caching with 9p
> > (cache=loose). We use this fid to write back dirty pages later. Can you
> > share the qemu command line used, 9p mount options and the test details ? 
> 
> I'm using kvmtool rather than qemu. rootfs is created via kernel parameters:
> 
> root=/dev/root rw rootflags=rw,trans=virtio,version=9p2000.L rootfstype=9p
> 
> The test is just running trinity, there's no 9p or mm specific test going on.
> 
> I've attached my .config.
> 

Ok, based on that I was able to reproduce the problem. I hope to have a
V2 before the end of the week. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-11-19 13:14 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-14 13:32 [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Mel Gorman
2014-11-14 13:33 ` [PATCH 1/7] mm: Add p[te|md] protnone helpers for use by NUMA balancing Mel Gorman
2014-11-14 13:33 ` [PATCH 2/7] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa Mel Gorman
2014-11-14 13:33 ` [PATCH 3/7] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations Mel Gorman
2014-11-14 13:33 ` [PATCH 4/7] mm: Remove remaining references to NUMA hinting bits and helpers Mel Gorman
2014-11-14 13:33 ` [PATCH 5/7] mm: numa: Do not trap faults on the huge zero page Mel Gorman
2014-11-14 13:33 ` [PATCH 6/7] x86: mm: Restore original pte_special check Mel Gorman
2014-11-14 13:33 ` [PATCH 7/7] mm: numa: Add paranoid check around pte_protnone_numa Mel Gorman
2014-11-15  1:41 ` [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections Linus Torvalds
2014-11-15  3:29 ` Sasha Levin
2014-11-18 15:42   ` Mel Gorman
2014-11-18 16:33     ` Sasha Levin
2014-11-18 16:56       ` Aneesh Kumar K.V
2014-11-18 17:14         ` Mel Gorman
2014-11-18 17:18         ` Sasha Levin
2014-11-19 13:14           ` Mel Gorman
2014-11-17  8:26 ` Aneesh Kumar K.V
2014-11-18 16:01   ` Mel Gorman
2014-11-18 16:33     ` Aneesh Kumar K.V
2014-11-18 17:08       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).