linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/9] Clean up i386-PAE
@ 2020-11-30 11:27 Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 1/9] mm: Update ptep_get_lockless()s comment Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

Hi!

While reviewing some other patches [1], Willy pointed out that lockless
page-table walkers should probably be using the same magic for PMD as we do for
PTE on things like i386-PAE.

These patches are the result of that and apply on top of that earlier series.

[1] https://lkml.kernel.org/r/20201126120114.071913521@infradead.org




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 1/9] mm: Update ptep_get_lockless()s comment
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 2/9] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

Improve the comment and add a lockdep assertion because nobody reads
comments.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/pgtable.h |   17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
 
 #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
 /*
- * WARNING: only to be used in the get_user_pages_fast() implementation.
- *
- * With get_user_pages_fast(), we walk down the pagetables without taking any
- * locks.  For this we would like to load the pointers atomically, but sometimes
- * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
- * we do have is the guarantee that a PTE will only either go from not present
- * to present, or present to not present or both -- it will not switch to a
- * completely different present page without a TLB flush in between; something
- * that we are blocking by holding interrupts off.
+ * For walking the pagetables without holding any locks.  Some architectures
+ * (eg x86-32 PAE) cannot load the entries atomically without using expensive
+ * instructions.  We are guaranteed that a PTE will only either go from not
+ * present to present, or present to not present -- it will not switch to a
+ * completely different present page without a TLB flush inbetween; which we
+ * are blocking by holding interrupts off.
  *
  * Setting ptes from not present to present goes:
  *
@@ -294,6 +291,8 @@ static inline pte_t ptep_get_lockless(pt
 {
 	pte_t pte;
 
+	lockdep_assert_irqs_disabled();
+
 	do {
 		pte.pte_low = ptep->pte_low;
 		smp_rmb();




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 2/9] x86/mm/pae: Make pmd_t similar to pte_t
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 1/9] mm: Update ptep_get_lockless()s comment Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 3/9] sh/mm: " Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

Instead of mucking about with at least 2 different ways of fudging
it, do the same thing we do for pte_t.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h       |   42 +++++++++-------------------
 arch/x86/include/asm/pgtable-3level_types.h |    7 ++++
 arch/x86/include/asm/pgtable_64_types.h     |    1 
 arch/x86/include/asm/pgtable_types.h        |    4 --
 4 files changed, 23 insertions(+), 31 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -87,7 +87,7 @@ static inline pmd_t pmd_read_atomic(pmd_
 		ret |= ((pmdval_t)*(tmp + 1)) << 32;
 	}
 
-	return (pmd_t) { ret };
+	return (pmd_t) { .pmd = ret };
 }
 
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
@@ -121,12 +121,11 @@ static inline void native_pte_clear(stru
 	ptep->pte_high = 0;
 }
 
-static inline void native_pmd_clear(pmd_t *pmd)
+static inline void native_pmd_clear(pmd_t *pmdp)
 {
-	u32 *tmp = (u32 *)pmd;
-	*tmp = 0;
+	pmdp->pmd_low = 0;
 	smp_wmb();
-	*(tmp + 1) = 0;
+	pmdp->pmd_high = 0;
 }
 
 static inline void native_pud_clear(pud_t *pudp)
@@ -162,25 +161,17 @@ static inline pte_t native_ptep_get_and_
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
-union split_pmd {
-	struct {
-		u32 pmd_low;
-		u32 pmd_high;
-	};
-	pmd_t pmd;
-};
-
 #ifdef CONFIG_SMP
 static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 {
-	union split_pmd res, *orig = (union split_pmd *)pmdp;
+	pmd_t res;
 
 	/* xchg acts as a barrier before setting of the high bits */
-	res.pmd_low = xchg(&orig->pmd_low, 0);
-	res.pmd_high = orig->pmd_high;
-	orig->pmd_high = 0;
+	res.pmd_low = xchg(&pmdp->pmd_low, 0);
+	res.pmd_high = READ_ONCE(pmdp->pmd_high);
+	WRITE_ONCE(pmdp->pmd_high, 0);
 
-	return res.pmd;
+	return res;
 }
 #else
 #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
@@ -199,17 +190,12 @@ static inline pmd_t pmdp_establish(struc
 	 * anybody.
 	 */
 	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
-		union split_pmd old, new, *ptr;
-
-		ptr = (union split_pmd *)pmdp;
-
-		new.pmd = pmd;
-
 		/* xchg acts as a barrier before setting of the high bits */
-		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
-		old.pmd_high = ptr->pmd_high;
-		ptr->pmd_high = new.pmd_high;
-		return old.pmd;
+		old.pmd_low = xchg(&pmdp->pmd_low, pmd.pmd_low);
+		old.pmd_high = READ_ONCE(pmdp->pmd_high);
+		WRITE_ONCE(pmdp->pmd_high, pmd.pmd_high);
+
+		return old;
 	}
 
 	do {
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -18,6 +18,13 @@ typedef union {
 	};
 	pteval_t pte;
 } pte_t;
+
+typedef union {
+	struct {
+		unsigned long pmd_low, pmd_high;
+	};
+	pmdval_t pmd;
+} pmd_t;
 #endif	/* !__ASSEMBLY__ */
 
 #define SHARED_KERNEL_PMD	(!static_cpu_has(X86_FEATURE_PTI))
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -19,6 +19,7 @@ typedef unsigned long	pgdval_t;
 typedef unsigned long	pgprotval_t;
 
 typedef struct { pteval_t pte; } pte_t;
+typedef struct { pmdval_t pmd; } pmd_t;
 
 #ifdef CONFIG_X86_5LEVEL
 extern unsigned int __pgtable_l5_enabled;
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -381,11 +381,9 @@ static inline pudval_t native_pud_val(pu
 #endif
 
 #if CONFIG_PGTABLE_LEVELS > 2
-typedef struct { pmdval_t pmd; } pmd_t;
-
 static inline pmd_t native_make_pmd(pmdval_t val)
 {
-	return (pmd_t) { val };
+	return (pmd_t) { .pmd = val };
 }
 
 static inline pmdval_t native_pmd_val(pmd_t pmd)




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 3/9] sh/mm: Make pmd_t similar to pte_t
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 1/9] mm: Update ptep_get_lockless()s comment Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 2/9] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  2020-11-30 14:10   ` David Laight
  2020-11-30 11:27 ` [RFC][PATCH 4/9] mm: Fix pmd_read_atomic() Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

Just like 64bit pte_t, have a low/high split in pmd_t.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/sh/include/asm/pgtable-3level.h |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- a/arch/sh/include/asm/pgtable-3level.h
+++ b/arch/sh/include/asm/pgtable-3level.h
@@ -28,9 +28,15 @@
 #define pmd_ERROR(e) \
 	printk("%s:%d: bad pmd %016llx.\n", __FILE__, __LINE__, pmd_val(e))
 
-typedef struct { unsigned long long pmd; } pmd_t;
+typedef struct {
+	struct {
+		unsigned long pmd_low;
+		unsigned long pmd_high;
+	};
+	unsigned long long pmd;
+} pmd_t;
 #define pmd_val(x)	((x).pmd)
-#define __pmd(x)	((pmd_t) { (x) } )
+#define __pmd(x)	((pmd_t) { .pmd = (x) } )
 
 static inline unsigned long pud_page_vaddr(pud_t pud)
 {




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 4/9] mm: Fix pmd_read_atomic()
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
                   ` (2 preceding siblings ...)
  2020-11-30 11:27 ` [RFC][PATCH 3/9] sh/mm: " Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 5/9] mm: Rename pmd_read_atomic() Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

AFAICT there's no reason to do anything different than what we do for
PTEs. Make it so (also affects SH).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h |   56 ----------------------------------
 include/linux/pgtable.h               |   49 +++++++++++++++++++++++------
 2 files changed, 39 insertions(+), 66 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -34,62 +34,6 @@ static inline void native_set_pte(pte_t
 	ptep->pte_low = pte.pte_low;
 }
 
-#define pmd_read_atomic pmd_read_atomic
-/*
- * pte_offset_map_lock() on 32-bit PAE kernels was reading the pmd_t with
- * a "*pmdp" dereference done by GCC. Problem is, in certain places
- * where pte_offset_map_lock() is called, concurrent page faults are
- * allowed, if the mmap_lock is hold for reading. An example is mincore
- * vs page faults vs MADV_DONTNEED. On the page fault side
- * pmd_populate() rightfully does a set_64bit(), but if we're reading the
- * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
- * because GCC will not read the 64-bit value of the pmd atomically.
- *
- * To fix this all places running pte_offset_map_lock() while holding the
- * mmap_lock in read mode, shall read the pmdp pointer using this
- * function to know if the pmd is null or not, and in turn to know if
- * they can run pte_offset_map_lock() or pmd_trans_huge() or other pmd
- * operations.
- *
- * Without THP if the mmap_lock is held for reading, the pmd can only
- * transition from null to not null while pmd_read_atomic() runs. So
- * we can always return atomic pmd values with this function.
- *
- * With THP if the mmap_lock is held for reading, the pmd can become
- * trans_huge or none or point to a pte (and in turn become "stable")
- * at any time under pmd_read_atomic(). We could read it truly
- * atomically here with an atomic64_read() for the THP enabled case (and
- * it would be a whole lot simpler), but to avoid using cmpxchg8b we
- * only return an atomic pmdval if the low part of the pmdval is later
- * found to be stable (i.e. pointing to a pte). We are also returning a
- * 'none' (zero) pmdval if the low part of the pmd is zero.
- *
- * In some cases the high and low part of the pmdval returned may not be
- * consistent if THP is enabled (the low part may point to previously
- * mapped hugepage, while the high part may point to a more recently
- * mapped hugepage), but pmd_none_or_trans_huge_or_clear_bad() only
- * needs the low part of the pmd to be read atomically to decide if the
- * pmd is unstable or not, with the only exception when the low part
- * of the pmd is zero, in which case we return a 'none' pmd.
- */
-static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
-{
-	pmdval_t ret;
-	u32 *tmp = (u32 *)pmdp;
-
-	ret = (pmdval_t) (*tmp);
-	if (ret) {
-		/*
-		 * If the low part is null, we must not read the high part
-		 * or we can end up with a partial pmd.
-		 */
-		smp_rmb();
-		ret |= ((pmdval_t)*(tmp + 1)) << 32;
-	}
-
-	return (pmd_t) { .pmd = ret };
-}
-
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
 	set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -258,6 +258,13 @@ static inline pte_t ptep_get(pte_t *ptep
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_GET
+static inline pmd_t pmdp_get(pmd_t *pmdp)
+{
+	return READ_ONCE(*pmdp);
+}
+#endif
+
 #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
 /*
  * For walking the pagetables without holding any locks.  Some architectures
@@ -302,15 +309,44 @@ static inline pte_t ptep_get_lockless(pt
 
 	return pte;
 }
-#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+#define ptep_get_lockless ptep_get_lockless
+
+#if CONFIG_PGTABLE_LEVELS > 2
+static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
+{
+	pmd_t pmd;
+
+	lockdep_assert_irqs_disabled();
+
+	do {
+		pmd.pmd_low = pmdp->pmd_low;
+		smp_rmb();
+		pmd.pmd_high = pmdp->pmd_high;
+		smp_rmb();
+	} while (unlikely(pmd.pmd_low != pmdp->pmd_low));
+
+	return pmd;
+}
+#define pmdp_get_lockless pmdp_get_lockless
+#endif /* CONFIG_PGTABLE_LEVELS > 2 */
+#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+
 /*
  * We require that the PTE can be read atomically.
  */
+#ifndef ptep_get_lockless
 static inline pte_t ptep_get_lockless(pte_t *ptep)
 {
 	return ptep_get(ptep);
 }
-#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+#endif
+
+#ifndef pmdp_get_lockless
+static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
+{
+	return pmdp_get(pmdp);
+}
+#endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -1211,17 +1247,10 @@ static inline int pud_trans_unstable(pud
 #endif
 }
 
-#ifndef pmd_read_atomic
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
-	/*
-	 * Depend on compiler for an atomic pmd read. NOTE: this is
-	 * only going to work, if the pmdval_t isn't larger than
-	 * an unsigned long.
-	 */
-	return *pmdp;
+	return pmdp_get_lockless(pmdp);
 }
-#endif
 
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 5/9] mm: Rename pmd_read_atomic()
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
                   ` (3 preceding siblings ...)
  2020-11-30 11:27 ` [RFC][PATCH 4/9] mm: Fix pmd_read_atomic() Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  2020-11-30 15:31   ` Jason Gunthorpe
  2020-11-30 11:27 ` [RFC][PATCH 6/9] mm/gup: Fix the lockless walkers Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

There's no point in having the identical routines for PTE/PMD have
different names.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/pgtable.h    |    7 +------
 mm/hmm.c                   |    2 +-
 mm/mapping_dirty_helpers.c |    2 +-
 mm/mprotect.c              |    2 +-
 mm/userfaultfd.c           |    2 +-
 5 files changed, 5 insertions(+), 10 deletions(-)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1244,11 +1244,6 @@ static inline int pud_trans_unstable(pud
 #endif
 }
 
-static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
-{
-	return pmdp_get_lockless(pmdp);
-}
-
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)
 #endif
@@ -1275,7 +1270,7 @@ static inline pmd_t pmd_read_atomic(pmd_
  */
 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 {
-	pmd_t pmdval = pmd_read_atomic(pmd);
+	pmd_t pmdval = pmdp_get_lockless(pmd);
 	/*
 	 * The barrier will stabilize the pmdval in a register or on
 	 * the stack so that it will stop changing under the code.
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -356,7 +356,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		 * huge or device mapping one and compute corresponding pfn
 		 * values.
 		 */
-		pmd = pmd_read_atomic(pmdp);
+		pmd = pmdp_get_lockless(pmdp);
 		barrier();
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -123,7 +123,7 @@ static int clean_record_pte(pte_t *pte,
 static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 			      struct mm_walk *walk)
 {
-	pmd_t pmdval = pmd_read_atomic(pmd);
+	pmd_t pmdval = pmdp_get_lockless(pmd);
 
 	if (!pmd_trans_unstable(&pmdval))
 		return 0;
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -190,7 +190,7 @@ static unsigned long change_pte_range(st
  */
 static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 {
-	pmd_t pmdval = pmd_read_atomic(pmd);
+	pmd_t pmdval = pmdp_get_lockless(pmd);
 
 	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -553,7 +553,7 @@ static __always_inline ssize_t __mcopy_a
 			break;
 		}
 
-		dst_pmdval = pmd_read_atomic(dst_pmd);
+		dst_pmdval = pmdp_get_lockless(dst_pmd);
 		/*
 		 * If the dst_pmd is mapped as THP don't
 		 * override it and just be strict.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 6/9] mm/gup: Fix the lockless walkers
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
                   ` (4 preceding siblings ...)
  2020-11-30 11:27 ` [RFC][PATCH 5/9] mm: Rename pmd_read_atomic() Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 7/9] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

On architectures where the PTE/PMD is larger than the native word size
(i386-PAE for example), READ_ONCE() can do the wrong thing. Use
pmdp_get_lockless() just like we use ptep_get_lockless().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |    2 +-
 mm/gup.c             |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7041,7 +7041,7 @@ static u64 perf_get_pgtable_size(struct
 		return pud_leaf_size(pud);
 
 	pmdp = pmd_offset_lockless(pudp, pud, addr);
-	pmd = READ_ONCE(*pmdp);
+	pmd = pmdp_get_lockless(pmdp);
 	if (!pmd_present(pmd))
 		return 0;
 
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2467,7 +2467,7 @@ static int gup_pmd_range(pud_t *pudp, pu
 
 	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
-		pmd_t pmd = READ_ONCE(*pmdp);
+		pmd_t pmd = pmdp_get_lockless(pmdp);
 
 		next = pmd_addr_end(addr, end);
 		if (!pmd_present(pmd))




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 7/9] x86/mm/pae: Dont (ab)use atomic64
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
                   ` (5 preceding siblings ...)
  2020-11-30 11:27 ` [RFC][PATCH 6/9] mm/gup: Fix the lockless walkers Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 8/9] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 9/9] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

PAE implies CX8, write readable code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -2,8 +2,6 @@
 #ifndef _ASM_X86_PGTABLE_3LEVEL_H
 #define _ASM_X86_PGTABLE_3LEVEL_H
 
-#include <asm/atomic64_32.h>
-
 /*
  * Intel Physical Address Extension (PAE) Mode - three-level page
  * tables on PPro+ CPUs.
@@ -96,11 +94,13 @@ static inline void pud_clear(pud_t *pudp
 #ifdef CONFIG_SMP
 static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
 {
-	pte_t res;
+	pte_t old;
 
-	res.pte = (pteval_t)arch_atomic64_xchg((atomic64_t *)ptep, 0);
+	do {
+		old = *ptep;
+	} while (cmpxchg64(&ptep->pte, old.pte, 0ULL) != old.pte);
 
-	return res;
+	return old;
 }
 #else
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 8/9] x86/mm/pae: Use WRITE_ONCE()
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
                   ` (6 preceding siblings ...)
  2020-11-30 11:27 ` [RFC][PATCH 7/9] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  2020-11-30 11:27 ` [RFC][PATCH 9/9] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -27,9 +27,9 @@
  */
 static inline void native_set_pte(pte_t *ptep, pte_t pte)
 {
-	ptep->pte_high = pte.pte_high;
+	WRITE_ONCE(ptep->pte_high, pte.pte_high);
 	smp_wmb();
-	ptep->pte_low = pte.pte_low;
+	WRITE_ONCE(ptep->pte_low, pte.pte_low);
 }
 
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
@@ -58,14 +58,14 @@ static inline void native_set_pud(pud_t
 static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
 {
-	ptep->pte_low = 0;
+	WRITE_ONCE(ptep->pte_low, 0);
 	smp_wmb();
-	ptep->pte_high = 0;
+	WRITE_ONCE(ptep->pte_high, 0);
 }
 
 static inline void native_pmd_clear(pmd_t *pmdp)
 {
-	pmdp->pmd_low = 0;
+	WRITE_ONCE(pmdp->pmd_low, 0);
 	smp_wmb();
 	pmdp->pmd_high = 0;
 }




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 9/9] x86/mm/pae: Be consistent with pXXp_get_and_clear()
  2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
                   ` (7 preceding siblings ...)
  2020-11-30 11:27 ` [RFC][PATCH 8/9] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
@ 2020-11-30 11:27 ` Peter Zijlstra
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 11:27 UTC (permalink / raw)
  To: x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel, peterz

Given that ptep_get_and_clear() uses cmpxchg8b, and that should be by
far the most common case, there's no point in having an optimized
variant for pmd/pud.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h |   34 ++++++++++------------------------
 1 file changed, 10 insertions(+), 24 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -108,14 +108,13 @@ static inline pte_t native_ptep_get_and_
 #ifdef CONFIG_SMP
 static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 {
-	pmd_t res;
+	pmd_t old;
 
-	/* xchg acts as a barrier before setting of the high bits */
-	res.pmd_low = xchg(&pmdp->pmd_low, 0);
-	res.pmd_high = READ_ONCE(pmdp->pmd_high);
-	WRITE_ONCE(pmdp->pmd_high, 0);
+	do {
+		old = *pmdp;
+	} while (cmpxchg64(&pmdp->pmd, old.pmd, 0ULL) != old.pmd);
 
-	return res;
+	return old;
 }
 #else
 #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
@@ -151,28 +150,15 @@ static inline pmd_t pmdp_establish(struc
 #endif
 
 #ifdef CONFIG_SMP
-union split_pud {
-	struct {
-		u32 pud_low;
-		u32 pud_high;
-	};
-	pud_t pud;
-};
-
 static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 {
-	union split_pud res, *orig = (union split_pud *)pudp;
+	pud_t old;
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-	pti_set_user_pgtbl(&pudp->p4d.pgd, __pgd(0));
-#endif
-
-	/* xchg acts as a barrier before setting of the high bits */
-	res.pud_low = xchg(&orig->pud_low, 0);
-	res.pud_high = orig->pud_high;
-	orig->pud_high = 0;
+	do {
+		old = *pudp;
+	} while (cmpxchg64(&pudp->pud, old.pud, 0ULL) != old.pud);
 
-	return res.pud;
+	return old;
 }
 #else
 #define native_pudp_get_and_clear(xp) native_local_pudp_get_and_clear(xp)




^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [RFC][PATCH 3/9] sh/mm: Make pmd_t similar to pte_t
  2020-11-30 11:27 ` [RFC][PATCH 3/9] sh/mm: " Peter Zijlstra
@ 2020-11-30 14:10   ` David Laight
  2020-11-30 14:21     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: David Laight @ 2020-11-30 14:10 UTC (permalink / raw)
  To: 'Peter Zijlstra', x86, willy
  Cc: linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel

From: Peter Zijlstra
> Sent: 30 November 2020 11:27
> 
> Just like 64bit pte_t, have a low/high split in pmd_t.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/sh/include/asm/pgtable-3level.h |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> --- a/arch/sh/include/asm/pgtable-3level.h
> +++ b/arch/sh/include/asm/pgtable-3level.h
> @@ -28,9 +28,15 @@
>  #define pmd_ERROR(e) \
>  	printk("%s:%d: bad pmd %016llx.\n", __FILE__, __LINE__, pmd_val(e))
> 
> -typedef struct { unsigned long long pmd; } pmd_t;
> +typedef struct {
> +	struct {
> +		unsigned long pmd_low;
> +		unsigned long pmd_high;
> +	};
> +	unsigned long long pmd;
> +} pmd_t;

Would it be better to use u32 and u64?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 3/9] sh/mm: Make pmd_t similar to pte_t
  2020-11-30 14:10   ` David Laight
@ 2020-11-30 14:21     ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-11-30 14:21 UTC (permalink / raw)
  To: David Laight
  Cc: x86, willy, linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel

On Mon, Nov 30, 2020 at 02:10:42PM +0000, David Laight wrote:
> From: Peter Zijlstra
> > Sent: 30 November 2020 11:27
> > 
> > Just like 64bit pte_t, have a low/high split in pmd_t.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  arch/sh/include/asm/pgtable-3level.h |   10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > --- a/arch/sh/include/asm/pgtable-3level.h
> > +++ b/arch/sh/include/asm/pgtable-3level.h
> > @@ -28,9 +28,15 @@
> >  #define pmd_ERROR(e) \
> >  	printk("%s:%d: bad pmd %016llx.\n", __FILE__, __LINE__, pmd_val(e))
> > 
> > -typedef struct { unsigned long long pmd; } pmd_t;
> > +typedef struct {
> > +	struct {
> > +		unsigned long pmd_low;
> > +		unsigned long pmd_high;
> > +	};
> > +	unsigned long long pmd;
> > +} pmd_t;
> 
> Would it be better to use u32 and u64?

That would be inconsistent with the rest of SH. If you want to go clean
up SH, have at, but that's not what this series is for.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 5/9] mm: Rename pmd_read_atomic()
  2020-11-30 11:27 ` [RFC][PATCH 5/9] mm: Rename pmd_read_atomic() Peter Zijlstra
@ 2020-11-30 15:31   ` Jason Gunthorpe
  2020-12-01  8:57     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2020-11-30 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel

On Mon, Nov 30, 2020 at 12:27:10PM +0100, Peter Zijlstra wrote:
> There's no point in having the identical routines for PTE/PMD have
> different names.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>  include/linux/pgtable.h    |    7 +------
>  mm/hmm.c                   |    2 +-
>  mm/mapping_dirty_helpers.c |    2 +-
>  mm/mprotect.c              |    2 +-
>  mm/userfaultfd.c           |    2 +-
>  5 files changed, 5 insertions(+), 10 deletions(-)
> 
> +++ b/include/linux/pgtable.h
> @@ -1244,11 +1244,6 @@ static inline int pud_trans_unstable(pud
>  #endif
>  }
>  
> -static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> -{
> -	return pmdp_get_lockless(pmdp);
> -}
> -
>  #ifndef arch_needs_pgtable_deposit
>  #define arch_needs_pgtable_deposit() (false)
>  #endif
> @@ -1275,7 +1270,7 @@ static inline pmd_t pmd_read_atomic(pmd_
>   */
>  static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  {
> -	pmd_t pmdval = pmd_read_atomic(pmd);
> +	pmd_t pmdval = pmdp_get_lockless(pmd);
>  	/*
>  	 * The barrier will stabilize the pmdval in a register or on
>  	 * the stack so that it will stop changing under the code.
> +++ b/mm/hmm.c
> @@ -356,7 +356,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		 * huge or device mapping one and compute corresponding pfn
>  		 * values.
>  		 */
> -		pmd = pmd_read_atomic(pmdp);
> +		pmd = pmdp_get_lockless(pmdp);
>  		barrier();
>  		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
>  			goto again;

The pagewalk API doesn't call the functions with interrupts disabled,
doesn't this mean we hit this assertion?

+#if CONFIG_PGTABLE_LEVELS > 2
+static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
+{
+       pmd_t pmd;
+
+       lockdep_assert_irqs_disabled();
+

It is only holding the read side of the mmap_sem here

Jason


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 5/9] mm: Rename pmd_read_atomic()
  2020-11-30 15:31   ` Jason Gunthorpe
@ 2020-12-01  8:57     ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2020-12-01  8:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: x86, willy, linux-mm, linux-kernel, aarcange, kirill.shutemov, jroedel

On Mon, Nov 30, 2020 at 11:31:20AM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 30, 2020 at 12:27:10PM +0100, Peter Zijlstra wrote:
> > There's no point in having the identical routines for PTE/PMD have
> > different names.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >  include/linux/pgtable.h    |    7 +------
> >  mm/hmm.c                   |    2 +-
> >  mm/mapping_dirty_helpers.c |    2 +-
> >  mm/mprotect.c              |    2 +-
> >  mm/userfaultfd.c           |    2 +-
> >  5 files changed, 5 insertions(+), 10 deletions(-)
> > 
> > +++ b/include/linux/pgtable.h
> > @@ -1244,11 +1244,6 @@ static inline int pud_trans_unstable(pud
> >  #endif
> >  }
> >  
> > -static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > -{
> > -	return pmdp_get_lockless(pmdp);
> > -}
> > -
> >  #ifndef arch_needs_pgtable_deposit
> >  #define arch_needs_pgtable_deposit() (false)
> >  #endif
> > @@ -1275,7 +1270,7 @@ static inline pmd_t pmd_read_atomic(pmd_
> >   */
> >  static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> >  {
> > -	pmd_t pmdval = pmd_read_atomic(pmd);
> > +	pmd_t pmdval = pmdp_get_lockless(pmd);
> >  	/*
> >  	 * The barrier will stabilize the pmdval in a register or on
> >  	 * the stack so that it will stop changing under the code.
> > +++ b/mm/hmm.c
> > @@ -356,7 +356,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  		 * huge or device mapping one and compute corresponding pfn
> >  		 * values.
> >  		 */
> > -		pmd = pmd_read_atomic(pmdp);
> > +		pmd = pmdp_get_lockless(pmdp);
> >  		barrier();
> >  		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
> >  			goto again;
> 
> The pagewalk API doesn't call the functions with interrupts disabled,
> doesn't this mean we hit this assertion?
> 
> +#if CONFIG_PGTABLE_LEVELS > 2
> +static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
> +{
> +       pmd_t pmd;
> +
> +       lockdep_assert_irqs_disabled();
> +
> 
> It is only holding the read side of the mmap_sem here

Hurmph, good point. I'll see what I can do about that.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-12-01  8:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-30 11:27 [RFC][PATCH 0/9] Clean up i386-PAE Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 1/9] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 2/9] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 3/9] sh/mm: " Peter Zijlstra
2020-11-30 14:10   ` David Laight
2020-11-30 14:21     ` Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 4/9] mm: Fix pmd_read_atomic() Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 5/9] mm: Rename pmd_read_atomic() Peter Zijlstra
2020-11-30 15:31   ` Jason Gunthorpe
2020-12-01  8:57     ` Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 6/9] mm/gup: Fix the lockless walkers Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 7/9] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 8/9] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
2020-11-30 11:27 ` [RFC][PATCH 9/9] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).