All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE
@ 2022-10-22 11:14 Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

Hi,

At long *long* last a respin of the patches that clean up pmd_get_atomic() and
i386-PAE. I'd nearly forgotten why I did this, but the old posting gave clue
that patch #7 was the whole purpose of me doing these patches.

Having carried these patches for at least 2 years, they recently hit a rebase
bump against the mg-lru patches, which is what prompted this repost.

Linus' comment about try_cmpxchg64() (and Uros before him) made me redo those
patches (see patch #10) which resulted in pxx_xchg64(). This in turn led to
killing off set_64bit().

The robot doesn't hate on these patches and they boot in kvm (because who still
has i386 hardware).

Patches also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/mm.pae

---
 arch/mips/Kconfig                           |   2 +-
 arch/sh/Kconfig                             |   2 +-
 arch/sh/include/asm/pgtable-3level.h        |  10 +-
 arch/um/include/asm/pgtable-3level.h        |   8 --
 arch/x86/Kconfig                            |   2 +-
 arch/x86/include/asm/cmpxchg_32.h           |  28 -----
 arch/x86/include/asm/cmpxchg_64.h           |   5 -
 arch/x86/include/asm/pgtable-3level.h       | 171 ++++++----------------------
 arch/x86/include/asm/pgtable-3level_types.h |   7 ++
 arch/x86/include/asm/pgtable_64_types.h     |   1 +
 arch/x86/include/asm/pgtable_types.h        |   4 +-
 drivers/iommu/intel/irq_remapping.c         |  10 +-
 include/linux/pgtable.h                     |  71 +++++++-----
 kernel/events/core.c                        |   2 +-
 mm/Kconfig                                  |   2 +-
 mm/gup.c                                    |   2 +-
 mm/hmm.c                                    |   3 +-
 mm/khugepaged.c                             |   2 +-
 mm/mapping_dirty_helpers.c                  |   2 +-
 mm/mprotect.c                               |   2 +-
 mm/userfaultfd.c                            |   2 +-
 mm/vmscan.c                                 |   5 +-
 22 files changed, 110 insertions(+), 233 deletions(-)


^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-24  5:42   ` John Hubbard
  2022-10-22 11:14 ` [PATCH 02/13] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

Improve the comment.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/pgtable.h |   17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
 
 #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
 /*
- * WARNING: only to be used in the get_user_pages_fast() implementation.
- *
- * With get_user_pages_fast(), we walk down the pagetables without taking any
- * locks.  For this we would like to load the pointers atomically, but sometimes
- * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
- * we do have is the guarantee that a PTE will only either go from not present
- * to present, or present to not present or both -- it will not switch to a
- * completely different present page without a TLB flush in between; something
- * that we are blocking by holding interrupts off.
+ * For walking the pagetables without holding any locks.  Some architectures
+ * (eg x86-32 PAE) cannot load the entries atomically without using expensive
+ * instructions.  We are guaranteed that a PTE will only either go from not
+ * present to present, or present to not present -- it will not switch to a
+ * completely different present page without a TLB flush inbetween; which we
+ * are blocking by holding interrupts off.
  *
  * Setting ptes from not present to present goes:
  *



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 02/13] x86/mm/pae: Make pmd_t similar to pte_t
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 03/13] sh/mm: " Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

Instead of mucking about with at least 2 different ways of fudging
it, do the same thing we do for pte_t.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h       |   42 +++++++++-------------------
 arch/x86/include/asm/pgtable-3level_types.h |    7 ++++
 arch/x86/include/asm/pgtable_64_types.h     |    1 
 arch/x86/include/asm/pgtable_types.h        |    4 --
 4 files changed, 23 insertions(+), 31 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -87,7 +87,7 @@ static inline pmd_t pmd_read_atomic(pmd_
 		ret |= ((pmdval_t)*(tmp + 1)) << 32;
 	}
 
-	return (pmd_t) { ret };
+	return (pmd_t) { .pmd = ret };
 }
 
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
@@ -121,12 +121,11 @@ static inline void native_pte_clear(stru
 	ptep->pte_high = 0;
 }
 
-static inline void native_pmd_clear(pmd_t *pmd)
+static inline void native_pmd_clear(pmd_t *pmdp)
 {
-	u32 *tmp = (u32 *)pmd;
-	*tmp = 0;
+	pmdp->pmd_low = 0;
 	smp_wmb();
-	*(tmp + 1) = 0;
+	pmdp->pmd_high = 0;
 }
 
 static inline void native_pud_clear(pud_t *pudp)
@@ -162,25 +161,17 @@ static inline pte_t native_ptep_get_and_
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
-union split_pmd {
-	struct {
-		u32 pmd_low;
-		u32 pmd_high;
-	};
-	pmd_t pmd;
-};
-
 #ifdef CONFIG_SMP
 static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 {
-	union split_pmd res, *orig = (union split_pmd *)pmdp;
+	pmd_t res;
 
 	/* xchg acts as a barrier before setting of the high bits */
-	res.pmd_low = xchg(&orig->pmd_low, 0);
-	res.pmd_high = orig->pmd_high;
-	orig->pmd_high = 0;
+	res.pmd_low = xchg(&pmdp->pmd_low, 0);
+	res.pmd_high = READ_ONCE(pmdp->pmd_high);
+	WRITE_ONCE(pmdp->pmd_high, 0);
 
-	return res.pmd;
+	return res;
 }
 #else
 #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
@@ -199,17 +190,12 @@ static inline pmd_t pmdp_establish(struc
 	 * anybody.
 	 */
 	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
-		union split_pmd old, new, *ptr;
-
-		ptr = (union split_pmd *)pmdp;
-
-		new.pmd = pmd;
-
 		/* xchg acts as a barrier before setting of the high bits */
-		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
-		old.pmd_high = ptr->pmd_high;
-		ptr->pmd_high = new.pmd_high;
-		return old.pmd;
+		old.pmd_low = xchg(&pmdp->pmd_low, pmd.pmd_low);
+		old.pmd_high = READ_ONCE(pmdp->pmd_high);
+		WRITE_ONCE(pmdp->pmd_high, pmd.pmd_high);
+
+		return old;
 	}
 
 	do {
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -18,6 +18,13 @@ typedef union {
 	};
 	pteval_t pte;
 } pte_t;
+
+typedef union {
+	struct {
+		unsigned long pmd_low, pmd_high;
+	};
+	pmdval_t pmd;
+} pmd_t;
 #endif	/* !__ASSEMBLY__ */
 
 #define SHARED_KERNEL_PMD	(!static_cpu_has(X86_FEATURE_PTI))
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -19,6 +19,7 @@ typedef unsigned long	pgdval_t;
 typedef unsigned long	pgprotval_t;
 
 typedef struct { pteval_t pte; } pte_t;
+typedef struct { pmdval_t pmd; } pmd_t;
 
 #ifdef CONFIG_X86_5LEVEL
 extern unsigned int __pgtable_l5_enabled;
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -381,11 +381,9 @@ static inline pudval_t native_pud_val(pu
 #endif
 
 #if CONFIG_PGTABLE_LEVELS > 2
-typedef struct { pmdval_t pmd; } pmd_t;
-
 static inline pmd_t native_make_pmd(pmdval_t val)
 {
-	return (pmd_t) { val };
+	return (pmd_t) { .pmd = val };
 }
 
 static inline pmdval_t native_pmd_val(pmd_t pmd)



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 03/13] sh/mm: Make pmd_t similar to pte_t
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 02/13] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-12-21 13:54   ` Guenter Roeck
  2022-10-22 11:14 ` [PATCH 04/13] mm: Fix pmd_read_atomic() Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

Just like 64bit pte_t, have a low/high split in pmd_t.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/sh/include/asm/pgtable-3level.h |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- a/arch/sh/include/asm/pgtable-3level.h
+++ b/arch/sh/include/asm/pgtable-3level.h
@@ -28,9 +28,15 @@
 #define pmd_ERROR(e) \
 	printk("%s:%d: bad pmd %016llx.\n", __FILE__, __LINE__, pmd_val(e))
 
-typedef struct { unsigned long long pmd; } pmd_t;
+typedef struct {
+	struct {
+		unsigned long pmd_low;
+		unsigned long pmd_high;
+	};
+	unsigned long long pmd;
+} pmd_t;
 #define pmd_val(x)	((x).pmd)
-#define __pmd(x)	((pmd_t) { (x) } )
+#define __pmd(x)	((pmd_t) { .pmd = (x) } )
 
 static inline pmd_t *pud_pgtable(pud_t pud)
 {



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 04/13] mm: Fix pmd_read_atomic()
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (2 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 03/13] sh/mm: " Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 17:30   ` Linus Torvalds
  2022-10-22 11:14 ` [PATCH 05/13] mm: Rename GUP_GET_PTE_LOW_HIGH Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

AFAICT there's no reason to do anything different than what we do for
PTEs. Make it so (also affects SH).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h |   56 ----------------------------------
 include/linux/pgtable.h               |   49 +++++++++++++++++++++++------
 2 files changed, 39 insertions(+), 66 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -34,62 +34,6 @@ static inline void native_set_pte(pte_t
 	ptep->pte_low = pte.pte_low;
 }
 
-#define pmd_read_atomic pmd_read_atomic
-/*
- * pte_offset_map_lock() on 32-bit PAE kernels was reading the pmd_t with
- * a "*pmdp" dereference done by GCC. Problem is, in certain places
- * where pte_offset_map_lock() is called, concurrent page faults are
- * allowed, if the mmap_lock is hold for reading. An example is mincore
- * vs page faults vs MADV_DONTNEED. On the page fault side
- * pmd_populate() rightfully does a set_64bit(), but if we're reading the
- * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
- * because GCC will not read the 64-bit value of the pmd atomically.
- *
- * To fix this all places running pte_offset_map_lock() while holding the
- * mmap_lock in read mode, shall read the pmdp pointer using this
- * function to know if the pmd is null or not, and in turn to know if
- * they can run pte_offset_map_lock() or pmd_trans_huge() or other pmd
- * operations.
- *
- * Without THP if the mmap_lock is held for reading, the pmd can only
- * transition from null to not null while pmd_read_atomic() runs. So
- * we can always return atomic pmd values with this function.
- *
- * With THP if the mmap_lock is held for reading, the pmd can become
- * trans_huge or none or point to a pte (and in turn become "stable")
- * at any time under pmd_read_atomic(). We could read it truly
- * atomically here with an atomic64_read() for the THP enabled case (and
- * it would be a whole lot simpler), but to avoid using cmpxchg8b we
- * only return an atomic pmdval if the low part of the pmdval is later
- * found to be stable (i.e. pointing to a pte). We are also returning a
- * 'none' (zero) pmdval if the low part of the pmd is zero.
- *
- * In some cases the high and low part of the pmdval returned may not be
- * consistent if THP is enabled (the low part may point to previously
- * mapped hugepage, while the high part may point to a more recently
- * mapped hugepage), but pmd_none_or_trans_huge_or_clear_bad() only
- * needs the low part of the pmd to be read atomically to decide if the
- * pmd is unstable or not, with the only exception when the low part
- * of the pmd is zero, in which case we return a 'none' pmd.
- */
-static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
-{
-	pmdval_t ret;
-	u32 *tmp = (u32 *)pmdp;
-
-	ret = (pmdval_t) (*tmp);
-	if (ret) {
-		/*
-		 * If the low part is null, we must not read the high part
-		 * or we can end up with a partial pmd.
-		 */
-		smp_rmb();
-		ret |= ((pmdval_t)*(tmp + 1)) << 32;
-	}
-
-	return (pmd_t) { .pmd = ret };
-}
-
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
 	set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -258,6 +258,13 @@ static inline pte_t ptep_get(pte_t *ptep
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_GET
+static inline pmd_t pmdp_get(pmd_t *pmdp)
+{
+	return READ_ONCE(*pmdp);
+}
+#endif
+
 #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
 /*
  * For walking the pagetables without holding any locks.  Some architectures
@@ -302,15 +309,42 @@ static inline pte_t ptep_get_lockless(pt
 
 	return pte;
 }
-#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+#define ptep_get_lockless ptep_get_lockless
+
+#if CONFIG_PGTABLE_LEVELS > 2
+static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
+{
+	pmd_t pmd;
+
+	do {
+		pmd.pmd_low = pmdp->pmd_low;
+		smp_rmb();
+		pmd.pmd_high = pmdp->pmd_high;
+		smp_rmb();
+	} while (unlikely(pmd.pmd_low != pmdp->pmd_low));
+
+	return pmd;
+}
+#define pmdp_get_lockless pmdp_get_lockless
+#endif /* CONFIG_PGTABLE_LEVELS > 2 */
+#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+
 /*
  * We require that the PTE can be read atomically.
  */
+#ifndef ptep_get_lockless
 static inline pte_t ptep_get_lockless(pte_t *ptep)
 {
 	return ptep_get(ptep);
 }
-#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+#endif
+
+#ifndef pmdp_get_lockless
+static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
+{
+	return pmdp_get(pmdp);
+}
+#endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -1211,17 +1247,10 @@ static inline int pud_trans_unstable(pud
 #endif
 }
 
-#ifndef pmd_read_atomic
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
-	/*
-	 * Depend on compiler for an atomic pmd read. NOTE: this is
-	 * only going to work, if the pmdval_t isn't larger than
-	 * an unsigned long.
-	 */
-	return *pmdp;
+	return pmdp_get_lockless(pmdp);
 }
-#endif
 
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 05/13] mm: Rename GUP_GET_PTE_LOW_HIGH
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (3 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 04/13] mm: Fix pmd_read_atomic() Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 06/13] mm: Rename pmd_read_atomic() Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

Since it no longer applies to only PTEs, rename it to PXX.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/mips/Kconfig       |    2 +-
 arch/sh/Kconfig         |    2 +-
 arch/x86/Kconfig        |    2 +-
 include/linux/pgtable.h |    4 ++--
 mm/Kconfig              |    2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -46,7 +46,7 @@ config MIPS
 	select GENERIC_SCHED_CLOCK if !CAVIUM_OCTEON_SOC
 	select GENERIC_SMP_IDLE_THREAD
 	select GENERIC_TIME_VSYSCALL
-	select GUP_GET_PTE_LOW_HIGH if CPU_MIPS32 && PHYS_ADDR_T_64BIT
+	select GUP_GET_PXX_LOW_HIGH if CPU_MIPS32 && PHYS_ADDR_T_64BIT
 	select HAVE_ARCH_COMPILER_H
 	select HAVE_ARCH_JUMP_LABEL
 	select HAVE_ARCH_KGDB if MIPS_FP_SUPPORT
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -24,7 +24,7 @@ config SUPERH
 	select GENERIC_PCI_IOMAP if PCI
 	select GENERIC_SCHED_CLOCK
 	select GENERIC_SMP_IDLE_THREAD
-	select GUP_GET_PTE_LOW_HIGH if X2TLB
+	select GUP_GET_PXX_LOW_HIGH if X2TLB
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_SECCOMP_FILTER
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -157,7 +157,7 @@ config X86
 	select GENERIC_TIME_VSYSCALL
 	select GENERIC_GETTIMEOFDAY
 	select GENERIC_VDSO_TIME_NS
-	select GUP_GET_PTE_LOW_HIGH		if X86_PAE
+	select GUP_GET_PXX_LOW_HIGH		if X86_PAE
 	select HARDIRQS_SW_RESEND
 	select HARDLOCKUP_CHECK_TIMESTAMP	if X86_64
 	select HAVE_ACPI_APEI			if ACPI
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -305,7 +305,7 @@ static inline pmd_t pmdp_get(pmd_t *pmdp
 }
 #endif
 
-#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
+#ifdef CONFIG_GUP_GET_PXX_LOW_HIGH
 /*
  * For walking the pagetables without holding any locks.  Some architectures
  * (eg x86-32 PAE) cannot load the entries atomically without using expensive
@@ -365,7 +365,7 @@ static inline pmd_t pmdp_get_lockless(pm
 }
 #define pmdp_get_lockless pmdp_get_lockless
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
-#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+#endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
 
 /*
  * We require that the PTE can be read atomically.
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1044,7 +1044,7 @@ config GUP_TEST
 comment "GUP_TEST needs to have DEBUG_FS enabled"
 	depends on !GUP_TEST && !DEBUG_FS
 
-config GUP_GET_PTE_LOW_HIGH
+config GUP_GET_PXX_LOW_HIGH
 	bool
 
 config ARCH_HAS_PTE_SPECIAL



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 06/13] mm: Rename pmd_read_atomic()
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (4 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 05/13] mm: Rename GUP_GET_PTE_LOW_HIGH Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 07/13] mm/gup: Fix the lockless PMD access Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

There's no point in having the identical routines for PTE/PMD have
different names.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/pgtable.h    |    9 ++-------
 mm/hmm.c                   |    2 +-
 mm/khugepaged.c            |    2 +-
 mm/mapping_dirty_helpers.c |    2 +-
 mm/mprotect.c              |    2 +-
 mm/userfaultfd.c           |    2 +-
 mm/vmscan.c                |    4 ++--
 7 files changed, 9 insertions(+), 14 deletions(-)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1352,11 +1352,6 @@ static inline int pud_trans_unstable(pud
 #endif
 }
 
-static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
-{
-	return pmdp_get_lockless(pmdp);
-}
-
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)
 #endif
@@ -1383,13 +1378,13 @@ static inline pmd_t pmd_read_atomic(pmd_
  */
 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 {
-	pmd_t pmdval = pmd_read_atomic(pmd);
+	pmd_t pmdval = pmdp_get_lockless(pmd);
 	/*
 	 * The barrier will stabilize the pmdval in a register or on
 	 * the stack so that it will stop changing under the code.
 	 *
 	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
-	 * pmd_read_atomic is allowed to return a not atomic pmdval
+	 * pmdp_get_lockless is allowed to return a not atomic pmdval
 	 * (for example pointing to an hugepage that has never been
 	 * mapped in the pmd). The below checks will only care about
 	 * the low part of the pmd with 32bit PAE x86 anyway, with the
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -361,7 +361,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		 * huge or device mapping one and compute corresponding pfn
 		 * values.
 		 */
-		pmd = pmd_read_atomic(pmdp);
+		pmd = pmdp_get_lockless(pmdp);
 		barrier();
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -862,7 +862,7 @@ static int find_pmd_or_thp_or_none(struc
 	if (!*pmd)
 		return SCAN_PMD_NULL;
 
-	pmde = pmd_read_atomic(*pmd);
+	pmde = pmdp_get_lockless(*pmd);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -126,7 +126,7 @@ static int clean_record_pte(pte_t *pte,
 static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 			      struct mm_walk *walk)
 {
-	pmd_t pmdval = pmd_read_atomic(pmd);
+	pmd_t pmdval = pmdp_get_lockless(pmd);
 
 	if (!pmd_trans_unstable(&pmdval))
 		return 0;
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -292,7 +292,7 @@ static unsigned long change_pte_range(st
  */
 static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 {
-	pmd_t pmdval = pmd_read_atomic(pmd);
+	pmd_t pmdval = pmdp_get_lockless(pmd);
 
 	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -613,7 +613,7 @@ static __always_inline ssize_t __mcopy_a
 			break;
 		}
 
-		dst_pmdval = pmd_read_atomic(dst_pmd);
+		dst_pmdval = pmdp_get_lockless(dst_pmd);
 		/*
 		 * If the dst_pmd is mapped as THP don't
 		 * override it and just be strict.
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4039,9 +4039,9 @@ static void walk_pmd_range(pud_t *pud, u
 	/* walk_pte_range() may call get_next_vma() */
 	vma = args->vma;
 	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
-		pmd_t val = pmd_read_atomic(pmd + i);
+		pmd_t val = pmdp_get_lockless(pmd + i);
 
-		/* for pmd_read_atomic() */
+		/* for pmdp_get_lockless() */
 		barrier();
 
 		next = pmd_addr_end(addr, end);



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 07/13] mm/gup: Fix the lockless PMD access
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (5 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 06/13] mm: Rename pmd_read_atomic() Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-23  0:42   ` Hugh Dickins
  2022-10-22 11:14 ` [PATCH 08/13] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

On architectures where the PTE/PMD is larger than the native word size
(i386-PAE for example), READ_ONCE() can do the wrong thing. Use
pmdp_get_lockless() just like we use ptep_get_lockless().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |    2 +-
 mm/gup.c             |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7186,7 +7186,7 @@ static u64 perf_get_pgtable_size(struct
 		return pud_leaf_size(pud);
 
 	pmdp = pmd_offset_lockless(pudp, pud, addr);
-	pmd = READ_ONCE(*pmdp);
+	pmd = pmdp_get_lockless(pmdp);
 	if (!pmd_present(pmd))
 		return 0;
 
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2507,7 +2507,7 @@ static int gup_pmd_range(pud_t *pudp, pu
 
 	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
-		pmd_t pmd = READ_ONCE(*pmdp);
+		pmd_t pmd = pmdp_get_lockless(pmdp);
 
 		next = pmd_addr_end(addr, end);
 		if (!pmd_present(pmd))



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 08/13] x86/mm/pae: Dont (ab)use atomic64
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (6 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 07/13] mm/gup: Fix the lockless PMD access Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

PAE implies CX8, write readable code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -2,8 +2,6 @@
 #ifndef _ASM_X86_PGTABLE_3LEVEL_H
 #define _ASM_X86_PGTABLE_3LEVEL_H
 
-#include <asm/atomic64_32.h>
-
 /*
  * Intel Physical Address Extension (PAE) Mode - three-level page
  * tables on PPro+ CPUs.
@@ -95,11 +93,12 @@ static inline void pud_clear(pud_t *pudp
 #ifdef CONFIG_SMP
 static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
 {
-	pte_t res;
+	pte_t old = *ptep;
 
-	res.pte = (pteval_t)arch_atomic64_xchg((atomic64_t *)ptep, 0);
+	do {
+	} while (!try_cmpxchg64(&ptep->pte, &old.pte, 0ULL));
 
-	return res;
+	return old;
 }
 #else
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE()
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (7 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 08/13] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 17:42   ` Linus Torvalds
  2022-10-22 11:14 ` [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

Disallow write-tearing, that would be really unfortunate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -27,9 +27,9 @@
  */
 static inline void native_set_pte(pte_t *ptep, pte_t pte)
 {
-	ptep->pte_high = pte.pte_high;
+	WRITE_ONCE(ptep->pte_high, pte.pte_high);
 	smp_wmb();
-	ptep->pte_low = pte.pte_low;
+	WRITE_ONCE(ptep->pte_low, pte.pte_low);
 }
 
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
@@ -58,16 +58,16 @@ static inline void native_set_pud(pud_t
 static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
 {
-	ptep->pte_low = 0;
+	WRITE_ONCE(ptep->pte_low, 0);
 	smp_wmb();
-	ptep->pte_high = 0;
+	WRITE_ONCE(ptep->pte_high, 0);
 }
 
 static inline void native_pmd_clear(pmd_t *pmdp)
 {
-	pmdp->pmd_low = 0;
+	WRITE_ONCE(pmdp->pmd_low, 0);
 	smp_wmb();
-	pmdp->pmd_high = 0;
+	WRITE_ONCE(pmdp->pmd_high, 0);
 }
 
 static inline void native_pud_clear(pud_t *pudp)



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear()
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (8 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 17:53   ` Linus Torvalds
  2022-10-22 11:14 ` [PATCH 11/13] x86_64: Remove pointless set_64bit() usage Peter Zijlstra
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

Given that ptep_get_and_clear() uses cmpxchg8b, and that should be by
far the most common case, there's no point in having an optimized
variant for pmd/pud.

Introduce the pxx_xchg64() helper to implement the common logic once.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/pgtable-3level.h |   67 ++++++++--------------------------
 1 file changed, 17 insertions(+), 50 deletions(-)

--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -90,34 +90,33 @@ static inline void pud_clear(pud_t *pudp
 	 */
 }
 
+
+#define pxx_xchg64(_pxx, _ptr, _val) ({					\
+	_pxx##val_t *_p = (_pxx##val_t *)_ptr;				\
+	_pxx##val_t _o = *_p;						\
+	do { } while (!try_cmpxchg64(_p, &_o, (_val)));			\
+	native_make_##_pxx(_o);						\
+})
+
 #ifdef CONFIG_SMP
 static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
 {
-	pte_t old = *ptep;
-
-	do {
-	} while (!try_cmpxchg64(&ptep->pte, &old.pte, 0ULL));
-
-	return old;
+	return pxx_xchg64(pte, ptep, 0ULL);
 }
-#else
-#define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
-#endif
 
-#ifdef CONFIG_SMP
 static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 {
-	pmd_t res;
-
-	/* xchg acts as a barrier before setting of the high bits */
-	res.pmd_low = xchg(&pmdp->pmd_low, 0);
-	res.pmd_high = READ_ONCE(pmdp->pmd_high);
-	WRITE_ONCE(pmdp->pmd_high, 0);
+	return pxx_xchg64(pmd, pmdp, 0ULL);
+}
 
-	return res;
+static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
+{
+	return pxx_xchg64(pud, pudp, 0ULL);
 }
 #else
+#define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+#define native_pudp_get_and_clear(xp) native_local_pudp_get_and_clear(xp)
 #endif
 
 #ifndef pmdp_establish
@@ -141,40 +140,8 @@ static inline pmd_t pmdp_establish(struc
 		return old;
 	}
 
-	do {
-		old = *pmdp;
-	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
-
-	return old;
-}
-#endif
-
-#ifdef CONFIG_SMP
-union split_pud {
-	struct {
-		u32 pud_low;
-		u32 pud_high;
-	};
-	pud_t pud;
-};
-
-static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
-{
-	union split_pud res, *orig = (union split_pud *)pudp;
-
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-	pti_set_user_pgtbl(&pudp->p4d.pgd, __pgd(0));
-#endif
-
-	/* xchg acts as a barrier before setting of the high bits */
-	res.pud_low = xchg(&orig->pud_low, 0);
-	res.pud_high = orig->pud_high;
-	orig->pud_high = 0;
-
-	return res.pud;
+	return pxx_xchg64(pmd, pmdp, pmd.pmd);
 }
-#else
-#define native_pudp_get_and_clear(xp) native_local_pudp_get_and_clear(xp)
 #endif
 
 /* Encode and de-code a swap entry */



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (9 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 17:55   ` Linus Torvalds
  2022-11-03 19:09   ` Nathan Chancellor
  2022-10-22 11:14 ` [PATCH 12/13] x86/mm/pae: Get rid of set_64bit() Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

The use of set_64bit() in X86_64 only code is pretty pointless, seeing
how it's a direct assignment. Remove all this nonsense.

Additionally, since x86_64 unconditionally has HAVE_CMPXCHG_DOUBLE,
there is no point in even having that fallback.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/um/include/asm/pgtable-3level.h |    8 --------
 arch/x86/include/asm/cmpxchg_64.h    |    5 -----
 drivers/iommu/intel/irq_remapping.c  |   10 ++--------
 3 files changed, 2 insertions(+), 21 deletions(-)

--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -58,11 +58,7 @@
 #define pud_populate(mm, pud, pmd) \
 	set_pud(pud, __pud(_PAGE_TABLE + __pa(pmd)))
 
-#ifdef CONFIG_64BIT
-#define set_pud(pudptr, pudval) set_64bit((u64 *) (pudptr), pud_val(pudval))
-#else
 #define set_pud(pudptr, pudval) (*(pudptr) = (pudval))
-#endif
 
 static inline int pgd_newpage(pgd_t pgd)
 {
@@ -71,11 +67,7 @@ static inline int pgd_newpage(pgd_t pgd)
 
 static inline void pgd_mkuptodate(pgd_t pgd) { pgd_val(pgd) &= ~_PAGE_NEWPAGE; }
 
-#ifdef CONFIG_64BIT
-#define set_pmd(pmdptr, pmdval) set_64bit((u64 *) (pmdptr), pmd_val(pmdval))
-#else
 #define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
-#endif
 
 static inline void pud_clear (pud_t *pud)
 {
--- a/arch/x86/include/asm/cmpxchg_64.h
+++ b/arch/x86/include/asm/cmpxchg_64.h
@@ -2,11 +2,6 @@
 #ifndef _ASM_X86_CMPXCHG_64_H
 #define _ASM_X86_CMPXCHG_64_H
 
-static inline void set_64bit(volatile u64 *ptr, u64 val)
-{
-	*ptr = val;
-}
-
 #define arch_cmpxchg64(ptr, o, n)					\
 ({									\
 	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -173,7 +173,6 @@ static int modify_irte(struct irq_2_iomm
 	index = irq_iommu->irte_index + irq_iommu->sub_handle;
 	irte = &iommu->ir_table->base[index];
 
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE)
 	if ((irte->pst == 1) || (irte_modified->pst == 1)) {
 		bool ret;
 
@@ -187,11 +186,6 @@ static int modify_irte(struct irq_2_iomm
 		 * same as the old value.
 		 */
 		WARN_ON(!ret);
-	} else
-#endif
-	{
-		set_64bit(&irte->low, irte_modified->low);
-		set_64bit(&irte->high, irte_modified->high);
 	}
 	__iommu_flush_cache(iommu, irte, sizeof(*irte));
 
@@ -249,8 +243,8 @@ static int clear_entries(struct irq_2_io
 	end = start + (1 << irq_iommu->irte_mask);
 
 	for (entry = start; entry < end; entry++) {
-		set_64bit(&entry->low, 0);
-		set_64bit(&entry->high, 0);
+		WRITE_ONCE(entry->low, 0);
+		WRITE_ONCE(entry->high, 0);
 	}
 	bitmap_release_region(iommu->ir_table->bitmap, index,
 			      irq_iommu->irte_mask);



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 12/13] x86/mm/pae: Get rid of set_64bit()
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (10 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 11/13] x86_64: Remove pointless set_64bit() usage Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 11:14 ` [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless() Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

Recognise that set_64bit() is a special case of our previously
introduced pxx_xchg64(), so use that and get rid of set_64bit().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/cmpxchg_32.h     |   28 ----------------------------
 arch/x86/include/asm/pgtable-3level.h |   23 ++++++++++++-----------
 2 files changed, 12 insertions(+), 39 deletions(-)

--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -7,34 +7,6 @@
  *       you need to test for the feature in boot_cpu_data.
  */
 
-/*
- * CMPXCHG8B only writes to the target if we had the previous
- * value in registers, otherwise it acts as a read and gives us the
- * "new previous" value.  That is why there is a loop.  Preloading
- * EDX:EAX is a performance optimization: in the common case it means
- * we need only one locked operation.
- *
- * A SIMD/3DNOW!/MMX/FPU 64-bit store here would require at the very
- * least an FPU save and/or %cr0.ts manipulation.
- *
- * cmpxchg8b must be used with the lock prefix here to allow the
- * instruction to be executed atomically.  We need to have the reader
- * side to see the coherent 64bit value.
- */
-static inline void set_64bit(volatile u64 *ptr, u64 value)
-{
-	u32 low  = value;
-	u32 high = value >> 32;
-	u64 prev = *ptr;
-
-	asm volatile("\n1:\t"
-		     LOCK_PREFIX "cmpxchg8b %0\n\t"
-		     "jnz 1b"
-		     : "=m" (*ptr), "+A" (prev)
-		     : "b" (low), "c" (high)
-		     : "memory");
-}
-
 #ifdef CONFIG_X86_CMPXCHG64
 #define arch_cmpxchg64(ptr, o, n)					\
 	((__typeof__(*(ptr)))__cmpxchg64((ptr), (unsigned long long)(o), \
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -19,7 +19,15 @@
 	pr_err("%s:%d: bad pgd %p(%016Lx)\n",				\
 	       __FILE__, __LINE__, &(e), pgd_val(e))
 
-/* Rules for using set_pte: the pte being assigned *must* be
+#define pxx_xchg64(_pxx, _ptr, _val) ({					\
+	_pxx##val_t *_p = (_pxx##val_t *)_ptr;				\
+	_pxx##val_t _o = *_p;						\
+	do { } while (!try_cmpxchg64(_p, &_o, (_val)));			\
+	native_make_##_pxx(_o);						\
+})
+
+/*
+ * Rules for using set_pte: the pte being assigned *must* be
  * either not present or in a state where the hardware will
  * not attempt to update the pte.  In places where this is
  * not possible, use pte_get_and_clear to obtain the old pte
@@ -34,12 +42,12 @@ static inline void native_set_pte(pte_t
 
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
-	set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
+	pxx_xchg64(pte, ptep, native_pte_val(pte));
 }
 
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
-	set_64bit((unsigned long long *)(pmdp), native_pmd_val(pmd));
+	pxx_xchg64(pmd, pmdp, native_pmd_val(pmd));
 }
 
 static inline void native_set_pud(pud_t *pudp, pud_t pud)
@@ -47,7 +55,7 @@ static inline void native_set_pud(pud_t
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 	pud.p4d.pgd = pti_set_user_pgtbl(&pudp->p4d.pgd, pud.p4d.pgd);
 #endif
-	set_64bit((unsigned long long *)(pudp), native_pud_val(pud));
+	pxx_xchg64(pud, pudp, native_pud_val(pud));
 }
 
 /*
@@ -91,13 +99,6 @@ static inline void pud_clear(pud_t *pudp
 }
 
 
-#define pxx_xchg64(_pxx, _ptr, _val) ({					\
-	_pxx##val_t *_p = (_pxx##val_t *)_ptr;				\
-	_pxx##val_t _o = *_p;						\
-	do { } while (!try_cmpxchg64(_p, &_o, (_val)));			\
-	native_make_##_pxx(_o);						\
-})
-
 #ifdef CONFIG_SMP
 static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
 {



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless()
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (11 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 12/13] x86/mm/pae: Get rid of set_64bit() Peter Zijlstra
@ 2022-10-22 11:14 ` Peter Zijlstra
  2022-10-22 19:59   ` Yu Zhao
  2022-10-22 17:57 ` [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Linus Torvalds
  2022-10-29 12:21 ` Peter Zijlstra
  14 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-22 11:14 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, peterz, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

pmdp_get_lockless() should itself imply any ordering required.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/hmm.c    |    1 -
 mm/vmscan.c |    3 ---
 2 files changed, 4 deletions(-)

--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -362,7 +362,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		 * values.
 		 */
 		pmd = pmdp_get_lockless(pmdp);
-		barrier();
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
 
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4041,9 +4041,6 @@ static void walk_pmd_range(pud_t *pud, u
 	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
 		pmd_t val = pmdp_get_lockless(pmd + i);
 
-		/* for pmdp_get_lockless() */
-		barrier();
-
 		next = pmd_addr_end(addr, end);
 
 		if (!pmd_present(val) || is_huge_zero_pmd(val)) {



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/13] mm: Fix pmd_read_atomic()
  2022-10-22 11:14 ` [PATCH 04/13] mm: Fix pmd_read_atomic() Peter Zijlstra
@ 2022-10-22 17:30   ` Linus Torvalds
  2022-10-24  8:09     ` Peter Zijlstra
  2022-11-01 12:41     ` Peter Zijlstra
  0 siblings, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-22 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -258,6 +258,13 @@ static inline pte_t ptep_get(pte_t *ptep
>  }
>  #endif
>
> +#ifndef __HAVE_ARCH_PMDP_GET
> +static inline pmd_t pmdp_get(pmd_t *pmdp)
> +{
> +       return READ_ONCE(*pmdp);
> +}
> +#endif

What, what, what?

Where did that __HAVE_ARCH_PMDP_GET come from?

I'm not seeing it #define'd anywhere, and we _really_ shouldn't be
doing this any more.

Please just do

    #ifndef pmdp_get
    static inline pmd_t pmdp_get(pmd_t *pmdp)
    ..

and have the architectures that do their own pmdp_get(), just have that

   #define pmdp_get pmdp_get

to let the generic code know about it. Instead of making up a new
__HAVE_ARCH_XYZ name.

That "use the same name for testing" pattern means that it shows up
much nicer when grepping for "where does this come from", but also
means that you really never need to make up new names for "does this
exist".

             Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE()
  2022-10-22 11:14 ` [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
@ 2022-10-22 17:42   ` Linus Torvalds
  2022-10-24 10:21     ` Peter Zijlstra
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-22 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>  static inline void native_set_pte(pte_t *ptep, pte_t pte)
>  {
> -       ptep->pte_high = pte.pte_high;
> +       WRITE_ONCE(ptep->pte_high, pte.pte_high);
>         smp_wmb();
> -       ptep->pte_low = pte.pte_low;
> +       WRITE_ONCE(ptep->pte_low, pte.pte_low);

With this, the smp_wmb() should just go away too. It was really only
ever there as a compiler barrier.

Two WRITE_ONCE() statements are inherently ordered for the compiler
(due to volatile rules), and x86 doesn't re-order writes.

It's not a big deal, since smp_wmb() is just a barrier() on x86-64
anyway, but it might make some improvement to code generation to
remove it, and the smp_wmb() really isn't adding anything.

If somebody likes the smp_wmb() as a comment, I think it would be
better to actually _make_ it a comment, and have these functions turn
into just

  /* Force ordered word-sized writes, set low word with present bit last */
  static inline void native_set_pte(pte_t *ptep, pte_t pte)
  {
        WRITE_ONCE(ptep->pte_high, pte.pte_high);
        WRITE_ONCE(ptep->pte_low, pte.pte_low);
  }

or similar. I think that kind of one-liner comment is much more
informative than a "smp_wmb()".

Or do we already have a comment elsewhere about why the ordering is
important (and how *clearing* clears the low word with the present bit
first, but setting a *new* entry sets the high word first so that the
64-bit entry is complete when the present bit is set?)

                 Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear()
  2022-10-22 11:14 ` [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
@ 2022-10-22 17:53   ` Linus Torvalds
  2022-10-24 11:13     ` Peter Zijlstra
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-22 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> +
> +#define pxx_xchg64(_pxx, _ptr, _val) ({                                        \
> +       _pxx##val_t *_p = (_pxx##val_t *)_ptr;                          \
> +       _pxx##val_t _o = *_p;                                           \
> +       do { } while (!try_cmpxchg64(_p, &_o, (_val)));                 \
> +       native_make_##_pxx(_o);                                         \
> +})

I think this could just be a "xchp64()", but if the pte/pmd code is
the only thing that actually wants this on 32-bit architectures, I'm
certainly ok with making it be specific to just this code, and calling
it "pxx_xchg()".

I wonder if there's some driver somewhere that wanted to use it, but
just made it be

        depends on CONFIG_64BIT

instead, or made it use a cmpxchg64() loop because a plain xchg() didn't work.

I guess it really doesn't matter, with 32-bit being relegated to
legacy status anyway. No need to try to expand usage.

                 Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-10-22 11:14 ` [PATCH 11/13] x86_64: Remove pointless set_64bit() usage Peter Zijlstra
@ 2022-10-22 17:55   ` Linus Torvalds
  2022-11-03 19:09   ` Nathan Chancellor
  1 sibling, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-22 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> The use of set_64bit() in X86_64 only code is pretty pointless, seeing
> how it's a direct assignment. Remove all this nonsense.

Thanks. That was really confusing code, using set_64bit() in exactly
the situation where it was _not_ needed.

                  Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (12 preceding siblings ...)
  2022-10-22 11:14 ` [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless() Peter Zijlstra
@ 2022-10-22 17:57 ` Linus Torvalds
  2022-10-29 12:21 ` Peter Zijlstra
  14 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-22 17:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> The robot doesn't hate on these patches and they boot in kvm (because who still
> has i386 hardware).

Well, I had a couple of comments, the only serious one being that odd
'__HAVE_ARCH_PMDP_GET' that I really didn't see anywhere else and
seemed actively wrong.

Other than that, it all looks good to me.

Thanks,
              Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless()
  2022-10-22 11:14 ` [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless() Peter Zijlstra
@ 2022-10-22 19:59   ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2022-10-22 19:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, torvalds, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 5:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> pmdp_get_lockless() should itself imply any ordering required.

There are three remaining barriers that should be removed as well.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/13] mm/gup: Fix the lockless PMD access
  2022-10-22 11:14 ` [PATCH 07/13] mm/gup: Fix the lockless PMD access Peter Zijlstra
@ 2022-10-23  0:42   ` Hugh Dickins
  2022-10-24  7:42     ` Peter Zijlstra
  0 siblings, 1 reply; 148+ messages in thread
From: Hugh Dickins @ 2022-10-23  0:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, x86, willy, torvalds, akpm, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Sat, 22 Oct 2022, Peter Zijlstra wrote:

> On architectures where the PTE/PMD is larger than the native word size
> (i386-PAE for example), READ_ONCE() can do the wrong thing. Use
> pmdp_get_lockless() just like we use ptep_get_lockless().

I thought that was something Will Deacon put a lot of effort
into handling around 5.8 and 5.9: see "strong prevailing wind" in
include/asm-generic/rwonce.h, formerly in include/linux/compiler.h.

Was it too optimistic?  Did the wind drop?

I'm interested in the answer, but I've certainly no objection
to making this all more obviously robust - thanks.

Hugh

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/events/core.c |    2 +-
>  mm/gup.c             |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7186,7 +7186,7 @@ static u64 perf_get_pgtable_size(struct
>  		return pud_leaf_size(pud);
>  
>  	pmdp = pmd_offset_lockless(pudp, pud, addr);
> -	pmd = READ_ONCE(*pmdp);
> +	pmd = pmdp_get_lockless(pmdp);
>  	if (!pmd_present(pmd))
>  		return 0;
>  
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2507,7 +2507,7 @@ static int gup_pmd_range(pud_t *pudp, pu
>  
>  	pmdp = pmd_offset_lockless(pudp, pud, addr);
>  	do {
> -		pmd_t pmd = READ_ONCE(*pmdp);
> +		pmd_t pmd = pmdp_get_lockless(pmdp);
>  
>  		next = pmd_addr_end(addr, end);
>  		if (!pmd_present(pmd))

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-22 11:14 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
@ 2022-10-24  5:42   ` John Hubbard
  2022-10-24  8:00     ` Peter Zijlstra
  0 siblings, 1 reply; 148+ messages in thread
From: John Hubbard @ 2022-10-24  5:42 UTC (permalink / raw)
  To: Peter Zijlstra, x86, willy, torvalds, akpm, Jann Horn
  Cc: linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel, ubizjak

On 10/22/22 04:14, Peter Zijlstra wrote:
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
>  
>  #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
>  /*
> - * WARNING: only to be used in the get_user_pages_fast() implementation.
> - *
> - * With get_user_pages_fast(), we walk down the pagetables without taking any
> - * locks.  For this we would like to load the pointers atomically, but sometimes
> - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
> - * we do have is the guarantee that a PTE will only either go from not present
> - * to present, or present to not present or both -- it will not switch to a
> - * completely different present page without a TLB flush in between; something
> - * that we are blocking by holding interrupts off.
> + * For walking the pagetables without holding any locks.  Some architectures
> + * (eg x86-32 PAE) cannot load the entries atomically without using expensive
> + * instructions.  We are guaranteed that a PTE will only either go from not
> + * present to present, or present to not present -- it will not switch to a
> + * completely different present page without a TLB flush inbetween; which we
> + * are blocking by holding interrupts off.


This is getting interesting. My latest understanding of this story is
that both the "before" and "after" versions of that comment are
incorrect! Because, as Jann Horn noticed recently [1], there might not
be any IPIs involved in a TLB flush, if x86 is running under a
hypervisor, and that breaks the chain of reasoning here.


[1] https://lore.kernel.org/all/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/13] mm/gup: Fix the lockless PMD access
  2022-10-23  0:42   ` Hugh Dickins
@ 2022-10-24  7:42     ` Peter Zijlstra
  2022-10-25  3:58       ` Hugh Dickins
  0 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-24  7:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Will Deacon, x86, willy, torvalds, akpm, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 05:42:18PM -0700, Hugh Dickins wrote:
> On Sat, 22 Oct 2022, Peter Zijlstra wrote:
> 
> > On architectures where the PTE/PMD is larger than the native word size
> > (i386-PAE for example), READ_ONCE() can do the wrong thing. Use
> > pmdp_get_lockless() just like we use ptep_get_lockless().
> 
> I thought that was something Will Deacon put a lot of effort
> into handling around 5.8 and 5.9: see "strong prevailing wind" in
> include/asm-generic/rwonce.h, formerly in include/linux/compiler.h.
> 
> Was it too optimistic?  Did the wind drop?
> 
> I'm interested in the answer, but I've certainly no objection
> to making this all more obviously robust - thanks.

READ_ONCE() can't do what the hardware can't do. There is absolutely no
way i386 can do an atomic 64bit load without resorting to cmpxchg8b.

Also see the comment that goes with compiletime_assert_rwonce_type(). It
explicitly allows 64bit because there's just too much stuff that does
that (and there's actually 32bit hardware that *can* do it).

But it's still very wrong.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-24  5:42   ` John Hubbard
@ 2022-10-24  8:00     ` Peter Zijlstra
  2022-10-24 19:58       ` Jann Horn
  0 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-24  8:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: x86, willy, torvalds, akpm, Jann Horn, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote:
> On 10/22/22 04:14, Peter Zijlstra wrote:
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
> >  
> >  #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
> >  /*
> > - * WARNING: only to be used in the get_user_pages_fast() implementation.
> > - *
> > - * With get_user_pages_fast(), we walk down the pagetables without taking any
> > - * locks.  For this we would like to load the pointers atomically, but sometimes
> > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
> > - * we do have is the guarantee that a PTE will only either go from not present
> > - * to present, or present to not present or both -- it will not switch to a
> > - * completely different present page without a TLB flush in between; something
> > - * that we are blocking by holding interrupts off.
> > + * For walking the pagetables without holding any locks.  Some architectures
> > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive
> > + * instructions.  We are guaranteed that a PTE will only either go from not
> > + * present to present, or present to not present -- it will not switch to a
> > + * completely different present page without a TLB flush inbetween; which we
> > + * are blocking by holding interrupts off.
> 
> 
> This is getting interesting. My latest understanding of this story is
> that both the "before" and "after" versions of that comment are
> incorrect! Because, as Jann Horn noticed recently [1], there might not
> be any IPIs involved in a TLB flush, if x86 is running under a
> hypervisor, and that breaks the chain of reasoning here.

That mail doesn't really include enough detail. The way x86 HV TLB
flushing is supposed to work is by making use of
MMU_GATHER_RCU_TABLE_FREE. Specifically, something like:


	vCPU0				vCPU1

					tlb_gather_mmut(&tlb, mm);

					....

	local_irq_disable();
	... starts page-table walk ...

	<schedules out; sets KVM_VCPU_PREEMPTED>

					tlb_finish_mmu(&tlb)
					  ...
					  kvm_flush_tlb_multi()
					    if (state & KVM_VCPU_PREEMPTED)
					      if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB))
						__cpumask_clear_cpu(cpu, flushmask);


					  tlb_remove_table_sync_one() / call_rcu()


	<schedules back in>

	... continues page-table walk ...
	local_irq_enable();

If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory
pressure), then you've got your IPI back, otherwise it does call_rcu()
and RCU itself will need vCPU0 to enable IRQs in order to make progress.

Either way around, the actual freeing of the pages is delayed until the
page-table walk is finished.

What am I missing?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/13] mm: Fix pmd_read_atomic()
  2022-10-22 17:30   ` Linus Torvalds
@ 2022-10-24  8:09     ` Peter Zijlstra
  2022-11-01 12:41     ` Peter Zijlstra
  1 sibling, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-24  8:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 10:30:51AM -0700, Linus Torvalds wrote:
> On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -258,6 +258,13 @@ static inline pte_t ptep_get(pte_t *ptep
> >  }
> >  #endif
> >
> > +#ifndef __HAVE_ARCH_PMDP_GET
> > +static inline pmd_t pmdp_get(pmd_t *pmdp)
> > +{
> > +       return READ_ONCE(*pmdp);
> > +}
> > +#endif
> 
> What, what, what?
> 
> Where did that __HAVE_ARCH_PMDP_GET come from?

Copy/paste like from ptep_get(), that has __HAVE_ARCH_PTEP_GET (which
does appear to get used, once).

Do I break the pattern and simply leave this off, or do I stay
consistent even though we hate it a little? ;-)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE()
  2022-10-22 17:42   ` Linus Torvalds
@ 2022-10-24 10:21     ` Peter Zijlstra
  0 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-24 10:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 10:42:52AM -0700, Linus Torvalds wrote:
> On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >  static inline void native_set_pte(pte_t *ptep, pte_t pte)
> >  {
> > -       ptep->pte_high = pte.pte_high;
> > +       WRITE_ONCE(ptep->pte_high, pte.pte_high);
> >         smp_wmb();
> > -       ptep->pte_low = pte.pte_low;
> > +       WRITE_ONCE(ptep->pte_low, pte.pte_low);
> 
> With this, the smp_wmb() should just go away too. It was really only
> ever there as a compiler barrier.

Right, however I find it easier to reason about this with the smp_wmb()
there, esp. since the counterpart is in generic code and (must) carries
those smp_rmb()s.

Still, I can take them out if you prefer.

> Or do we already have a comment elsewhere about why the ordering is
> important (and how *clearing* clears the low word with the present bit
> first, but setting a *new* entry sets the high word first so that the
> 64-bit entry is complete when the present bit is set?)

There's a comment in include/linux/pgtable.h near ptep_get_lockless().

Now; I've been on the fence about making those READ_ONCE(), I think
KCSAN would want that, but I think the code is correct without them,
even if the loads get torn, we rely on the equality of the first and
third load and the barriers then guarantee the second load is coherent.

OTOH, if the stores (this patch) go funny and get torn bad things can
happen, imagine it writing the byte with the present bit in first and
then the other bytes (because the compile is an evil bastard and wants a
giggle).



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear()
  2022-10-22 17:53   ` Linus Torvalds
@ 2022-10-24 11:13     ` Peter Zijlstra
  0 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-24 11:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 10:53:42AM -0700, Linus Torvalds wrote:
> On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > +
> > +#define pxx_xchg64(_pxx, _ptr, _val) ({                                        \
> > +       _pxx##val_t *_p = (_pxx##val_t *)_ptr;                          \
> > +       _pxx##val_t _o = *_p;                                           \
> > +       do { } while (!try_cmpxchg64(_p, &_o, (_val)));                 \
> > +       native_make_##_pxx(_o);                                         \
> > +})
> 
> I think this could just be a "xchp64()", but if the pte/pmd code is
> the only thing that actually wants this on 32-bit architectures, I'm
> certainly ok with making it be specific to just this code, and calling
> it "pxx_xchg()".

Regular xchg64() didn't work, the casting crud there is required because
of how pxx_t is a struct.

Now I could obviously do a xchg64(), but then we'd still need this
wrapper -- and yeah, I don't know how many other users there are.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-24  8:00     ` Peter Zijlstra
@ 2022-10-24 19:58       ` Jann Horn
  2022-10-24 20:19         ` Linus Torvalds
  2022-10-25 14:02         ` Peter Zijlstra
  0 siblings, 2 replies; 148+ messages in thread
From: Jann Horn @ 2022-10-24 19:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Hubbard, x86, willy, torvalds, akpm, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Mon, Oct 24, 2022 at 10:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote:
> > On 10/22/22 04:14, Peter Zijlstra wrote:
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
> > >
> > >  #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
> > >  /*
> > > - * WARNING: only to be used in the get_user_pages_fast() implementation.
> > > - *
> > > - * With get_user_pages_fast(), we walk down the pagetables without taking any
> > > - * locks.  For this we would like to load the pointers atomically, but sometimes
> > > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
> > > - * we do have is the guarantee that a PTE will only either go from not present
> > > - * to present, or present to not present or both -- it will not switch to a
> > > - * completely different present page without a TLB flush in between; something
> > > - * that we are blocking by holding interrupts off.
> > > + * For walking the pagetables without holding any locks.  Some architectures
> > > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive
> > > + * instructions.  We are guaranteed that a PTE will only either go from not
> > > + * present to present, or present to not present -- it will not switch to a
> > > + * completely different present page without a TLB flush inbetween; which we
> > > + * are blocking by holding interrupts off.
> >
> >
> > This is getting interesting. My latest understanding of this story is
> > that both the "before" and "after" versions of that comment are
> > incorrect! Because, as Jann Horn noticed recently [1], there might not
> > be any IPIs involved in a TLB flush, if x86 is running under a
> > hypervisor, and that breaks the chain of reasoning here.
>
> That mail doesn't really include enough detail. The way x86 HV TLB
> flushing is supposed to work is by making use of
> MMU_GATHER_RCU_TABLE_FREE. Specifically, something like:
>
>
>         vCPU0                           vCPU1
>
>                                         tlb_gather_mmut(&tlb, mm);
>
>                                         ....
>
>         local_irq_disable();
>         ... starts page-table walk ...
>
>         <schedules out; sets KVM_VCPU_PREEMPTED>
>
>                                         tlb_finish_mmu(&tlb)
>                                           ...
>                                           kvm_flush_tlb_multi()
>                                             if (state & KVM_VCPU_PREEMPTED)
>                                               if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB))
>                                                 __cpumask_clear_cpu(cpu, flushmask);
>
>
>                                           tlb_remove_table_sync_one() / call_rcu()
>
>
>         <schedules back in>
>
>         ... continues page-table walk ...
>         local_irq_enable();
>
> If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory
> pressure), then you've got your IPI back, otherwise it does call_rcu()
> and RCU itself will need vCPU0 to enable IRQs in order to make progress.
>
> Either way around, the actual freeing of the pages is delayed until the
> page-table walk is finished.
>
> What am I missing?

Unless I'm completely misunderstanding what's going on here, the whole
"remove_table" thing only happens when you "remove a table", meaning
you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
logic.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-24 19:58       ` Jann Horn
@ 2022-10-24 20:19         ` Linus Torvalds
  2022-10-24 20:23           ` Jann Horn
  2022-10-25 14:02         ` Peter Zijlstra
  1 sibling, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-24 20:19 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Zijlstra, John Hubbard, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel, ubizjak

On Mon, Oct 24, 2022 at 12:58 PM Jann Horn <jannh@google.com> wrote:
>
> Unless I'm completely misunderstanding what's going on here, the whole
> "remove_table" thing only happens when you "remove a table", meaning
> you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> logic.

I do have to admit that I'd be happier if this code - and the GUP code
that also relies on "interrupts off" behavior - would just use a
sequence counter instead.

Relying on blocking IPI's is clever, but also clearly very subtle and
somewhat dangerous.

I think our GUP code is a *lot* more important than some "legacy
x86-32 has problems in case you have an incredibly unlikely race that
re-populates the page table with a different page that just happens to
be exactly the same MOD-4GB", so honestly, I don't think the
load-tearing is even worth worrying about - if you have hardware that
is good enough at virtualizing things, it's almost certainly already
64-bit, and running 32-bit virtual machines with PAE you really only
have yourself to blame.

So I can't find it in myself to care about the 32-bit tearing thing,
but this discussion makes me worried about Fast GUP.

Note that even with proper atomic

                pte_t pte = ptep_get_lockless(ptep);

in gup_pte_range(), and even if the page tables are RCU-free'd, that
just means that the 'ptep' access itself is safe.

But then you have the whole "the lookup of the page pointer is not
atomic" wrt that. And right now that GUP code does rely on the "block
IPI" to make it basically valid.

I don't think it matters if GUP races with munmap or madvise() or
something like that - if you get the old page, that's still a valid
page, and the user only has himself to blame.

But if we have memory pressure that causes vmscan to push out a page,
and it gets replaced with a new page, and GUP gets the old page with
no serialization, that sounds like a possible source of data
inconsistency.

I don't know if this can happen, but the whole "interrupts disabled
doesn't actually block IPI's and synchronize with TLB flushes" really
sounds like it would affect GUP too. And be much more serious there
than on some x86-32 platform that nobody should be using anyway.

               Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-24 20:19         ` Linus Torvalds
@ 2022-10-24 20:23           ` Jann Horn
  2022-10-24 20:36             ` Linus Torvalds
  2022-10-25  3:21             ` Matthew Wilcox
  0 siblings, 2 replies; 148+ messages in thread
From: Jann Horn @ 2022-10-24 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, John Hubbard, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel, ubizjak

On Mon, Oct 24, 2022 at 10:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Oct 24, 2022 at 12:58 PM Jann Horn <jannh@google.com> wrote:
> >
> > Unless I'm completely misunderstanding what's going on here, the whole
> > "remove_table" thing only happens when you "remove a table", meaning
> > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > logic.
>
> I do have to admit that I'd be happier if this code - and the GUP code
> that also relies on "interrupts off" behavior - would just use a
> sequence counter instead.
>
> Relying on blocking IPI's is clever, but also clearly very subtle and
> somewhat dangerous.
>
> I think our GUP code is a *lot* more important than some "legacy
> x86-32 has problems in case you have an incredibly unlikely race that
> re-populates the page table with a different page that just happens to
> be exactly the same MOD-4GB", so honestly, I don't think the
> load-tearing is even worth worrying about - if you have hardware that
> is good enough at virtualizing things, it's almost certainly already
> 64-bit, and running 32-bit virtual machines with PAE you really only
> have yourself to blame.
>
> So I can't find it in myself to care about the 32-bit tearing thing,
> but this discussion makes me worried about Fast GUP.
>
> Note that even with proper atomic
>
>                 pte_t pte = ptep_get_lockless(ptep);
>
> in gup_pte_range(), and even if the page tables are RCU-free'd, that
> just means that the 'ptep' access itself is safe.
>
> But then you have the whole "the lookup of the page pointer is not
> atomic" wrt that. And right now that GUP code does rely on the "block
> IPI" to make it basically valid.
>
> I don't think it matters if GUP races with munmap or madvise() or
> something like that - if you get the old page, that's still a valid
> page, and the user only has himself to blame.
>
> But if we have memory pressure that causes vmscan to push out a page,
> and it gets replaced with a new page, and GUP gets the old page with
> no serialization, that sounds like a possible source of data
> inconsistency.
>
> I don't know if this can happen, but the whole "interrupts disabled
> doesn't actually block IPI's and synchronize with TLB flushes" really
> sounds like it would affect GUP too. And be much more serious there
> than on some x86-32 platform that nobody should be using anyway.

That's why GUP-fast re-checks the PTE after it has grabbed a reference
to the page. Pasting from a longer writeup that I plan to publish
soon, describing how gup_pte_range() works:

"""
This guarantees that the page tables that are being walked
aren't freed concurrently, but at the end of the walk, we
have to grab a stable reference to the referenced page.
For this we use the grab-reference-and-revalidate trick
from above again: First we (locklessly) load the page
table entry, then we grab a reference to the page that it
points to (which can fail if the refcount is zero, in that
case we bail), then we recheck that the page table entry
is still the same, and if it changed in between, we drop the
page reference and bail.
This can, again, grab a reference to a page after it has
already been freed and reallocated. The reason why this is
fine is that the metadata structure that holds this refcount,
`struct folio` (or `struct page`, depending on which kernel
version you're looking at; in current kernels it's `folio`
but `struct page` and `struct folio` are actually aliases for
the same memory, basically, though that is supposed to maybe
change at some point) is never freed; even when a page is
freed and reallocated, the corresponding `struct folio`
stays. This does have the fun consequence that whenever a
page/folio has a non-zero refcount, the refcount can
spuriously go up and then back down for a little bit.
(Also it's technically not as simple as I just described it,
because the `struct page` that the PTE points to might be
a "tail page" of a `struct folio`.
So actually we first read the PTE, the PTE gives us the
`page*`, then from that we go to the `folio*`, then we
try to grab a reference to the `folio`, then if that worked
we check that the `page` still points to the same `folio`,
and then we recheck that the PTE is still the same.)
"""

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-24 20:23           ` Jann Horn
@ 2022-10-24 20:36             ` Linus Torvalds
  2022-10-25  3:21             ` Matthew Wilcox
  1 sibling, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-24 20:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Zijlstra, John Hubbard, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel, ubizjak

On Mon, Oct 24, 2022 at 1:24 PM Jann Horn <jannh@google.com> wrote:
>
> That's why GUP-fast re-checks the PTE after it has grabbed a reference
> to the page.

Bah. I should have known that. We got that through the PPC version
that never had the whole IPI serialization thing (just the RCU
freeing).

I haven't looked at that code in much too long.

Actually, by "much too long" I really mean "thankfully I haven't needed to" ;)

                 Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-24 20:23           ` Jann Horn
  2022-10-24 20:36             ` Linus Torvalds
@ 2022-10-25  3:21             ` Matthew Wilcox
  2022-10-25  7:54               ` Alistair Popple
  1 sibling, 1 reply; 148+ messages in thread
From: Matthew Wilcox @ 2022-10-25  3:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linus Torvalds, Peter Zijlstra, John Hubbard, x86, akpm,
	linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel,
	ubizjak

On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
> """
> This guarantees that the page tables that are being walked
> aren't freed concurrently, but at the end of the walk, we
> have to grab a stable reference to the referenced page.
> For this we use the grab-reference-and-revalidate trick
> from above again:
> First we (locklessly) load the page
> table entry, then we grab a reference to the page that it
> points to (which can fail if the refcount is zero, in that
> case we bail), then we recheck that the page table entry
> is still the same, and if it changed in between, we drop the
> page reference and bail.
> This can, again, grab a reference to a page after it has
> already been freed and reallocated. The reason why this is
> fine is that the metadata structure that holds this refcount,
> `struct folio` (or `struct page`, depending on which kernel
> version you're looking at; in current kernels it's `folio`
> but `struct page` and `struct folio` are actually aliases for
> the same memory, basically, though that is supposed to maybe
> change at some point) is never freed; even when a page is
> freed and reallocated, the corresponding `struct folio`
> stays. This does have the fun consequence that whenever a
> page/folio has a non-zero refcount, the refcount can
> spuriously go up and then back down for a little bit.
> (Also it's technically not as simple as I just described it,
> because the `struct page` that the PTE points to might be
> a "tail page" of a `struct folio`.
> So actually we first read the PTE, the PTE gives us the
> `page*`, then from that we go to the `folio*`, then we
> try to grab a reference to the `folio`, then if that worked
> we check that the `page` still points to the same `folio`,
> and then we recheck that the PTE is still the same.)
> """

Nngh.  In trying to make this description fit all kernels (with
both pages and folios), you've complicated it maximally.  Let's
try a more simple explanation:

First we (locklessly) load the page table entry, then we grab a
reference to the folio that contains it (which can fail if the
refcount is zero, in that case we bail), then we recheck that the
page table entry is still the same, and if it changed in between,
we drop the folio reference and bail.
This can, again, grab a reference to a folio after it has
already been freed and reallocated. The reason why this is
fine is that the metadata structure that holds this refcount,
`struct folio` is never freed; even when a folio is
freed and reallocated, the corresponding `struct folio`
stays. This does have the fun consequence that whenever a
folio has a non-zero refcount, the refcount can
spuriously go up and then back down for a little bit.
(Also it's slightly more complex than I just described,
because the PTE that we just loaded might be in the middle of
being reallocated into a different folio.
So actually we first read the PTE, translate the PTE into the
`page*`, then from that we go to the `folio*`, then we
try to grab a reference to the `folio`, then if that worked
we check that the `page` is still in the same `folio`,
and then we recheck that the PTE is still the same.  Older kernels
did not make a clear distinction between pages and folios, so
it was even more confusing.)


Better?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 07/13] mm/gup: Fix the lockless PMD access
  2022-10-24  7:42     ` Peter Zijlstra
@ 2022-10-25  3:58       ` Hugh Dickins
  0 siblings, 0 replies; 148+ messages in thread
From: Hugh Dickins @ 2022-10-25  3:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Will Deacon, x86, willy, torvalds, akpm,
	linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel,
	ubizjak

On Mon, 24 Oct 2022, Peter Zijlstra wrote:
> On Sat, Oct 22, 2022 at 05:42:18PM -0700, Hugh Dickins wrote:
> > On Sat, 22 Oct 2022, Peter Zijlstra wrote:
> > 
> > > On architectures where the PTE/PMD is larger than the native word size
> > > (i386-PAE for example), READ_ONCE() can do the wrong thing. Use
> > > pmdp_get_lockless() just like we use ptep_get_lockless().
> > 
> > I thought that was something Will Deacon put a lot of effort
> > into handling around 5.8 and 5.9: see "strong prevailing wind" in
> > include/asm-generic/rwonce.h, formerly in include/linux/compiler.h.
> > 
> > Was it too optimistic?  Did the wind drop?
> > 
> > I'm interested in the answer, but I've certainly no objection
> > to making this all more obviously robust - thanks.
> 
> READ_ONCE() can't do what the hardware can't do. There is absolutely no
> way i386 can do an atomic 64bit load without resorting to cmpxchg8b.

Right.

> 
> Also see the comment that goes with compiletime_assert_rwonce_type(). It
> explicitly allows 64bit because there's just too much stuff that does
> that (and there's actually 32bit hardware that *can* do it).

Yes, the "strong prevailing wind" comment. I think I've never read that
carefully enough, until you redirected me back there: it is in fact
quite clear, that it's only *atomic* in the Armv7 + LPAE case; but
READ_ONCEy (READ_EACH_HALF_ONCE I guess) for other 64-on-32 cases.

> 
> But it's still very wrong.

Somewhat clearer to me now, thanks.

Hugh

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-25  3:21             ` Matthew Wilcox
@ 2022-10-25  7:54               ` Alistair Popple
  2022-10-25 13:33                 ` Peter Zijlstra
  2022-10-25 13:44                 ` Jann Horn
  0 siblings, 2 replies; 148+ messages in thread
From: Alistair Popple @ 2022-10-25  7:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jann Horn, Linus Torvalds, Peter Zijlstra, John Hubbard, x86,
	akpm, linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel,
	ubizjak


Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
>> """
>> This guarantees that the page tables that are being walked
>> aren't freed concurrently, but at the end of the walk, we
>> have to grab a stable reference to the referenced page.
>> For this we use the grab-reference-and-revalidate trick
>> from above again:
>> First we (locklessly) load the page
>> table entry, then we grab a reference to the page that it
>> points to (which can fail if the refcount is zero, in that
>> case we bail), then we recheck that the page table entry
>> is still the same, and if it changed in between, we drop the
>> page reference and bail.
>> This can, again, grab a reference to a page after it has
>> already been freed and reallocated. The reason why this is
>> fine is that the metadata structure that holds this refcount,
>> `struct folio` (or `struct page`, depending on which kernel
>> version you're looking at; in current kernels it's `folio`
>> but `struct page` and `struct folio` are actually aliases for
>> the same memory, basically, though that is supposed to maybe
>> change at some point) is never freed; even when a page is
>> freed and reallocated, the corresponding `struct folio`
>> stays. This does have the fun consequence that whenever a
>> page/folio has a non-zero refcount, the refcount can
>> spuriously go up and then back down for a little bit.
>> (Also it's technically not as simple as I just described it,
>> because the `struct page` that the PTE points to might be
>> a "tail page" of a `struct folio`.
>> So actually we first read the PTE, the PTE gives us the
>> `page*`, then from that we go to the `folio*`, then we
>> try to grab a reference to the `folio`, then if that worked
>> we check that the `page` still points to the same `folio`,
>> and then we recheck that the PTE is still the same.)
>> """
>
> Nngh.  In trying to make this description fit all kernels (with
> both pages and folios), you've complicated it maximally.  Let's
> try a more simple explanation:
>
> First we (locklessly) load the page table entry, then we grab a
> reference to the folio that contains it (which can fail if the
> refcount is zero, in that case we bail), then we recheck that the
> page table entry is still the same, and if it changed in between,
> we drop the folio reference and bail.
> This can, again, grab a reference to a folio after it has
> already been freed and reallocated. The reason why this is
> fine is that the metadata structure that holds this refcount,
> `struct folio` is never freed; even when a folio is
> freed and reallocated, the corresponding `struct folio`
> stays.

I'm probably missing something obvious but how is that synchronised
against memory hotplug? AFAICT if it isn't couldn't the pages be freed
and memory removed? In that case the above would no longer hold because
(I think) the metadata structure could have been freed.

> This does have the fun consequence that whenever a
> folio has a non-zero refcount, the refcount can
> spuriously go up and then back down for a little bit.
> (Also it's slightly more complex than I just described,
> because the PTE that we just loaded might be in the middle of
> being reallocated into a different folio.
> So actually we first read the PTE, translate the PTE into the
> `page*`, then from that we go to the `folio*`, then we
> try to grab a reference to the `folio`, then if that worked
> we check that the `page` is still in the same `folio`,
> and then we recheck that the PTE is still the same.  Older kernels
> did not make a clear distinction between pages and folios, so
> it was even more confusing.)
>
>
> Better?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-25  7:54               ` Alistair Popple
@ 2022-10-25 13:33                 ` Peter Zijlstra
  2022-10-25 13:44                 ` Jann Horn
  1 sibling, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-25 13:33 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Matthew Wilcox, Jann Horn, Linus Torvalds, John Hubbard, x86,
	akpm, linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel,
	ubizjak

On Tue, Oct 25, 2022 at 06:54:10PM +1100, Alistair Popple wrote:

> > First we (locklessly) load the page table entry, then we grab a
> > reference to the folio that contains it (which can fail if the
> > refcount is zero, in that case we bail), then we recheck that the
> > page table entry is still the same, and if it changed in between,
> > we drop the folio reference and bail.
> > This can, again, grab a reference to a folio after it has
> > already been freed and reallocated. The reason why this is
> > fine is that the metadata structure that holds this refcount,
> > `struct folio` is never freed; even when a folio is
> > freed and reallocated, the corresponding `struct folio`
> > stays.
> 
> I'm probably missing something obvious but how is that synchronised
> against memory hotplug? AFAICT if it isn't couldn't the pages be freed
> and memory removed? In that case the above would no longer hold because
> (I think) the metadata structure could have been freed.

Note, this scheme is older than memory hot-plug, so if anybody is to
blame it's the memory hotplug code.

Anyway, since all that is done with IRQs disabled, all the hotplug stuff
needs to do is rcu_synchronize() in order to ensure all active
IRQ-disabled regions are finshed (between ensuring the memory is unused
and taking out the struct page).

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-25  7:54               ` Alistair Popple
  2022-10-25 13:33                 ` Peter Zijlstra
@ 2022-10-25 13:44                 ` Jann Horn
  2022-10-26  0:45                   ` Alistair Popple
  1 sibling, 1 reply; 148+ messages in thread
From: Jann Horn @ 2022-10-25 13:44 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Matthew Wilcox, Linus Torvalds, Peter Zijlstra, John Hubbard,
	x86, akpm, linux-kernel, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak

On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Matthew Wilcox <willy@infradead.org> writes:
>
> > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
> >> """
> >> This guarantees that the page tables that are being walked
> >> aren't freed concurrently, but at the end of the walk, we
> >> have to grab a stable reference to the referenced page.
> >> For this we use the grab-reference-and-revalidate trick
> >> from above again:
> >> First we (locklessly) load the page
> >> table entry, then we grab a reference to the page that it
> >> points to (which can fail if the refcount is zero, in that
> >> case we bail), then we recheck that the page table entry
> >> is still the same, and if it changed in between, we drop the
> >> page reference and bail.
> >> This can, again, grab a reference to a page after it has
> >> already been freed and reallocated. The reason why this is
> >> fine is that the metadata structure that holds this refcount,
> >> `struct folio` (or `struct page`, depending on which kernel
> >> version you're looking at; in current kernels it's `folio`
> >> but `struct page` and `struct folio` are actually aliases for
> >> the same memory, basically, though that is supposed to maybe
> >> change at some point) is never freed; even when a page is
> >> freed and reallocated, the corresponding `struct folio`
> >> stays. This does have the fun consequence that whenever a
> >> page/folio has a non-zero refcount, the refcount can
> >> spuriously go up and then back down for a little bit.
> >> (Also it's technically not as simple as I just described it,
> >> because the `struct page` that the PTE points to might be
> >> a "tail page" of a `struct folio`.
> >> So actually we first read the PTE, the PTE gives us the
> >> `page*`, then from that we go to the `folio*`, then we
> >> try to grab a reference to the `folio`, then if that worked
> >> we check that the `page` still points to the same `folio`,
> >> and then we recheck that the PTE is still the same.)
> >> """
> >
> > Nngh.  In trying to make this description fit all kernels (with
> > both pages and folios), you've complicated it maximally.  Let's
> > try a more simple explanation:
> >
> > First we (locklessly) load the page table entry, then we grab a
> > reference to the folio that contains it (which can fail if the
> > refcount is zero, in that case we bail), then we recheck that the
> > page table entry is still the same, and if it changed in between,
> > we drop the folio reference and bail.
> > This can, again, grab a reference to a folio after it has
> > already been freed and reallocated. The reason why this is
> > fine is that the metadata structure that holds this refcount,
> > `struct folio` is never freed; even when a folio is
> > freed and reallocated, the corresponding `struct folio`
> > stays.

Oh, thanks. You're right, trying to talk about kernels with folios
made it unnecessarily complicated...

> I'm probably missing something obvious but how is that synchronised
> against memory hotplug? AFAICT if it isn't couldn't the pages be freed
> and memory removed? In that case the above would no longer hold because
> (I think) the metadata structure could have been freed.

Hm... that's this codepath?

arch_remove_memory -> __remove_pages -> __remove_section ->
sparse_remove_section -> section_deactivate ->
depopulate_section_memmap -> vmemmap_free -> remove_pagetable

which then walks down the page tables and ends up freeing individual
pages in remove_pte_table() using the confusingly-named
free_pagetable()?

I'm not sure what the synchronization against hotplug is - GUP-fast is
running with IRQs disabled, but other codepaths might not, like
get_ksm_page()? I don't know if that's holding something else for protection...

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-24 19:58       ` Jann Horn
  2022-10-24 20:19         ` Linus Torvalds
@ 2022-10-25 14:02         ` Peter Zijlstra
  2022-10-25 14:18           ` Jann Horn
  1 sibling, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-25 14:02 UTC (permalink / raw)
  To: Jann Horn
  Cc: John Hubbard, x86, willy, torvalds, akpm, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote:

> Unless I'm completely misunderstanding what's going on here, the whole
> "remove_table" thing only happens when you "remove a table", meaning
> you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> logic.

Aah; yes true. OTOH even it that were not so, I think it would still be
broken because the current code relies on the TLB flush to have
completed, whereas the RCU scheme is effectively async and can be
considered pending until the callback runs.

Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi()
for i386-pae builds.

Something like so... nobody in his right mind should care about i386-pae
virt performance much.

---
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 95fb85bea111..cbfb84e88251 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -473,6 +473,12 @@ static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
 
 static bool pv_tlb_flush_supported(void)
 {
+	/*
+	 * i386-PAE split loads are incompatible with optimized TLB flushes.
+	 */
+	if (IS_ENABLED(CONFIG_GUP_GET_PTE_LOW_HIGH))
+		return false;
+
 	return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-25 14:02         ` Peter Zijlstra
@ 2022-10-25 14:18           ` Jann Horn
  2022-10-25 15:06             ` Peter Zijlstra
  0 siblings, 1 reply; 148+ messages in thread
From: Jann Horn @ 2022-10-25 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Hubbard, x86, willy, torvalds, akpm, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote:
>
> > Unless I'm completely misunderstanding what's going on here, the whole
> > "remove_table" thing only happens when you "remove a table", meaning
> > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > logic.
>
> Aah; yes true. OTOH even it that were not so, I think it would still be
> broken because the current code relies on the TLB flush to have
> completed, whereas the RCU scheme is effectively async and can be
> considered pending until the callback runs.
>
> Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi()
> for i386-pae builds.
>
> Something like so... nobody in his right mind should care about i386-pae
> virt performance much.

I think Xen and HyperV have similar codepaths.
hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls,
xen_flush_tlb_multi() too.

On top of that, I think that theoretically, Linux doesn't even ensure
that you have a TLB flush in between tearing down one PTE and
installing another PTE (see
https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/),
but I haven't tested that, and if it is true, I'm also not entirely
sure if it's correct (in the sense that it only creates incoherent-TLB
states when userspace is doing something stupid like racing
MADV_DONTNEED and page faults on the same region).

I think the more clearly correct fix would be to get rid of the split
loads and use CMPXCHG16B instead (probably destroying the performance
of GUP-fast completely), but that's complicated because some of the
architectures that use the split loads path don't have cmpxchg_double
(or at least don't have it wired up).

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-25 14:18           ` Jann Horn
@ 2022-10-25 15:06             ` Peter Zijlstra
  2022-10-26 16:45               ` Jann Horn
  2022-10-26 19:43               ` Nadav Amit
  0 siblings, 2 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-25 15:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: John Hubbard, x86, willy, torvalds, akpm, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Tue, Oct 25, 2022 at 04:18:20PM +0200, Jann Horn wrote:
> On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote:
> >
> > > Unless I'm completely misunderstanding what's going on here, the whole
> > > "remove_table" thing only happens when you "remove a table", meaning
> > > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > > logic.
> >
> > Aah; yes true. OTOH even it that were not so, I think it would still be
> > broken because the current code relies on the TLB flush to have
> > completed, whereas the RCU scheme is effectively async and can be
> > considered pending until the callback runs.
> >
> > Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi()
> > for i386-pae builds.
> >
> > Something like so... nobody in his right mind should care about i386-pae
> > virt performance much.
> 
> I think Xen and HyperV have similar codepaths.
> hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls,
> xen_flush_tlb_multi() too.

Sure (not updated).

> On top of that, I think that theoretically, Linux doesn't even ensure
> that you have a TLB flush in between tearing down one PTE and
> installing another PTE (see
> https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/),
> but I haven't tested that, and if it is true, I'm also not entirely
> sure if it's correct (in the sense that it only creates incoherent-TLB
> states when userspace is doing something stupid like racing
> MADV_DONTNEED and page faults on the same region).
> 
> I think the more clearly correct fix would be to get rid of the split
> loads and use CMPXCHG16B instead (probably destroying the performance
> of GUP-fast completely), but that's complicated because some of the
> architectures that use the split loads path don't have cmpxchg_double
> (or at least don't have it wired up).

cmpxchg8b; but no, I think we want to fix MADV_DONTNEED, incoherent TLB
states are a pain nobody needs.

Something like so should force TLB flushes before dropping pte_lock (not
looked at the various pmd level things yet).

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 95fb85bea111..cbfb84e88251 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -473,6 +473,12 @@ static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
 
 static bool pv_tlb_flush_supported(void)
 {
+	/*
+	 * i386-PAE split loads are incompatible with optimized TLB flushes.
+	 */
+	if (IS_ENABLED(CONFIG_GUP_GET_PTE_LOW_HIGH))
+		return false;
+
 	return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..397bc04e2d82 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3474,5 +3474,6 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  * default, the flag is not set.
  */
 #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
+#define  ZAP_FLAG_FORCE_FLUSH	     ((__force zap_flags_t) BIT(1))
 
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..9bb63b3fbee1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
 						      ptent);
+
+			if (!force_flush && !tlb->fullmm && details &&
+			    details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
+				force_flush = 1;
+
 			if (unlikely(!page))
 				continue;
 
@@ -1749,6 +1754,9 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	struct maple_tree *mt = &vma->vm_mm->mm_mt;
 	unsigned long end = start + size;
 	struct mmu_notifier_range range;
+	struct zap_details details = {
+		.zap_flags = ZAP_FLAG_FORCE_FLUSH,
+	};
 	struct mmu_gather tlb;
 	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
@@ -1759,7 +1767,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	do {
-		unmap_single_vma(&tlb, vma, start, range.end, NULL);
+		unmap_single_vma(&tlb, vma, start, range.end, &details);
 	} while ((vma = mas_find(&mas, end - 1)) != NULL);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
@@ -1806,7 +1814,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size)
 {
 	if (!range_in_vma(vma, address, address + size) ||
-	    		!(vma->vm_flags & VM_PFNMAP))
+	    !(vma->vm_flags & VM_PFNMAP))
 		return;
 
 	zap_page_range_single(vma, address, size, NULL);

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-25 13:44                 ` Jann Horn
@ 2022-10-26  0:45                   ` Alistair Popple
  0 siblings, 0 replies; 148+ messages in thread
From: Alistair Popple @ 2022-10-26  0:45 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matthew Wilcox, Linus Torvalds, Peter Zijlstra, John Hubbard,
	x86, akpm, linux-kernel, linux-mm, aarcange, kirill.shutemov,
	jroedel, ubizjak


Jann Horn <jannh@google.com> writes:

> On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@nvidia.com> wrote:
>>
>>
>> Matthew Wilcox <willy@infradead.org> writes:
>>
>> > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
>> >> """
>> >> This guarantees that the page tables that are being walked
>> >> aren't freed concurrently, but at the end of the walk, we
>> >> have to grab a stable reference to the referenced page.
>> >> For this we use the grab-reference-and-revalidate trick
>> >> from above again:
>> >> First we (locklessly) load the page
>> >> table entry, then we grab a reference to the page that it
>> >> points to (which can fail if the refcount is zero, in that
>> >> case we bail), then we recheck that the page table entry
>> >> is still the same, and if it changed in between, we drop the
>> >> page reference and bail.
>> >> This can, again, grab a reference to a page after it has
>> >> already been freed and reallocated. The reason why this is
>> >> fine is that the metadata structure that holds this refcount,
>> >> `struct folio` (or `struct page`, depending on which kernel
>> >> version you're looking at; in current kernels it's `folio`
>> >> but `struct page` and `struct folio` are actually aliases for
>> >> the same memory, basically, though that is supposed to maybe
>> >> change at some point) is never freed; even when a page is
>> >> freed and reallocated, the corresponding `struct folio`
>> >> stays. This does have the fun consequence that whenever a
>> >> page/folio has a non-zero refcount, the refcount can
>> >> spuriously go up and then back down for a little bit.
>> >> (Also it's technically not as simple as I just described it,
>> >> because the `struct page` that the PTE points to might be
>> >> a "tail page" of a `struct folio`.
>> >> So actually we first read the PTE, the PTE gives us the
>> >> `page*`, then from that we go to the `folio*`, then we
>> >> try to grab a reference to the `folio`, then if that worked
>> >> we check that the `page` still points to the same `folio`,
>> >> and then we recheck that the PTE is still the same.)
>> >> """
>> >
>> > Nngh.  In trying to make this description fit all kernels (with
>> > both pages and folios), you've complicated it maximally.  Let's
>> > try a more simple explanation:
>> >
>> > First we (locklessly) load the page table entry, then we grab a
>> > reference to the folio that contains it (which can fail if the
>> > refcount is zero, in that case we bail), then we recheck that the
>> > page table entry is still the same, and if it changed in between,
>> > we drop the folio reference and bail.
>> > This can, again, grab a reference to a folio after it has
>> > already been freed and reallocated. The reason why this is
>> > fine is that the metadata structure that holds this refcount,
>> > `struct folio` is never freed; even when a folio is
>> > freed and reallocated, the corresponding `struct folio`
>> > stays.
>
> Oh, thanks. You're right, trying to talk about kernels with folios
> made it unnecessarily complicated...
>
>> I'm probably missing something obvious but how is that synchronised
>> against memory hotplug? AFAICT if it isn't couldn't the pages be freed
>> and memory removed? In that case the above would no longer hold because
>> (I think) the metadata structure could have been freed.
>
> Hm... that's this codepath?
>
> arch_remove_memory -> __remove_pages -> __remove_section ->
> sparse_remove_section -> section_deactivate ->
> depopulate_section_memmap -> vmemmap_free -> remove_pagetable
> which then walks down the page tables and ends up freeing individual
> pages in remove_pte_table() using the confusingly-named
> free_pagetable()?

Right. section_deactivate() will also clear SECTION_HAS_MEM_MAP which
would trigger VM_BUG_ON(!pfn_valid(pte_pfn(pte))) in gup_pte_range().

> I'm not sure what the synchronization against hotplug is - GUP-fast is
> running with IRQs disabled, but other codepaths might not, like
> get_ksm_page()? I don't know if that's holding something else for protection...

I was thinking about this from the ZONE_DEVICE perspective (ie.
memunmap_pages -> pageunmap_range -> arch_remove_memory -> ...). That
runs with IRQs enabled, and I couldn't see any other synchronization.
pageunmap_range() does call mem_hotplug_begin() which takes hotplug
locks but GUP-fast doesn't take those locks. So based on Peter's
response I think I need to add a rcu_synchronize() call to
pageunmap_range() right after calling mem_hotplug_begin().

I could just add it to mem_hotplug_begin() but offline_pages() calls
that too early (before pages have been isolated) so will need a separate
change.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-25 15:06             ` Peter Zijlstra
@ 2022-10-26 16:45               ` Jann Horn
  2022-10-27  7:08                 ` Peter Zijlstra
  2022-10-26 19:43               ` Nadav Amit
  1 sibling, 1 reply; 148+ messages in thread
From: Jann Horn @ 2022-10-26 16:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Hubbard, x86, willy, torvalds, akpm, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Tue, Oct 25, 2022 at 5:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 25, 2022 at 04:18:20PM +0200, Jann Horn wrote:
> > On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote:
> > >
> > > > Unless I'm completely misunderstanding what's going on here, the whole
> > > > "remove_table" thing only happens when you "remove a table", meaning
> > > > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > > > logic.
> > >
> > > Aah; yes true. OTOH even it that were not so, I think it would still be
> > > broken because the current code relies on the TLB flush to have
> > > completed, whereas the RCU scheme is effectively async and can be
> > > considered pending until the callback runs.
> > >
> > > Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi()
> > > for i386-pae builds.
> > >
> > > Something like so... nobody in his right mind should care about i386-pae
> > > virt performance much.
> >
> > I think Xen and HyperV have similar codepaths.
> > hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls,
> > xen_flush_tlb_multi() too.
>
> Sure (not updated).
>
> > On top of that, I think that theoretically, Linux doesn't even ensure
> > that you have a TLB flush in between tearing down one PTE and
> > installing another PTE (see
> > https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/),
> > but I haven't tested that, and if it is true, I'm also not entirely
> > sure if it's correct (in the sense that it only creates incoherent-TLB
> > states when userspace is doing something stupid like racing
> > MADV_DONTNEED and page faults on the same region).
> >
> > I think the more clearly correct fix would be to get rid of the split
> > loads and use CMPXCHG16B instead (probably destroying the performance
> > of GUP-fast completely), but that's complicated because some of the
> > architectures that use the split loads path don't have cmpxchg_double
> > (or at least don't have it wired up).
>
> cmpxchg8b; but no, I think we want to fix MADV_DONTNEED, incoherent TLB
> states are a pain nobody needs.
>
> Something like so should force TLB flushes before dropping pte_lock (not
> looked at the various pmd level things yet).
[...]
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..9bb63b3fbee1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                         tlb_remove_tlb_entry(tlb, pte, addr);
>                         zap_install_uffd_wp_if_needed(vma, addr, pte, details,
>                                                       ptent);
> +
> +                       if (!force_flush && !tlb->fullmm && details &&
> +                           details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
> +                               force_flush = 1;
> +

Hmm... I guess that might work, assuming that there is no other
codepath we might race with that first turns the present PTE into a
non-present PTE but keeps the flush queued for later. At least
codepaths that use the tlb_batched infrastructure are unproblematic...

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-25 15:06             ` Peter Zijlstra
  2022-10-26 16:45               ` Jann Horn
@ 2022-10-26 19:43               ` Nadav Amit
  2022-10-27  7:27                 ` Peter Zijlstra
  1 sibling, 1 reply; 148+ messages in thread
From: Nadav Amit @ 2022-10-26 19:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Linus Torvalds,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak

On Oct 25, 2022, at 6:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> 			if (!force_flush && !tlb->fullmm && details &&
> +			    details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
> +				force_flush = 1;

Isn’t it too big of a hammer?

At the same time, the whole reasoning about TLB flushes is not getting any
simpler. We had cases in which MADV_DONTNEED and another concurrent
operation that effectively zapped PTEs (e.g., another MADV_DONTNEED) caused
the zap_pte_range() to skip entries since pte_none() was true. To resolve
these cases we relied on tlb_finish_mmu() to flush the range when needed
(i.e., flush the whole range when mm_tlb_flush_nested()).

Now, I do not have a specific broken scenario in mind following this change,
but it is all sounds to me a bit dangerous and at same time can potentially
introduce new overheads.

One alternative may be using mm_tlb_flush_pending() when setting a new PTE
to check for pending flushes and flushing the TLB if that is the case. This
is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this
might induce some overheads. As noted before, it is possible to track
pending TLB flushes in VMA/page-table granularity, with different tradeoffs
of overheads.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-26 16:45               ` Jann Horn
@ 2022-10-27  7:08                 ` Peter Zijlstra
  2022-10-27 18:13                   ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-27  7:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: John Hubbard, x86, willy, torvalds, akpm, linux-kernel, linux-mm,
	aarcange, kirill.shutemov, jroedel, ubizjak

On Wed, Oct 26, 2022 at 06:45:16PM +0200, Jann Horn wrote:

> >  #endif /* _LINUX_MM_H */
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f88c351aecd4..9bb63b3fbee1 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                         tlb_remove_tlb_entry(tlb, pte, addr);
> >                         zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> >                                                       ptent);
> > +
> > +                       if (!force_flush && !tlb->fullmm && details &&
> > +                           details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
> > +                               force_flush = 1;
> > +
> 
> Hmm... I guess that might work, assuming that there is no other
> codepath we might race with that first turns the present PTE into a
> non-present PTE but keeps the flush queued for later. At least
> codepaths that use the tlb_batched infrastructure are unproblematic...

So I thought the general rule was that if you modify a PTE and have not
unmapped things -- IOW, there's actual concurrency possible on the
thing, then the TLB invalidate needs to happen under pte_lock, since
that is what controls concurrency at the pte level.

As it stands MADV_DONTNEED seems to blatatly violate that general rule.

Then again; I could've missed something and the rules changed?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-26 19:43               ` Nadav Amit
@ 2022-10-27  7:27                 ` Peter Zijlstra
  2022-10-27 17:30                   ` Nadav Amit
  0 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-27  7:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Linus Torvalds,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak

On Wed, Oct 26, 2022 at 10:43:21PM +0300, Nadav Amit wrote:
> On Oct 25, 2022, at 6:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > 			if (!force_flush && !tlb->fullmm && details &&
> > +			    details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
> > +				force_flush = 1;
> 
> Isn’t it too big of a hammer?

It is the obvious hammer :-) TLB invalidate under pte_lock when
clearing.

> At the same time, the whole reasoning about TLB flushes is not getting any
> simpler. We had cases in which MADV_DONTNEED and another concurrent
> operation that effectively zapped PTEs (e.g., another MADV_DONTNEED) caused
> the zap_pte_range() to skip entries since pte_none() was true. To resolve
> these cases we relied on tlb_finish_mmu() to flush the range when needed
> (i.e., flush the whole range when mm_tlb_flush_nested()).

Yeah, whoever thought that allowing concurrency there was a great idea :/

And I must admit to hating the pending thing with a passion. And that
mm_tlb_flush_nested() thing in tlb_finish_mmu() is a giant hack at the
best of times.

Also; I feel it's part of the problem here; it violates the basic rules
we've had for a very long time.

> Now, I do not have a specific broken scenario in mind following this change,
> but it is all sounds to me a bit dangerous and at same time can potentially
> introduce new overheads.

I'll take correctness over being fast. As you say, this whole TLB thing
is getting out of hand.

> One alternative may be using mm_tlb_flush_pending() when setting a new PTE
> to check for pending flushes and flushing the TLB if that is the case. This
> is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this
> might induce some overheads. As noted before, it is possible to track
> pending TLB flushes in VMA/page-table granularity, with different tradeoffs
> of overheads.

Right; I just don't believe in VMAs for this, they're *waaay* to big.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-27  7:27                 ` Peter Zijlstra
@ 2022-10-27 17:30                   ` Nadav Amit
  0 siblings, 0 replies; 148+ messages in thread
From: Nadav Amit @ 2022-10-27 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Linus Torvalds,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak

On Oct 27, 2022, at 12:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:

>> One alternative may be using mm_tlb_flush_pending() when setting a new PTE
>> to check for pending flushes and flushing the TLB if that is the case. This
>> is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this
>> might induce some overheads. As noted before, it is possible to track
>> pending TLB flushes in VMA/page-table granularity, with different tradeoffs
>> of overheads.
> 
> Right; I just don't believe in VMAs for this, they're *waaay* to big.

Well, I did it for VMA in an RFC only because I was pushed. I thought and do
think that page-table granularity is the right one.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-27  7:08                 ` Peter Zijlstra
@ 2022-10-27 18:13                   ` Linus Torvalds
  2022-10-27 19:35                     ` Peter Zijlstra
  2022-10-27 20:15                     ` Nadav Amit
  0 siblings, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-27 18:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jann Horn, John Hubbard, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel, ubizjak

On Thu, Oct 27, 2022 at 12:08 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I thought the general rule was that if you modify a PTE and have not
> unmapped things -- IOW, there's actual concurrency possible on the
> thing, then the TLB invalidate needs to happen under pte_lock, since
> that is what controls concurrency at the pte level.

Yeah, I think we should really think about the TLB issue and make the
rules very clear.

A lot of our thinking behind "fix TLB races" have been about
"use-after-free wrt TLB on another CPU", and the design in
zap_page_ranges() is almost entirely about that: making sure that the
TLB flush happens before the underlying pages are free'd.

I say "almost entirely", because you can see the effects of the
*other* kind of race in

                        if (!PageAnon(page)) {
                                if (pte_dirty(ptent)) {
                                        force_flush = 1;
                                        set_page_dirty(page);
                                }

where it isn't about the lifetime of the page, but about the state
coherency wrt other users.

But I think even that code is quite questionable, and we really should
write down the rules much more strictly somewhere.

For example, wrt that pte_dirty() causing force_flush, I'd love to
have some core set of rules that would explain

 (a) why it is only relevant for shared pages

 (b) do even shared pages need it for the "fullmm" case when we tear
down the whole VM?

 (c) what are the force_flush interactions wrt holding the mmap lock
for writing vs reading?

In the above, I think (b)/(c) are related - I suspect "fullmm" is
mostly equivalent to "mmap held for writing", in that it guarantees
that there should be no concurrent faults or other active sw ops on
the table.

But "fullmm" is probably even stronger than "mmap write-lock" in that
it should also mean "no other CPU can be actively using this" - either
for hardware page table walking, or for GUP.

Both both have the vmscan code and rmap that can still get to the
pages - and the vmscan code clearly only cares about the page table
lock. But the vmscan code itself also shouldn't care about some stale
TLB value, so ...

> As it stands MADV_DONTNEED seems to blatatly violate that general rule.

So let's talk about exactly what and why the TLB would need to be
flushed before dropping the page table lock.

For example, MADV_DONTNEED does this all with just the mmap lock held
for reading, so we *unless* we have that 'force_flush', we can

 (a) have another CPU continue to use the old stale TLB entry for quite a while

 (b) yet another CPU (that didn't have a TLB entry, or wanted to write
to a read-only one ) could take a page fault, and install a *new* PTE
entry in the same slot, all at the same time.

Now, that's clearly *very* confusing. But being confusing may not mean
"wrong" - we're still delaying the free of the old entry, so there's
no use-after-free.

The biggest problem I can see is that this means that user space
memory ordering might be screwed up: the CPU in (a) will see not just
an old TLB entry, but simply old *data*, when the CPU in (b) may be
writing to that same address with new data.

So I think we clearly do *NOT* serialize as much as we could for
MADV_DONTNEED, and I think the above is a real semantic result of
that.

But equally arguably if you do this kind of unserialized MADV_DONTNEED
in user space, that's a case of "you asked for the breakage - you get
to keep both pieces".

So is that ever an actual problem? If the worst issue is that some
CPU's may see old ("pre-DONTNEED") data, while other CPU's then see
new data while the MADV_DONTNEED is executing, I think maybe *THAT*
part is fine.

But because it's so very confusing, maybe we have other problems in
this area, which is why I think it would be really good to have an
actual set of real hard documented rules about this all, and exactly
when we need to flush TLBs synchronously with the page table lock, and
when we do not.

Anybody willing to try to write up the rules (and have each rule
document *why* it's a rule - not just "by fiat", but an actual "these
are the rules and this is *why* they are the rules").

Because right now I think all of our rules are almost entirely just
encoded in the code, with a couple of comments, and a few people who
just remember why we do what we do.

                   Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-27 18:13                   ` Linus Torvalds
@ 2022-10-27 19:35                     ` Peter Zijlstra
  2022-10-27 19:43                       ` Linus Torvalds
  2022-10-27 20:15                     ` Nadav Amit
  1 sibling, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-27 19:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, John Hubbard, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel, ubizjak


Just two quick remarks; it's far to late to really think :-)

On Thu, Oct 27, 2022 at 11:13:55AM -0700, Linus Torvalds wrote:

> But "fullmm" is probably even stronger than "mmap write-lock" in that
> it should also mean "no other CPU can be actively using this" - either
> for hardware page table walking, or for GUP.

IIRC fullmm is really: this is the last user and we're taking the whole
mm down -- IOW exit().

> For example, MADV_DONTNEED does this all with just the mmap lock held
> for reading, so we *unless* we have that 'force_flush', we can
> 
>  (a) have another CPU continue to use the old stale TLB entry for quite a while
> 
>  (b) yet another CPU (that didn't have a TLB entry, or wanted to write
> to a read-only one ) could take a page fault, and install a *new* PTE
> entry in the same slot, all at the same time.
> 
> Now, that's clearly *very* confusing. But being confusing may not mean
> "wrong" - we're still delaying the free of the old entry, so there's
> no use-after-free.

Do we worry about CPU errata where things go side-ways if multiple CPUs
have inconsistent TLB state?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-27 19:35                     ` Peter Zijlstra
@ 2022-10-27 19:43                       ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-27 19:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jann Horn, John Hubbard, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel, ubizjak

On Thu, Oct 27, 2022 at 12:35 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 27, 2022 at 11:13:55AM -0700, Linus Torvalds wrote:
>
> > But "fullmm" is probably even stronger than "mmap write-lock" in that
> > it should also mean "no other CPU can be actively using this" - either
> > for hardware page table walking, or for GUP.
>
> IIRC fullmm is really: this is the last user and we're taking the whole
> mm down -- IOW exit().

Yes.

But that doesn't mean that it's entirely "just ours" - vmscan can
still see the entries due to rmap, I think. So there can still be some
concurrency concerns, but it's limited.

> Do we worry about CPU errata where things go side-ways if multiple CPUs
> have inconsistent TLB state?

Yeah, we should definitely worry about those, since I think they have
been known to cause machine checks etc, which then crashes the machine
because the machine check architecture is broken garbage.

"User gets the odd memory ordering they asked for" is different from
"user can crash machine because of bad machine check architecture" ;)

That said, I don't think this is a real worry here. Because if I
recall the errata correctly, they are not about "different TLB
contents", but "different cacheability for the same physical page".

Because different TLB contents are normal and even expected, I think.
Things like kmap_local etc already end up doing some lazy TLB
flushing. No?

I think it's only "somebody did an UC access to a cacheline I have"
that ends up being bad.

Note the *WILD HANDWAVING* above - I didn't actually look up the
errata. The above is from my dim memories of the issues we had, and I
might just be wrong.

               Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-27 18:13                   ` Linus Torvalds
  2022-10-27 19:35                     ` Peter Zijlstra
@ 2022-10-27 20:15                     ` Nadav Amit
  2022-10-27 20:31                       ` Linus Torvalds
  1 sibling, 1 reply; 148+ messages in thread
From: Nadav Amit @ 2022-10-27 20:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, x86, willy,
	Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	kirill.shutemov, jroedel, ubizjak

On Oct 27, 2022, at 11:13 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Anybody willing to try to write up the rules (and have each rule
> document *why* it's a rule - not just "by fiat", but an actual "these
> are the rules and this is *why* they are the rules").
> 
> Because right now I think all of our rules are almost entirely just
> encoded in the code, with a couple of comments, and a few people who
> just remember why we do what we do.

I think it might be easier to come up with new rules instead of phrasing the
existing ones.

The approach I suggested before [1] is something like:

1. Turn x86’s TLB-generation mechanism to be generic. Turn the
   TLB-generation into “pending TLB-generation”.

2. For each mm track “completed TLB-generation”, whenever an actual flush
   takes place.

3. When you defer a TLB-flush, while holding the PTL:
  a. Increase the TLB-generation.
  b. Save the updated “table generation" in a new field in the
     page-table’s page-struct.

4. When you are about to rely on a PTE value that is read from a page-table,
   first check if a TLB flush is needed. The check is performed by comparing
   the “table generation” with the “completed generation”. If the “table
   generation” is behind, a TLB flush is needed.

   [ You rely on the PTE value when you install new PTEs or change them ]

That’s about it. I might have not covered some issues with fast-GUP. But in
general I think it is a simple scheme. The thing I like about this scheme
the most is that it avoids relying on almost all the OS data-structures
(e.g., PageAnon()), making it much easier to grasp.

I can revive the patch-set if the overall approach is agreeable.


[1] https://lore.kernel.org/lkml/20210131001132.3368247-1-namit@vmware.com/

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-27 20:15                     ` Nadav Amit
@ 2022-10-27 20:31                       ` Linus Torvalds
  2022-10-27 21:44                         ` Nadav Amit
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-27 20:31 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, x86, willy,
	Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	kirill.shutemov, jroedel, ubizjak

On Thu, Oct 27, 2022 at 1:15 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> I think it might be easier to come up with new rules instead of phrasing the
> existing ones.

I'm ok with that, but I think you are missing a very important issue:
all the cases where we can short-circuit TLB invalidations *entirely*.

You don't mention those at all.

Those optimizations are *very* important. Process exit is one of the
most performance-critical pieces of code in the kernel on some loads,
because a lot of traditional unix loads have a *ton* of small
fork/exec/exit sequences, and the whole "do just one TLB flush" was at
least historically quite a big deal.

So one very big issue here is when zap_page_tables() can end up
skipping TLB flushes entirely, because nobody cares.

And no, the fix is not to turn it into some "just increment a
generation number".

We want to avoid *even that* cost for the whole "we don't actually
need a TLB flush at all, until we actually free the pages".

So there are two levels of tlb flush optimizations

 (a) avoiding them entirely in the first place

 (b) the whole "once you have to flush, keep track of lazy modes and
TLB generations, and flush ranges"

And honestly, I think you ignored (a), and that's where we do exactly
those kinds of "this case doesn't need to flush AT ALL" things.

So when you say

>       The thing I like about this scheme
> the most is that it avoids relying on almost all the OS data-structures
> (e.g., PageAnon()), making it much easier to grasp.

I think it's because you've ignored a big part of the whole issue.

               Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-27 20:31                       ` Linus Torvalds
@ 2022-10-27 21:44                         ` Nadav Amit
  2022-10-28 23:57                           ` Nadav Amit
  0 siblings, 1 reply; 148+ messages in thread
From: Nadav Amit @ 2022-10-27 21:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, x86, willy,
	Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	kirill.shutemov, jroedel, ubizjak

On Oct 27, 2022, at 1:31 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So there are two levels of tlb flush optimizations
> 
> (a) avoiding them entirely in the first place
> 
> (b) the whole "once you have to flush, keep track of lazy modes and
> TLB generations, and flush ranges"
> 
> And honestly, I think you ignored (a), and that's where we do exactly
> those kinds of "this case doesn't need to flush AT ALL" things.

I did try to avoid TLB flushes by introducing pte_needs_flush() and avoiding
flushes based on the architectural PTE changes. There are even more x86
arch-based opportunities to further avoid TLB flushes (and then only flush
the TLB if spurious #PF occurs).

Personally, I still think that making decisions on flushes based on (mostly)
only the arch state makes the code more robust against misuse (e.g., see
various confusions between mmu_gather’s fullmm and need_flush_all).

Having said that, I will follow your feedback that the extra complexity
worth the extra performance.

Anyhow, admittedly, I need to give it more thought. For instance, in respect
to the code that you mentioned (in zap_pte_range), after reading it again,
it seems strange: how is ok to defer the TLB flush after the rmap was
removed, even if it is done while the PTL is held.
folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page
might be later re-dirtied using a stale cached PTE. Oh well.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-27 21:44                         ` Nadav Amit
@ 2022-10-28 23:57                           ` Nadav Amit
  2022-10-29  0:42                             ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Nadav Amit @ 2022-10-28 23:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jann Horn, John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton,
	kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov,
	jroedel, ubizjak, Linus Torvalds

On Oct 27, 2022, at 2:44 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> Anyhow, admittedly, I need to give it more thought. For instance, in respect
> to the code that you mentioned (in zap_pte_range), after reading it again,
> it seems strange: how is ok to defer the TLB flush after the rmap was
> removed, even if it is done while the PTL is held.
> folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page
> might be later re-dirtied using a stale cached PTE. Oh well.

Peter,

So it appears to be a problem - flushing after removing the rmap. I attach a
PoC that shows the problem.

The problem is in the following code of zap_pte_range():

	if (!PageAnon(page)) {
		if (pte_dirty(ptent)) {
			force_flush = 1;
			set_page_dirty(page);
		}
                …
	}
	page_remove_rmap(page, vma, false);

Once we remove the rmap, rmap_walk() would not acquire the page-table lock
anymore. As a result, nothing prevents the kernel from performing writeback
and cleaning the page-struct dirty-bit before the TLB is actually flushed.
As a result, if there is an additional thread that has the dirty-PTE cached
in the TLB, it can keep writing to the page and nothing (PTE/page-struct)
will keep track that the page has been dirtied.

In other words, writes to the memory mapped file after
munmap()/MADV_DONTNEED started can be lost.

This affects both munmap() and MADV_DONTNEED. One might argue that if you
don’t keep writing after munmap()/MADV_DONTNEED it’s your fault
(questionable), but if that’s the case why do we bother with force_flush at
all?

If we want it to behave correctly - i.e., writes after munmap/MADV_DONTNEED
to propagate to the file - we need to collect dirty pages and remove their
rmap only after we flush the TLB (and before we release the ptl). mmu_gather
would probably need to be changed for this matter.

Thoughts?

---

Some details about the PoC (below):

We got 3 threads that use 2MB range:

1. Maintains a counter in the first 4KB of the 2MB range and checks it
   actually updates the memory.

2. Dirties pages in 2MB range (just to make the race more likely).

3. Either (i) maps the same mapping at the first 4KB (to test munmap
   indirectly); or (ii) runs MADV_DONTNEED on the first 4KB.

In addition, a child process runs msync and fdatasync to writeback the first
4KB.

The PoC should be run with a file on a block RAM device. It manages to
trigger the issue relatively reliably (within 60 seconds) with munmap() and
slightly less reliably with MADV_DONTNEED. I have no idea if it works in VM,
deep C-states, etc..

-- >8 --

#define _GNU_SOURCE
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

#define handle_error(msg) \
   do { perror(msg); exit(EXIT_FAILURE); } while (0)

void *p;
volatile bool stop = false;
pid_t flusher_pid;
int fd;

#define PAGE_SIZE	(4096ul)
#define PAGES_PER_PMD	(512)
#define HPAGE_SIZE	(PAGE_SIZE * PAGES_PER_PMD)

// Comment MUNMAP_TEST for MADV_DONTNEED test
#define MUNMAP_TEST

void *dirtying_thread(void *arg)
{
	int i;

	while (!stop) {
		for (i = 1; i < PAGES_PER_PMD; i++) {
			*(volatile char *)(p + (i * PAGE_SIZE) + 64) = 5;
		}
	}
	return NULL;
}

void *checking_thread(void *arg)
{
	volatile unsigned long *ul_p = (volatile unsigned long*)p;
	unsigned long cnt = 0;

	while (!stop) {
		*ul_p = cnt;
		if (*ul_p != cnt) {
			printf("FAILED: expected %ld, got %ld\n", cnt, *ul_p);
			kill(flusher_pid, SIGTERM);
			exit(0);
		}
		cnt++;
	}
	return NULL;
}

void *remap_thread(void *arg)
{
	void *ptr;
	struct timespec t = {
		.tv_nsec = 10000,
	};

	while (!stop) {
#ifdef MUNMAP_TEST
		ptr = mmap(p, PAGE_SIZE, PROT_READ|PROT_WRITE,
			   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);
		if (ptr == MAP_FAILED)
			handle_error("remap_thread");
#else
		if (madvise(p, PAGE_SIZE, MADV_DONTNEED) < 0)
			handle_error("MADV_DONTNEED");
		nanosleep(&t, NULL);
#endif
	}
	return NULL;
}

void flushing_process(void)
{
	// Remove the pages to speed up rmap_walk and allow to drop caches.
	if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
		handle_error("MADV_DONTNEED");

	while (true) {
		if (msync(p, PAGE_SIZE, MS_SYNC))
			handle_error("msync");
		if (posix_fadvise(fd, 0, PAGE_SIZE, POSIX_FADV_DONTNEED))
			handle_error("posix_fadvise");
	}
}

int main(int argc, char *argv[])
{
	void *(*thread_funcs[])(void*) = {
		&dirtying_thread,
		&checking_thread,
		&remap_thread,
	};	
	int r, i;
	int rc1, rc2;
	unsigned long addr;
	void *ptr;
	char *page = malloc(PAGE_SIZE);
	int n_threads = sizeof(thread_funcs) / sizeof(*thread_funcs);
	pthread_t *threads = malloc(sizeof(pthread_t) * n_threads);
	pid_t pid;
	
	if (argc < 2) {
		fprintf(stderr, "usages: %s [filename]\n", argv[0]);
		exit(EXIT_FAILURE);
	}

	fd = open(argv[1], O_RDWR|O_CREAT, 0666);
	if (fd == -1)
		handle_error("open fd");

	for (i = 0; i < PAGES_PER_PMD; i++) {
		if (write(fd, page, PAGE_SIZE) != PAGE_SIZE)
			handle_error("write");
	}
	free(page);

	ptr = mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, MAP_PRIVATE|MAP_ANON,
                   -1, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap anon");

	addr = (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
	printf("starting...\n");

	ptr = mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap file - start");
	
	p = ptr;

	for (i = 0; i < n_threads; i++) {
		r = pthread_create(&threads[i], NULL, thread_funcs[i], NULL);
		if (r)
			handle_error("pthread_create");
	}

	// Run the flushing process in a different process, so msync() would
	// not require mmap_lock.
	pid = fork();
	if (pid == 0)
		flushing_process();
	flusher_pid = pid;

	sleep(60);

	stop = true;
	for (i = 0; i < n_threads; i++)
		pthread_join(threads[i], NULL);
	kill(flusher_pid, SIGTERM);
	printf("Finished without an error");

	exit(0);
}

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-28 23:57                           ` Nadav Amit
@ 2022-10-29  0:42                             ` Linus Torvalds
  2022-10-29 18:05                               ` Nadav Amit
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-29  0:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak

On Fri, Oct 28, 2022 at 4:57 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> The problem is in the following code of zap_pte_range():
>
>         if (!PageAnon(page)) {
>                 if (pte_dirty(ptent)) {
>                         force_flush = 1;
>                         set_page_dirty(page);
>                 }
>                 …
>         }
>         page_remove_rmap(page, vma, false);
>
> Once we remove the rmap, rmap_walk() would not acquire the page-table lock
> anymore. As a result, nothing prevents the kernel from performing writeback
> and cleaning the page-struct dirty-bit before the TLB is actually flushed.

Hah.

The original reason for force_flush there was similar, with a race wrt
page_mkclean() because this code doesn't take the page lock that would
normally serialize things, because the page lock is so painful and
ends up having nasty nasty interactions with slow IO operations.

So all the page-dirty handling really *wants* to use the page lock,
and for the IO side (writeback) that ends up being acceptable and
works well, but from that "serialize VM state" it's horrendous.

So instead the code intentionally serialized on the rmap data
structures which page_mkclean() also walks, and as you point out,
that's broken. It's not broken at the point where we do
set_page_dirty(), but it *comes* broken when we drop the rmap, and the
problem is exactly that "we still have the dirty bit hidden in the TLB
state" issue that you pinpoint.

I think the proper fix (or at least _a_ proper fix) would be to
actually carry the dirty bit along to the __tlb_remove_page() point,
and actually treat it exactly the same way as the page pointer itself
- set the page dirty after the TLB flush, the same way we can free the
page after the TLB flush.

We could easiy hide said dirty bit in the low bits of the
"batch->pages[]" array or something like that. We'd just have to add
the 'dirty' argument to __tlb_remove_page_size() and friends.

Hmm?

Your idea of "do the page_remove_rmap() late instead" would also work,
but the reason I think just squirrelling away the dirty bit is the
"proper" fix is that it would get rid of the whole need for
'force_flush' in this area entirely. So we'd not only fix that race
you noticed, we'd actually do so and reduce the number of TLB flushes
too.

I don't know. Maybe I'm missing something fundamental, and my idea is
just stupid.

               Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE
  2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
                   ` (13 preceding siblings ...)
  2022-10-22 17:57 ` [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Linus Torvalds
@ 2022-10-29 12:21 ` Peter Zijlstra
  14 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-29 12:21 UTC (permalink / raw)
  To: x86, willy, torvalds, akpm
  Cc: linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 01:14:03PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> At long *long* last a respin of the patches that clean up pmd_get_atomic() and
> i386-PAE. I'd nearly forgotten why I did this, but the old posting gave clue
> that patch #7 was the whole purpose of me doing these patches.
> 
> Having carried these patches for at least 2 years, they recently hit a rebase
> bump against the mg-lru patches, which is what prompted this repost.
> 
> Linus' comment about try_cmpxchg64() (and Uros before him) made me redo those
> patches (see patch #10) which resulted in pxx_xchg64(). This in turn led to
> killing off set_64bit().
> 
> The robot doesn't hate on these patches and they boot in kvm (because who still
> has i386 hardware).
> 
> Patches also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/mm.pae

Aside from the whole discussions about TLB invalidates; I mean to commit
these patches to tip/x86/mm next week somewhere since they do improve
the current situation.

Holler if there's objections.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29  0:42                             ` Linus Torvalds
@ 2022-10-29 18:05                               ` Nadav Amit
  2022-10-29 18:36                                 ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Nadav Amit @ 2022-10-29 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 28, 2022, at 5:42 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I think the proper fix (or at least _a_ proper fix) would be to
> actually carry the dirty bit along to the __tlb_remove_page() point,
> and actually treat it exactly the same way as the page pointer itself
> - set the page dirty after the TLB flush, the same way we can free the
> page after the TLB flush.
> 
> We could easiy hide said dirty bit in the low bits of the
> "batch->pages[]" array or something like that. We'd just have to add
> the 'dirty' argument to __tlb_remove_page_size() and friends.

Thank you for your quick response. I was slow to respond due to a jet lag.

Anyhow, I am not sure whether the solution that you propose would work.
Please let me know if my understanding makes sense.

Let’s assume that we do not call set_page_dirty() before we remove the rmap
but only after we invalidate the page [*]. Let’s assume that
shrink_page_list() is called after the page’s rmap is removed and the page
is no longer mapped, but before set_page_dirty() was actually called.

In such a case, shrink_page_list() would consider the page clean, and would
indeed keep the page (since __remove_mapping() would find elevated page
refcount), which appears to give us a chance to mark the page as dirty
later.

However, IIUC, in this case shrink_page_list() might still call
filemap_release_folio() and release the buffers, so calling set_page_dirty()
afterwards - after the actual TLB invalidation took place - would fail.

> Your idea of "do the page_remove_rmap() late instead" would also work,
> but the reason I think just squirrelling away the dirty bit is the
> "proper" fix is that it would get rid of the whole need for
> 'force_flush' in this area entirely. So we'd not only fix that race
> you noticed, we'd actually do so and reduce the number of TLB flushes
> too.

I’m all for reducing the number of TLB flushes, and your solution does sound
better in general. I proposed something that I considered having the path of
least resistance (i.e., least chance of breaking something). I can do what
you propsosed, but I am not sure how to deal with the buffers being removed.

One more note: This issue, I think, also affects migrate_vma_collect_pmd().
Alistair recently addressed an issue there, but in my prior feedback to him
I missed this issue.


[*] Note that this would be for our scenario pretty much the same if we also
called set_page_dirty() before removing the rmap, but the page was cleaned
while the TLB invalidation has still not been performed.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 18:05                               ` Nadav Amit
@ 2022-10-29 18:36                                 ` Linus Torvalds
  2022-10-29 18:58                                   ` Linus Torvalds
  2022-10-29 19:39                                   ` John Hubbard
  0 siblings, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-29 18:36 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

[-- Attachment #1: Type: text/plain, Size: 3113 bytes --]

On Sat, Oct 29, 2022 at 11:05 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Anyhow, I am not sure whether the solution that you propose would work.
> Please let me know if my understanding makes sense.

Let me include my (UNTESTED! PROBABLY GARBAGE!) patch here just as a
"I meant something like this".

Note that it's untested, but I tried to make it intentionally use
different names and an opaque type in the 'mmu_gather' data structure
so that at least these bit games end up being type-safe and more
likely to be correct if it compiles.

And I will say that it does compile for me - but only in my normal
x86-64 config. I haven't tested anything else, and I guarantee it
won't build on s390, for example, because s390 has its own mmu_gather
functions.

> Let’s assume that we do not call set_page_dirty() before we remove the rmap
> but only after we invalidate the page [*]. Let’s assume that
> shrink_page_list() is called after the page’s rmap is removed and the page
> is no longer mapped, but before set_page_dirty() was actually called.
>
> In such a case, shrink_page_list() would consider the page clean, and would
> indeed keep the page (since __remove_mapping() would find elevated page
> refcount), which appears to give us a chance to mark the page as dirty
> later.

Right. That is not different to any other function (like "write()"
having looked up the page.

> However, IIUC, in this case shrink_page_list() might still call
> filemap_release_folio() and release the buffers, so calling set_page_dirty()
> afterwards - after the actual TLB invalidation took place - would fail.

I'm not seeing why.

That would imply that any "look up page, do set_page_dirty()" is
broken. They don't have rmap either. And we have a number of them all
over (eg think "GUP users" etc).

But hey, you were right about the stale TLB case, so I may just be
missing something.

I *think* the important thing is just that we need to mark the page
dirtty from the page tables _after_ we've flushed the TLB, just to
make sure that there can be no subsequent dirtiers that then get lost.

Anyway, I think the best documentation for "this is what I meant" is
simply the patch. Does this affect your PoC on your setup?

I tried to run your program on my machine (WITHOUT this patch - I have
compiled this patch, but I haven't even booted it - it really could be
horrible broken).

But it doesn't fail for me even without the patch and I just get
"Finished without an error" over and over again - but you said it had
to be run on a RAM block device, which I didn't do, so my testing is
very suspect,.

Again: THIS PATCH IS UNTESTED. I feel like it might actually work, if
only because I tried to be so careful with the type system.

But that "might actually work" is probably me being wildly optimistic,
and also depends on the whole concept being ok in the first place.

So realistically, think of this patch more as a "document in code what
Linus meant with his incoherent ramblings" rather than anything else.

Hmm?

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 6722 bytes --]

 include/asm-generic/tlb.h | 28 +++++++++++++++++++++++-----
 mm/memory.c               | 17 ++++++++---------
 mm/mmu_gather.c           | 36 ++++++++++++++++++++++++++++++++----
 3 files changed, 63 insertions(+), 18 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..a95085f6dd47 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -238,11 +238,29 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
  */
 #define MMU_GATHER_BUNDLE	8
 
+/* Fake type for an encoded page pointer with the dirty bit in the low bit */
+struct encoded_page;
+
+static inline struct encoded_page *encode_page(struct page *page, bool dirty)
+{
+	return (struct encoded_page *)(dirty | (unsigned long)page);
+}
+
+static inline bool encoded_page_dirty(struct encoded_page *page)
+{
+	return 1 & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~1ul & (unsigned long)page);
+}
+
 struct mmu_gather_batch {
 	struct mmu_gather_batch	*next;
 	unsigned int		nr;
 	unsigned int		max;
-	struct page		*pages[];
+	struct encoded_page	*encoded_pages[];
 };
 
 #define MAX_GATHER_BATCH	\
@@ -257,7 +275,7 @@ struct mmu_gather_batch {
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
-				   int page_size);
+				   int page_size, bool dirty);
 #endif
 
 /*
@@ -431,13 +449,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
-	if (__tlb_remove_page_size(tlb, page, page_size))
+	if (__tlb_remove_page_size(tlb, page, page_size, false))
 		tlb_flush_mmu(tlb);
 }
 
-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 {
-	return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, page, PAGE_SIZE, dirty);
 }
 
 /* tlb_remove_page
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..8ab4c0d7e99e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1423,7 +1423,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
-		struct page *page;
 
 		if (pte_none(ptent))
 			continue;
@@ -1432,7 +1431,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
-			page = vm_normal_page(vma, addr, ptent);
+			struct page *page = vm_normal_page(vma, addr, ptent);
+			int dirty;
+
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -1443,11 +1444,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!page))
 				continue;
 
+			dirty = 0;
 			if (!PageAnon(page)) {
-				if (pte_dirty(ptent)) {
-					force_flush = 1;
-					set_page_dirty(page);
-				}
+				dirty = pte_dirty(ptent);
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
@@ -1456,7 +1455,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
+			if (unlikely(__tlb_remove_page(tlb, page, dirty))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
 				break;
@@ -1467,7 +1466,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		entry = pte_to_swp_entry(ptent);
 		if (is_device_private_entry(entry) ||
 		    is_device_exclusive_entry(entry)) {
-			page = pfn_swap_entry_to_page(entry);
+			struct page *page = pfn_swap_entry_to_page(entry);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
 			/*
@@ -1489,7 +1488,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!free_swap_and_cache(entry)))
 				print_bad_pte(vma, addr, ptent, NULL);
 		} else if (is_migration_entry(entry)) {
-			page = pfn_swap_entry_to_page(entry);
+			struct page *page = pfn_swap_entry_to_page(entry);
 			if (!should_zap_page(details, page))
 				continue;
 			rss[mm_counter(page)]--;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index add4244e5790..fa79e054413a 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -43,12 +43,40 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	return true;
 }
 
+/*
+ * We get an 'encoded page' array, which has page pointers with
+ * the dirty bit in the low bit of the array.
+ *
+ * The TLB has been flushed, now we need to move the dirty bit into
+ * the 'struct page', clean the array in-place, and then free the
+ * pages and their swap cache.
+ */
+static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
+{
+	for (unsigned int i = 0; i < nr; i++) {
+		struct encoded_page *encoded = pages[i];
+		if (encoded_page_dirty(encoded)) {
+			struct page *page = encoded_page_ptr(encoded);
+			/* Clean the dirty pointer in-place */
+			pages[i] = encode_page(page, 0);
+			set_page_dirty(page);
+		}
+	}
+
+	/*
+	 * Now all entries have been un-encoded, and changed to plain
+	 * page pointers, so we can cast the 'encoded_page' array to
+	 * a plain page array and free them
+	 */
+	free_pages_and_swap_cache((struct page **)pages, nr);
+}
+
 static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		struct page **pages = batch->pages;
+		struct encoded_page **pages = batch->encoded_pages;
 
 		do {
 			/*
@@ -56,7 +84,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 			 */
 			unsigned int nr = min(512U, batch->nr);
 
-			free_pages_and_swap_cache(pages, nr);
+			clean_and_free_pages_and_swap_cache(pages, nr);
 			pages += nr;
 			batch->nr -= nr;
 
@@ -77,7 +105,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, bool dirty)
 {
 	struct mmu_gather_batch *batch;
 
@@ -92,7 +120,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->pages[batch->nr++] = page;
+	batch->encoded_pages[batch->nr++] = encode_page(page, dirty);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 18:36                                 ` Linus Torvalds
@ 2022-10-29 18:58                                   ` Linus Torvalds
  2022-10-29 19:14                                     ` Linus Torvalds
  2022-10-30  2:17                                     ` Nadav Amit
  2022-10-29 19:39                                   ` John Hubbard
  1 sibling, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-29 18:58 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

On Sat, Oct 29, 2022 at 11:36 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, I think the best documentation for "this is what I meant" is
> simply the patch. Does this affect your PoC on your setup?

Here's a slightly cleaned up set with preliminary commit messages, and
an explanation for why some of the 'struct page' declarations were
moved around a bit in case you wondered about that part of the change
in the full patch.

The end result should be the same, so if you already looked at the
previous unified patch, never mind. But this one tries to make for a
better patch series.

Still not tested in any way, shape, or form. I decided I wanted to
send this one before booting into this and possibly blowing up ;^)

                   Linus

[-- Attachment #2: 0001-mm-zap_page_range-narrow-down-page-variable-scope.patch --]
[-- Type: text/x-patch, Size: 2124 bytes --]

From 8caca6a93ebe3b0e4adabfb1b8d13e86d41fd329 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 29 Oct 2022 10:42:25 -0700
Subject: [PATCH 1/2] mm: zap_page_range: narrow down 'page' variable scope

We're using the same 'struct page *page' variable for three very
distinct cases.  That works and the compiler does the right thing, but
I'm about to add some page-related attributes that only affects one of
them, so let's make the whole "these are really different uses"
explicit.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..d52f5a68c561 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1423,7 +1423,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
-		struct page *page;
 
 		if (pte_none(ptent))
 			continue;
@@ -1432,7 +1431,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
-			page = vm_normal_page(vma, addr, ptent);
+			struct page *page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -1467,7 +1466,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		entry = pte_to_swp_entry(ptent);
 		if (is_device_private_entry(entry) ||
 		    is_device_exclusive_entry(entry)) {
-			page = pfn_swap_entry_to_page(entry);
+			struct page *page = pfn_swap_entry_to_page(entry);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
 			/*
@@ -1489,7 +1488,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!free_swap_and_cache(entry)))
 				print_bad_pte(vma, addr, ptent, NULL);
 		} else if (is_migration_entry(entry)) {
-			page = pfn_swap_entry_to_page(entry);
+			struct page *page = pfn_swap_entry_to_page(entry);
 			if (!should_zap_page(details, page))
 				continue;
 			rss[mm_counter(page)]--;
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #3: 0002-mm-make-sure-to-flush-TLB-before-marking-page-dcirty.patch --]
[-- Type: text/x-patch, Size: 7018 bytes --]

From 86d1a3807c013abca72086278d9308e398e7b41d Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 29 Oct 2022 11:45:07 -0700
Subject: [PATCH 2/2] mm: make sure to flush TLB before marking page dcirty

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPU's could still be
using the page through stale TLB entries until after the flush.

However, we mark the underlying page dirty immediately, and then remove
the rmap entry for the page, which means that

 (a) another CPU could come in and clean it, never seeing our mapping of
     the page

 (b) yet another CPU could continue to use the stale and dirty TLB entry
     and continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.  End result:
possibly lost dirty data.

This commit uses the same old TLB gather array that we use to delay the
freeing of the page to also keep the dirty state of the page table
entry, so that the 'set_page_dirty()' from the page table can be done
after the TLB flush, closing the race.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/asm-generic/tlb.h | 28 +++++++++++++++++++++++-----
 mm/memory.c               | 10 +++++-----
 mm/mmu_gather.c           | 36 ++++++++++++++++++++++++++++++++----
 3 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..a95085f6dd47 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -238,11 +238,29 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
  */
 #define MMU_GATHER_BUNDLE	8
 
+/* Fake type for an encoded page pointer with the dirty bit in the low bit */
+struct encoded_page;
+
+static inline struct encoded_page *encode_page(struct page *page, bool dirty)
+{
+	return (struct encoded_page *)(dirty | (unsigned long)page);
+}
+
+static inline bool encoded_page_dirty(struct encoded_page *page)
+{
+	return 1 & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~1ul & (unsigned long)page);
+}
+
 struct mmu_gather_batch {
 	struct mmu_gather_batch	*next;
 	unsigned int		nr;
 	unsigned int		max;
-	struct page		*pages[];
+	struct encoded_page	*encoded_pages[];
 };
 
 #define MAX_GATHER_BATCH	\
@@ -257,7 +275,7 @@ struct mmu_gather_batch {
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
-				   int page_size);
+				   int page_size, bool dirty);
 #endif
 
 /*
@@ -431,13 +449,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
-	if (__tlb_remove_page_size(tlb, page, page_size))
+	if (__tlb_remove_page_size(tlb, page, page_size, false))
 		tlb_flush_mmu(tlb);
 }
 
-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, bool dirty)
 {
-	return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, page, PAGE_SIZE, dirty);
 }
 
 /* tlb_remove_page
diff --git a/mm/memory.c b/mm/memory.c
index d52f5a68c561..8ab4c0d7e99e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1432,6 +1432,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 		if (pte_present(ptent)) {
 			struct page *page = vm_normal_page(vma, addr, ptent);
+			int dirty;
+
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -1442,11 +1444,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!page))
 				continue;
 
+			dirty = 0;
 			if (!PageAnon(page)) {
-				if (pte_dirty(ptent)) {
-					force_flush = 1;
-					set_page_dirty(page);
-				}
+				dirty = pte_dirty(ptent);
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
@@ -1455,7 +1455,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
+			if (unlikely(__tlb_remove_page(tlb, page, dirty))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
 				break;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index add4244e5790..fa79e054413a 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -43,12 +43,40 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	return true;
 }
 
+/*
+ * We get an 'encoded page' array, which has page pointers with
+ * the dirty bit in the low bit of the array.
+ *
+ * The TLB has been flushed, now we need to move the dirty bit into
+ * the 'struct page', clean the array in-place, and then free the
+ * pages and their swap cache.
+ */
+static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
+{
+	for (unsigned int i = 0; i < nr; i++) {
+		struct encoded_page *encoded = pages[i];
+		if (encoded_page_dirty(encoded)) {
+			struct page *page = encoded_page_ptr(encoded);
+			/* Clean the dirty pointer in-place */
+			pages[i] = encode_page(page, 0);
+			set_page_dirty(page);
+		}
+	}
+
+	/*
+	 * Now all entries have been un-encoded, and changed to plain
+	 * page pointers, so we can cast the 'encoded_page' array to
+	 * a plain page array and free them
+	 */
+	free_pages_and_swap_cache((struct page **)pages, nr);
+}
+
 static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		struct page **pages = batch->pages;
+		struct encoded_page **pages = batch->encoded_pages;
 
 		do {
 			/*
@@ -56,7 +84,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 			 */
 			unsigned int nr = min(512U, batch->nr);
 
-			free_pages_and_swap_cache(pages, nr);
+			clean_and_free_pages_and_swap_cache(pages, nr);
 			pages += nr;
 			batch->nr -= nr;
 
@@ -77,7 +105,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, bool dirty)
 {
 	struct mmu_gather_batch *batch;
 
@@ -92,7 +120,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->pages[batch->nr++] = page;
+	batch->encoded_pages[batch->nr++] = encode_page(page, dirty);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
-- 
2.37.1.289.g45aa1e5c72.dirty


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 18:58                                   ` Linus Torvalds
@ 2022-10-29 19:14                                     ` Linus Torvalds
  2022-10-29 19:28                                       ` Nadav Amit
  2022-10-30  0:18                                       ` Nadav Amit
  2022-10-30  2:17                                     ` Nadav Amit
  1 sibling, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-29 19:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sat, Oct 29, 2022 at 11:58 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Still not tested in any way, shape, or form. I decided I wanted to
> send this one before booting into this and possibly blowing up ;^)

Well, it boots, and I see no difference with your PoC code.

It didn't fail for me before, it doesn't fail for me with those patches.

Again, the "it doesn't fail for me" is probably because I'm running it
incorrectly, although for all I know there can also be hardware
differences.

I'm testing on an older AMD threadripper, and as I'm sure you are very
aware, some AMD cores used to have special support for keeping the TLB
coherent with the actual page table contents in order to then avoid
TLB flushes entirely.

Those things ended up being buggy and disabled, but my point is that
hardware differences can obviously actively hide this issue by making
the TLB contents track page table changes.

So even if I were to run it the same way you do, I might not see the
failure due to just running it on different hardware with different
TLB and timing.

Anyway, the patches don't seem to cause any *obvious* problems. That's
not to say that they are correct, or that they fix anything, but it's
certainly a fairly simple and straightforward patch, and it "feels
right" to me.

Sadly, reality doesn't always agree with my feelings. Damn.

                Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 19:14                                     ` Linus Torvalds
@ 2022-10-29 19:28                                       ` Nadav Amit
  2022-10-30  0:18                                       ` Nadav Amit
  1 sibling, 0 replies; 148+ messages in thread
From: Nadav Amit @ 2022-10-29 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 29, 2022, at 12:14 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Oct 29, 2022 at 11:58 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Still not tested in any way, shape, or form. I decided I wanted to
>> send this one before booting into this and possibly blowing up ;^)
> 
> Well, it boots, and I see no difference with your PoC code.
> 
> It didn't fail for me before, it doesn't fail for me with those patches.
> 
> Again, the "it doesn't fail for me" is probably because I'm running it
> incorrectly, although for all I know there can also be hardware
> differences.

Please give me some time to test it. I presume you ran it with block ram
device (not tmpfs) and not on a virtual machine (which can affect besides
Intel/AMD implementation differences).

But even if your patches work and the tests pass, I am not sure it means
that everything is fine. I did not try to trigger a race with
shrink_page_list(), and doing that might be harder than the race I tried to
create before. I need to do some tracing to understand what I was missing in
my understanding of the shrink_page_list() - assuming that I am mistaken
about the buffers being potentially released.

I would note that my concern about releasing the buffers is partially driven
by to issues that were reported before [1]. I am actually not sure how they
were resolved.


[1] https://lore.kernel.org/all/20180103100430.GE4911@quack2.suse.cz/


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 18:36                                 ` Linus Torvalds
  2022-10-29 18:58                                   ` Linus Torvalds
@ 2022-10-29 19:39                                   ` John Hubbard
  2022-10-29 20:15                                     ` Linus Torvalds
  1 sibling, 1 reply; 148+ messages in thread
From: John Hubbard @ 2022-10-29 19:39 UTC (permalink / raw)
  To: Linus Torvalds, Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, X86 ML, Matthew Wilcox, Andrew Morton,
	kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov,
	jroedel, ubizjak, Alistair Popple

On 10/29/22 11:36, Linus Torvalds wrote:
>> In such a case, shrink_page_list() would consider the page clean, and would
>> indeed keep the page (since __remove_mapping() would find elevated page
>> refcount), which appears to give us a chance to mark the page as dirty
>> later.
> 
> Right. That is not different to any other function (like "write()"
> having looked up the page.
> 
>> However, IIUC, in this case shrink_page_list() might still call
>> filemap_release_folio() and release the buffers, so calling set_page_dirty()
>> afterwards - after the actual TLB invalidation took place - would fail.
> 
> I'm not seeing why.
> 
> That would imply that any "look up page, do set_page_dirty()" is
> broken. They don't have rmap either. And we have a number of them all
> over (eg think "GUP users" etc).

Yes, we do have a bunch of "look up page, do set_page_dirty()" cases.
And I think that many (most?) of them are in fact broken!

Because: the dirtiness of a page is something that the filesystem
believes that it is managing, and so filesystem coordination is, in
general, required in order to mark a page as dirty.

Jan Kara's 2018 analysis [1] (which launched the pin_user_pages()
effort) shows a nice clear example. And since then, I've come to believe
that most of the gup/pup call sites have it wrong:

    a) pin_user_pages() b) /* access page contents */ c)
    set_page_dirty() or set_page_dirty_lock() // PROBLEM HERE d)
    unpin_user_page()

ext4 has since papered over the problem, by soldiering on if it finds a
page without writeback buffers when it expected to be able to writeback
a dirty page. But you get the idea.

And I think that applies beyond the gup/pup situation.


[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/T/#u


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 19:39                                   ` John Hubbard
@ 2022-10-29 20:15                                     ` Linus Torvalds
  2022-10-29 20:30                                       ` Linus Torvalds
                                                         ` (2 more replies)
  0 siblings, 3 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-29 20:15 UTC (permalink / raw)
  To: John Hubbard
  Cc: Nadav Amit, Peter Zijlstra, Jann Horn, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sat, Oct 29, 2022 at 12:39 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> ext4 has since papered over the problem, by soldiering on if it finds a
> page without writeback buffers when it expected to be able to writeback
> a dirty page. But you get the idea.

I suspect that "soldiering on" is the right thing to do, but yes, our
'mkdirty' vs 'mkclean' thing has always been problematic.

I think we always needed a page lock for it, but PG_lock itself
doesn't work (as mentioned earlier) because the VM can't serialize
with IO, and needs the lock to basically be a spinlock.

The page table lock kind of took its place, and then the rmap removal
makes for problems (since it is what basically ends up being the
shared place to look it upo).

I can think of three options:

 (a) filesystems just deal with it

 (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too

 (c) we could actually add a spinlock (hashed on the page?) for this

I think (a) is basically our current expectation.

And (b) would be fairly easy - same model as that dirty bit patch,
just a 'do page_remove_rmap too' - except page_remove_rmap() wants the
vma as well (and we delay the TLB flush over multiple vma's, so it's
not just a "save vma in mmu_gather").

Doing (c) doesn't look hard, except for the "new lock" thing, which is
always a potential huge disaster. If it's only across set_page_dirty()
and page_mkclean(), though, and uses some simple page-based hash, it
sounds fairly benign.

                  Linus



              Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 20:15                                     ` Linus Torvalds
@ 2022-10-29 20:30                                       ` Linus Torvalds
  2022-10-29 20:42                                         ` John Hubbard
  2022-10-29 20:56                                       ` Nadav Amit
  2022-10-29 20:59                                       ` Theodore Ts'o
  2 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-29 20:30 UTC (permalink / raw)
  To: John Hubbard
  Cc: Nadav Amit, Peter Zijlstra, Jann Horn, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sat, Oct 29, 2022 at 1:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I can think of three options:
>
>  (a) filesystems just deal with it
>
>  (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
>
>  (c) we could actually add a spinlock (hashed on the page?) for this
>
> I think (a) is basically our current expectation.

Side note: anybody doing gup + set_page_dirty() won't be fixed by b/c
anyway, so I think (a) is basically the only thing.

And that's true even if you do a page pinning gup, since the source of
the gup may be actively unmapped after the gup.

So a filesystem that thinks that only write, or a rmap-accessible mmap
can turn the page dirty really seems to be fundamentally broken.

And I think that has always been the case, it's just that filesystem
writers may not have been happy with it, and may not have had
test-cases for it.

It's not surprising that the filesystem people then try to blame users.

          Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 20:30                                       ` Linus Torvalds
@ 2022-10-29 20:42                                         ` John Hubbard
  0 siblings, 0 replies; 148+ messages in thread
From: John Hubbard @ 2022-10-29 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Peter Zijlstra, Jann Horn, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On 10/29/22 13:30, Linus Torvalds wrote:
>> I can think of three options:
>>
>>  (a) filesystems just deal with it
>>
>>  (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
>>
>>  (c) we could actually add a spinlock (hashed on the page?) for this
>>
>> I think (a) is basically our current expectation.
> 
> Side note: anybody doing gup + set_page_dirty() won't be fixed by b/c
> anyway, so I think (a) is basically the only thing.
> 
> And that's true even if you do a page pinning gup, since the source of
> the gup may be actively unmapped after the gup.

I was just now writing a response that favored (c) over (b), precisely
because of that, yes. :)

> 
> So a filesystem that thinks that only write, or a rmap-accessible mmap
> can turn the page dirty really seems to be fundamentally broken.
> 
> And I think that has always been the case, it's just that filesystem
> writers may not have been happy with it, and may not have had
> test-cases for it.
> 
> It's not surprising that the filesystem people then try to blame users.
> 
>           Linus

Yes, lots of unhappy debates about this over the years.

However, I remain intrigued by (c), because if we had a "dirty page lock"
that is looked up by page (much like looking up the ptl), it seems like
a building block that would potentially help solve the whole thing.

The above points about "file system needs to coordinate with mm about
what's allowed to be dirtied, including gup/dma cases", those are still
true and not yet solved, yes. But having a solid point of synchronization
for this, definitely looks interesting.

Of course, without working through this more thoroughly, it's not fair
to impose this constraint on the current discussion, understood. :)

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 20:15                                     ` Linus Torvalds
  2022-10-29 20:30                                       ` Linus Torvalds
@ 2022-10-29 20:56                                       ` Nadav Amit
  2022-10-29 21:03                                         ` Nadav Amit
  2022-10-29 21:12                                         ` Linus Torvalds
  2022-10-29 20:59                                       ` Theodore Ts'o
  2 siblings, 2 replies; 148+ messages in thread
From: Nadav Amit @ 2022-10-29 20:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Hubbard, Peter Zijlstra, Jann Horn, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
> 
> And (b) would be fairly easy - same model as that dirty bit patch,
> just a 'do page_remove_rmap too' - except page_remove_rmap() wants the
> vma as well (and we delay the TLB flush over multiple vma's, so it's
> not just a "save vma in mmu_gather”).

(b) sounds reasonable and may potentially allow future performance
improvements (batching, doing stuff without locks).

It does appear to break a potential hidden assumption that rmap is removed
while the ptl is acquired (at least in the several instances I samples).
Yet, anyhow page_vma_mapped_walk() checks the PTE before calling the
function, so it should be fine.

I’ll give it a try.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 20:15                                     ` Linus Torvalds
  2022-10-29 20:30                                       ` Linus Torvalds
  2022-10-29 20:56                                       ` Nadav Amit
@ 2022-10-29 20:59                                       ` Theodore Ts'o
  2 siblings, 0 replies; 148+ messages in thread
From: Theodore Ts'o @ 2022-10-29 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Hubbard, Nadav Amit, Peter Zijlstra, Jann Horn, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, jroedel, ubizjak,
	Alistair Popple

On Sat, Oct 29, 2022 at 01:15:26PM -0700, Linus Torvalds wrote:
> On Sat, Oct 29, 2022 at 12:39 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > ext4 has since papered over the problem, by soldiering on if it finds a
> > page without writeback buffers when it expected to be able to writeback
> > a dirty page. But you get the idea.
> 
> I suspect that "soldiering on" is the right thing to do, but yes, our
> 'mkdirty' vs 'mkclean' thing has always been problematic.
>
> ...
>
>  (a) filesystems just deal with it

It should be noted that "soldiering on" just means that the kernel
will not crash or BUG.  It may mean that the dirty page will not
gotten written back (since at the time when it is discovered we are in
a context we may not allocate memory or block if there is a need to
allocate blocks if the file system uses delayed allocation).

Furthermore, since the file system does not know that one or more
pages have dirtied behind it's back, if the file system is almost
full, some writes may silently fail --- including writes where the
usesrspace application was implicitly promised that the write would
succeed by having the write(2) system call return without errors.

If people are OK with that, it's fine.  Just don't complain to the
file system maintainers.  :-)

						- Ted

P.S.  The reason why this isn't an utter disaster is because normally
users of remote RMA use preallocated and pre-written/initialized
files.  And there aren't _that_ many other users of gup.  So long as
this remains the case, we might be happy to let sleeping canines lie.
Just please dear $DEITY, let's not have any additional users of gup
until we have a better solution.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 20:56                                       ` Nadav Amit
@ 2022-10-29 21:03                                         ` Nadav Amit
  2022-10-29 21:12                                         ` Linus Torvalds
  1 sibling, 0 replies; 148+ messages in thread
From: Nadav Amit @ 2022-10-29 21:03 UTC (permalink / raw)
  To: Linus Torvalds, John Hubbard
  Cc: Peter Zijlstra, Jann Horn, X86 ML, Matthew Wilcox, Andrew Morton,
	kernel list, Linux-MM, Andrea Arcangeli, Kirill A . Shutemov,
	jroedel, ubizjak, Alistair Popple

On Oct 29, 2022, at 1:56 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
>> 
>> And (b) would be fairly easy - same model as that dirty bit patch,
>> just a 'do page_remove_rmap too' - except page_remove_rmap() wants the
>> vma as well (and we delay the TLB flush over multiple vma's, so it's
>> not just a "save vma in mmu_gather”).
> 
> (b) sounds reasonable and may potentially allow future performance
> improvements (batching, doing stuff without locks).
> 
> It does appear to break a potential hidden assumption that rmap is removed
> while the ptl is acquired (at least in the several instances I samples).
> Yet, anyhow page_vma_mapped_walk() checks the PTE before calling the
> function, so it should be fine.
> 
> I’ll give it a try.

I have just seen John’s and your emails. It seems (b) fell off. (a) is out
of my “zone”, and anyhow assuming it would not be solved soon, deferring
page_remove_rmap() might cause regressions.

(c) might be more intrusive and potentially induce overheads. If we need a
small backportable solution, I think the approach that I proposed (marking
the page dirty after the invalidation, before the PTL is released) is the
simplest one.

Please advise how to proceed.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 20:56                                       ` Nadav Amit
  2022-10-29 21:03                                         ` Nadav Amit
@ 2022-10-29 21:12                                         ` Linus Torvalds
  1 sibling, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-29 21:12 UTC (permalink / raw)
  To: Nadav Amit
  Cc: John Hubbard, Peter Zijlstra, Jann Horn, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sat, Oct 29, 2022 at 1:56 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
> >
> > And (b) would be fairly easy - same model as that dirty bit patch,
> > just a 'do page_remove_rmap too' - except page_remove_rmap() wants the
> > vma as well (and we delay the TLB flush over multiple vma's, so it's
> > not just a "save vma in mmu_gather”).
>
> (b) sounds reasonable and may potentially allow future performance
> improvements (batching, doing stuff without locks).

So the thing is, I think (b) makes sense from a TLB flush standpoint,
but as mentioned, if filesystems _really_ want to feel in control,
none of these locks fundamentally help, because the whole "gup +
set_page_dirty()" situation still exists.

And that fundamentally does not hold any locks between the lookup and
the set_page_dirty(), and fundamentally doesn't stop an rmap from
happening in between, so while the zap_page_range() and dirty TLB case
case be dealt with that way, you don't actually fix any fundamental
issues.

Now, neither does my (c) case on its own, but as John Hubbard alluded
to, *if* we had some sane serialization for a struct page, maybe that
could then be used to at least avoid the issue with "rmap no longer
exists", and make the filesystem handling case a bit better.

IOW, I really do think that not only is (a) the current solution, it's
the *correct* solution.

But it is possible that (c) could then be used as a way to make (a)
more palatable to filesystems, in that at least then there would be
some way to serialize with "set_page_dirty()".

But (b) cannot be used for that - because GUP fundamentally breaks
that rmap association.

It's all nasty.

                Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 19:14                                     ` Linus Torvalds
  2022-10-29 19:28                                       ` Nadav Amit
@ 2022-10-30  0:18                                       ` Nadav Amit
  1 sibling, 0 replies; 148+ messages in thread
From: Nadav Amit @ 2022-10-30  0:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 29, 2022, at 12:14 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It didn't fail for me before, it doesn't fail for me with those patches.

For the record, I tried to run the PoC on another machine, and it indeed did
not fail.

Turns out I had a small bug in one of the mechanisms that were intended to
make the failure more likely (I should have mapped again or madvised
HPAGE_SIZE to increase the time zap_pte_range spends to increase the
probability of the race).

I am still trying to figure out how to address this issue, and whether the
fact that some rmap_walk(), which do not use PVMW_SYNC are of an issue.

---

#define _GNU_SOURCE
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

#define handle_error(msg) \
   do { perror(msg); exit(EXIT_FAILURE); } while (0)

void *p;
volatile bool stop = false;
pid_t flusher_pid;
int fd;

#define PAGE_SIZE	(4096ul)
#define PAGES_PER_PMD	(512)
#define HPAGE_SIZE	(PAGE_SIZE * PAGES_PER_PMD)

// Comment MUNMAP_TEST for MADV_DONTNEED test
#define MUNMAP_TEST

void *dirtying_thread(void *arg)
{
	int i;

	while (!stop) {
		for (i = 1; i < PAGES_PER_PMD; i++) {
			*(volatile char *)(p + (i * PAGE_SIZE) + 64) = 5;
		}
	}
	return NULL;
}

void *checking_thread(void *arg)
{
	volatile unsigned long *ul_p = (volatile unsigned long*)p;
	unsigned long cnt = 0;

	while (!stop) {
		*ul_p = cnt;
		if (*ul_p != cnt) {
			printf("FAILED: expected %ld, got %ld\n", cnt, *ul_p);
			kill(flusher_pid, SIGTERM);
			exit(0);
		}
		cnt++;
	}
	return NULL;
}

void *remap_thread(void *arg)
{
	void *ptr;
	struct timespec t = {
		.tv_nsec = 10000,
	};

	while (!stop) {
#ifdef MUNMAP_TEST
		ptr = mmap(p, HPAGE_SIZE, PROT_READ|PROT_WRITE,
			   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);
		if (ptr == MAP_FAILED)
			handle_error("remap_thread");
#else
		if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
			handle_error("MADV_DONTNEED");
		nanosleep(&t, NULL);
#endif
	}
	return NULL;
}

void flushing_process(void)
{
	// Remove the pages to speed up rmap_walk and allow to drop caches.
	if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
		handle_error("MADV_DONTNEED");

	while (true) {
		if (msync(p, PAGE_SIZE, MS_SYNC))
			handle_error("msync");
		if (posix_fadvise(fd, 0, PAGE_SIZE, POSIX_FADV_DONTNEED))
			handle_error("posix_fadvise");
	}
}

int main(int argc, char *argv[])
{
	void *(*thread_funcs[])(void*) = {
		&dirtying_thread,
		&checking_thread,
		&remap_thread,
	};
	int r, i;
	int rc1, rc2;
	unsigned long addr;
	void *ptr;
	char *page = malloc(PAGE_SIZE);
	int n_threads = sizeof(thread_funcs) / sizeof(*thread_funcs);
	pthread_t *threads = malloc(sizeof(pthread_t) * n_threads);
	pid_t pid;

	if (argc < 2) {
		fprintf(stderr, "usages: %s [filename]\n", argv[0]);
		exit(EXIT_FAILURE);
	}

	fd = open(argv[1], O_RDWR|O_CREAT, 0666);
	if (fd == -1)
		handle_error("open fd");

	for (i = 0; i < PAGES_PER_PMD; i++) {
		if (write(fd, page, PAGE_SIZE) != PAGE_SIZE)
			handle_error("write");
	}
	free(page);

	ptr = mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, MAP_PRIVATE|MAP_ANON,
                   -1, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap anon");

	addr = (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
	printf("starting...\n");

	ptr = mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap file - start");

	p = ptr;

	for (i = 0; i < n_threads; i++) {
		r = pthread_create(&threads[i], NULL, thread_funcs[i], NULL);
		if (r)
			handle_error("pthread_create");
	}

	// Run the flushing process in a different process, so msync() would
	// not require mmap_lock.
	pid = fork();
	if (pid == 0)
		flushing_process();
	flusher_pid = pid;

	sleep(60);

	stop = true;
	for (i = 0; i < n_threads; i++)
		pthread_join(threads[i], NULL);
	kill(flusher_pid, SIGTERM);
	printf("Finished without an error\n");

	exit(0);
}

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-29 18:58                                   ` Linus Torvalds
  2022-10-29 19:14                                     ` Linus Torvalds
@ 2022-10-30  2:17                                     ` Nadav Amit
  2022-10-30 18:19                                       ` Linus Torvalds
  1 sibling, 1 reply; 148+ messages in thread
From: Nadav Amit @ 2022-10-30  2:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 29, 2022, at 11:58 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Oct 29, 2022 at 11:36 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Anyway, I think the best documentation for "this is what I meant" is
>> simply the patch. Does this affect your PoC on your setup?
> 
> Here's a slightly cleaned up set with preliminary commit messages, and
> an explanation for why some of the 'struct page' declarations were
> moved around a bit in case you wondered about that part of the change
> in the full patch.
> 
> The end result should be the same, so if you already looked at the
> previous unified patch, never mind. But this one tries to make for a
> better patch series.
> 
> Still not tested in any way, shape, or form. I decided I wanted to
> send this one before booting into this and possibly blowing up ;^)

Running the PoC on Linux 6.0.6 with these patches caused the following splat
on the following line:

	WARN_ON_ONCE(!folio_test_locked(folio) && !folio_test_dirty(folio));

Although I did not hit the warning on the next line (!folio_buffers(folio)),
the commit log for the warning that actually triggered also leads to the
same patch of Jan Kara that is intended to check if a page is dirtied
without buffers (the scenario we are concerned about).


  Author: Jan Kara <jack@suse.cz>
  Date:   Thu Dec 1 11:46:40 2016 -0500

    ext4: warn when page is dirtied without buffers
    
    Warn when a page is dirtied without buffers (as that will likely lead to
    a crash in ext4_writepages()) or when it gets newly dirtied without the
    page being locked (as there is nothing that prevents buffers to get
    stripped just before calling set_page_dirty() under memory pressure). 



[  908.444806] ------------[ cut here ]------------
[  908.451010] WARNING: CPU: 16 PID: 2113 at fs/ext4/inode.c:3634 ext4_dirty_folio+0x74/0x80
[  908.460343] Modules linked in:
[  908.463856] CPU: 16 PID: 2113 Comm: poc Not tainted 6.0.6+ #21
[  908.470521] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.13.0 05/14/2021
[  908.479202] RIP: 0010:ext4_dirty_folio+0x74/0x80
[  908.484489] Code: d5 ee ff 41 5c 41 5d 5d c3 cc cc cc cc be 08 00 00 00 4c 89 e7 e8 bc 03 e0 ff 4c 89 e7 e8 f4 f8 df ff 49 8b 04 24 a8 08 75 bc <0f> 0b eb b8 0f 0b eb c6 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41
[  908.505851] RSP: 0018:ffff88a1197df9a8 EFLAGS: 00010246
[  908.511826] RAX: 0057ffffc0002014 RBX: ffffffff83414b60 RCX: ffffffff818ceafc
[  908.519964] RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffffea00fffd9f40
[  908.528103] RBP: ffff88a1197df9b8 R08: 0000000000000001 R09: fffff9401fffb3e9
[  908.536239] R10: ffffea00fffd9f47 R11: fffff9401fffb3e8 R12: ffffea00fffd9f40
[  908.544376] R13: ffff88a087d368d8 R14: ffff88a1197dfb08 R15: ffff88a1197dfb00
[  908.552509] FS:  00007ff7caa68700(0000) GS:ffff8897edc00000(0000) knlGS:0000000000000000
[  908.561731] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  908.568299] CR2: 00007ff7caa67ed8 CR3: 00000020cc970001 CR4: 00000000001706e0
[  908.576437] Call Trace:
[  908.579252]  <TASK>
[  908.581683]  folio_mark_dirty+0x69/0xa0
[  908.586097]  set_page_dirty+0x2a/0x90
[  908.590301]  tlb_flush_mmu+0xc1/0x320
[  908.594517]  tlb_finish_mmu+0x49/0x190
[  908.598822]  unmap_region+0x1fa/0x250
[  908.603029]  ? anon_vma_compatible+0x120/0x120
[  908.608110]  ? __kasan_check_read+0x11/0x20
[  908.612926]  ? __vma_rb_erase+0x38a/0x610
[  908.617547]  __do_munmap+0x313/0x770
[  908.621669]  mmap_region+0x227/0xa50
[  908.625774]  ? down_read+0x320/0x320
[  908.629874]  ? lock_acquire+0x19a/0x450
[  908.634285]  ? __x64_sys_brk+0x4e0/0x4e0
[  908.641552]  ? thp_get_unmapped_area+0xca/0x150
[  908.649404]  ? cap_mmap_addr+0x1d/0x90
[  908.656373]  ? security_mmap_addr+0x3c/0x50
[  908.663781]  ? get_unmapped_area+0x173/0x1f0
[  908.671248]  ? arch_get_unmapped_area+0x330/0x330
[  908.679231]  do_mmap+0x3c3/0x610
[  908.685519]  vm_mmap_pgoff+0x177/0x230
[  908.692303]  ? randomize_page+0x70/0x70
[  908.699133]  ksys_mmap_pgoff+0x241/0x2a0
[  908.706011]  __x64_sys_mmap+0x8d/0xb0
[  908.712594]  do_syscall_64+0x3b/0x90
[  908.719090]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  908.727201] RIP: 0033:0x7ff7cbf868e6
[  908.733559] Code: 00 00 00 00 f3 0f 1e fa 41 f7 c1 ff 0f 00 00 75 2b 55 48 89 fd 53 89 cb 48 85 ff 74 37 41 89 da 48 89 ef b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 62 5b 5d c3 0f 1f 80 00 00 00 00 48 8b 05 71
[  908.759522] RSP: 002b:00007ff7caa67ea8 EFLAGS: 00000206 ORIG_RAX: 0000000000000009
[  908.770475] RAX: ffffffffffffffda RBX: 0000000000008011 RCX: 00007ff7cbf868e6
[  908.780919] RDX: 0000000000000003 RSI: 0000000000200000 RDI: 00007ff7cbc00000
[  908.791344] RBP: 00007ff7cbc00000 R08: 0000000000000003 R09: 0000000000000000
[  908.801751] R10: 0000000000008011 R11: 0000000000000206 R12: 00007ffed51cbc4e
[  908.812118] R13: 00007ffed51cbc4f R14: 00007ffed51cbc50 R15: 00007ff7caa67fc0
[  908.822523]  </TASK>
[  908.827213] irq event stamp: 4169
[  908.833101] hardirqs last  enabled at (4183): [<ffffffff8133f028>] __up_console_sem+0x68/0x80
[  908.844884] hardirqs last disabled at (4194): [<ffffffff8133f00d>] __up_console_sem+0x4d/0x80
[  908.856622] softirqs last  enabled at (4154): [<ffffffff83000430>] __do_softirq+0x430/0x5db
[  908.868167] softirqs last disabled at (4149): [<ffffffff8125fd89>] irq_exit_rcu+0xe9/0x120
[  908.879611] ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-30  2:17                                     ` Nadav Amit
@ 2022-10-30 18:19                                       ` Linus Torvalds
  2022-10-30 18:51                                         ` Linus Torvalds
  2022-10-30 19:34                                         ` Nadav Amit
  0 siblings, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-30 18:19 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sat, Oct 29, 2022 at 7:17 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Running the PoC on Linux 6.0.6 with these patches caused the following splat
> on the following line:
>
>         WARN_ON_ONCE(!folio_test_locked(folio) && !folio_test_dirty(folio));

Yeah, this is a sign of that "folio_mkclean() serializes with
folio_mark_dirty using rmap and the page table lock".

And page_remove_rmap() could *almost* be called later, but it does
have code that also depends on the page table lock, although it looks
like realistically that's just because it "knows" that means that
preemption is disabled, so it uses non-atomic statistics update.

I say "knows" in quotes, because that's what the comment says, but it
turns out that __mod_node_page_state() has to deal with CONFIG_RT
anyway and does that

        preempt_disable_nested();
        ...
        preempt_enable_nested();

thing.

And then it wants to see the vma, although that's actually only to see
if it's 'mlock'ed, so we could just squirrel that away.

So we *could* move page_remove_rmap() later into the TLB flush region,
but then we would have lost the page table lock anyway, so then
folio_mkclean() can come in regardless.

So that doesn't even help.

End result: we do want to do the page_set_dirty() and the
remove_rmap() under the paeg table lock, because it's what serializes
folio_mkclean().

And we'd _like_ to do the TLB flush before the remove_rmap(), but we
*really* don't want to do that for every page.

So my current gut feel is that we should just say that if you do
"MADV_DONTNEED or do a munmap() (which includes the "re-mmap() over
the area", while some other thread is still writing to that memory
region, you may lose writes.

IOW, just accept the behavior that Nadav's test-program tries to show,
and say "look, you're doing insane things, we've never given you any
other semantics, it's your problem" to any user program that does
that.

If a user program does MADV_DONTNEED on an area that it is actively
using at the same time in another thread, that sounds really really
bogus. Same goes doubly for 'munmap()' or over-mapping.

              Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-30 18:19                                       ` Linus Torvalds
@ 2022-10-30 18:51                                         ` Linus Torvalds
  2022-10-30 22:47                                           ` Linus Torvalds
  2022-10-30 19:34                                         ` Nadav Amit
  1 sibling, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-30 18:51 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sun, Oct 30, 2022 at 11:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And we'd _like_ to do the TLB flush before the remove_rmap(), but we
> *really* don't want to do that for every page.

Hmm. I have yet another crazy idea.

We could keep the current placement of the TLB flush, to just before
we drop the page table lock.

And we could do all the things we do in 'page_remove_rmap()' right now
*except* for the mapcount stuff.

And only move the mapcount code to the page freeing stage.

Because all the rmap() walk synchronization really needs is that
'page->_mapcount' is still elevated, and if it is it will serialize
with the page table lock.

And it turns out that 'page_remove_rmap()' already treats the case we
care about differently, and all it does is

        lock_page_memcg(page);

        if (!PageAnon(page)) {
                page_remove_file_rmap(page, compound);
                goto out;
        }
       ...
out:
        unlock_page_memcg(page);

        munlock_vma_page(page, vma, compound);

for that case.

And that 'page_remove_file_rmap()' is literally the code that modifies
the _mapcount.

Annoyingly, this is all complicated by that 'compound' argument, but
that's always false in that zap_page_range() case.

So what we *could* do, is make a new version of page_remove_rmap(),
which is specialized for this case: no 'compound' argument (always
false), and doesn't call 'page_remove_file_rmap()', because we'll do
that for the !PageAnon(page) case later after the TLB flush.

That would keep the existing TLB flush logic, keep the existing 'mark
page dirty' and would just make sure that 'folio_mkclean()' ends up
being serialized with the TLB flush simply because it will take the
page table lock because we delay the '._mapcount' update until
afterwards.

Annoyingly, the organization of 'page_remove_rmap()' is a bit ugly,
and we have several other callers that want the existing logic, so
while the above sounds conceptually simple, I think the patch would be
a bit messy.

                  Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-30 18:19                                       ` Linus Torvalds
  2022-10-30 18:51                                         ` Linus Torvalds
@ 2022-10-30 19:34                                         ` Nadav Amit
  1 sibling, 0 replies; 148+ messages in thread
From: Nadav Amit @ 2022-10-30 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 30, 2022, at 11:19 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And page_remove_rmap() could *almost* be called later, but it does
> have code that also depends on the page table lock, although it looks
> like realistically that's just because it "knows" that means that
> preemption is disabled, so it uses non-atomic statistics update.
> 
> I say "knows" in quotes, because that's what the comment says, but it
> turns out that __mod_node_page_state() has to deal with CONFIG_RT
> anyway and does that
> 
>        preempt_disable_nested();
>        ...
>        preempt_enable_nested();
> 
> thing.
> 
> And then it wants to see the vma, although that's actually only to see
> if it's 'mlock'ed, so we could just squirrel that away.
> 
> So we *could* move page_remove_rmap() later into the TLB flush region,
> but then we would have lost the page table lock anyway, so then
> folio_mkclean() can come in regardless.
> 
> So that doesn't even help.

Well, if you combine it with the per-page-table stale TLB detection
mechanism that I proposed, I think this could work.

Reminder (feel free to skip): you would have per-mm “completed
TLB-generation” in addition to the current one, which would be renamed to
“pending TLB-generation”. Whenever you update the page-tables in a manner
that might require a TLB flush, you would increase the “pending
TLB-generation” and save the pending TLB-generation in the page-table’s
page-struct. All of that is done once under the page-table lock. When you
finish a TLB-flush, you update the “completed TLB-generation”.

Then on page_vma_mkclean_one(), you would check if the page-table’s
TLB-generation is greater than the completed TLB-generation, which would
indicate that TLB entries for PTEs in this table might be stale. In that
case you would just flush the TLB. [ Of course you can instead just flush if
mm_tlb_flush_pending(), but nobody likes this mechanism that has a very
coarse granularity, and therefore can lead to many unnecessary TLB flushes.
]

Indeed, there would be potentially some overhead in extreme cases, since
mm's TLB-generation since its cache is already highly-contended in extreme
cases. But I think it worth it to have simple logic that allows to reason
about correctness.

My intuition is that although you appear to be right that we can just mark
this case as “extreme case nobody cares about”, it might have now or in the
future some other implications that are hard to predict and prevent.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-30 18:51                                         ` Linus Torvalds
@ 2022-10-30 22:47                                           ` Linus Torvalds
  2022-10-31  1:47                                             ` Linus Torvalds
  2022-10-31  9:28                                             ` Peter Zijlstra
  0 siblings, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-30 22:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]

On Sun, Oct 30, 2022 at 11:51 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> We could keep the current placement of the TLB flush, to just before
> we drop the page table lock.
>
> And we could do all the things we do in 'page_remove_rmap()' right now
> *except* for the mapcount stuff.
>
> And only move the mapcount code to the page freeing stage.

So I actually have a three-commit series to do the rmap
simplification, but let me just post the end result of that series,
because the end result is actually smaller than the individual commits
(I did it as three incremental commits just to make it more obvious to
me how to get to that end result).

The three commits end up being

      mm: introduce simplified versions of 'page_remove_rmap()'
      mm: inline simpler case of page_remove_file_rmap()
      mm: re-unify the simplified page_zap_*_rmap() function

and the end result of them is this attached patch.

I'm *claiming* that the attached patch is semantically identical to
what we do before it, just _hugely_ simplified.

Basically, that new 'page_zap_pte_rmap()' does the same things that
'page_remove_rmap()' did, except it is limited to only last-level PTE
entries (and that munlock_vma_page() has to be called separately).

The simplification comes from 'compound' being false, from it always
being about small pages, and from the atomic mapcount decrement having
been moved outside the memcg lock, since it is independent of it.

Anyway, this simplification patch basically means that the *next* step
could be to just move that ipage_zap_pte_rmap()' after the TLB flush,
and now it's trivial and no longer scary.

I did *not* do that yet, because it still needs that "encoded_page[]"
array - except now it doesn't encode the 'dirty' bit, now it would
encode the 'do a page->_mapcount decrement' bit.

I didn't do that part, because needed to do the rc3 release, plus I'd
like to have somebody look at this introductory patch first.

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2229 bytes --]

 include/linux/rmap.h |  1 +
 mm/memory.c          |  3 ++-
 mm/rmap.c            | 24 ++++++++++++++++++++++++
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..f62af001707c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,7 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void page_zap_pte_rmap(struct page *);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..c893f5ffc5a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1452,8 +1452,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
+			page_zap_pte_rmap(page);
+			munlock_vma_page(page, vma, false);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..28b51a31ebb0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1412,6 +1412,30 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
 }
 
+/**
+ * page_zap_pte_rmap - take down a pte mapping from a page
+ * @page:	page to remove mapping from
+ *
+ * This is the simplified form of page_remove_rmap(), that only
+ * deals with last-level pages, so 'compound' is always false,
+ * and the caller does 'munlock_vma_page(page, vma, compound)'
+ * separately.
+ *
+ * This allows for a much simpler calling convention and code.
+ *
+ * The caller holds the pte lock.
+ */
+void page_zap_pte_rmap(struct page *page)
+{
+	if (!atomic_add_negative(-1, &page->_mapcount))
+		return;
+
+	lock_page_memcg(page);
+	__dec_lruvec_page_state(page,
+		PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
+	unlock_page_memcg(page);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-30 22:47                                           ` Linus Torvalds
@ 2022-10-31  1:47                                             ` Linus Torvalds
  2022-10-31  4:09                                               ` Nadav Amit
                                                                 ` (3 more replies)
  2022-10-31  9:28                                             ` Peter Zijlstra
  1 sibling, 4 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-31  1:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]

On Sun, Oct 30, 2022 at 3:47 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, this simplification patch basically means that the *next* step
> could be to just move that ipage_zap_pte_rmap()' after the TLB flush,
> and now it's trivial and no longer scary.

So here's that next patch: it's patch 4/4 in this series.

Patches 1-3 are the patches that I already sent out as one (smaller)
combined patch. I'm including them here as the whole series in case
somebody else wants to follow along with how I did the simplified
version of page_remove_rmap().

So if you already looked at the previous patch and don't have any
questions about that one, you could just apply PATCH 4/4 on top of
that one.

Or you can do the whole series with commit messages and (hopefully)
clearer individual steps to the simplified version of
page_remove_rmap().

I haven't actually tested 4/4 yet, so this is yet another of my "I
think this should work" patches.

The reason I haven't actually tested it is partly because I never
recreated the original problem Navav reported, and partly because the
meat of patch 4/4 is just the same "encode an extra flag bit in the
low bit of the page pointer" that I _did_ test, just doing the "remove
rmap" instead of "set dirty".

In other words, I *think* this should make Nadav's test-case happy,
and avoid the warning he saw.

If somebody wants a git branch, I guess I can push that too, but it
would be a non-stable branch only for testing.

Also, it's worth noting that zap_pte_range() does this sanity test:

                        if (unlikely(page_mapcount(page) < 0))
                                print_bad_pte(vma, addr, ptent, page);

and that is likely worthless now (because it hasn't actually
decremented the mapcount yet). I didn't remove it, because I wasn't
sure which option was best:

 (a) just remove it entirely

 (b) change the "< 0" to "<= 0"

 (c) move it to clean_and_free_pages_and_swap_cache() that actually
does the page_zap_pte_rmap() now.

so I'm in no way claiming this series is any kind of final word, but I
do mostly like how the code ended up looking.

Now, this whole "remove rmap after flush" would probably make sense
for some of the largepage cases too, and there's room for another bit
in there (or more, if you align 'struct page') and that whole code
could use page size hints etc too. But I suspect that the main case we
really care is for individual small pages, just because that's the
case where I think it would be much worse to do any fancy TLB
tracking. The largepage cases presumably aren't as critical, since
there by definition is fewer of those.

Comments?

                 Linus

[-- Attachment #2: 0001-mm-introduce-simplified-versions-of-page_remove_rmap.patch --]
[-- Type: text/x-patch, Size: 5434 bytes --]

From aeea35b14fa697ab4e5aabc03915d954cdbedaf8 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 30 Oct 2022 13:26:07 -0700
Subject: [PATCH 1/4] mm: introduce simplified versions of 'page_remove_rmap()'

The rmap handling is proving a bit problematic, and part of it comes
from the complexities of all the different cases of our implementation
of 'page_remove_rmap()'.

And a large part of that complexity comes from the fact that while we
have multiple different versions of _adding_ an rmap, this 'remove rmap'
function tries to deal with all possible cases.

So we have these specific versions for page_add_anon_rmap(),
page_add_new_anon_rmap() and page_add_file_rmap() which all do slightly
different things, but then 'page_remove_rmap()' has to handle all the
cases.

That's particularly annoying for 'zap_pte_range()', which already knows
which special case it's dealing with.  It already checked for its own
reasons whether it's an anonymous page, and it already knows it's not
the compound page case and passed in an unconditional 'false' argument.

So this introduces the specialized versions of 'page_remove_rmap()' for
the cases that zap_pte_range() wants.  We also make it the job of the
caller to do the munlock_vma_page(), which is really unrelated and is
the only thing that cares aboiut the 'vma'.

This just means that we end up with several simplifications:

 - there's no 'vma' argument any more, because it's not used

 - there's no 'compund' argument any more, because it was always false

 - we can get rid of the tests for 'compund' and 'PageAnon()' since we
   know that they are

and so instead of having that fairly complicated page_remove_rmap()
function, we end up with a couple of specialized functions that are
_much_ simpler.

There is supposed to be no semantic difference from this change,
although this does end up simplifying the code further by moving the
atomic_add_negative() on the PageAnon mapcount to outside the memcg
locking.

That locking protects other data structures (the page state statistics),
and this avoids not only an ugly 'goto', but means that we don't need to
take and release the lock when we're not actually doing anything with
the state statistics.

We also remove the test for PageTransCompound(), since this is only
called for the final pte level from zap_pte_range().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/rmap.h |  2 ++
 mm/memory.c          |  6 ++++--
 mm/rmap.c            | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..8d29b7c38368 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void page_zap_file_rmap(struct page *);
+void page_zap_anon_rmap(struct page *);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..ba1d08a908a4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1451,9 +1451,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
-			}
+				page_zap_file_rmap(page);
+			} else
+				page_zap_anon_rmap(page);
+			munlock_vma_page(page, vma, false);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..71a5365f23f3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1412,6 +1412,48 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
 }
 
+/**
+ * page_zap_file_rmap - take down non-anon pte mapping from a page
+ * @page:	page to remove mapping from
+ *
+ * This is the simplified form of page_remove_rmap(), with:
+ *  - we've already checked for '!PageAnon(page)'
+ *  - 'compound' is always false
+ *  - the caller does 'munlock_vma_page(page, vma, compound)' separately
+ * which allows for a much simpler calling convention.
+ *
+ * The caller holds the pte lock.
+ */
+void page_zap_file_rmap(struct page *page)
+{
+	lock_page_memcg(page);
+	page_remove_file_rmap(page, false);
+	unlock_page_memcg(page);
+}
+
+/**
+ * page_zap_anon_rmap(page) - take down non-anon pte mapping from a page
+ * @page:	page to remove mapping from
+ *
+ * This is the simplified form of page_remove_rmap(), with:
+ *  - we've already checked for 'PageAnon(page)'
+ *  - 'compound' is always false
+ *  - the caller does 'munlock_vma_page(page, vma, compound)' separately
+ * which allows for a much simpler calling convention.
+ *
+ * The caller holds the pte lock.
+ */
+void page_zap_anon_rmap(struct page *page)
+{
+	/* page still mapped by someone else? */
+	if (!atomic_add_negative(-1, &page->_mapcount))
+		return;
+
+	lock_page_memcg(page);
+	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
+	unlock_page_memcg(page);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #3: 0002-mm-inline-simpler-case-of-page_remove_file_rmap.patch --]
[-- Type: text/x-patch, Size: 1914 bytes --]

From 79c23c212f9e21edb2dbb440dd499d0a49e79bea Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 30 Oct 2022 13:50:39 -0700
Subject: [PATCH 2/4] mm: inline simpler case of page_remove_file_rmap()

Now that we have a simplified special case of 'page_remove_rmap()' that
doesn't deal with the 'compound' case and always gets a file-mapped (ie
not anonymous) page, it ended up doing just

	lock_page_memcg(page);
	page_remove_file_rmap(page, false);
	unlock_page_memcg(page);

but 'page_remove_file_rmap()' is actually trivial when 'compound' is false.

So just inline that non-compound case in the caller, and - like we did
in the previous commit for the anon pages - only do the memcg locking for
the parts that actually matter: the page statistics.

Also, as the previous commit did for anonymous pages, knowing we only
get called for the last-level page table entries allows for a further
simplification: we can get rid of the 'PageHuge(page)' case too.

You can't map a huge-page in a pte without splitting it (and the full
code in the generic page_remove_file_rmap() function has a comment to
that effect: "hugetlb pages are always mapped with pmds").

That means that the page_zap_file_rmap() case of that whole function is
really small and trivial.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/rmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 71a5365f23f3..69de6c833d5c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1426,8 +1426,11 @@ static void page_remove_anon_compound_rmap(struct page *page)
  */
 void page_zap_file_rmap(struct page *page)
 {
+	if (!atomic_add_negative(-1, &page->_mapcount))
+		return;
+
 	lock_page_memcg(page);
-	page_remove_file_rmap(page, false);
+	__dec_lruvec_page_state(page, NR_FILE_MAPPED);
 	unlock_page_memcg(page);
 }
 
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #4: 0003-mm-re-unify-the-simplified-page_zap_-_rmap-function.patch --]
[-- Type: text/x-patch, Size: 4013 bytes --]

From 25d9e6a9b37e573390af2e3f6c1db429d8ddb4ad Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 30 Oct 2022 15:14:43 -0700
Subject: [PATCH 3/4] mm: re-unify the simplified page_zap_*_rmap() function

Now that we've simplified both the anonymous and file-backed opage zap
functions, they end up being identical except for which page statistic
they update, and we can re-unify the implementation of that much
simplified code.

To make it very clear that this is onlt for the final pte zapping (since
a lot of the simplifications depended on that), name the unified
function 'page_zap_pte_rmap()'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/rmap.h |  3 +--
 mm/memory.c          |  5 ++---
 mm/rmap.c            | 39 +++++++++------------------------------
 3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8d29b7c38368..f62af001707c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,8 +196,7 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
-void page_zap_file_rmap(struct page *);
-void page_zap_anon_rmap(struct page *);
+void page_zap_pte_rmap(struct page *);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 
diff --git a/mm/memory.c b/mm/memory.c
index ba1d08a908a4..c893f5ffc5a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1451,9 +1451,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
-				page_zap_file_rmap(page);
-			} else
-				page_zap_anon_rmap(page);
+			}
+			page_zap_pte_rmap(page);
 			munlock_vma_page(page, vma, false);
 			rss[mm_counter(page)]--;
 			if (unlikely(page_mapcount(page) < 0))
diff --git a/mm/rmap.c b/mm/rmap.c
index 69de6c833d5c..28b51a31ebb0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1413,47 +1413,26 @@ static void page_remove_anon_compound_rmap(struct page *page)
 }
 
 /**
- * page_zap_file_rmap - take down non-anon pte mapping from a page
+ * page_zap_pte_rmap - take down a pte mapping from a page
  * @page:	page to remove mapping from
  *
- * This is the simplified form of page_remove_rmap(), with:
- *  - we've already checked for '!PageAnon(page)'
- *  - 'compound' is always false
- *  - the caller does 'munlock_vma_page(page, vma, compound)' separately
- * which allows for a much simpler calling convention.
+ * This is the simplified form of page_remove_rmap(), that only
+ * deals with last-level pages, so 'compound' is always false,
+ * and the caller does 'munlock_vma_page(page, vma, compound)'
+ * separately.
  *
- * The caller holds the pte lock.
- */
-void page_zap_file_rmap(struct page *page)
-{
-	if (!atomic_add_negative(-1, &page->_mapcount))
-		return;
-
-	lock_page_memcg(page);
-	__dec_lruvec_page_state(page, NR_FILE_MAPPED);
-	unlock_page_memcg(page);
-}
-
-/**
- * page_zap_anon_rmap(page) - take down non-anon pte mapping from a page
- * @page:	page to remove mapping from
- *
- * This is the simplified form of page_remove_rmap(), with:
- *  - we've already checked for 'PageAnon(page)'
- *  - 'compound' is always false
- *  - the caller does 'munlock_vma_page(page, vma, compound)' separately
- * which allows for a much simpler calling convention.
+ * This allows for a much simpler calling convention and code.
  *
  * The caller holds the pte lock.
  */
-void page_zap_anon_rmap(struct page *page)
+void page_zap_pte_rmap(struct page *page)
 {
-	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
 		return;
 
 	lock_page_memcg(page);
-	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
+	__dec_lruvec_page_state(page,
+		PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
 	unlock_page_memcg(page);
 }
 
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #5: 0004-mm-delay-rmap-removal-until-after-TLB-flush.patch --]
[-- Type: text/x-patch, Size: 8374 bytes --]

From 552d121375f88ba4db55460cd378c9150f994945 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 29 Oct 2022 11:45:07 -0700
Subject: [PATCH 4/4] mm: delay rmap removal until after TLB flush

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with
the page table lock because the page had already been made invisible to
rmap.

And that is a problem, because while the TLB entry exists, we could end
up with the followign situation:

 (a) one CPU could come in and clean it, never seeing our mapping of
     the page

 (b) another CPU could continue to use the stale and dirty TLB entry
     and continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This commit uses the same old TLB gather array that we use to delay the
freeing of the page to also say 'remove from rmap after flush', so that
we can keep the rmap entries alive until all TLB entries have been
flushed.

NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly
even then.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/asm-generic/tlb.h | 36 +++++++++++++++++++++++++++++-----
 mm/memory.c               |  3 +--
 mm/mmu_gather.c           | 41 +++++++++++++++++++++++++++++++++++----
 mm/rmap.c                 |  4 +---
 4 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..a5c9c9989fd2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -238,11 +238,37 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
  */
 #define MMU_GATHER_BUNDLE	8
 
+/*
+ * Fake type for an encoded page with flag bits in the low bits.
+ *
+ * Right now just one bit, but we could have more depending on the
+ * alignment of 'struct page'.
+ */
+struct encoded_page;
+#define TLB_ZAP_RMAP 1ul
+#define ENCODE_PAGE_BITS (TLB_ZAP_RMAP)
+
+static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
+{
+	flags &= ENCODE_PAGE_BITS;
+	return (struct encoded_page *)(flags | (unsigned long)page);
+}
+
+static inline bool encoded_page_flags(struct encoded_page *page)
+{
+	return ENCODE_PAGE_BITS & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+}
+
 struct mmu_gather_batch {
 	struct mmu_gather_batch	*next;
 	unsigned int		nr;
 	unsigned int		max;
-	struct page		*pages[];
+	struct encoded_page	*encoded_pages[];
 };
 
 #define MAX_GATHER_BATCH	\
@@ -257,7 +283,7 @@ struct mmu_gather_batch {
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
-				   int page_size);
+				   int page_size, unsigned int flags);
 #endif
 
 /*
@@ -431,13 +457,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
-	if (__tlb_remove_page_size(tlb, page, page_size))
+	if (__tlb_remove_page_size(tlb, page, page_size, 0))
 		tlb_flush_mmu(tlb);
 }
 
-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
 {
-	return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, page, PAGE_SIZE, flags);
 }
 
 /* tlb_remove_page
diff --git a/mm/memory.c b/mm/memory.c
index c893f5ffc5a8..230946536115 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1452,12 +1452,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
-			page_zap_pte_rmap(page);
 			munlock_vma_page(page, vma, false);
 			rss[mm_counter(page)]--;
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
+			if (unlikely(__tlb_remove_page(tlb, page, TLB_ZAP_RMAP))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
 				break;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index add4244e5790..587873e5984c 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,6 +9,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/swap.h>
+#include <linux/rmap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
@@ -43,12 +44,43 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	return true;
 }
 
+/*
+ * We get an 'encoded page' array, which has page pointers with
+ * encoded flags in the low bits of the array.
+ *
+ * The TLB has been flushed, now we need to react to the flag bits
+ * the 'struct page', clean the array in-place, and then free the
+ * pages and their swap cache.
+ */
+static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
+{
+	for (unsigned int i = 0; i < nr; i++) {
+		struct encoded_page *encoded = pages[i];
+		unsigned int flags = encoded_page_flags(encoded);
+		if (flags) {
+			/* Clean the flagged pointer in-place */
+			struct page *page = encoded_page_ptr(encoded);
+			pages[i] = encode_page(page, 0);
+
+			/* The flag bit being set means that we should zap the rmap */
+			page_zap_pte_rmap(page);
+		}
+	}
+
+	/*
+	 * Now all entries have been un-encoded, and changed to plain
+	 * page pointers, so we can cast the 'encoded_page' array to
+	 * a plain page array and free them
+	 */
+	free_pages_and_swap_cache((struct page **)pages, nr);
+}
+
 static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		struct page **pages = batch->pages;
+		struct encoded_page **pages = batch->encoded_pages;
 
 		do {
 			/*
@@ -56,7 +88,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 			 */
 			unsigned int nr = min(512U, batch->nr);
 
-			free_pages_and_swap_cache(pages, nr);
+			clean_and_free_pages_and_swap_cache(pages, nr);
 			pages += nr;
 			batch->nr -= nr;
 
@@ -77,11 +109,12 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, unsigned int flags)
 {
 	struct mmu_gather_batch *batch;
 
 	VM_BUG_ON(!tlb->end);
+	VM_BUG_ON(flags & ~ENCODE_PAGE_BITS);
 
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	VM_WARN_ON(tlb->page_size != page_size);
@@ -92,7 +125,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->pages[batch->nr++] = page;
+	batch->encoded_pages[batch->nr++] = encode_page(page, flags);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
diff --git a/mm/rmap.c b/mm/rmap.c
index 28b51a31ebb0..416b7078b75f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1422,8 +1422,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
  * separately.
  *
  * This allows for a much simpler calling convention and code.
- *
- * The caller holds the pte lock.
  */
 void page_zap_pte_rmap(struct page *page)
 {
@@ -1431,7 +1429,7 @@ void page_zap_pte_rmap(struct page *page)
 		return;
 
 	lock_page_memcg(page);
-	__dec_lruvec_page_state(page,
+	dec_lruvec_page_state(page,
 		PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
 	unlock_page_memcg(page);
 }
-- 
2.37.1.289.g45aa1e5c72.dirty


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  1:47                                             ` Linus Torvalds
@ 2022-10-31  4:09                                               ` Nadav Amit
  2022-10-31  4:55                                                 ` Nadav Amit
  2022-10-31  5:00                                                 ` Linus Torvalds
  2022-10-31  9:36                                               ` Peter Zijlstra
                                                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 148+ messages in thread
From: Nadav Amit @ 2022-10-31  4:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 30, 2022, at 6:47 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The reason I haven't actually tested it is partly because I never
> recreated the original problem Navav reported, and partly because the
> meat of patch 4/4 is just the same "encode an extra flag bit in the
> low bit of the page pointer" that I _did_ test, just doing the "remove
> rmap" instead of "set dirty".
> 
> In other words, I *think* this should make Nadav's test-case happy,
> and avoid the warning he saw.

I am sorry for not managing to make it reproducible on your system. The fact
that you did not get the warning that I got means that it is not a
hardware-TLB differences issue (at least not only that), but the race does
not happen on your system (assuming you used ext4 on the BRD).

Anyhow, I ran the tests with the patches and there are no failures.
Thanks for addressing this issue.

I understand from the code that you decided to drop the deferring of
set_page_dirty(), which could - at least for the munmap case (where
mmap_lock is taken for write) - prevent the need for “force_flush” and
potentially save TLB flushes.

I was just wondering whether the reason for that is that you wanted
to have small backportable and conservative patches, or whether you
changed your mind about it.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  4:09                                               ` Nadav Amit
@ 2022-10-31  4:55                                                 ` Nadav Amit
  2022-10-31  5:00                                                 ` Linus Torvalds
  1 sibling, 0 replies; 148+ messages in thread
From: Nadav Amit @ 2022-10-31  4:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 30, 2022, at 9:09 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> I understand from the code that you decided to drop the deferring of
> set_page_dirty(), which could - at least for the munmap case (where
> mmap_lock is taken for write) - prevent the need for “force_flush” and
> potentially save TLB flushes.
> 
> I was just wondering whether the reason for that is that you wanted
> to have small backportable and conservative patches, or whether you
> changed your mind about it.

Please ignore this silly question. I understand - the buffers might still be
dropped.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  4:09                                               ` Nadav Amit
  2022-10-31  4:55                                                 ` Nadav Amit
@ 2022-10-31  5:00                                                 ` Linus Torvalds
  2022-10-31 15:43                                                   ` Nadav Amit
  1 sibling, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-31  5:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sun, Oct 30, 2022 at 9:09 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> I am sorry for not managing to make it reproducible on your system.

Heh, that's very much *not* your fault. Honestly, I didn't try very
much or very hard.

I felt like I understood the problem cause sufficiently that I didn't
really need to have a reproducer, and I much prefer to just think the
solution through and try to make it really robust.

Or, put another way - I'm just lazy.

> Anyhow, I ran the tests with the patches and there are no failures.

Lovely.

> Thanks for addressing this issue.

Well, I'm not sure the issue is "addressed" yet. I think the patch
series is likely the right thing to do, but others may disagree with
this approach.

And regardless of that, this still leaves some questions open.

 (a) there's the issue of s390, which does its own version of
__tlb_remove_page_size.

I *think* s390 basically does the TLB flush synchronously in
zap_pte_range(), and that it would be for that reason trivial to just
add that 'flags' argument to the s390 __tlb_remove_page_size(), and
make it do

        if (flags & TLB_ZAP_RMAP)
                page_zap_pte_rmap(page);

at the top synchronously too. But some s390 person would need to look at it.

I *think* the issue is literally that straightforward and not a big
deal, but it's probably not even worth bothering the s390 people until
VM people have decided "yes, this makes sense".

 (b) the issue I mentioned with the currently useless
"page_mapcount(page) < 0" test with that patch.

Again, this is mostly just janitorial stuff associated with that patch series.

 (c) whether to worry about back-porting

I don't *think* this is worth backporting, but if it causes other
changes, then maybe..

> I understand from the code that you decided to drop the deferring of
> set_page_dirty(), which could - at least for the munmap case (where
> mmap_lock is taken for write) - prevent the need for “force_flush” and
> potentially save TLB flushes.

I really liked my dirty patch, but your warning case really made it
obvious that it was just broken.

The thing is, moving the "set_page_dirty()" to later is really nice,
and really makes a *lot* of sense from a conceptual standpoint: only
after that TLB flush do we really have no more people who can dirty
it.

BUT.

Even if we just used another bit for in the array for "dirty", and did
the set_page_dirty() later (but still before getting rid of the rmap),
that wouldn't actually *work*.

Why? Because the race with folio_mkclean() would just come back. Yes,
now we'd have the rmap data, so mkclean would be forced to serialize
with the page table lock.

But if we get rid of the "force_flush" for the dirty bit, that
serialization won't help, simply because we've *dropped* the page
table lock before we actually then do the set_page_dirty() again.

So the mkclean serialization needs *both* the late rmap dropping _and_
the page table lock being kept.

So deferring set_page_dirty() is conceptually the right thing to do
from a pure "just track dirty bit" standpoint, but it doesn't work
with the way we currently expect mkclean to work.

> I was just wondering whether the reason for that is that you wanted
> to have small backportable and conservative patches, or whether you
> changed your mind about it.

See above: I still think it would be the right thing in a perfect world.

But with the current folio_mkclean(), we just can't do it. I had
completely forgotten / repressed that horror-show.

So the current ordering rules are basically that we need to do
set_page_dirty() *and* we need to flush the TLB's before dropping the
page table lock. That's what gets us serialized with "mkclean".

The whole "drop rmap" can then happen at any later time, the only
important thing was that it was kept to at least after the TLB flush.

We could do the rmap drop still inside the page table lock, but
honestly, it just makes more sense to do as we free the batched pages
anyway.

Am I missing something still?

And again, this is about our horrid serialization between
folio_mkclean and set_page_dirty(). It's related to how GUP +
set_page_dirty() is also fundamentally problematic. So that dirty bit
situation *may* change if the rules for folio_mkclean() change...

                Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-30 22:47                                           ` Linus Torvalds
  2022-10-31  1:47                                             ` Linus Torvalds
@ 2022-10-31  9:28                                             ` Peter Zijlstra
  2022-10-31 17:19                                               ` Linus Torvalds
  1 sibling, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-31  9:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sun, Oct 30, 2022 at 03:47:36PM -0700, Linus Torvalds wrote:

>  include/linux/rmap.h |  1 +
>  mm/memory.c          |  3 ++-
>  mm/rmap.c            | 24 ++++++++++++++++++++++++
>  3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bd3504d11b15..f62af001707c 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,7 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>  		unsigned long address);
>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>  		bool compound);
> +void page_zap_pte_rmap(struct page *);
>  void page_remove_rmap(struct page *, struct vm_area_struct *,
>  		bool compound);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..c893f5ffc5a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1452,8 +1452,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				    likely(!(vma->vm_flags & VM_SEQ_READ)))
>  					mark_page_accessed(page);
>  			}
> +			page_zap_pte_rmap(page);
> +			munlock_vma_page(page, vma, false);
>  			rss[mm_counter(page)]--;
> -			page_remove_rmap(page, vma, false);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
>  			if (unlikely(__tlb_remove_page(tlb, page))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2ec925e5fa6a..28b51a31ebb0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1412,6 +1412,30 @@ static void page_remove_anon_compound_rmap(struct page *page)
>  		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
>  }
>  
> +/**
> + * page_zap_pte_rmap - take down a pte mapping from a page
> + * @page:	page to remove mapping from
> + *
> + * This is the simplified form of page_remove_rmap(), that only
> + * deals with last-level pages, so 'compound' is always false,
> + * and the caller does 'munlock_vma_page(page, vma, compound)'
> + * separately.
> + *
> + * This allows for a much simpler calling convention and code.
> + *
> + * The caller holds the pte lock.
> + */
> +void page_zap_pte_rmap(struct page *page)
> +{

One could consider adding something like:

#ifdef USE_SPLIT_PTE_PTLOCKS
	lockdep_assert_held(ptlock_ptr(page))
#endif


> +	if (!atomic_add_negative(-1, &page->_mapcount))
> +		return;
> +
> +	lock_page_memcg(page);
> +	__dec_lruvec_page_state(page,
> +		PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
> +	unlock_page_memcg(page);
> +}

Took me a little while, but yes, .compound=false seems to reduce to
this.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  1:47                                             ` Linus Torvalds
  2022-10-31  4:09                                               ` Nadav Amit
@ 2022-10-31  9:36                                               ` Peter Zijlstra
  2022-10-31 17:28                                                 ` Linus Torvalds
  2022-10-31  9:39                                               ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
  2022-10-31  9:46                                               ` Peter Zijlstra
  3 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-31  9:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote:

> Also, it's worth noting that zap_pte_range() does this sanity test:
> 
>                         if (unlikely(page_mapcount(page) < 0))
>                                 print_bad_pte(vma, addr, ptent, page);
> 
> and that is likely worthless now (because it hasn't actually
> decremented the mapcount yet). I didn't remove it, because I wasn't
> sure which option was best:
> 
>  (a) just remove it entirely
> 
>  (b) change the "< 0" to "<= 0"
> 
>  (c) move it to clean_and_free_pages_and_swap_cache() that actually
> does the page_zap_pte_rmap() now.

I'm leaning towards (c); simply because the error case is so terrifying
I feel we should check for it (and I do have vague memories of us
actually hitting something like this in the very distant past).


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  1:47                                             ` Linus Torvalds
  2022-10-31  4:09                                               ` Nadav Amit
  2022-10-31  9:36                                               ` Peter Zijlstra
@ 2022-10-31  9:39                                               ` Peter Zijlstra
  2022-10-31 17:22                                                 ` Linus Torvalds
  2022-10-31  9:46                                               ` Peter Zijlstra
  3 siblings, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-31  9:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote:

>  - there's no 'compund' argument any more, because it was always false
> 
>  - we can get rid of the tests for 'compund' and 'PageAnon()' since we
>    know that they are

You're making up new words ;-)

  s/compund/compound/g

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  1:47                                             ` Linus Torvalds
                                                                 ` (2 preceding siblings ...)
  2022-10-31  9:39                                               ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
@ 2022-10-31  9:46                                               ` Peter Zijlstra
  3 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-10-31  9:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote:

> diff --git a/mm/memory.c b/mm/memory.c
> index ba1d08a908a4..c893f5ffc5a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1451,9 +1451,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				if (pte_young(ptent) &&
>  				    likely(!(vma->vm_flags & VM_SEQ_READ)))
>  					mark_page_accessed(page);
> +			}
> +			page_zap_pte_rmap(page);
>  			munlock_vma_page(page, vma, false);
>  			rss[mm_counter(page)]--;
>  			if (unlikely(page_mapcount(page) < 0))
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 69de6c833d5c..28b51a31ebb0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1413,47 +1413,26 @@ static void page_remove_anon_compound_rmap(struct page *page)
>  }
>  
>  /**
> + * page_zap_pte_rmap - take down a pte mapping from a page
>   * @page:	page to remove mapping from
>   *
> + * This is the simplified form of page_remove_rmap(), that only
> + * deals with last-level pages, so 'compound' is always false,
> + * and the caller does 'munlock_vma_page(page, vma, compound)'
> + * separately.
>   *
> + * This allows for a much simpler calling convention and code.
>   *
>   * The caller holds the pte lock.
>   */
> +void page_zap_pte_rmap(struct page *page)
>  {
>  	if (!atomic_add_negative(-1, &page->_mapcount))
>  		return;
>  
>  	lock_page_memcg(page);
> +	__dec_lruvec_page_state(page,
> +		PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
>  	unlock_page_memcg(page);
>  }

So we *could* use atomic_add_return() and include the print_bad_pte()
thing in this function -- however that turns the whole thing into a mess
again :/

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  5:00                                                 ` Linus Torvalds
@ 2022-10-31 15:43                                                   ` Nadav Amit
  2022-10-31 17:32                                                     ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Nadav Amit @ 2022-10-31 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Oct 30, 2022, at 10:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So the current ordering rules are basically that we need to do
> set_page_dirty() *and* we need to flush the TLB's before dropping the
> page table lock. That's what gets us serialized with "mkclean”.

I understand. I am still not sure whether ordering the set_page_dirty() and
dropping the mapcount reference cannot suffice for the reclaim logic not to
free the buffers if the page is dirtied.

According to the code, shrink_page_list() first checks for folio_mapped()
and then for folio_test_dirty() to check whether pageout() is necessary.
IIUC, the buffers are not dropped up to this point and set_page_dirty()
would always set the page-struct dirty bit.

IOW: In shrink_page_list(), when we decide on whether to pageout(), we
should see whether the page is dirty (give or take smp_rmb()).

But this is an optimization and I do not know all the cases in which buffers
might be dropped. My intuition says that they cannot be dropped while
mapcount != 0, but I need to further explore it.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  9:28                                             ` Peter Zijlstra
@ 2022-10-31 17:19                                               ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-31 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Mon, Oct 31, 2022 at 2:29 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Oct 30, 2022 at 03:47:36PM -0700, Linus Torvalds wrote:
>
> > + * This is the simplified form of page_remove_rmap(), that only
> > + * deals with last-level pages, so 'compound' is always false,
> > + * and the caller does 'munlock_vma_page(page, vma, compound)'
> > + * separately.
> > + *
> > + * This allows for a much simpler calling convention and code.
> > + *
> > + * The caller holds the pte lock.
> > + */
> > +void page_zap_pte_rmap(struct page *page)
> > +{
>
> One could consider adding something like:
>
> #ifdef USE_SPLIT_PTE_PTLOCKS
>         lockdep_assert_held(ptlock_ptr(page))
> #endif

Yes, except that the page lock goes away in the next few patches and
gets replaced by just using the safe dec_lruvec_page_state() instead,
so it's not really worth it.

> > +     if (!atomic_add_negative(-1, &page->_mapcount))
> > +             return;
> > +
> > +     lock_page_memcg(page);
> > +     __dec_lruvec_page_state(page,
> > +             PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
> > +     unlock_page_memcg(page);
> > +}
>
> Took me a little while, but yes, .compound=false seems to reduce to
> this.

Yeah - it's why I kept that thing as three separate patches, because
even if each of the patches isn't "immediately obvious", you can at
least go back and follow along and see what I did.

The *full* simplification end result just looks like magic.

Admittedly, I think a lot of that "looks like magic" is because the
rmap code has seriously cruftified over the years. We had that time
when we actually

Go back a decade, and we literally used to do pretty much exactly what
the simplified form does. The transformation to complexity hell starts
with commit 89c06bd52fb9 ("memcg: use new logic for page stat
accounting"), but just look at what it looked like before that:

  git show 89c06bd52fb9^:mm/rmap.c

gets you the state back when it was simple. And look at what it did:

        void page_remove_rmap(struct page *page)
        {
                /* page still mapped by someone else? */
                if (!atomic_add_negative(-1, &page->_mapcount))
                        return;
                ... statistics go here ..

so in the end the simplified form of this page_zap_pte_rmap() really
isn't *that* surprising.

In fact, that old code handled PageHuge() fairly naturally too, and
almost all the mess comes from the memcg accounting - and locking -
changes.

And I actually really wanted to one step further, and try to then
batch up the page state accounting too. It's kind of stupid how we do
all that memcg locking for each page, and with this new setup we have
one nice array of pages that we *could* try to just batch things with.

The pages in normal situations *probably* mostly all share the same
memcg and node, so just optimistically trying to do something like "as
long as it's the same memcg as last time, just keep the lock".

Instead of locking and unlocking for every single page.

But just looking at it exhausted me enough that I don't think I'll go there.

Put another way: after this series, it's not the 'rmap' code that
makes me go Ugh, it's the memcg tracking..

(But the hugepage rmap code is incredibly nasty stuff still, and I
think the whole 'compound=true' case would be better with somebody
taking a look at that case too, but that somebody won't be me).

                  Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  9:39                                               ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
@ 2022-10-31 17:22                                                 ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-31 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Mon, Oct 31, 2022 at 2:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> You're making up new words ;-)
>
>   s/compund/compound/g

Oops.

Fixed. Thanks,

              Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31  9:36                                               ` Peter Zijlstra
@ 2022-10-31 17:28                                                 ` Linus Torvalds
  2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-10-31 17:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Mon, Oct 31, 2022 at 2:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> >  (c) move it to clean_and_free_pages_and_swap_cache() that actually
> > does the page_zap_pte_rmap() now.
>
> I'm leaning towards (c); simply because the error case is so terrifying
> I feel we should check for it (and I do have vague memories of us
> actually hitting something like this in the very distant past).

Ok. At that point we no longer have the pte or the virtual address, so
it's not goign to be exactly the same debug output.

But I think it ends up being fairly natural to do

        VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);

instead, and I've fixed that last patch up to do that.

                      Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
  2022-10-31 15:43                                                   ` Nadav Amit
@ 2022-10-31 17:32                                                     ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-31 17:32 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, jroedel, ubizjak, Alistair Popple

On Mon, Oct 31, 2022 at 8:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> On Oct 30, 2022, at 10:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > So the current ordering rules are basically that we need to do
> > set_page_dirty() *and* we need to flush the TLB's before dropping the
> > page table lock. That's what gets us serialized with "mkclean”.
>
> I understand. I am still not sure whether ordering the set_page_dirty() and
> dropping the mapcount reference cannot suffice for the reclaim logic not to
> free the buffers if the page is dirtied.

Ahh, ok.

> According to the code, shrink_page_list() first checks for folio_mapped()
> and then for folio_test_dirty() to check whether pageout() is necessary.
> IIUC, the buffers are not dropped up to this point and set_page_dirty()
> would always set the page-struct dirty bit.
>
> IOW: In shrink_page_list(), when we decide on whether to pageout(), we
> should see whether the page is dirty (give or take smp_rmb()).
>
> But this is an optimization and I do not know all the cases in which buffers
> might be dropped. My intuition says that they cannot be dropped while
> mapcount != 0, but I need to further explore it.

Yes, the above sounds like one fairly good way out of the whole forced
TLB flushing for dirty pages, while still keeping the filesystem code
happy.

But at this point it's an independent issue.

And I really would like any fix like that to also fix the whole issue
with GUP too, which seems related.

           Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* mm: delay rmap removal until after TLB flush
  2022-10-31 17:28                                                 ` Linus Torvalds
@ 2022-10-31 18:43                                                   ` Linus Torvalds
  2022-11-02  9:14                                                     ` Christian Borntraeger
                                                                       ` (4 more replies)
  0 siblings, 5 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-10-31 18:43 UTC (permalink / raw)
  To: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

Updated subject line, and here's the link to the original discussion
for new people:

    https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/

On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok. At that point we no longer have the pte or the virtual address, so
> it's not going to be exactly the same debug output.
>
> But I think it ends up being fairly natural to do
>
>         VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>
> instead, and I've fixed that last patch up to do that.

Ok, so I've got a fixed set of patches based on the feedback from
PeterZ, and also tried to do the s390 updates for this blindly, and
pushed them out into a git branch:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix

If people really want to see the patches in email again, I can do
that, but most of you already have, and the changes are either trivial
fixes or the s390 updates.

For the s390 people that I've now added to the participant list maybe
the git tree is fine - and the fundamental explanation of the problem
is in that top-most commit (with the three preceding commits being
prep-work). Or that link to the thread about this all.

That top-most commit is also where I tried to fix things up for s390
that uses its own non-gathering TLB flush due to
CONFIG_MMU_GATHER_NO_GATHER.

NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
rebased etc for further comments and fixes. So don't consider that
stable, it's still more of an RFC branch.

At a minimum I'll update it with Ack's etc, assuming I get those, and
my s390 changes are entirely untested and probably won't work.

As far as I can tell, s390 doesn't actually *have* the problem that
causes this change, because of its synchronous TLB flush, but it
obviously needs to deal with the change of rmap zapping logic.

Also added a few people who are explicitly listed as being mmu_gather
maintainers. Maybe people saw the discussion on the linux-mm list, but
let's make it explicit.

Do people have any objections to this approach, or other suggestions?

I do *not* consider this critical, so it's a "queue for 6.2" issue for me.

It probably makes most sense to queue in the -MM tree (after the thing
is acked and people agree), but I can keep that branch alive too and
just deal with it all myself as well.

Anybody?

                     Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/13] mm: Fix pmd_read_atomic()
  2022-10-22 17:30   ` Linus Torvalds
  2022-10-24  8:09     ` Peter Zijlstra
@ 2022-11-01 12:41     ` Peter Zijlstra
  2022-11-01 17:42       ` Linus Torvalds
                         ` (3 more replies)
  1 sibling, 4 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-11-01 12:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 10:30:51AM -0700, Linus Torvalds wrote:
> On Sat, Oct 22, 2022 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -258,6 +258,13 @@ static inline pte_t ptep_get(pte_t *ptep
> >  }
> >  #endif
> >
> > +#ifndef __HAVE_ARCH_PMDP_GET
> > +static inline pmd_t pmdp_get(pmd_t *pmdp)
> > +{
> > +       return READ_ONCE(*pmdp);
> > +}
> > +#endif
> 
> What, what, what?
> 
> Where did that __HAVE_ARCH_PMDP_GET come from?
> 
> I'm not seeing it #define'd anywhere, and we _really_ shouldn't be
> doing this any more.
> 
> Please just do
> 
>     #ifndef pmdp_get
>     static inline pmd_t pmdp_get(pmd_t *pmdp)
>     ..
> 
> and have the architectures that do their own pmdp_get(), just have that
> 
>    #define pmdp_get pmdp_get
> 
> to let the generic code know about it. Instead of making up a new
> __HAVE_ARCH_XYZ name.

So I've stuck the below on. There's a *TON* more to convert and I'm not
going to be doing that just now (seems like a clever enough script
should be able to), but this gets rid of the new one I introduced.

---

Subject: mm: Convert __HAVE_ARCH_P..P_GET to the new style
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Nov  1 12:53:18 CET 2022

Since __HAVE_ARCH_* style guards have been depricated in favour of
defining the function name onto itself, convert pxxp_get().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/powerpc/include/asm/nohash/32/pgtable.h |    2 +-
 include/linux/pgtable.h                      |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -263,7 +263,7 @@ static inline pte_basic_t pte_update(str
 }
 
 #ifdef CONFIG_PPC_16K_PAGES
-#define __HAVE_ARCH_PTEP_GET
+#define ptep_get ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
	pte_basic_t val = READ_ONCE(ptep->pte);
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -291,14 +291,14 @@ static inline void ptep_clear(struct mm_
	ptep_get_and_clear(mm, addr, ptep);
 }
 
-#ifndef __HAVE_ARCH_PTEP_GET
+#ifndef ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
	return READ_ONCE(*ptep);
 }
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_GET
+#ifndef pmdp_get
 static inline pmd_t pmdp_get(pmd_t *pmdp)
 {
	return READ_ONCE(*pmdp);


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 04/13] mm: Fix pmd_read_atomic()
  2022-11-01 12:41     ` Peter Zijlstra
@ 2022-11-01 17:42       ` Linus Torvalds
  2022-11-02  9:12       ` [tip: x86/mm] mm: Convert __HAVE_ARCH_P..P_GET to the new style tip-bot2 for Peter Zijlstra
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-01 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Tue, Nov 1, 2022 at 5:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I've stuck the below on. There's a *TON* more to convert and I'm not
> going to be doing that just now (seems like a clever enough script
> should be able to), but this gets rid of the new one I introduced.

Thanks.

And no, I don't think the churn of converting old cases is worth it. I
just want to discourage *more* of this.

Using the same name really helps when you do a "git grep" for a
symbol, the whole '#ifndef' patterns for alternate architecture
definitions shows up really clearly.

So that - together with not having the possibility of mixing up names
- is the main reason I don't like the ARCH_HAS_XYZ pattern, and much
prefer just using the name of whichever function gets an architecture
override.

                 Linus

PS. I'd love to get an ack/nak on the "mm: delay rmap removal until
after TLB flush" thing.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* [tip: x86/mm] mm: Convert __HAVE_ARCH_P..P_GET to the new style
  2022-11-01 12:41     ` Peter Zijlstra
  2022-11-01 17:42       ` Linus Torvalds
@ 2022-11-02  9:12       ` tip-bot2 for Peter Zijlstra
  2022-11-03 21:15       ` tip-bot2 for Peter Zijlstra
  2022-12-17 18:55       ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 148+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2022-11-02  9:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Linus Torvalds, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     0daf48b9e44674fb5ffc33cd41a3a17326e26cca
Gitweb:        https://git.kernel.org/tip/0daf48b9e44674fb5ffc33cd41a3a17326e26cca
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 01 Nov 2022 12:53:18 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 01 Nov 2022 13:44:05 +01:00

mm: Convert __HAVE_ARCH_P..P_GET to the new style

Since __HAVE_ARCH_* style guards have been depricated in favour of
defining the function name onto itself, convert pxxp_get().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y2EUEBlQXNgaJgoI@hirez.programming.kicks-ass.net
---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +-
 include/linux/pgtable.h                      | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 0d40b33..cb1ac02 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -263,7 +263,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p
 }
 
 #ifdef CONFIG_PPC_16K_PAGES
-#define __HAVE_ARCH_PTEP_GET
+#define ptep_get ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
 	pte_basic_t val = READ_ONCE(ptep->pte);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2334852..70e2a7e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -291,14 +291,14 @@ static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
 	ptep_get_and_clear(mm, addr, ptep);
 }
 
-#ifndef __HAVE_ARCH_PTEP_GET
+#ifndef ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
 	return READ_ONCE(*ptep);
 }
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_GET
+#ifndef pmdp_get
 static inline pmd_t pmdp_get(pmd_t *pmdp)
 {
 	return READ_ONCE(*pmdp);

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
@ 2022-11-02  9:14                                                     ` Christian Borntraeger
  2022-11-02  9:23                                                       ` Christian Borntraeger
  2022-11-02 17:55                                                       ` Linus Torvalds
  2022-11-02 12:45                                                     ` Peter Zijlstra
                                                                       ` (3 subsequent siblings)
  4 siblings, 2 replies; 148+ messages in thread
From: Christian Borntraeger @ 2022-11-02  9:14 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Gerald Schaefer
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

Am 31.10.22 um 19:43 schrieb Linus Torvalds:
> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ok. At that point we no longer have the pte or the virtual address, so
>> it's not going to be exactly the same debug output.
>>
>> But I think it ends up being fairly natural to do
>>
>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>
>> instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.

Adding Gerald.

> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.
> 
> Also added a few people who are explicitly listed as being mmu_gather
> maintainers. Maybe people saw the discussion on the linux-mm list, but
> let's make it explicit.
> 
> Do people have any objections to this approach, or other suggestions?
> 
> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
> 
> It probably makes most sense to queue in the -MM tree (after the thing
> is acked and people agree), but I can keep that branch alive too and
> just deal with it all myself as well.
> 
> Anybody?
> 
>                       Linus

It certainly needs a build fix for s390:


In file included from kernel/sched/core.c:78:
./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
    50 |                 page_zap_pte_rmap(page);
       |                 ^~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02  9:14                                                     ` Christian Borntraeger
@ 2022-11-02  9:23                                                       ` Christian Borntraeger
  2022-11-02 17:55                                                       ` Linus Torvalds
  1 sibling, 0 replies; 148+ messages in thread
From: Christian Borntraeger @ 2022-11-02  9:23 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Gerald Schaefer
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch


Am 02.11.22 um 10:14 schrieb Christian Borntraeger:
> Am 31.10.22 um 19:43 schrieb Linus Torvalds:
>> Updated subject line, and here's the link to the original discussion
>> for new people:
>>
>>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
>>
>> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> Ok. At that point we no longer have the pte or the virtual address, so
>>> it's not going to be exactly the same debug output.
>>>
>>> But I think it ends up being fairly natural to do
>>>
>>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>>
>>> instead, and I've fixed that last patch up to do that.
>>
>> Ok, so I've got a fixed set of patches based on the feedback from
>> PeterZ, and also tried to do the s390 updates for this blindly, and
>> pushed them out into a git branch:
>>
>>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
>>
>> If people really want to see the patches in email again, I can do
>> that, but most of you already have, and the changes are either trivial
>> fixes or the s390 updates.
>>
>> For the s390 people that I've now added to the participant list maybe
>> the git tree is fine - and the fundamental explanation of the problem
>> is in that top-most commit (with the three preceding commits being
>> prep-work). Or that link to the thread about this all.
> 
> Adding Gerald.

now the correct Gerald....
> 
>>
>> That top-most commit is also where I tried to fix things up for s390
>> that uses its own non-gathering TLB flush due to
>> CONFIG_MMU_GATHER_NO_GATHER.
>>
>> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
>> rebased etc for further comments and fixes. So don't consider that
>> stable, it's still more of an RFC branch.
>>
>> At a minimum I'll update it with Ack's etc, assuming I get those, and
>> my s390 changes are entirely untested and probably won't work.
>>
>> As far as I can tell, s390 doesn't actually *have* the problem that
>> causes this change, because of its synchronous TLB flush, but it
>> obviously needs to deal with the change of rmap zapping logic.
>>
>> Also added a few people who are explicitly listed as being mmu_gather
>> maintainers. Maybe people saw the discussion on the linux-mm list, but
>> let's make it explicit.
>>
>> Do people have any objections to this approach, or other suggestions?
>>
>> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
>>
>> It probably makes most sense to queue in the -MM tree (after the thing
>> is acked and people agree), but I can keep that branch alive too and
>> just deal with it all myself as well.
>>
>> Anybody?
>>
>>                       Linus
> 
> It certainly needs a build fix for s390:
> 
> 
> In file included from kernel/sched/core.c:78:
> ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
>     50 |                 page_zap_pte_rmap(page);
>        |                 ^~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
  2022-11-02  9:14                                                     ` Christian Borntraeger
@ 2022-11-02 12:45                                                     ` Peter Zijlstra
  2022-11-02 22:31                                                     ` Gerald Schaefer
                                                                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-11-02 12:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote:

>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix


Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02  9:14                                                     ` Christian Borntraeger
  2022-11-02  9:23                                                       ` Christian Borntraeger
@ 2022-11-02 17:55                                                       ` Linus Torvalds
  2022-11-02 18:28                                                         ` Linus Torvalds
  2022-11-02 22:29                                                         ` Gerald Schaefer
  1 sibling, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-02 17:55 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> It certainly needs a build fix for s390:
>
> In file included from kernel/sched/core.c:78:
> ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
>     50 |                 page_zap_pte_rmap(page);
>        |                 ^~~~~~~~~~~~~~~~~

Hmm. I'm not sure if I can add a

   #include <linux/rmap.h>

to that s390 asm header file without causing more issues.

The minimal damage would probably be to duplicate the declaration of
page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used.

Not pretty to have two different declarations of that thing, but
anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which
is much of mm) would then verify the consistency of  them.

So I'll do that minimal fix and update that branch, but if s390 people
end up having a better fix, please holler.

                Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02 17:55                                                       ` Linus Torvalds
@ 2022-11-02 18:28                                                         ` Linus Torvalds
  2022-11-02 22:29                                                         ` Gerald Schaefer
  1 sibling, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-02 18:28 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On Wed, Nov 2, 2022 at 10:55 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'll do that minimal fix and update that branch, but if s390 people
> end up having a better fix, please holler.

I've updated the branch with that, so hopefully s390 builds now.

I also fixed a typo in the commit message and added Peter's ack. Other
than that it's all the same it was before.

                 Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02 17:55                                                       ` Linus Torvalds
  2022-11-02 18:28                                                         ` Linus Torvalds
@ 2022-11-02 22:29                                                         ` Gerald Schaefer
  1 sibling, 0 replies; 148+ messages in thread
From: Gerald Schaefer @ 2022-11-02 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Borntraeger, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Wed, 2 Nov 2022 10:55:10 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger
> <borntraeger@linux.ibm.com> wrote:
> >
> > It certainly needs a build fix for s390:
> >
> > In file included from kernel/sched/core.c:78:
> > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
> >     50 |                 page_zap_pte_rmap(page);
> >        |                 ^~~~~~~~~~~~~~~~~
> 
> Hmm. I'm not sure if I can add a
> 
>    #include <linux/rmap.h>
> 
> to that s390 asm header file without causing more issues.
> 
> The minimal damage would probably be to duplicate the declaration of
> page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used.
> 
> Not pretty to have two different declarations of that thing, but
> anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which
> is much of mm) would then verify the consistency of  them.
> 
> So I'll do that minimal fix and update that branch, but if s390 people
> end up having a better fix, please holler.

It compiles now with your duplicate declaration, but adding the #include
also did not cause any damage, so that should also be OK.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
  2022-11-02  9:14                                                     ` Christian Borntraeger
  2022-11-02 12:45                                                     ` Peter Zijlstra
@ 2022-11-02 22:31                                                     ` Gerald Schaefer
  2022-11-02 23:13                                                       ` Linus Torvalds
  2022-11-03  9:52                                                     ` David Hildenbrand
  2022-11-04  6:33                                                     ` Alexander Gordeev
  4 siblings, 1 reply; 148+ messages in thread
From: Gerald Schaefer @ 2022-11-02 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Mon, 31 Oct 2022 11:43:30 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>     https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ok. At that point we no longer have the pte or the virtual address, so
> > it's not going to be exactly the same debug output.
> >
> > But I think it ends up being fairly natural to do
> >
> >         VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
> >
> > instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.
> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.

Correct, we need to flush already when we change a PTE, which is
done in ptep_get_and_clear() etc. Only exception would be lazy
flushing when only one active thread is attached, then we would
flush later in flush_tlb_mm/range(), or as soon as another thread
is attached (IIRC).

So it seems straight forward to just call page_zap_pte_rmap()
from our private __tlb_remove_page_size() implementation.

Just wondering a bit why you did not also add the
VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like
in the generic change.

Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-02 22:31                                                     ` Gerald Schaefer
@ 2022-11-02 23:13                                                       ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-02 23:13 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Wed, Nov 2, 2022 at 3:31 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> Just wondering a bit why you did not also add the
> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like
> in the generic change.

Heh, I had considered dropping it entirely even from the generic code,
since I don't remember seeing that ever trigger, but PeterZ convinced
me otherwise.

For the s390 side I really wanted to keep things minimal since I
(obviously) didn't even built-test it, so..

I'm perfectly happy with s390 people adding it later, of course.

                Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
                                                                       ` (2 preceding siblings ...)
  2022-11-02 22:31                                                     ` Gerald Schaefer
@ 2022-11-03  9:52                                                     ` David Hildenbrand
  2022-11-03 16:54                                                       ` Linus Torvalds
  2022-11-04  6:33                                                     ` Alexander Gordeev
  4 siblings, 1 reply; 148+ messages in thread
From: David Hildenbrand @ 2022-11-03  9:52 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle
  Cc: Nadav Amit, Jann Horn, John Hubbard, X86 ML, Matthew Wilcox,
	Andrew Morton, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

On 31.10.22 19:43, Linus Torvalds wrote:
> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ok. At that point we no longer have the pte or the virtual address, so
>> it's not going to be exactly the same debug output.
>>
>> But I think it ends up being fairly natural to do
>>
>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>
>> instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.
> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.
> 
> Also added a few people who are explicitly listed as being mmu_gather
> maintainers. Maybe people saw the discussion on the linux-mm list, but
> let's make it explicit.
> 
> Do people have any objections to this approach, or other suggestions?
> 
> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
> 
> It probably makes most sense to queue in the -MM tree (after the thing
> is acked and people agree), but I can keep that branch alive too and
> just deal with it all myself as well.
> 
> Anybody?

Happy to see that we're still decrementing the mapcount before 
decrementingthe refcount, I was briefly concerned.

I was not able to come up quickly with something that would be 
fundamentally wrong here, but devil is in the detail.

Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is 
unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON 
instead of VM_BUG_ON).

I agree that 6.2 is good enough and that upstreaming this via the -MM 
tree would be a good way to move forward.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-03  9:52                                                     ` David Hildenbrand
@ 2022-11-03 16:54                                                       ` Linus Torvalds
  2022-11-03 17:09                                                         ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-03 16:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Thu, Nov 3, 2022 at 2:52 AM David Hildenbrand <david@redhat.com> wrote:
>
> Happy to see that we're still decrementing the mapcount before
> decrementingthe refcount, I was briefly concerned.

Oh, that would have been horribly wrong.

> I was not able to come up quickly with something that would be
> fundamentally wrong here, but devil is in the detail.

So I tried to be very careful.

The biggest change in the whole series (visible in last patch, but
there in the prep-patches too) is how it narrows down some lock
coverage.

Now, that locking didn't *do* anything valid, but I did try to point
it out when it happens - how first the mapcount is decremented outside
the memcg lock (in preparatory patches), and then later on the
__dec_lruvec_page_state() turns into a dec_lruvec_page_state() because
it's then done outside the page table lock.

The locking in the second case did do something - not locking-wise,
but simply the "running under the spinlock means we are not
preemptable without RT".

And in the memcg case it was just plain overly wide lock regions.

None of the other changes should have any real semantic meaning
*apart* from just keeping ->_mapcount elevated slightly longer.

> Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is
> unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON
> instead of VM_BUG_ON).

That VM_BUG_ON() is a case of "what to do if this ever triggers?" So a
WARN_ON() would be fatal too, it's some seriously bogus stuff to try
to put bits in that won't fit.

It probably should be removed, since the value should always be pretty
much a simple constant. It was more of a "let's be careful with new
code, not for production".

Probably like pretty much all VM_BUG_ON's are - test code that just
got left around.

I considered just getting rid of ENCODE_PAGE_BITS entirely, since
there is only one bit. But it was always "let's keep the option open
for dirty bits etc", so I kept it, but I agree that the name isn't
wonderful.

And in fact I wanted the encoding to really be done by the caller (so
that TLB_ZAP_RMAP wouldn't be a new argument, but the 'page' argument
to __tlb_remove_page_*() would simply be an 'encoded page' pointer,
but that would have caused the patch to be much bigger (and expanded
the s390 side too). Which I didn't want to do.

Long-term that's probably still the right thing to do, including
passing the encoded pointers all the way to
free_pages_and_swap_cache().

Because it's pretty disgusting how it cleans up that array in place
and then does that cast to a new array type, but it is also disgusting
how it traverses that array multiple times (ie
free_pages_and_swap_cache() will just have *another* loop).

But again, those changes would have made the patch bigger, which I
didn't want at this point (and 'release_pages()' would need that
clean-in-place anyway, unless we changed *that* too and made the whole
page encoding be something widely available).

That's also why I then went with that fairly generic
"ENCODE_PAGE_BITS" name. The *use* of it right now is very very
specific to just the TLB gather, and the TLB_ZAP_RMAP bit shows that
in the name. But then I went for a fairly generic "encode extra bits
in the page pointer" name because it felt like it might expand beyond
the initial intentionally small patch in the long run.

So it's a combination of "we might want to expand on this in the
future" and yet also "I really want to keep the changes localized in
this patch".

And the two are kind of inverses of each other, which hopefully
explains the seemingly disparate naming logic.

                 Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-03 16:54                                                       ` Linus Torvalds
@ 2022-11-03 17:09                                                         ` Linus Torvalds
  2022-11-03 17:36                                                           ` David Hildenbrand
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-03 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But again, those changes would have made the patch bigger, which I
> didn't want at this point (and 'release_pages()' would need that
> clean-in-place anyway, unless we changed *that* too and made the whole
> page encoding be something widely available).

And just to clarify: this is not just me trying to expand the reach of my patch.

I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE
and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits
that TLB_ZAP_RMAP is.

Except the mlock code does *not* show that in the type system, and
instead just passes a "struct page **" array around in pvec->pages,
and then you'd just better know that "oh, it's not *really* just a
page pointer".

So I really think that the "array of encoded page pointers" thing is a
generic notion that we *already* have.

It's just that we've done it disgustingly in the past, and I didn't
want to do that disgusting thing again.

So I would hope that the nasty things that the mlock code would some
day use the same page pointer encoding logic to actually make the
whole "this is not a page pointer that you can use directly, it has
low bits set for flags" very explicit.

I am *not* sure if then the actual encoded bits would be unified.
Probably not - you might have very different and distinct uses of the
encode_page() thing where the bits mean different things in different
contexts.

Anyway, this is me just explaining the thinking behind it all. The
page bit encoding is a very generic thing (well, "very generic" in
this case means "has at least one other independent user"), explaining
the very generic naming.

But at the same time, the particular _patch_ was meant to be very targeted.

So slightly schizophrenic name choices as a result.

             Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-03 17:09                                                         ` Linus Torvalds
@ 2022-11-03 17:36                                                           ` David Hildenbrand
  0 siblings, 0 replies; 148+ messages in thread
From: David Hildenbrand @ 2022-11-03 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On 03.11.22 18:09, Linus Torvalds wrote:
> On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> But again, those changes would have made the patch bigger, which I
>> didn't want at this point (and 'release_pages()' would need that
>> clean-in-place anyway, unless we changed *that* too and made the whole
>> page encoding be something widely available).
> 
> And just to clarify: this is not just me trying to expand the reach of my patch.
> 
> I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE
> and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits
> that TLB_ZAP_RMAP is.
> 
> Except the mlock code does *not* show that in the type system, and
> instead just passes a "struct page **" array around in pvec->pages,
> and then you'd just better know that "oh, it's not *really* just a
> page pointer".
> 
> So I really think that the "array of encoded page pointers" thing is a
> generic notion that we *already* have.
> 
> It's just that we've done it disgustingly in the past, and I didn't
> want to do that disgusting thing again.
> 
> So I would hope that the nasty things that the mlock code would some
> day use the same page pointer encoding logic to actually make the
> whole "this is not a page pointer that you can use directly, it has
> low bits set for flags" very explicit.
> 
> I am *not* sure if then the actual encoded bits would be unified.
> Probably not - you might have very different and distinct uses of the
> encode_page() thing where the bits mean different things in different
> contexts.
> 
> Anyway, this is me just explaining the thinking behind it all. The
> page bit encoding is a very generic thing (well, "very generic" in
> this case means "has at least one other independent user"), explaining
> the very generic naming.
> 
> But at the same time, the particular _patch_ was meant to be very targeted.
> 
> So slightly schizophrenic name choices as a result.

Thanks for the explanation. I brought it up because the generic name 
somehow felt weird in include/asm-generic/tlb.h. Skimming over the code 
I'd have expected something like TLB_ENCODE_PAGE_BITS, so making the 
"very generic" things "very specific" as long as it lives in tlb.h :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-10-22 11:14 ` [PATCH 11/13] x86_64: Remove pointless set_64bit() usage Peter Zijlstra
  2022-10-22 17:55   ` Linus Torvalds
@ 2022-11-03 19:09   ` Nathan Chancellor
  2022-11-03 19:23     ` Uros Bizjak
  1 sibling, 1 reply; 148+ messages in thread
From: Nathan Chancellor @ 2022-11-03 19:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, torvalds, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

Hi Peter,

On Sat, Oct 22, 2022 at 01:14:14PM +0200, Peter Zijlstra wrote:
> The use of set_64bit() in X86_64 only code is pretty pointless, seeing
> how it's a direct assignment. Remove all this nonsense.
> 
> Additionally, since x86_64 unconditionally has HAVE_CMPXCHG_DOUBLE,
> there is no point in even having that fallback.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/um/include/asm/pgtable-3level.h |    8 --------
>  arch/x86/include/asm/cmpxchg_64.h    |    5 -----
>  drivers/iommu/intel/irq_remapping.c  |   10 ++--------
>  3 files changed, 2 insertions(+), 21 deletions(-)
> 
> --- a/arch/um/include/asm/pgtable-3level.h
> +++ b/arch/um/include/asm/pgtable-3level.h
> @@ -58,11 +58,7 @@
>  #define pud_populate(mm, pud, pmd) \
>  	set_pud(pud, __pud(_PAGE_TABLE + __pa(pmd)))
>  
> -#ifdef CONFIG_64BIT
> -#define set_pud(pudptr, pudval) set_64bit((u64 *) (pudptr), pud_val(pudval))
> -#else
>  #define set_pud(pudptr, pudval) (*(pudptr) = (pudval))
> -#endif
>  
>  static inline int pgd_newpage(pgd_t pgd)
>  {
> @@ -71,11 +67,7 @@ static inline int pgd_newpage(pgd_t pgd)
>  
>  static inline void pgd_mkuptodate(pgd_t pgd) { pgd_val(pgd) &= ~_PAGE_NEWPAGE; }
>  
> -#ifdef CONFIG_64BIT
> -#define set_pmd(pmdptr, pmdval) set_64bit((u64 *) (pmdptr), pmd_val(pmdval))
> -#else
>  #define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
> -#endif
>  
>  static inline void pud_clear (pud_t *pud)
>  {
> --- a/arch/x86/include/asm/cmpxchg_64.h
> +++ b/arch/x86/include/asm/cmpxchg_64.h
> @@ -2,11 +2,6 @@
>  #ifndef _ASM_X86_CMPXCHG_64_H
>  #define _ASM_X86_CMPXCHG_64_H
>  
> -static inline void set_64bit(volatile u64 *ptr, u64 val)
> -{
> -	*ptr = val;
> -}
> -
>  #define arch_cmpxchg64(ptr, o, n)					\
>  ({									\
>  	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -173,7 +173,6 @@ static int modify_irte(struct irq_2_iomm
>  	index = irq_iommu->irte_index + irq_iommu->sub_handle;
>  	irte = &iommu->ir_table->base[index];
>  
> -#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE)
>  	if ((irte->pst == 1) || (irte_modified->pst == 1)) {
>  		bool ret;
>  
> @@ -187,11 +186,6 @@ static int modify_irte(struct irq_2_iomm
>  		 * same as the old value.
>  		 */
>  		WARN_ON(!ret);
> -	} else
> -#endif
> -	{
> -		set_64bit(&irte->low, irte_modified->low);
> -		set_64bit(&irte->high, irte_modified->high);
>  	}
>  	__iommu_flush_cache(iommu, irte, sizeof(*irte));
>  
> @@ -249,8 +243,8 @@ static int clear_entries(struct irq_2_io
>  	end = start + (1 << irq_iommu->irte_mask);
>  
>  	for (entry = start; entry < end; entry++) {
> -		set_64bit(&entry->low, 0);
> -		set_64bit(&entry->high, 0);
> +		WRITE_ONCE(entry->low, 0);
> +		WRITE_ONCE(entry->high, 0);
>  	}
>  	bitmap_release_region(iommu->ir_table->bitmap, index,
>  			      irq_iommu->irte_mask);
> 
> 

This commit is now in -next as commit 0475a2d10fc7 ("x86_64: Remove
pointless set_64bit() usage") and I just bisect a boot failure on my
Intel test desktop to it.

# bad: [81214a573d19ae2fa5b528286ba23cd1cb17feec] Add linux-next specific files for 20221103
# good: [8e5423e991e8cd0988d0c4a3f4ac4ca1af7d148a] Merge tag 'parisc-for-6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
git bisect start '81214a573d19ae2fa5b528286ba23cd1cb17feec' '8e5423e991e8cd0988d0c4a3f4ac4ca1af7d148a'
# good: [8c13089d26d070fef87a64b48191cb7ae6dfbdb2] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
git bisect good 8c13089d26d070fef87a64b48191cb7ae6dfbdb2
# bad: [1bba8e9d15551d2f1c304d8f9d5c647a5b54bfc0] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
git bisect bad 1bba8e9d15551d2f1c304d8f9d5c647a5b54bfc0
# good: [748c419c7ade509684ce5bcf74f50e13e0447afd] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
git bisect good 748c419c7ade509684ce5bcf74f50e13e0447afd
# good: [0acc81a3bf9f875c5ef03037ff5431d37f536f05] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git
git bisect good 0acc81a3bf9f875c5ef03037ff5431d37f536f05
# bad: [c0fb84e0698d2ce57f9391c7f4112f6e17676f99] Merge branch into tip/master: 'x86/cleanups'
git bisect bad c0fb84e0698d2ce57f9391c7f4112f6e17676f99
# good: [7212c34aac1ec6abadf8b439824c8307ef0dd338] Merge branch 'x86/core' into x86/paravirt, to resolve conflicts
git bisect good 7212c34aac1ec6abadf8b439824c8307ef0dd338
# good: [e1f2ac1d285d963a783a027a1b109420b07f30c1] Merge branch into tip/master: 'x86/cpu'
git bisect good e1f2ac1d285d963a783a027a1b109420b07f30c1
# good: [306b75edbf25b86fe8189a4f96c217e49483f8ae] Merge branch into tip/master: 'x86/cleanups'
git bisect good 306b75edbf25b86fe8189a4f96c217e49483f8ae
# good: [8f28b415703e1935457a4bf0be7f03dc5471d09f] mm: Rename GUP_GET_PTE_LOW_HIGH
git bisect good 8f28b415703e1935457a4bf0be7f03dc5471d09f
# bad: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage
git bisect bad 0475a2d10fc7ced3268cd0f0551390b5858f90cd
# good: [a677802d5b0258f93f54620e1cd181b56547c36c] x86/mm/pae: Don't (ab)use atomic64
git bisect good a677802d5b0258f93f54620e1cd181b56547c36c
# good: [533627610ae7709572a4fac1393fb61153e2a5b3] x86/mm/pae: Be consistent with pXXp_get_and_clear()
git bisect good 533627610ae7709572a4fac1393fb61153e2a5b3
# first bad commit: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage
# good: [533627610ae7709572a4fac1393fb61153e2a5b3] x86/mm/pae: Be consistent with pXXp_get_and_clear()
git bisect good 533627610ae7709572a4fac1393fb61153e2a5b3
# first bad commit: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage

Unfortunately, I see no output on the screen it is attached to so I
assume it is happening pretty early during the boot sequence, which will
probably make getting logs somewhat hard. I can provide information
about the system if that would help reveal anything. If there is
anything I can test, I am more than happy to do so.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-03 19:09   ` Nathan Chancellor
@ 2022-11-03 19:23     ` Uros Bizjak
  2022-11-03 19:35       ` Nathan Chancellor
  0 siblings, 1 reply; 148+ messages in thread
From: Uros Bizjak @ 2022-11-03 19:23 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Peter Zijlstra, x86, willy, torvalds, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel

On Thu, Nov 3, 2022 at 8:09 PM Nathan Chancellor <nathan@kernel.org> wrote:
>
> Hi Peter,
>
> On Sat, Oct 22, 2022 at 01:14:14PM +0200, Peter Zijlstra wrote:
> > The use of set_64bit() in X86_64 only code is pretty pointless, seeing
> > how it's a direct assignment. Remove all this nonsense.
> >
> > Additionally, since x86_64 unconditionally has HAVE_CMPXCHG_DOUBLE,
> > there is no point in even having that fallback.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  arch/um/include/asm/pgtable-3level.h |    8 --------
> >  arch/x86/include/asm/cmpxchg_64.h    |    5 -----
> >  drivers/iommu/intel/irq_remapping.c  |   10 ++--------
> >  3 files changed, 2 insertions(+), 21 deletions(-)
> >
> > --- a/arch/um/include/asm/pgtable-3level.h
> > +++ b/arch/um/include/asm/pgtable-3level.h
> > @@ -58,11 +58,7 @@
> >  #define pud_populate(mm, pud, pmd) \
> >       set_pud(pud, __pud(_PAGE_TABLE + __pa(pmd)))
> >
> > -#ifdef CONFIG_64BIT
> > -#define set_pud(pudptr, pudval) set_64bit((u64 *) (pudptr), pud_val(pudval))
> > -#else
> >  #define set_pud(pudptr, pudval) (*(pudptr) = (pudval))
> > -#endif
> >
> >  static inline int pgd_newpage(pgd_t pgd)
> >  {
> > @@ -71,11 +67,7 @@ static inline int pgd_newpage(pgd_t pgd)
> >
> >  static inline void pgd_mkuptodate(pgd_t pgd) { pgd_val(pgd) &= ~_PAGE_NEWPAGE; }
> >
> > -#ifdef CONFIG_64BIT
> > -#define set_pmd(pmdptr, pmdval) set_64bit((u64 *) (pmdptr), pmd_val(pmdval))
> > -#else
> >  #define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
> > -#endif
> >
> >  static inline void pud_clear (pud_t *pud)
> >  {
> > --- a/arch/x86/include/asm/cmpxchg_64.h
> > +++ b/arch/x86/include/asm/cmpxchg_64.h
> > @@ -2,11 +2,6 @@
> >  #ifndef _ASM_X86_CMPXCHG_64_H
> >  #define _ASM_X86_CMPXCHG_64_H
> >
> > -static inline void set_64bit(volatile u64 *ptr, u64 val)
> > -{
> > -     *ptr = val;
> > -}
> > -
> >  #define arch_cmpxchg64(ptr, o, n)                                    \
> >  ({                                                                   \
> >       BUILD_BUG_ON(sizeof(*(ptr)) != 8);                              \
> > --- a/drivers/iommu/intel/irq_remapping.c
> > +++ b/drivers/iommu/intel/irq_remapping.c
> > @@ -173,7 +173,6 @@ static int modify_irte(struct irq_2_iomm
> >       index = irq_iommu->irte_index + irq_iommu->sub_handle;
> >       irte = &iommu->ir_table->base[index];
> >
> > -#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE)
> >       if ((irte->pst == 1) || (irte_modified->pst == 1)) {
> >               bool ret;
> >
> > @@ -187,11 +186,6 @@ static int modify_irte(struct irq_2_iomm
> >                * same as the old value.
> >                */
> >               WARN_ON(!ret);
> > -     } else
> > -#endif
> > -     {
> > -             set_64bit(&irte->low, irte_modified->low);
> > -             set_64bit(&irte->high, irte_modified->high);
> >       }
> >       __iommu_flush_cache(iommu, irte, sizeof(*irte));

It looks to me that the above part should not be removed, but
set_64bit should be substituted with WRITE_ONCE. Only #if/#endif lines
should be removed.

Uros.

> >
> > @@ -249,8 +243,8 @@ static int clear_entries(struct irq_2_io
> >       end = start + (1 << irq_iommu->irte_mask);
> >
> >       for (entry = start; entry < end; entry++) {
> > -             set_64bit(&entry->low, 0);
> > -             set_64bit(&entry->high, 0);
> > +             WRITE_ONCE(entry->low, 0);
> > +             WRITE_ONCE(entry->high, 0);
> >       }
> >       bitmap_release_region(iommu->ir_table->bitmap, index,
> >                             irq_iommu->irte_mask);
> >
> >
>
> This commit is now in -next as commit 0475a2d10fc7 ("x86_64: Remove
> pointless set_64bit() usage") and I just bisect a boot failure on my
> Intel test desktop to it.
>
> # bad: [81214a573d19ae2fa5b528286ba23cd1cb17feec] Add linux-next specific files for 20221103
> # good: [8e5423e991e8cd0988d0c4a3f4ac4ca1af7d148a] Merge tag 'parisc-for-6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
> git bisect start '81214a573d19ae2fa5b528286ba23cd1cb17feec' '8e5423e991e8cd0988d0c4a3f4ac4ca1af7d148a'
> # good: [8c13089d26d070fef87a64b48191cb7ae6dfbdb2] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
> git bisect good 8c13089d26d070fef87a64b48191cb7ae6dfbdb2
> # bad: [1bba8e9d15551d2f1c304d8f9d5c647a5b54bfc0] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> git bisect bad 1bba8e9d15551d2f1c304d8f9d5c647a5b54bfc0
> # good: [748c419c7ade509684ce5bcf74f50e13e0447afd] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
> git bisect good 748c419c7ade509684ce5bcf74f50e13e0447afd
> # good: [0acc81a3bf9f875c5ef03037ff5431d37f536f05] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git
> git bisect good 0acc81a3bf9f875c5ef03037ff5431d37f536f05
> # bad: [c0fb84e0698d2ce57f9391c7f4112f6e17676f99] Merge branch into tip/master: 'x86/cleanups'
> git bisect bad c0fb84e0698d2ce57f9391c7f4112f6e17676f99
> # good: [7212c34aac1ec6abadf8b439824c8307ef0dd338] Merge branch 'x86/core' into x86/paravirt, to resolve conflicts
> git bisect good 7212c34aac1ec6abadf8b439824c8307ef0dd338
> # good: [e1f2ac1d285d963a783a027a1b109420b07f30c1] Merge branch into tip/master: 'x86/cpu'
> git bisect good e1f2ac1d285d963a783a027a1b109420b07f30c1
> # good: [306b75edbf25b86fe8189a4f96c217e49483f8ae] Merge branch into tip/master: 'x86/cleanups'
> git bisect good 306b75edbf25b86fe8189a4f96c217e49483f8ae
> # good: [8f28b415703e1935457a4bf0be7f03dc5471d09f] mm: Rename GUP_GET_PTE_LOW_HIGH
> git bisect good 8f28b415703e1935457a4bf0be7f03dc5471d09f
> # bad: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage
> git bisect bad 0475a2d10fc7ced3268cd0f0551390b5858f90cd
> # good: [a677802d5b0258f93f54620e1cd181b56547c36c] x86/mm/pae: Don't (ab)use atomic64
> git bisect good a677802d5b0258f93f54620e1cd181b56547c36c
> # good: [533627610ae7709572a4fac1393fb61153e2a5b3] x86/mm/pae: Be consistent with pXXp_get_and_clear()
> git bisect good 533627610ae7709572a4fac1393fb61153e2a5b3
> # first bad commit: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage
> # good: [533627610ae7709572a4fac1393fb61153e2a5b3] x86/mm/pae: Be consistent with pXXp_get_and_clear()
> git bisect good 533627610ae7709572a4fac1393fb61153e2a5b3
> # first bad commit: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage
>
> Unfortunately, I see no output on the screen it is attached to so I
> assume it is happening pretty early during the boot sequence, which will
> probably make getting logs somewhat hard. I can provide information
> about the system if that would help reveal anything. If there is
> anything I can test, I am more than happy to do so.
>
> Cheers,
> Nathan

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-03 19:23     ` Uros Bizjak
@ 2022-11-03 19:35       ` Nathan Chancellor
  2022-11-03 20:39         ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Nathan Chancellor @ 2022-11-03 19:35 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Peter Zijlstra, x86, willy, torvalds, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel

On Thu, Nov 03, 2022 at 08:23:41PM +0100, Uros Bizjak wrote:
> On Thu, Nov 3, 2022 at 8:09 PM Nathan Chancellor <nathan@kernel.org> wrote:
> >
> > Hi Peter,
> >
> > On Sat, Oct 22, 2022 at 01:14:14PM +0200, Peter Zijlstra wrote:
> > > The use of set_64bit() in X86_64 only code is pretty pointless, seeing
> > > how it's a direct assignment. Remove all this nonsense.
> > >
> > > Additionally, since x86_64 unconditionally has HAVE_CMPXCHG_DOUBLE,
> > > there is no point in even having that fallback.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  arch/um/include/asm/pgtable-3level.h |    8 --------
> > >  arch/x86/include/asm/cmpxchg_64.h    |    5 -----
> > >  drivers/iommu/intel/irq_remapping.c  |   10 ++--------
> > >  3 files changed, 2 insertions(+), 21 deletions(-)
> > >
> > > --- a/arch/um/include/asm/pgtable-3level.h
> > > +++ b/arch/um/include/asm/pgtable-3level.h
> > > @@ -58,11 +58,7 @@
> > >  #define pud_populate(mm, pud, pmd) \
> > >       set_pud(pud, __pud(_PAGE_TABLE + __pa(pmd)))
> > >
> > > -#ifdef CONFIG_64BIT
> > > -#define set_pud(pudptr, pudval) set_64bit((u64 *) (pudptr), pud_val(pudval))
> > > -#else
> > >  #define set_pud(pudptr, pudval) (*(pudptr) = (pudval))
> > > -#endif
> > >
> > >  static inline int pgd_newpage(pgd_t pgd)
> > >  {
> > > @@ -71,11 +67,7 @@ static inline int pgd_newpage(pgd_t pgd)
> > >
> > >  static inline void pgd_mkuptodate(pgd_t pgd) { pgd_val(pgd) &= ~_PAGE_NEWPAGE; }
> > >
> > > -#ifdef CONFIG_64BIT
> > > -#define set_pmd(pmdptr, pmdval) set_64bit((u64 *) (pmdptr), pmd_val(pmdval))
> > > -#else
> > >  #define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
> > > -#endif
> > >
> > >  static inline void pud_clear (pud_t *pud)
> > >  {
> > > --- a/arch/x86/include/asm/cmpxchg_64.h
> > > +++ b/arch/x86/include/asm/cmpxchg_64.h
> > > @@ -2,11 +2,6 @@
> > >  #ifndef _ASM_X86_CMPXCHG_64_H
> > >  #define _ASM_X86_CMPXCHG_64_H
> > >
> > > -static inline void set_64bit(volatile u64 *ptr, u64 val)
> > > -{
> > > -     *ptr = val;
> > > -}
> > > -
> > >  #define arch_cmpxchg64(ptr, o, n)                                    \
> > >  ({                                                                   \
> > >       BUILD_BUG_ON(sizeof(*(ptr)) != 8);                              \
> > > --- a/drivers/iommu/intel/irq_remapping.c
> > > +++ b/drivers/iommu/intel/irq_remapping.c
> > > @@ -173,7 +173,6 @@ static int modify_irte(struct irq_2_iomm
> > >       index = irq_iommu->irte_index + irq_iommu->sub_handle;
> > >       irte = &iommu->ir_table->base[index];
> > >
> > > -#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE)
> > >       if ((irte->pst == 1) || (irte_modified->pst == 1)) {
> > >               bool ret;
> > >
> > > @@ -187,11 +186,6 @@ static int modify_irte(struct irq_2_iomm
> > >                * same as the old value.
> > >                */
> > >               WARN_ON(!ret);
> > > -     } else
> > > -#endif
> > > -     {
> > > -             set_64bit(&irte->low, irte_modified->low);
> > > -             set_64bit(&irte->high, irte_modified->high);
> > >       }
> > >       __iommu_flush_cache(iommu, irte, sizeof(*irte));
> 
> It looks to me that the above part should not be removed, but
> set_64bit should be substituted with WRITE_ONCE. Only #if/#endif lines
> should be removed.

Thanks, I also realized that only a couple minutes after I sent my
initial message. I just got done testing the following diff, which
resolves my issue. Peter, would you like a formal patch or did you want
to squash it in to the original change?

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 4216cafd67e7..5d176168bb76 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -186,6 +186,9 @@ static int modify_irte(struct irq_2_iommu *irq_iommu,
 		 * same as the old value.
 		 */
 		WARN_ON(!ret);
+	} else {
+		WRITE_ONCE(irte->low, irte_modified->low);
+		WRITE_ONCE(irte->high, irte_modified->high);
 	}
 	__iommu_flush_cache(iommu, irte, sizeof(*irte));
 

> > >
> > > @@ -249,8 +243,8 @@ static int clear_entries(struct irq_2_io
> > >       end = start + (1 << irq_iommu->irte_mask);
> > >
> > >       for (entry = start; entry < end; entry++) {
> > > -             set_64bit(&entry->low, 0);
> > > -             set_64bit(&entry->high, 0);
> > > +             WRITE_ONCE(entry->low, 0);
> > > +             WRITE_ONCE(entry->high, 0);
> > >       }
> > >       bitmap_release_region(iommu->ir_table->bitmap, index,
> > >                             irq_iommu->irte_mask);
> > >
> > >
> >
> > This commit is now in -next as commit 0475a2d10fc7 ("x86_64: Remove
> > pointless set_64bit() usage") and I just bisect a boot failure on my
> > Intel test desktop to it.
> >
> > # bad: [81214a573d19ae2fa5b528286ba23cd1cb17feec] Add linux-next specific files for 20221103
> > # good: [8e5423e991e8cd0988d0c4a3f4ac4ca1af7d148a] Merge tag 'parisc-for-6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
> > git bisect start '81214a573d19ae2fa5b528286ba23cd1cb17feec' '8e5423e991e8cd0988d0c4a3f4ac4ca1af7d148a'
> > # good: [8c13089d26d070fef87a64b48191cb7ae6dfbdb2] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
> > git bisect good 8c13089d26d070fef87a64b48191cb7ae6dfbdb2
> > # bad: [1bba8e9d15551d2f1c304d8f9d5c647a5b54bfc0] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> > git bisect bad 1bba8e9d15551d2f1c304d8f9d5c647a5b54bfc0
> > # good: [748c419c7ade509684ce5bcf74f50e13e0447afd] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
> > git bisect good 748c419c7ade509684ce5bcf74f50e13e0447afd
> > # good: [0acc81a3bf9f875c5ef03037ff5431d37f536f05] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git
> > git bisect good 0acc81a3bf9f875c5ef03037ff5431d37f536f05
> > # bad: [c0fb84e0698d2ce57f9391c7f4112f6e17676f99] Merge branch into tip/master: 'x86/cleanups'
> > git bisect bad c0fb84e0698d2ce57f9391c7f4112f6e17676f99
> > # good: [7212c34aac1ec6abadf8b439824c8307ef0dd338] Merge branch 'x86/core' into x86/paravirt, to resolve conflicts
> > git bisect good 7212c34aac1ec6abadf8b439824c8307ef0dd338
> > # good: [e1f2ac1d285d963a783a027a1b109420b07f30c1] Merge branch into tip/master: 'x86/cpu'
> > git bisect good e1f2ac1d285d963a783a027a1b109420b07f30c1
> > # good: [306b75edbf25b86fe8189a4f96c217e49483f8ae] Merge branch into tip/master: 'x86/cleanups'
> > git bisect good 306b75edbf25b86fe8189a4f96c217e49483f8ae
> > # good: [8f28b415703e1935457a4bf0be7f03dc5471d09f] mm: Rename GUP_GET_PTE_LOW_HIGH
> > git bisect good 8f28b415703e1935457a4bf0be7f03dc5471d09f
> > # bad: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage
> > git bisect bad 0475a2d10fc7ced3268cd0f0551390b5858f90cd
> > # good: [a677802d5b0258f93f54620e1cd181b56547c36c] x86/mm/pae: Don't (ab)use atomic64
> > git bisect good a677802d5b0258f93f54620e1cd181b56547c36c
> > # good: [533627610ae7709572a4fac1393fb61153e2a5b3] x86/mm/pae: Be consistent with pXXp_get_and_clear()
> > git bisect good 533627610ae7709572a4fac1393fb61153e2a5b3
> > # first bad commit: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage
> > # good: [533627610ae7709572a4fac1393fb61153e2a5b3] x86/mm/pae: Be consistent with pXXp_get_and_clear()
> > git bisect good 533627610ae7709572a4fac1393fb61153e2a5b3
> > # first bad commit: [0475a2d10fc7ced3268cd0f0551390b5858f90cd] x86_64: Remove pointless set_64bit() usage
> >
> > Unfortunately, I see no output on the screen it is attached to so I
> > assume it is happening pretty early during the boot sequence, which will
> > probably make getting logs somewhat hard. I can provide information
> > about the system if that would help reveal anything. If there is
> > anything I can test, I am more than happy to do so.
> >
> > Cheers,
> > Nathan

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-03 19:35       ` Nathan Chancellor
@ 2022-11-03 20:39         ` Linus Torvalds
  2022-11-03 21:06           ` Peter Zijlstra
  2022-11-04 16:01           ` Peter Zijlstra
  0 siblings, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-03 20:39 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Uros Bizjak, Peter Zijlstra, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel

On Thu, Nov 3, 2022 at 12:36 PM Nathan Chancellor <nathan@kernel.org> wrote:
>
> Thanks, I also realized that only a couple minutes after I sent my
> initial message. I just got done testing the following diff, which
> resolves my issue.

That looks obviously correct.

Except in this case "obviously correct patch" is to some very
non-obvious code, and I think the whole code around it is very very
questionable.

I had to actually go check that this code can only be enabled on
x86-64 (because "IRQ_REMAP" has a "depends on X86_64" on it), because
it also uses cmpxchg_double and that now exists on x86-32 too (but
only does 64 bits, not 128 bits, of course).

Now, to make things even more confusing, I think that

    #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE)

has *never* made sense, since it's always enabled for x86.

HOWEVER - there were actually early AMD x86-64 machines that didn't
have CMPXCHG16B. So the conditional kind of makes sense, but doing it
based on CONFIG_HAVE_CMPXCHG_DOUBLE does not.

It turns out that we do do this all correctly, except we do it at boot
time, with a test for boot_cpu_has(X86_FEATURE_CX16):

        /*
         * Note: GA (128-bit IRTE) mode requires cmpxchg16b supports.
         * XT, GAM also requires GA mode. Therefore, we need to
         * check cmpxchg16b support before enabling them.
         */
        if (!boot_cpu_has(X86_FEATURE_CX16) ||
              ...

but that #ifdef has apparenrly never been valid (I didn't go back and
see if we at some point had a config entry for those old CPUs).

And even after I checked *that*, I then checked the 'struct irte' to
check that it's actually properly defined, and it isn't. Considering
that this all requires 16-byte alignment to work, I think that type
should also be marked as being 16-byte aligned.

In fact, I wonder if we should aim to actually force compile-time
checking, because right now we have

        VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long)));
        VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2));

in our x86-64 __cmpxchg_double() macro, and honestly, that first one
might be better as a compile time check of __alignof__, and the second
one shouldn't exisrt at all because our interface shouldn't be using
two different pointers when the only possible use is for one single
aligned value.

If somebody actually wants the old m68k type of "DCAS" that did a
cmpxchg on two actually *different* pointers, we should call it
somethign else (and that's not what any current architecture does).

So honestly, just looking at this trivially correct patch, I got into
a rats nest of horribly wrong code. Nasty.

               Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-03 20:39         ` Linus Torvalds
@ 2022-11-03 21:06           ` Peter Zijlstra
  2022-11-04 16:01           ` Peter Zijlstra
  1 sibling, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-11-03 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nathan Chancellor, Uros Bizjak, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel

On Thu, Nov 03, 2022 at 01:39:17PM -0700, Linus Torvalds wrote:
> On Thu, Nov 3, 2022 at 12:36 PM Nathan Chancellor <nathan@kernel.org> wrote:
> >
> > Thanks, I also realized that only a couple minutes after I sent my
> > initial message. I just got done testing the following diff, which
> > resolves my issue.
> 
> That looks obviously correct.

I'll force-push tip/x86/mm with the fix from Nathan and I'll try and
look into the rest of the trainwreck tomorrow with the brain awake --
after I figure out that other fail Nathan reported :/

Sorry for breaking all your machines Nate.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* [tip: x86/mm] mm: Convert __HAVE_ARCH_P..P_GET to the new style
  2022-11-01 12:41     ` Peter Zijlstra
  2022-11-01 17:42       ` Linus Torvalds
  2022-11-02  9:12       ` [tip: x86/mm] mm: Convert __HAVE_ARCH_P..P_GET to the new style tip-bot2 for Peter Zijlstra
@ 2022-11-03 21:15       ` tip-bot2 for Peter Zijlstra
  2022-12-17 18:55       ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 148+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2022-11-03 21:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Linus Torvalds, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     3301badde43dee7c2a013fbd6479c258366519da
Gitweb:        https://git.kernel.org/tip/3301badde43dee7c2a013fbd6479c258366519da
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 01 Nov 2022 12:53:18 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 03 Nov 2022 22:04:29 +01:00

mm: Convert __HAVE_ARCH_P..P_GET to the new style

Since __HAVE_ARCH_* style guards have been depricated in favour of
defining the function name onto itself, convert pxxp_get().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y2EUEBlQXNgaJgoI@hirez.programming.kicks-ass.net
---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +-
 include/linux/pgtable.h                      | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 0d40b33..cb1ac02 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -263,7 +263,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p
 }
 
 #ifdef CONFIG_PPC_16K_PAGES
-#define __HAVE_ARCH_PTEP_GET
+#define ptep_get ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
 	pte_basic_t val = READ_ONCE(ptep->pte);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2334852..70e2a7e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -291,14 +291,14 @@ static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
 	ptep_get_and_clear(mm, addr, ptep);
 }
 
-#ifndef __HAVE_ARCH_PTEP_GET
+#ifndef ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
 	return READ_ONCE(*ptep);
 }
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_GET
+#ifndef pmdp_get
 static inline pmd_t pmdp_get(pmd_t *pmdp)
 {
 	return READ_ONCE(*pmdp);

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
                                                                       ` (3 preceding siblings ...)
  2022-11-03  9:52                                                     ` David Hildenbrand
@ 2022-11-04  6:33                                                     ` Alexander Gordeev
  2022-11-04 17:35                                                       ` Linus Torvalds
  4 siblings, 1 reply; 148+ messages in thread
From: Alexander Gordeev @ 2022-11-04  6:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote:
[...]
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.

I rather have a question to the generic part (had to master the code quotting).

> static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
> {
> 	for (unsigned int i = 0; i < nr; i++) {
> 		struct encoded_page *encoded = pages[i];
> 		unsigned int flags = encoded_page_flags(encoded);
> 		if (flags) {
> 			/* Clean the flagged pointer in-place */
> 			struct page *page = encoded_page_ptr(encoded);
> 			pages[i] = encode_page(page, 0);
> 
> 			/* The flag bit being set means that we should zap the rmap */

Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version?
(I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling
page_zap_pte_rmap() without such a check would be a bug).

> 			page_zap_pte_rmap(page);
> 			VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
> 		}
> 	}
> 
> 	/*
> 	 * Now all entries have been un-encoded, and changed to plain
> 	 * page pointers, so we can cast the 'encoded_page' array to
> 	 * a plain page array and free them
> 	 */
> 	free_pages_and_swap_cache((struct page **)pages, nr);
> }

Thanks!

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-03 20:39         ` Linus Torvalds
  2022-11-03 21:06           ` Peter Zijlstra
@ 2022-11-04 16:01           ` Peter Zijlstra
  2022-11-04 17:15             ` Linus Torvalds
  1 sibling, 1 reply; 148+ messages in thread
From: Peter Zijlstra @ 2022-11-04 16:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nathan Chancellor, Uros Bizjak, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel

On Thu, Nov 03, 2022 at 01:39:17PM -0700, Linus Torvalds wrote:

> And even after I checked *that*, I then checked the 'struct irte' to
> check that it's actually properly defined, and it isn't. Considering
> that this all requires 16-byte alignment to work, I think that type
> should also be marked as being 16-byte aligned.
> 
> In fact, I wonder if we should aim to actually force compile-time
> checking, because right now we have
> 
>         VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long)));
>         VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2));
> 
> in our x86-64 __cmpxchg_double() macro, and honestly, that first one
> might be better as a compile time check of __alignof__, and the second
> one shouldn't exisrt at all because our interface shouldn't be using
> two different pointers when the only possible use is for one single
> aligned value.

So cmpxchg_double() does a cmpxchg on a double long value and is
currently supported by: i386, x86_64, arm64 and s390.

On all those, except i386, two longs are u128.

So how about we introduce u128 and cmpxchg128 -- then it directly
mirrors the u64 and cmpxchg64 usage we already have. It then also
naturally imposses the alignment thing.

Afaict the only cmpxchg_double() users are:

  arch/s390/kernel/perf_cpum_sf.c
  drivers/iommu/amd/iommu.c
  drivers/iommu/intel/irq_remapping.c
  mm/slub.c

Of those slub.c is the only one that cares about 32bit and would need
some 'compat' code to pick between cmpxchg64 / cmpxchg128, but it
already has everything wrapped in helpers so that shouldn't be too big
of a worry.

Then we can convert these few users over and simply delete the whole
cmpxchg_double() thing.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-04 16:01           ` Peter Zijlstra
@ 2022-11-04 17:15             ` Linus Torvalds
  2022-11-05 13:29               ` Jason A. Donenfeld
  2022-12-19 15:44               ` Peter Zijlstra
  0 siblings, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-04 17:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nathan Chancellor, Uros Bizjak, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel

On Fri, Nov 4, 2022 at 9:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So cmpxchg_double() does a cmpxchg on a double long value and is
> currently supported by: i386, x86_64, arm64 and s390.
>
> On all those, except i386, two longs are u128.
>
> So how about we introduce u128 and cmpxchg128 -- then it directly
> mirrors the u64 and cmpxchg64 usage we already have. It then also
> naturally imposses the alignment thing.

Ack, except that we might have some "u128" users that do *not*
necessarily want any alignment thing.

But maybe we could at least start with an u128 type that is marked as
being fully aligned, and if some other user comes in down the line
that wants relaxed alignment we can call it "u128_unaligned" or
something.

That would have avoided the pain we have on x86-32 where "u64" in a
UAPI structure is differently aligned than it is on actual 64-bit
architectures.

> Of those slub.c is the only one that cares about 32bit and would need
> some 'compat' code to pick between cmpxchg64 / cmpxchg128, but it
> already has everything wrapped in helpers so that shouldn't be too big
> of a worry.

Ack. Special case, and we can call it something very clear, and in
fact it would clean up the slab code too if we actually used some
explicit type for this.

Right now, slab does

        /* Double-word boundary */
        void *freelist;         /* first free object */
        union {
                unsigned long counters;
                struct {

where the alignment constraints are just a comment, and then it passes
pointers to the two words around manually.

But it should be fairly straightforward to make it use an actual
properly typed thing, and using an unnamed union member, do something
that takes an explicitly aligned "two word" thing instead.

IOW, make the 'compat' case *not* use those stupid two separate
pointers (that have to be consecutive and aligned), but actually do
something like

        struct cmpxchg_double_long {
                unsigned long a,b;
        } __aligned(2*sizeof(long));

and then the slab case can use that union of that and the existing
"freelist+word-union" to make this all not just a comment to people,
but something the compiler sees too.

            Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-04  6:33                                                     ` Alexander Gordeev
@ 2022-11-04 17:35                                                       ` Linus Torvalds
  2022-11-06 21:06                                                         ` Hugh Dickins
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-04 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> I rather have a question to the generic part (had to master the code quotting).

Sure.

Although now I think the series in in Andrew's -mm tree, or just about
to get moved in there, so I'm not going to touch my actual branch any
more.

> > static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
> > {
> >       for (unsigned int i = 0; i < nr; i++) {
> >               struct encoded_page *encoded = pages[i];
> >               unsigned int flags = encoded_page_flags(encoded);
> >               if (flags) {
> >                       /* Clean the flagged pointer in-place */
> >                       struct page *page = encoded_page_ptr(encoded);
> >                       pages[i] = encode_page(page, 0);
> >
> >                       /* The flag bit being set means that we should zap the rmap */
>
> Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version?
> (I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling
> page_zap_pte_rmap() without such a check would be a bug).

No major reason. This is basically the same issue as the naming, which
I touched on in

  https://lore.kernel.org/all/CAHk-=wiDg_1up8K4PhK4+kzPN7xJG297=nw+tvgrGn7aVgZdqw@mail.gmail.com/

and the follow-up note about how I hope the "encoded page pointer with
flags" thing gets used by the mlock code some day too.

IOW, there's kind of a generic "I have extra flags associated with the
pointer", and then the specific "this case uses this flag", and
depending on which mindset you have at the time, you might do one or
the other.

So in that clean_and_free_pages_and_swap_cache(), the code basically
knows "I have a pointer with extra flags", and it's written that way.
And that's partly historical, because it actually started with the
original code tracking the dirty bit as the extra piece of
information, and then transformed into this "no, the information is
TLB_ZAP_RMAP".

So "unsigned int flags" at one point was "bool dirty" instead, but
then became more of a "I think this whole page pointer with flags is
general", and the naming changed, and I had both cases in mind, and
then the code is perhaps not so specifically named. I'm not sure the
zap_page_range() case will ever use more than one flag, but the mlock
case already has two different flags. So the "encode_page" thing is
very much written to be about more than just the zap_page_range()
case.

But yes, that code could (and maybe should) use "if (flags &
TLB_ZAP_RMAP)" to make it clear that in this case, the single flags
bit is that one bit.

But the "if ()" there actually does two conceptually *separate*
things: it needs to clean the pointer in-place (which is regardless of
"which" flag bit is set, and then it does that page_zap_pte_rmap(),
which is just for the TLB_ZAP_RMAP bit.

So to be really explicit about it, you'd have two different tests: one
for "do I have flags that need to be cleaned up" and then an inner
test for each flag. And since there is only one flag in this
particular use case, it's essentially that inner test that I dropped
as pointless.

In contrast, in the s390 version, that bit was never encoded as a a
general "flags associated with a page pointer" in the first place, so
there was never any such duality. There is only TLB_ZAP_RMAP.

Hope that explains the thinking.

                 Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-04 17:15             ` Linus Torvalds
@ 2022-11-05 13:29               ` Jason A. Donenfeld
  2022-11-05 15:14                 ` Peter Zijlstra
  2022-12-19 15:44               ` Peter Zijlstra
  1 sibling, 1 reply; 148+ messages in thread
From: Jason A. Donenfeld @ 2022-11-05 13:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Nathan Chancellor, Uros Bizjak, x86, willy, akpm,
	linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel

On Fri, Nov 04, 2022 at 10:15:08AM -0700, Linus Torvalds wrote:
> On Fri, Nov 4, 2022 at 9:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So cmpxchg_double() does a cmpxchg on a double long value and is
> > currently supported by: i386, x86_64, arm64 and s390.
> >
> > On all those, except i386, two longs are u128.
> >
> > So how about we introduce u128 and cmpxchg128 -- then it directly
> > mirrors the u64 and cmpxchg64 usage we already have. It then also
> > naturally imposses the alignment thing.
> 
> Ack, except that we might have some "u128" users that do *not*
> necessarily want any alignment thing.
> 
> But maybe we could at least start with an u128 type that is marked as
> being fully aligned, and if some other user comes in down the line
> that wants relaxed alignment we can call it "u128_unaligned" or
> something.

Hm, sounds maybe not so nice for another use case: arithmetic code that
makes use of u128 for efficient computations, but otherwise has
no particular alignment requirements. For example, `typedef __uint128_t
u128;` in:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/crypto/poly1305-donna64.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/crypto/curve25519-hacl64.c

I always thought it'd be nice to see that typedef alongside the others
in the shared kernel headers, but figured the requirement for 64-bit and
libgcc for some operations on some architectures made it a bit less
general purpose, so I never proposed it.

Jason

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-05 13:29               ` Jason A. Donenfeld
@ 2022-11-05 15:14                 ` Peter Zijlstra
  2022-11-05 20:54                   ` Jason A. Donenfeld
  2022-11-07  9:14                   ` David Laight
  0 siblings, 2 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-11-05 15:14 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Linus Torvalds, Nathan Chancellor, Uros Bizjak, x86, willy, akpm,
	linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel

On Sat, Nov 05, 2022 at 02:29:47PM +0100, Jason A. Donenfeld wrote:
> On Fri, Nov 04, 2022 at 10:15:08AM -0700, Linus Torvalds wrote:
> > On Fri, Nov 4, 2022 at 9:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > So cmpxchg_double() does a cmpxchg on a double long value and is
> > > currently supported by: i386, x86_64, arm64 and s390.
> > >
> > > On all those, except i386, two longs are u128.
> > >
> > > So how about we introduce u128 and cmpxchg128 -- then it directly
> > > mirrors the u64 and cmpxchg64 usage we already have. It then also
> > > naturally imposses the alignment thing.
> > 
> > Ack, except that we might have some "u128" users that do *not*
> > necessarily want any alignment thing.
> > 
> > But maybe we could at least start with an u128 type that is marked as
> > being fully aligned, and if some other user comes in down the line
> > that wants relaxed alignment we can call it "u128_unaligned" or
> > something.
> 
> Hm, sounds maybe not so nice for another use case: arithmetic code that
> makes use of u128 for efficient computations, but otherwise has
> no particular alignment requirements. For example, `typedef __uint128_t
> u128;` in:

Natural alignment is... natural. Making it unaligned is quite mad. That
whole u64 is not naturally aligned on i386 thing Linus referred to is a
sodding pain in the backside.

If the code has no alignment requirements, natural alignment is as good
as any. And if it does have requirements, you can use u128_unaligned.

Also:

$ ./align
16, 16

---

#include <stdio.h>

int main(int argx, char **argv)
{
	__int128 a;

	printf("%d, %d\n", sizeof(a), __alignof(a));
	return 0;
}

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-05 15:14                 ` Peter Zijlstra
@ 2022-11-05 20:54                   ` Jason A. Donenfeld
  2022-11-07  9:14                   ` David Laight
  1 sibling, 0 replies; 148+ messages in thread
From: Jason A. Donenfeld @ 2022-11-05 20:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nathan Chancellor, Uros Bizjak, x86, willy, akpm,
	linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel

On Sat, Nov 05, 2022 at 04:14:19PM +0100, Peter Zijlstra wrote:
> Also:
> 
> $ ./align
> 16, 16
> 
> ---
> 
> #include <stdio.h>
> 
> int main(int argx, char **argv)
> {
> 	__int128 a;
> 
> 	printf("%d, %d\n", sizeof(a), __alignof(a));
> 	return 0;
> }

zx2c4@thinkpad /tmp $ x86_64-linux-musl-gcc -O2 a.c -static && qemu-x86_64 ./a.out
16, 16
zx2c4@thinkpad /tmp $ aarch64-linux-musl-gcc -O2 a.c -static && qemu-aarch64 ./a.out
16, 16
zx2c4@thinkpad /tmp $ powerpc64le-linux-musl-gcc -O2 a.c -static && qemu-ppc64le ./a.out
16, 16
zx2c4@thinkpad /tmp $ s390x-linux-musl-gcc -O2 a.c -static && qemu-s390x ./a.out
16, 8
zx2c4@thinkpad /tmp $ riscv64-linux-musl-gcc -O2 a.c -static && qemu-riscv64 ./a.out
16, 16

Er, yea, you're right. Looks like of these, it's just s390x, so
whatever.

Jason

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-04 17:35                                                       ` Linus Torvalds
@ 2022-11-06 21:06                                                         ` Hugh Dickins
  2022-11-06 22:34                                                           ` Linus Torvalds
                                                                             ` (2 more replies)
  0 siblings, 3 replies; 148+ messages in thread
From: Hugh Dickins @ 2022-11-06 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

Adding Stephen to Cc, for next-20221107 alert.
Adding Johannes to Cc, particularly for lock_page_memcg discussion.

On Fri, 4 Nov 2022, Linus Torvalds wrote:
> On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > I rather have a question to the generic part (had to master the code quotting).
> 
> Sure.
> 
> Although now I think the series in in Andrew's -mm tree, or just about
> to get moved in there, so I'm not going to touch my actual branch any
> more.

Linus, we've been exchanging about my mm/rmap.c mods in private mail,
I need to raise some points about your mods here in public mail.

Four topics - munlock (good), anon_vma (bad), mm-unstable (bad),
lock_page_memcg (uncertain).  I'm asking Andrew here to back your
patchset out of mm-unstable for now, until we have its fixes in:
otherwise I'm worried that next-20221107 will waste everyone's time.

munlock (good)
--------------
You've separated out the munlock_vma_page() from the rest of PTE
remove rmap work.  Worried me at first, but I think that's fine:
bundling it into page_remove_rmap() was mainly so that we wouldn't
forget to do it anywhere (and other rmap funcs already took vma arg).

Certainly during development, I had been using page mapcount somewhere
inside munlock_vma_page(), either for optimization or for sanity check,
I forget; but gave up on that as unnecessary complication, and I think
it became impossible with the pagevec; so not an issue now.

If it were my change, I'd certainly do it by changing the name of the
vma arg in page_remove_rmap() from vma to munlock_vma, not doing the
munlock when that is NULL, and avoid the need for duplicating code:
whereas you're very pleased with cutting out the unnecessary stuff
in slimmed-down page_zap_pte_rmap().  Differing tastes (or perhaps
you'll say taste versus no taste).

anon_vma (bad)
--------------
My first reaction to your patchset was, that it wasn't obvious to me
that delaying the decrementation of mapcount is safe: I needed time to
think about it.  But happily, testing saved me from needing much thought.

The first problem, came immediately in the first iteration of my
huge tmpfs kbuild swapping loads, was BUG at mm/migrate.c:1105!
VM_BUG_ON_FOLIO(folio_test_anon(src) && !folio_test_ksm(src) && !anon_vma, src);
Okay, that's interesting but not necessarily fatal, we can very easily
make it a recoverable condition.  I didn't bother to think on that one,
just patched around it.

The second problem was more significant: came after nearly five hours
of load, BUG NULL dereference (address 8) in down_read_trylock() in
rmap_walk_anon(), while kswapd was doing a folio_referenced().

That one goes right to the heart of the matter, and instinct had been
correct to worry about delaying the decrementation of mapcount, that is,
extending the life of page_mapped() or folio_mapped().

See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
establishing the continued validity of an anon_vma.  See comments
above folio_get_anon_vma(), some by me but most by PeterZ IIRC.

I believe what has happened is that your patchset has, very intentionally,
kept the page as "folio_mapped" until after free_pgtables() does its
unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read()
that the anon_vma is safe to use when actually it has been freed.
(It looked like a page table when I peeped at it.)

I'm not certain, but I think that you made page_zap_pte_rmap() handle
anon as well as file, just for the righteous additional simplification;
but I'm afraid that (without opening a huge anon_vma refcounting can of
worms) that unification has to be reverted, and anon left to go the
same old way it did before.

I didn't look into whether reverting one of your patches would achieve
that, I just adjusted the code in zap_pte_range() to go the old way for
PageAnon; and that has been running successfully, hitting neither BUG,
for 15 hours now.

mm-unstable (bad)
-----------------
Aside from that PageAnon issue, mm-unstable is in an understandably bad
state because you could not have foreseen my subpages_mapcount addition
to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
PageCompound (but not the "compound") case too.  I rushed you and akpm
an emergency patch for that on Friday night, but you, let's say, had
reservations about it.  So I haven't posted it, and while the PageAnon
issue remains, I think your patchset has to be removed from mm-unstable
and linux-next anyway.

What happens to mm-unstable with page_zap_pte_rmap() not handling
subpages_mapcount?  In my CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y case,
I get a "Bad page state" even before reaching first login after boot,
"page dumped because: nonzero subpages_mapcount".  Yes, I make it
worse for myself by having a BUG() in bad_page(); others will limp
along with more and more of those "Bad page" messages.

lock_page_memcg (uncertain)
---------------------------
Johannes, please help! Linus has got quite upset by lock_page_memcg(),
its calls from mm/rmap.c anyway, and most particularly by the way
in which it is called at the start of page_remove_rmap(), before
anyone's critical atomic_add_negative(), yet its use is to guarantee
the stability of page memcg while doing the stats updates, done only
when atomic_add_negative() says so.

I do have one relevant insight on this.  It (or its antecedents under
other names) date from the days when we did "reparenting" of memcg
charges from an LRU: and in those days the lock_page_memcg() before
mapcount adjustment was vital, to pair with the uses of folio_mapped()
or page_mapped() in mem_cgroup_move_account() - those "mapped" checks
are precisely around the stats which the rmap functions affect.

But nowadays mem_cgroup_move_account() is only called, with page table
lock held, on matching pages found in a task's page table: so its
"mapped" checks are redundant - I've sometimes thought in the past of
removing them, but held back, because I always have the notion (not
hope!) that "reparenting" may one day be brought back from the grave.
I'm too out of touch with memcg to know where that notion stands today.

I've gone through a multiverse of opinions on those lock_page_memcg()s
in the last day: I currently believe that Linus is right, that the
lock_page_memcg()s could and should be moved just before the stats
updates.  But I am not 100% certain of that - is there still some
reason why it's important that the page memcg at the instant of the
critical mapcount transition be kept unchanged until the stats are
updated?  I've tried running scenarios through my mind but given up.

(And note that the answer might be different with Linus's changes
than without them: since he delays the mapcount decrementation
until long after pte was removed and page table lock dropped).

And I do wish that we could settle this lock_page_memcg() question
in an entirely separate patch: as it stands, page_zap_pte_rmap()
gets to benefit from Linus's insight (or not), and all the other rmap
functions carry on with the mis?placed lock_page_memcg() as before.

Let's press Send,
Hugh

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 21:06                                                         ` Hugh Dickins
@ 2022-11-06 22:34                                                           ` Linus Torvalds
  2022-11-06 23:14                                                             ` Andrew Morton
  2022-11-07  9:12                                                           ` Peter Zijlstra
  2022-11-07 20:07                                                           ` Johannes Weiner
  2 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-06 22:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

[ Editing down to just the bare-bones problem cases ]

On Sun, Nov 6, 2022 at 1:06 PM Hugh Dickins <hughd@google.com> wrote:
>
> anon_vma (bad)
> --------------
>
> See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
> establishing the continued validity of an anon_vma.  See comments
> above folio_get_anon_vma(), some by me but most by PeterZ IIRC.
>
> I believe what has happened is that your patchset has, very intentionally,
> kept the page as "folio_mapped" until after free_pgtables() does its
> unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read()
> that the anon_vma is safe to use when actually it has been freed.
> (It looked like a page table when I peeped at it.)
>
> I'm not certain, but I think that you made page_zap_pte_rmap() handle
> anon as well as file, just for the righteous additional simplification;
> but I'm afraid that (without opening a huge anon_vma refcounting can of
> worms) that unification has to be reverted, and anon left to go the
> same old way it did before.

Indeed. I made them separate initially, just because the only case
that mattered for the dirty bit was the file-mapped case.

But then the two functions ended up being basically the identical
function, so I unified them again.

But the anonvma lifetime issue looks very real, and so doing the
"delay rmap only for file mappings" seems sane.

In fact, I wonder if we should delay it only for *dirty* file
mappings, since it doesn't matter for the clean case.

Hmm.

I already threw away my branch (since Andrew picked the patches up),
so a question for Andrew: do you want me to re-do the branch entirely,
or do you want me to just send you an incremental patch?

To make for minimal changes, I'd drop the 're-unification' patch, and
then small updates to the zap_pte_range() code to keep the anon (and
possibly non-dirty) case synchronous.

And btw, this one is interesting: for anonymous (and non-dirty
file-mapped) patches, we actually can end up delaying the final page
free (and the rmap zapping) all the way to "tlb_finish_mmu()".

Normally we still have the vma's all available, but yes,
free_pgtables() can and does happen before the final TLB flush.

The file-mapped dirty case doesn't have that issue - not just because
it doesn't have an anonvma at all, but because it also does that
"force_flush" thing that just measn that the page freeign never gets
delayed that far in the first place.

> mm-unstable (bad)
> -----------------
> Aside from that PageAnon issue, mm-unstable is in an understandably bad
> state because you could not have foreseen my subpages_mapcount addition
> to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
> PageCompound (but not the "compound") case too.  I rushed you and akpm
> an emergency patch for that on Friday night, but you, let's say, had
> reservations about it.  So I haven't posted it, and while the PageAnon
> issue remains, I think your patchset has to be removed from mm-unstable
> and linux-next anyway.

So I think I'm fine with your patch, I just want to move the memcg
accounting to outside of it.

I can re-do my series on top of mm-unstable, I guess. That's probably
the easiest way to handle this all.

Andrew - can you remove those patches again, and I'll create a new
series for you?

                 Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 22:34                                                           ` Linus Torvalds
@ 2022-11-06 23:14                                                             ` Andrew Morton
  2022-11-07  0:06                                                               ` Stephen Rothwell
  2022-11-07 16:19                                                               ` Linus Torvalds
  0 siblings, 2 replies; 148+ messages in thread
From: Andrew Morton @ 2022-11-06 23:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Sun, 6 Nov 2022 14:34:51 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > mm-unstable (bad)
> > -----------------
> > Aside from that PageAnon issue, mm-unstable is in an understandably bad
> > state because you could not have foreseen my subpages_mapcount addition
> > to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
> > PageCompound (but not the "compound") case too.  I rushed you and akpm
> > an emergency patch for that on Friday night, but you, let's say, had
> > reservations about it.  So I haven't posted it, and while the PageAnon
> > issue remains, I think your patchset has to be removed from mm-unstable
> > and linux-next anyway.
> 
> So I think I'm fine with your patch, I just want to move the memcg
> accounting to outside of it.
> 
> I can re-do my series on top of mm-unstable, I guess. That's probably
> the easiest way to handle this all.
> 
> Andrew - can you remove those patches again, and I'll create a new
> series for you?

Yes, I've removed both serieses and shall push the tree out within half
an hour from now.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 23:14                                                             ` Andrew Morton
@ 2022-11-07  0:06                                                               ` Stephen Rothwell
  2022-11-07 16:19                                                               ` Linus Torvalds
  1 sibling, 0 replies; 148+ messages in thread
From: Stephen Rothwell @ 2022-11-07  0:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Johannes Weiner, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

[-- Attachment #1: Type: text/plain, Size: 305 bytes --]

Hi all,

On Sun, 6 Nov 2022 15:14:16 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Yes, I've removed both serieses and shall push the tree out within half
> an hour from now.

And I have picked up the new version for today's linux-next.  Thanks all.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 21:06                                                         ` Hugh Dickins
  2022-11-06 22:34                                                           ` Linus Torvalds
@ 2022-11-07  9:12                                                           ` Peter Zijlstra
  2022-11-07 20:07                                                           ` Johannes Weiner
  2 siblings, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-11-07  9:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Johannes Weiner, Stephen Rothwell,
	Alexander Gordeev, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch


Hi Hugh!

On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote:
> See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
> establishing the continued validity of an anon_vma.  See comments
> above folio_get_anon_vma(), some by me but most by PeterZ IIRC.

Ohhh, you're quite right. Unfortunately I seem to have completely
forgotten about that :-(



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-05 15:14                 ` Peter Zijlstra
  2022-11-05 20:54                   ` Jason A. Donenfeld
@ 2022-11-07  9:14                   ` David Laight
  1 sibling, 0 replies; 148+ messages in thread
From: David Laight @ 2022-11-07  9:14 UTC (permalink / raw)
  To: 'Peter Zijlstra', Jason A. Donenfeld
  Cc: Linus Torvalds, Nathan Chancellor, Uros Bizjak, x86, willy, akpm,
	linux-kernel, linux-mm, aarcange, kirill.shutemov, jroedel

From: Peter Zijlstra
> Sent: 05 November 2022 15:14
> 
> On Sat, Nov 05, 2022 at 02:29:47PM +0100, Jason A. Donenfeld wrote:
> > On Fri, Nov 04, 2022 at 10:15:08AM -0700, Linus Torvalds wrote:
> > > On Fri, Nov 4, 2022 at 9:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > So cmpxchg_double() does a cmpxchg on a double long value and is
> > > > currently supported by: i386, x86_64, arm64 and s390.
> > > >
> > > > On all those, except i386, two longs are u128.
> > > >
> > > > So how about we introduce u128 and cmpxchg128 -- then it directly
> > > > mirrors the u64 and cmpxchg64 usage we already have. It then also
> > > > naturally imposses the alignment thing.
> > >
> > > Ack, except that we might have some "u128" users that do *not*
> > > necessarily want any alignment thing.
> > >
> > > But maybe we could at least start with an u128 type that is marked as
> > > being fully aligned, and if some other user comes in down the line
> > > that wants relaxed alignment we can call it "u128_unaligned" or
> > > something.
> >
> > Hm, sounds maybe not so nice for another use case: arithmetic code that
> > makes use of u128 for efficient computations, but otherwise has
> > no particular alignment requirements. For example, `typedef __uint128_t
> > u128;` in:
> 
> Natural alignment is... natural. Making it unaligned is quite mad. That
> whole u64 is not naturally aligned on i386 thing Linus referred to is a
> sodding pain in the backside.
> 
> If the code has no alignment requirements, natural alignment is as good
> as any. And if it does have requirements, you can use u128_unaligned.
> 
> Also:
> 
> $ ./align
> 16, 16
> 
> ---
> 
> #include <stdio.h>
> 
> int main(int argx, char **argv)
> {
> 	__int128 a;
> 
> 	printf("%d, %d\n", sizeof(a), __alignof(a));
> 	return 0;
> }

Well, __alignof() doesn't return the required value.
(cf 'long long' on 32bit x86).
But the alignment of __int128 is 16 :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 23:14                                                             ` Andrew Morton
  2022-11-07  0:06                                                               ` Stephen Rothwell
@ 2022-11-07 16:19                                                               ` Linus Torvalds
  2022-11-07 23:02                                                                 ` Andrew Morton
  1 sibling, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-07 16:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Yes, I've removed both serieses and shall push the tree out within half
> an hour from now.

Oh, can you please just put Hugh's series back in?

I don't love the mapcount changes, but I'm going to re-do my parts so
that there is no clash with them.

And while I'd much rather see the mapcounts done another way, that's a
completely separate issue.

                Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-06 21:06                                                         ` Hugh Dickins
  2022-11-06 22:34                                                           ` Linus Torvalds
  2022-11-07  9:12                                                           ` Peter Zijlstra
@ 2022-11-07 20:07                                                           ` Johannes Weiner
  2022-11-07 20:29                                                             ` Linus Torvalds
  2 siblings, 1 reply; 148+ messages in thread
From: Johannes Weiner @ 2022-11-07 20:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

Hello,

On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote:
> lock_page_memcg (uncertain)
> ---------------------------
> Johannes, please help! Linus has got quite upset by lock_page_memcg(),
> its calls from mm/rmap.c anyway, and most particularly by the way
> in which it is called at the start of page_remove_rmap(), before
> anyone's critical atomic_add_negative(), yet its use is to guarantee
> the stability of page memcg while doing the stats updates, done only
> when atomic_add_negative() says so.

As you mentioned, the pte lock historically wasn't always taken on the
move side. And so the reason lock_page_memcg() covers the mapcount
update is that move_account() needs to atomically see either a) the
page is mapped and counted, or b) unmapped and uncounted. If we lock
after mapcountdec, move_account() could miss pending updates that need
transferred, and it would break the scheme breaks thusly:

memcg1->nr_mapped = 1
memcg2->nr_mapped = 0

page_remove_rmap:                              mem_cgroup_move_account():
  if atomic_add_negative(page->mapcount):
                                                 lock_page_memcg()
                                                 if page->mapcount: // NOT TAKEN
                                                   memcg1->nr_mapped--
                                                   memcg2->nr_mapped++
                                                 page->memcg = memcg2
                                                 unlock_page_memcg()
    lock_page_memcg()
    page->memcg->nr_mapped-- // UNDERFLOW memcg2->nr_mapped
    unlock_page_memcg()

> I do have one relevant insight on this.  It (or its antecedents under
> other names) date from the days when we did "reparenting" of memcg
> charges from an LRU: and in those days the lock_page_memcg() before
> mapcount adjustment was vital, to pair with the uses of folio_mapped()
> or page_mapped() in mem_cgroup_move_account() - those "mapped" checks
> are precisely around the stats which the rmap functions affect.
> 
> But nowadays mem_cgroup_move_account() is only called, with page table
> lock held, on matching pages found in a task's page table: so its
> "mapped" checks are redundant - I've sometimes thought in the past of
> removing them, but held back, because I always have the notion (not
> hope!) that "reparenting" may one day be brought back from the grave.
> I'm too out of touch with memcg to know where that notion stands today.
>
> I've gone through a multiverse of opinions on those lock_page_memcg()s
> in the last day: I currently believe that Linus is right, that the
> lock_page_memcg()s could and should be moved just before the stats
> updates.  But I am not 100% certain of that - is there still some
> reason why it's important that the page memcg at the instant of the
> critical mapcount transition be kept unchanged until the stats are
> updated?  I've tried running scenarios through my mind but given up.

Okay, I think there are two options.

- If we don't want to codify the pte lock requirement on the move
  side, then moving the lock_page_memcg() like that would break the
  locking scheme, as per above.

- If we DO want to codify the pte lock requirement, we should just
  remove the lock_page_memcg() altogether, as it's fully redundant.

I'm leaning toward the second option. If somebody brings back
reparenting they can bring the lock back along with it.

[ If it's even still necessary by then.

  It's conceivable reparenting is brought back only for cgroup2, where
  the race wouldn't matter because of the hierarchical stats. The
  reparenting side wouldn't have to move page state to the parent -
  it's already there. And whether rmap would see the dying child or
  the parent doesn't matter much either: the parent will see the
  update anyway, directly or recursively, and we likely don't care to
  balance the books on a dying cgroup.

  It's then just a matter of lifetime - which should be guaranteed
  also, as long as the pte lock prevents an RCU quiescent state. ]

So how about something like below?

UNTESTED, just for illustration. This is cgroup1 code, which I haven't
looked at too closely in a while. If you can't spot an immediate hole
in it, I'd go ahead and test it and send a proper patch.

---
From 88a32b1b5737630fb981114f6333d8fd057bd8e9 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 7 Nov 2022 12:05:09 -0500
Subject: [PATCH] mm: remove lock_page_memcg() from rmap

rmap changes (mapping and unmapping) of a page currently take
lock_page_memcg() to serialize 1) update of the mapcount and the
cgroup mapped counter with 2) cgroup moving the page and updating the
old cgroup and the new cgroup counters based on page_mapped().

Before b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), we used to reassign all pages that could be found
on a cgroup's LRU list on deletion - something that rmap didn't
naturally serialize against. Since that commit, however, the only
pages that get moved are those mapped into page tables of a task
that's being migrated. In that case, the pte lock is always held (and
we know the page is mapped), which keeps rmap changes at bay already.

The additional lock_page_memcg() by rmap is redundant. Remove it.

NOT-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 27 ++++++++++++---------------
 mm/rmap.c       | 30 ++++++++++++------------------
 2 files changed, 24 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..f7716e9038e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5666,7 +5666,10 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
- * The caller must make sure the page is not on LRU (isolate_page() is useful.)
+ * This function acquires folio_lock() and folio_lock_memcg(). The
+ * caller must exclude all other possible ways of accessing
+ * page->memcg, such as LRU isolation (to lock out isolation) and
+ * having the page mapped and pte-locked (to lock out rmap).
  *
  * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
  * from old cgroup.
@@ -5685,6 +5688,7 @@ static int mem_cgroup_move_account(struct page *page,
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON(compound && !folio_test_large(folio));
+	VM_WARN_ON_ONCE(!folio_mapped(folio));
 
 	/*
 	 * Prevent mem_cgroup_migrate() from looking at
@@ -5705,30 +5709,23 @@ static int mem_cgroup_move_account(struct page *page,
 	folio_memcg_lock(folio);
 
 	if (folio_test_anon(folio)) {
-		if (folio_mapped(folio)) {
-			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
-			if (folio_test_transhuge(folio)) {
-				__mod_lruvec_state(from_vec, NR_ANON_THPS,
-						   -nr_pages);
-				__mod_lruvec_state(to_vec, NR_ANON_THPS,
-						   nr_pages);
-			}
+		__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
+		if (folio_test_transhuge(folio)) {
+			__mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages);
+			__mod_lruvec_state(to_vec, NR_ANON_THPS, nr_pages);
 		}
 	} else {
 		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
 		__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
+		__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
 
 		if (folio_test_swapbacked(folio)) {
 			__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
 			__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
 		}
 
-		if (folio_mapped(folio)) {
-			__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
-		}
-
 		if (folio_test_dirty(folio)) {
 			struct address_space *mapping = folio_mapping(folio);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..60c31375f274 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1197,11 +1197,6 @@ void page_add_anon_rmap(struct page *page,
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
 
-	if (unlikely(PageKsm(page)))
-		lock_page_memcg(page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(page), page);
-
 	if (compound) {
 		atomic_t *mapcount;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1217,19 +1212,19 @@ void page_add_anon_rmap(struct page *page,
 	if (first) {
 		int nr = compound ? thp_nr_pages(page) : 1;
 		/*
-		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
-		 * these counters are not modified in interrupt context, and
-		 * pte lock(a spinlock) is held, which implies preemption
-		 * disabled.
+		 * We use the irq-unsafe __{inc|mod}_zone_page_stat
+		 * because these counters are not modified in
+		 * interrupt context, and pte lock(a spinlock) is
+		 * held, which implies preemption disabled.
+		 *
+		 * The pte lock also stabilizes page->memcg wrt
+		 * mem_cgroup_move_account().
 		 */
 		if (compound)
 			__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
-	if (unlikely(PageKsm(page)))
-		unlock_page_memcg(page);
-
 	/* address might be in next vma when migration races vma_adjust */
 	else if (first)
 		__page_set_anon_rmap(page, vma, address,
@@ -1290,7 +1285,6 @@ void page_add_file_rmap(struct page *page,
 	int i, nr = 0;
 
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
-	lock_page_memcg(page);
 	if (compound && PageTransHuge(page)) {
 		int nr_pages = thp_nr_pages(page);
 
@@ -1311,6 +1305,7 @@ void page_add_file_rmap(struct page *page,
 		if (nr == nr_pages && PageDoubleMap(page))
 			ClearPageDoubleMap(page);
 
+		/* The pte lock stabilizes page->memcg */
 		if (PageSwapBacked(page))
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						nr_pages);
@@ -1328,7 +1323,6 @@ void page_add_file_rmap(struct page *page,
 out:
 	if (nr)
 		__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
-	unlock_page_memcg(page);
 
 	mlock_vma_page(page, vma, compound);
 }
@@ -1356,6 +1350,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
+		/* The pte lock stabilizes page->memcg */
 		if (PageSwapBacked(page))
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						-nr_pages);
@@ -1423,8 +1418,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
 void page_remove_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	lock_page_memcg(page);
-
 	if (!PageAnon(page)) {
 		page_remove_file_rmap(page, compound);
 		goto out;
@@ -1443,6 +1436,9 @@ void page_remove_rmap(struct page *page,
 	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
+	 *
+	 * The pte lock also stabilizes page->memcg wrt
+	 * mem_cgroup_move_account().
 	 */
 	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
 
@@ -1459,8 +1455,6 @@ void page_remove_rmap(struct page *page,
 	 * faster for those pages still in swapcache.
 	 */
 out:
-	unlock_page_memcg(page);
-
 	munlock_vma_page(page, vma, compound);
 }
 
-- 
2.38.1

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 20:07                                                           ` Johannes Weiner
@ 2022-11-07 20:29                                                             ` Linus Torvalds
  2022-11-07 23:47                                                               ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-07 20:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Mon, Nov 7, 2022 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> - If we DO want to codify the pte lock requirement, we should just
>   remove the lock_page_memcg() altogether, as it's fully redundant.
>
> I'm leaning toward that second option.

The thing is, that's very much the case we do *not* want.

We need to delay the rmap removal until at least after the TLB flush.
At least for dirty filemapped pages - because the page cleaning needs
to see that they exists as mapped entities until all CPU's have
*actually* dropped them.

Now, we do the TLB flush still under the page table lock, so we could
still then do the rmap removal before dropping the lock.

But it would be much cleaner from the TLB flushing standpoint to delay
it until the page freeing, which ends up being delayed until after the
lock is dropped.

That said, if always doing the rmap removal under the page table lock
means that that memcg lock can just be deleted in that whole path, I
will certainly bow to _that_ simplification instead, and just handle
the dirty pages after the TLB flush but before the page table drop.

              Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 16:19                                                               ` Linus Torvalds
@ 2022-11-07 23:02                                                                 ` Andrew Morton
  2022-11-07 23:44                                                                   ` Stephen Rothwell
  0 siblings, 1 reply; 148+ messages in thread
From: Andrew Morton @ 2022-11-07 23:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Yes, I've removed both serieses and shall push the tree out within half
> > an hour from now.
> 
> Oh, can you please just put Hugh's series back in?
> 

Done, all pushed out to
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 23:02                                                                 ` Andrew Morton
@ 2022-11-07 23:44                                                                   ` Stephen Rothwell
  0 siblings, 0 replies; 148+ messages in thread
From: Stephen Rothwell @ 2022-11-07 23:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Johannes Weiner, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, kernel list, Linux-MM, Andrea Arcangeli,
	Kirill A . Shutemov, Joerg Roedel, Uros Bizjak, Alistair Popple,
	linux-arch

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

Hi all,

On Mon, 7 Nov 2022 15:02:42 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> Done, all pushed out to
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable.

Will be in today's linux-next.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 20:29                                                             ` Linus Torvalds
@ 2022-11-07 23:47                                                               ` Linus Torvalds
  2022-11-08  4:28                                                                 ` Linus Torvalds
                                                                                   ` (4 more replies)
  0 siblings, 5 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-07 23:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Mon, Nov 7, 2022 at 12:29 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That said, if always doing the rmap removal under the page table lock
> means that that memcg lock can just be deleted in that whole path, I
> will certainly bow to _that_ simplification instead, and just handle
> the dirty pages after the TLB flush but before the page table drop.

Ok, so I think I have a fairly clean way to do this.

Let me try to make that series look reasonable, although it might be
until tomorrow. I'll need to massage my mess into not just prettier
code, but a sane history.

               Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-07 23:47                                                               ` Linus Torvalds
@ 2022-11-08  4:28                                                                 ` Linus Torvalds
  2022-11-08 19:56                                                                   ` Linus Torvalds
  2022-11-08 19:41                                                                 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
                                                                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08  4:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

[-- Attachment #1: Type: text/plain, Size: 3263 bytes --]

On Mon, Nov 7, 2022 at 3:47 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, so I think I have a fairly clean way to do this.
>
> Let me try to make that series look reasonable, although it might be
> until tomorrow. I'll need to massage my mess into not just prettier
> code, but a sane history.

Ugh. Ok, so massaging it into a saner form and splitting it out into a
pretty history took a lot longer than writing the initial ugly
"something like this".

Anyway, the end result looks very different from the previous series,
since I very consciously tried to keep away from rmap changes to not
clash with Hugh's work, and also made sure we still do the
page_remove_rmap() under the page table lock, just *after* the TLB
flush.

It does share some basic stuff - I'm still using that "struct
encoded_page" thing, I just moved it into a separate commit, and moved
it to where it conceptually belongs (I'd love to say that I also made
the mlock.c code use it, but I didn't - that 'pagevec' thing is a
pretty pointless abstraction and I didn't want to fight it).

So you'll see some familiar support structures and scaffolding, but
the actual approach to zap_pte_range() is very different.

The approach taken to the "s390 is very different" issue is also
completely different. It should actually be better for s390: we used
to cause that "force_flush" whenever we hit a dirty shared page, and
s390 doesn't care. The new model makes that "s390 doesn't care" be
part of the whole design, so now s390 treats dirty shared pages
basically as if they weren't anything special at all. Which they
aren't on s390.

But, as with the previous case, I don't even try to cross-compile for
s390, so my attempts at handling s390 well are just that: attempts.
The code may be broken.

Of course, the code may be broken on x86-64 too, but at least there
I've compiled it and am running it right now.

Oh, and because I copied large parts of the commit message from the
previous approach (because the problem description is the same), I
noticed that I also kept the "Acked-by:". Those are bogus, because the
code is sufficiently different that any previous acks are just not
valid any more, but I just hadn't fixed that yet.

The meat of it all is in that last patch, the rest is just type system
cleanups to prep for it. But it was also that last patch that I spent
hours on just tweaking it to look sensible. The *code* was pretty
easy. Making it have sensible naming and a sensible abstraction
interface that worked well for s390 too, that was 90% of the effort.

But also I hope the end result is reasonably easy to review as a result.

I'm sending this out because I'm stepping away from the keyboard,
because that whole "let's massage it into something lagible" was
really somewhat exhausting. You don't see all the small side turns it
took only to go "that's ugly, let's try again" ;)

Anybody interested in giving this a peek?

(Patch 2/4 might make some people pause. It's fairly small and simple.
It's effective and makes it easy to do some of the later changes. And
it's also quite different from our usual model. It was "inspired" by
the signed-vs-unsigned char thread from a few weeks ago. But patch 4/4
is the one that matters).

                  Linus

[-- Attachment #2: 0001-mm-introduce-encoded-page-pointers-with-embedded-ext.patch --]
[-- Type: text/x-patch, Size: 2749 bytes --]

From 675b73aaa7718e93e9f2492a3b9cc417f9e820b4 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 7 Nov 2022 15:48:27 -0800
Subject: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra
 bits

We already have this notion in parts of the MM code (see the mlock code
with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new
case, and I refuse to do the same thing we've done before where we just
put bits in the raw pointer and say it's still a normal pointer.

So this introduces a 'struct encoded_page' pointer that cannot be used
for anything else than to encode a real page pointer and a couple of
extra bits in the low bits.  That way the compiler can trivially track
the state of the pointer and you just explicitly encode and decode the
extra bits.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm_types.h | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..b5cffd250784 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -67,7 +67,7 @@ struct mem_cgroup;
 #ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
 #define _struct_page_alignment	__aligned(2 * sizeof(unsigned long))
 #else
-#define _struct_page_alignment
+#define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif
 
 struct page {
@@ -241,6 +241,37 @@ struct page {
 #endif
 } _struct_page_alignment;
 
+/**
+ * struct encoded_page - a nonexistent type marking this pointer
+ *
+ * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
+ * with the low bits of the pointer indicating extra context-dependent
+ * information. Not super-common, but happens in mmu_gather and mlock
+ * handling, and this acts as a type system check on that use.
+ *
+ * We only really have two guaranteed bits in general, although you could
+ * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+ * for more.
+ *
+ * Use the supplied helper functions to endcode/decode the pointer and bits.
+ */
+struct encoded_page;
+#define ENCODE_PAGE_BITS 3ul
+static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
+{
+	return (struct encoded_page *)(flags | (unsigned long)page);
+}
+
+static inline bool encoded_page_flags(struct encoded_page *page)
+{
+	return ENCODE_PAGE_BITS & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+}
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #3: 0002-mm-teach-release_pages-to-take-an-array-of-encoded-p.patch --]
[-- Type: text/x-patch, Size: 4058 bytes --]

From 31e4135eeedbb6ae12bfb9b17a7f6d9d815ff289 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 7 Nov 2022 16:46:16 -0800
Subject: [PATCH 2/4] mm: teach release_pages() to take an array of encoded
 page pointers too

release_pages() already could take either an array of page pointers, or
an array of folio pointers.  Expand it to also accept an array of
encoded page pointers, which is what both the existing mlock() use and
the upcoming mmu_gather use of encoded page pointers wants.

Note that release_pages() won't actually use, or react to, any extra
encoded bits.  Instead, this is very much a case of "I have walked the
array of encoded pages and done everything the extra bits tell me to do,
now release it all".

Also, while the "either page or folio pointers" dual use was handled
with a cast of the pointer in "release_folios()", this takes a slightly
different approach and uses the "transparent union" attribute to
describe the set of arguments to the function:

  https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html

which has been supported by gcc forever, but the kernel hasn't used
before.

That allows us to avoid using various wrappers with casts, and just use
the same function regardless of use.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm.h | 21 +++++++++++++++++++--
 mm/swap.c          | 16 ++++++++++++----
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..d9fb5c3e3045 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1179,7 +1179,24 @@ static inline void folio_put_refs(struct folio *folio, int refs)
 		__folio_put(folio);
 }
 
-void release_pages(struct page **pages, int nr);
+/**
+ * release_pages - release an array of pages or folios
+ *
+ * This just releases a simple array of multiple pages, and
+ * accepts various different forms of said page array: either
+ * a regular old boring array of pages, an array of folios, or
+ * an array of encoded page pointers.
+ *
+ * The transparent union syntax for this kind of "any of these
+ * argument types" is all kinds of ugly, so look away.
+ */
+typedef union {
+	struct page **pages;
+	struct folio **folios;
+	struct encoded_page **encoded_pages;
+} release_pages_arg __attribute__ ((__transparent_union__));
+
+void release_pages(release_pages_arg, int nr);
 
 /**
  * folios_put - Decrement the reference count on an array of folios.
@@ -1195,7 +1212,7 @@ void release_pages(struct page **pages, int nr);
  */
 static inline void folios_put(struct folio **folios, unsigned int nr)
 {
-	release_pages((struct page **)folios, nr);
+	release_pages(folios, nr);
 }
 
 static inline void put_page(struct page *page)
diff --git a/mm/swap.c b/mm/swap.c
index 955930f41d20..596ed226ddb8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -968,22 +968,30 @@ void lru_cache_disable(void)
 
 /**
  * release_pages - batched put_page()
- * @pages: array of pages to release
+ * @arg: array of pages to release
  * @nr: number of pages
  *
- * Decrement the reference count on all the pages in @pages.  If it
+ * Decrement the reference count on all the pages in @arg.  If it
  * fell to zero, remove the page from the LRU and free it.
+ *
+ * Note that the argument can be an array of pages, encoded pages,
+ * or folio pointers. We ignore any encoded bits, and turn any of
+ * them into just a folio that gets free'd.
  */
-void release_pages(struct page **pages, int nr)
+void release_pages(release_pages_arg arg, int nr)
 {
 	int i;
+	struct encoded_page **encoded = arg.encoded_pages;
 	LIST_HEAD(pages_to_free);
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 	unsigned int lock_batch;
 
 	for (i = 0; i < nr; i++) {
-		struct folio *folio = page_folio(pages[i]);
+		struct folio *folio;
+
+		/* Turn any of the argument types into a folio */
+		folio = page_folio(encoded_page_ptr(encoded[i]));
 
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #4: 0003-mm-mmu_gather-prepare-to-gather-encoded-page-pointer.patch --]
[-- Type: text/x-patch, Size: 3376 bytes --]

From 0e6863a5389a10e984122c6dca143f9be71da310 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 7 Nov 2022 17:36:43 -0800
Subject: [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers
 with flags

This is purely a preparatory patch that makes all the data structures
ready for encoding flags with the mmu_gather page pointers.

The code currently always sets the flag to zero and doesn't use it yet,
but now it's tracking the type state along.  The next step will be to
actually start using it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/asm-generic/tlb.h |  2 +-
 include/linux/swap.h      |  2 +-
 mm/mmu_gather.c           |  4 ++--
 mm/swap_state.c           | 11 ++++-------
 4 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..faca23e87278 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -242,7 +242,7 @@ struct mmu_gather_batch {
 	struct mmu_gather_batch	*next;
 	unsigned int		nr;
 	unsigned int		max;
-	struct page		*pages[];
+	struct encoded_page	*encoded_pages[];
 };
 
 #define MAX_GATHER_BATCH	\
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a18cf4b7c724..40e418e3461b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -470,7 +470,7 @@ static inline unsigned long total_swapcache_pages(void)
 
 extern void free_swap_cache(struct page *page);
 extern void free_page_and_swap_cache(struct page *);
-extern void free_pages_and_swap_cache(struct page **, int);
+extern void free_pages_and_swap_cache(struct encoded_page **, int);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index add4244e5790..57b7850c1b5e 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -48,7 +48,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		struct page **pages = batch->pages;
+		struct encoded_page **pages = batch->encoded_pages;
 
 		do {
 			/*
@@ -92,7 +92,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->pages[batch->nr++] = page;
+	batch->encoded_pages[batch->nr++] = encode_page(page, 0);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 438d0676c5be..8bf08c313872 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -303,15 +303,12 @@ void free_page_and_swap_cache(struct page *page)
  * Passed an array of pages, drop them all from swapcache and then release
  * them.  They are removed from the LRU and freed if this is their last use.
  */
-void free_pages_and_swap_cache(struct page **pages, int nr)
+void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 {
-	struct page **pagep = pages;
-	int i;
-
 	lru_add_drain();
-	for (i = 0; i < nr; i++)
-		free_swap_cache(pagep[i]);
-	release_pages(pagep, nr);
+	for (int i = 0; i < nr; i++)
+		free_swap_cache(encoded_page_ptr(pages[i]));
+	release_pages(pages, nr);
 }
 
 static inline bool swap_use_vma_readahead(void)
-- 
2.37.1.289.g45aa1e5c72.dirty


[-- Attachment #5: 0004-mm-delay-page_remove_rmap-until-after-the-TLB-has-be.patch --]
[-- Type: text/x-patch, Size: 11173 bytes --]

From 7ef5220cd5825d6e4a770286d8949b9b838bbc30 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 7 Nov 2022 19:38:38 -0800
Subject: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been
 flushed

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with
the page table lock because the page had already been made invisible to
rmap.

And that is a problem, because while the TLB entry exists, we could end
up with the following situation:

 (a) one CPU could come in and clean it, never seeing our mapping of the
     page

 (b) another CPU could continue to use the stale and dirty TLB entry and
     continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This extends our current TLB gather infrastructure to optionally track a
"should I do a delayed page_remove_rmap() for this page after flushing
the TLB".  It uses the newly introduced 'encoded page pointer' to do
that without having to keep separate data around.

Note, this is complicated by a couple of issues:

 - s390 has its own mmu_gather model that doesn't delay TLB flushing,
   and as a result also does not want the delayed rmap

 - we want to delay the rmap removal, but not past the page table lock

 - we can track an enormous number of pages in our mmu_gather structure,
   with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
   all set up to be approximately 10k pending pages.

   We do not want to have a huge number of batched pages that we then
   need to check for delayed rmap handling inside the page table lock.

Particularly that last point results in a noteworthy detail, where the
normal page batch gathering is limited once we have delayed rmaps
pending, in such a way that only the last batch (the so-called "active
batch") in the mmu_gather structure can have any delayed entries.

NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly
even when user space goes off the rails.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
Cc: Will Deacon <will@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/s390/include/asm/tlb.h | 21 +++++++++++++++++++--
 include/asm-generic/tlb.h   | 21 +++++++++++++++++----
 mm/memory.c                 | 23 +++++++++++++++++------
 mm/mmu_gather.c             | 35 +++++++++++++++++++++++++++++++++--
 4 files changed, 86 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 3a5c8fb590e5..e5903ee2f1ca 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,7 +25,8 @@
 void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct page *page, int page_size);
+					  struct page *page, int page_size,
+					  unsigned int flags);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -36,13 +37,24 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 #include <asm/tlbflush.h>
 #include <asm-generic/tlb.h>
 
+/*
+ * s390 never needs to delay page_remove_rmap, because
+ * the ptep_get_and_clear_full() will have flushed the
+ * TLB across CPUs
+ */
+static inline bool tlb_delay_rmap(struct mmu_gather *tlb)
+{
+	return false;
+}
+
 /*
  * Release the page cache reference for a pte removed by
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct page *page, int page_size)
+					  struct page *page, int page_size,
+					  unsigned int flags)
 {
 	free_page_and_swap_cache(page);
 	return false;
@@ -53,6 +65,11 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 	__tlb_flush_mm_lazy(tlb->mm);
 }
 
+static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	/* Nothing to do, s390 does not delay rmaps */
+}
+
 /*
  * pte_free_tlb frees a pte table and clears the CRSTE for the
  * page table from the tlb.
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faca23e87278..9df513e5ad28 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -257,7 +257,15 @@ struct mmu_gather_batch {
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
-				   int page_size);
+				   int page_size, unsigned int flags);
+extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
+
+/*
+ * This both sets 'delayed_rmap', and returns true. It would be an inline
+ * function, except we define it before the 'struct mmu_gather'.
+ */
+#define tlb_delay_rmap(tlb) (((tlb)->delayed_rmap = 1), true)
+
 #endif
 
 /*
@@ -290,6 +298,11 @@ struct mmu_gather {
 	 */
 	unsigned int		freed_tables : 1;
 
+	/*
+	 * Do we have pending delayed rmap removals?
+	 */
+	unsigned int		delayed_rmap : 1;
+
 	/*
 	 * at which levels have we cleared entries?
 	 */
@@ -431,13 +444,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
-	if (__tlb_remove_page_size(tlb, page, page_size))
+	if (__tlb_remove_page_size(tlb, page, page_size, 0))
 		tlb_flush_mmu(tlb);
 }
 
-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
 {
-	return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, page, PAGE_SIZE, flags);
 }
 
 /* tlb_remove_page
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..60a0f44f6e72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1432,6 +1432,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
+			unsigned int delay_rmap;
+
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
@@ -1443,20 +1445,26 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!page))
 				continue;
 
+			delay_rmap = 0;
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
-					force_flush = 1;
 					set_page_dirty(page);
+					if (tlb_delay_rmap(tlb)) {
+						delay_rmap = 1;
+						force_flush = 1;
+					}
 				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, vma, false);
-			if (unlikely(page_mapcount(page) < 0))
-				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
+			if (!delay_rmap) {
+				page_remove_rmap(page, vma, false);
+				if (unlikely(page_mapcount(page) < 0))
+					print_bad_pte(vma, addr, ptent, page);
+			}
+			if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
 				break;
@@ -1513,8 +1521,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (force_flush) {
 		tlb_flush_mmu_tlbonly(tlb);
+		if (tlb->delayed_rmap)
+			tlb_flush_rmaps(tlb, vma);
+	}
 	pte_unmap_unlock(start_pte, ptl);
 
 	/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 57b7850c1b5e..136f5fad43e3 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,6 +9,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/swap.h>
+#include <linux/rmap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
@@ -19,6 +20,10 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
+	/* No more batching if we have delayed rmaps pending */
+	if (tlb->delayed_rmap)
+		return false;
+
 	batch = tlb->active;
 	if (batch->next) {
 		tlb->active = batch->next;
@@ -43,6 +48,31 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	return true;
 }
 
+/**
+ * tlb_flush_rmaps - do pending rmap removals after we have flushed the TLB
+ * @tlb: the current mmu_gather
+ *
+ * Note that because of how tlb_next_batch() above works, we will
+ * never start new batches with pending delayed rmaps, so we only
+ * need to walk through the current active batch.
+ */
+void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	struct mmu_gather_batch *batch;
+
+	batch = tlb->active;
+	for (int i = 0; i < batch->nr; i++) {
+		struct encoded_page *enc = batch->encoded_pages[i];
+
+		if (encoded_page_flags(enc)) {
+			struct page *page = encoded_page_ptr(enc);
+			page_remove_rmap(page, vma, false);
+		}
+	}
+
+	tlb->delayed_rmap = 0;
+}
+
 static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
@@ -77,7 +107,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, unsigned int flags)
 {
 	struct mmu_gather_batch *batch;
 
@@ -92,7 +122,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->encoded_pages[batch->nr++] = encode_page(page, 0);
+	batch->encoded_pages[batch->nr++] = encode_page(page, flags);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
@@ -286,6 +316,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
 #endif
+	tlb->delayed_rmap = 0;
 
 	tlb_table_init(tlb);
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
-- 
2.37.1.289.g45aa1e5c72.dirty


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-07 23:47                                                               ` Linus Torvalds
  2022-11-08  4:28                                                                 ` Linus Torvalds
@ 2022-11-08 19:41                                                                 ` Linus Torvalds
  2022-11-08 20:37                                                                   ` Nadav Amit
  2022-11-09  6:36                                                                   ` Alexander Gordeev
  2022-11-08 19:41                                                                 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
                                                                                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08 19:41 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton; +Cc: linux-kernel, linux-mm

We already have this notion in parts of the MM code (see the mlock code
with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new
case, and I refuse to do the same thing we've done before where we just
put bits in the raw pointer and say it's still a normal pointer.

So this introduces a 'struct encoded_page' pointer that cannot be used
for anything else than to encode a real page pointer and a couple of
extra bits in the low bits.  That way the compiler can trivially track
the state of the pointer and you just explicitly encode and decode the
extra bits.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm_types.h | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..b5cffd250784 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -67,7 +67,7 @@ struct mem_cgroup;
 #ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
 #define _struct_page_alignment	__aligned(2 * sizeof(unsigned long))
 #else
-#define _struct_page_alignment
+#define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif
 
 struct page {
@@ -241,6 +241,37 @@ struct page {
 #endif
 } _struct_page_alignment;
 
+/**
+ * struct encoded_page - a nonexistent type marking this pointer
+ *
+ * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
+ * with the low bits of the pointer indicating extra context-dependent
+ * information. Not super-common, but happens in mmu_gather and mlock
+ * handling, and this acts as a type system check on that use.
+ *
+ * We only really have two guaranteed bits in general, although you could
+ * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+ * for more.
+ *
+ * Use the supplied helper functions to endcode/decode the pointer and bits.
+ */
+struct encoded_page;
+#define ENCODE_PAGE_BITS 3ul
+static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
+{
+	return (struct encoded_page *)(flags | (unsigned long)page);
+}
+
+static inline bool encoded_page_flags(struct encoded_page *page)
+{
+	return ENCODE_PAGE_BITS & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+}
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too
  2022-11-07 23:47                                                               ` Linus Torvalds
  2022-11-08  4:28                                                                 ` Linus Torvalds
  2022-11-08 19:41                                                                 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
@ 2022-11-08 19:41                                                                 ` Linus Torvalds
  2022-11-08 19:41                                                                 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
  2022-11-08 19:41                                                                 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
  4 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08 19:41 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton; +Cc: linux-kernel, linux-mm

release_pages() already could take either an array of page pointers, or
an array of folio pointers.  Expand it to also accept an array of
encoded page pointers, which is what both the existing mlock() use and
the upcoming mmu_gather use of encoded page pointers wants.

Note that release_pages() won't actually use, or react to, any extra
encoded bits.  Instead, this is very much a case of "I have walked the
array of encoded pages and done everything the extra bits tell me to do,
now release it all".

Also, while the "either page or folio pointers" dual use was handled
with a cast of the pointer in "release_folios()", this takes a slightly
different approach and uses the "transparent union" attribute to
describe the set of arguments to the function:

  https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html

which has been supported by gcc forever, but the kernel hasn't used
before.

That allows us to avoid using various wrappers with casts, and just use
the same function regardless of use.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm.h | 21 +++++++++++++++++++--
 mm/swap.c          | 16 ++++++++++++----
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..d9fb5c3e3045 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1179,7 +1179,24 @@ static inline void folio_put_refs(struct folio *folio, int refs)
 		__folio_put(folio);
 }
 
-void release_pages(struct page **pages, int nr);
+/**
+ * release_pages - release an array of pages or folios
+ *
+ * This just releases a simple array of multiple pages, and
+ * accepts various different forms of said page array: either
+ * a regular old boring array of pages, an array of folios, or
+ * an array of encoded page pointers.
+ *
+ * The transparent union syntax for this kind of "any of these
+ * argument types" is all kinds of ugly, so look away.
+ */
+typedef union {
+	struct page **pages;
+	struct folio **folios;
+	struct encoded_page **encoded_pages;
+} release_pages_arg __attribute__ ((__transparent_union__));
+
+void release_pages(release_pages_arg, int nr);
 
 /**
  * folios_put - Decrement the reference count on an array of folios.
@@ -1195,7 +1212,7 @@ void release_pages(struct page **pages, int nr);
  */
 static inline void folios_put(struct folio **folios, unsigned int nr)
 {
-	release_pages((struct page **)folios, nr);
+	release_pages(folios, nr);
 }
 
 static inline void put_page(struct page *page)
diff --git a/mm/swap.c b/mm/swap.c
index 955930f41d20..596ed226ddb8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -968,22 +968,30 @@ void lru_cache_disable(void)
 
 /**
  * release_pages - batched put_page()
- * @pages: array of pages to release
+ * @arg: array of pages to release
  * @nr: number of pages
  *
- * Decrement the reference count on all the pages in @pages.  If it
+ * Decrement the reference count on all the pages in @arg.  If it
  * fell to zero, remove the page from the LRU and free it.
+ *
+ * Note that the argument can be an array of pages, encoded pages,
+ * or folio pointers. We ignore any encoded bits, and turn any of
+ * them into just a folio that gets free'd.
  */
-void release_pages(struct page **pages, int nr)
+void release_pages(release_pages_arg arg, int nr)
 {
 	int i;
+	struct encoded_page **encoded = arg.encoded_pages;
 	LIST_HEAD(pages_to_free);
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 	unsigned int lock_batch;
 
 	for (i = 0; i < nr; i++) {
-		struct folio *folio = page_folio(pages[i]);
+		struct folio *folio;
+
+		/* Turn any of the argument types into a folio */
+		folio = page_folio(encoded_page_ptr(encoded[i]));
 
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags
  2022-11-07 23:47                                                               ` Linus Torvalds
                                                                                   ` (2 preceding siblings ...)
  2022-11-08 19:41                                                                 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
@ 2022-11-08 19:41                                                                 ` Linus Torvalds
  2022-11-08 19:41                                                                 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
  4 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08 19:41 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton; +Cc: linux-kernel, linux-mm

This is purely a preparatory patch that makes all the data structures
ready for encoding flags with the mmu_gather page pointers.

The code currently always sets the flag to zero and doesn't use it yet,
but now it's tracking the type state along.  The next step will be to
actually start using it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/asm-generic/tlb.h |  2 +-
 include/linux/swap.h      |  2 +-
 mm/mmu_gather.c           |  4 ++--
 mm/swap_state.c           | 11 ++++-------
 4 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..faca23e87278 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -242,7 +242,7 @@ struct mmu_gather_batch {
 	struct mmu_gather_batch	*next;
 	unsigned int		nr;
 	unsigned int		max;
-	struct page		*pages[];
+	struct encoded_page	*encoded_pages[];
 };
 
 #define MAX_GATHER_BATCH	\
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a18cf4b7c724..40e418e3461b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -470,7 +470,7 @@ static inline unsigned long total_swapcache_pages(void)
 
 extern void free_swap_cache(struct page *page);
 extern void free_page_and_swap_cache(struct page *);
-extern void free_pages_and_swap_cache(struct page **, int);
+extern void free_pages_and_swap_cache(struct encoded_page **, int);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index add4244e5790..57b7850c1b5e 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -48,7 +48,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		struct page **pages = batch->pages;
+		struct encoded_page **pages = batch->encoded_pages;
 
 		do {
 			/*
@@ -92,7 +92,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->pages[batch->nr++] = page;
+	batch->encoded_pages[batch->nr++] = encode_page(page, 0);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 438d0676c5be..8bf08c313872 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -303,15 +303,12 @@ void free_page_and_swap_cache(struct page *page)
  * Passed an array of pages, drop them all from swapcache and then release
  * them.  They are removed from the LRU and freed if this is their last use.
  */
-void free_pages_and_swap_cache(struct page **pages, int nr)
+void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 {
-	struct page **pagep = pages;
-	int i;
-
 	lru_add_drain();
-	for (i = 0; i < nr; i++)
-		free_swap_cache(pagep[i]);
-	release_pages(pagep, nr);
+	for (int i = 0; i < nr; i++)
+		free_swap_cache(encoded_page_ptr(pages[i]));
+	release_pages(pages, nr);
 }
 
 static inline bool swap_use_vma_readahead(void)
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-07 23:47                                                               ` Linus Torvalds
                                                                                   ` (3 preceding siblings ...)
  2022-11-08 19:41                                                                 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
@ 2022-11-08 19:41                                                                 ` Linus Torvalds
  2022-11-08 20:48                                                                   ` [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] " kernel test robot
                                                                                     ` (2 more replies)
  4 siblings, 3 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08 19:41 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton
  Cc: linux-kernel, linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Peter Zijlstra,
	Gerald Schaefer

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with
the page table lock because the page had already been made invisible to
rmap.

And that is a problem, because while the TLB entry exists, we could end
up with the following situation:

 (a) one CPU could come in and clean it, never seeing our mapping of the
     page

 (b) another CPU could continue to use the stale and dirty TLB entry and
     continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This extends our current TLB gather infrastructure to optionally track a
"should I do a delayed page_remove_rmap() for this page after flushing
the TLB".  It uses the newly introduced 'encoded page pointer' to do
that without having to keep separate data around.

Note, this is complicated by a couple of issues:

 - s390 has its own mmu_gather model that doesn't delay TLB flushing,
   and as a result also does not want the delayed rmap

 - we want to delay the rmap removal, but not past the page table lock

 - we can track an enormous number of pages in our mmu_gather structure,
   with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
   all set up to be approximately 10k pending pages.

   We do not want to have a huge number of batched pages that we then
   need to check for delayed rmap handling inside the page table lock.

Particularly that last point results in a noteworthy detail, where the
normal page batch gathering is limited once we have delayed rmaps
pending, in such a way that only the last batch (the so-called "active
batch") in the mmu_gather structure can have any delayed entries.

NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly
even when user space goes off the rails.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
Cc: Will Deacon <will@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/s390/include/asm/tlb.h | 21 +++++++++++++++++++--
 include/asm-generic/tlb.h   | 21 +++++++++++++++++----
 mm/memory.c                 | 23 +++++++++++++++++------
 mm/mmu_gather.c             | 35 +++++++++++++++++++++++++++++++++--
 4 files changed, 86 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 3a5c8fb590e5..e5903ee2f1ca 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,7 +25,8 @@
 void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct page *page, int page_size);
+					  struct page *page, int page_size,
+					  unsigned int flags);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -36,13 +37,24 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 #include <asm/tlbflush.h>
 #include <asm-generic/tlb.h>
 
+/*
+ * s390 never needs to delay page_remove_rmap, because
+ * the ptep_get_and_clear_full() will have flushed the
+ * TLB across CPUs
+ */
+static inline bool tlb_delay_rmap(struct mmu_gather *tlb)
+{
+	return false;
+}
+
 /*
  * Release the page cache reference for a pte removed by
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct page *page, int page_size)
+					  struct page *page, int page_size,
+					  unsigned int flags)
 {
 	free_page_and_swap_cache(page);
 	return false;
@@ -53,6 +65,11 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 	__tlb_flush_mm_lazy(tlb->mm);
 }
 
+static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	/* Nothing to do, s390 does not delay rmaps */
+}
+
 /*
  * pte_free_tlb frees a pte table and clears the CRSTE for the
  * page table from the tlb.
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faca23e87278..9df513e5ad28 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -257,7 +257,15 @@ struct mmu_gather_batch {
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
-				   int page_size);
+				   int page_size, unsigned int flags);
+extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
+
+/*
+ * This both sets 'delayed_rmap', and returns true. It would be an inline
+ * function, except we define it before the 'struct mmu_gather'.
+ */
+#define tlb_delay_rmap(tlb) (((tlb)->delayed_rmap = 1), true)
+
 #endif
 
 /*
@@ -290,6 +298,11 @@ struct mmu_gather {
 	 */
 	unsigned int		freed_tables : 1;
 
+	/*
+	 * Do we have pending delayed rmap removals?
+	 */
+	unsigned int		delayed_rmap : 1;
+
 	/*
 	 * at which levels have we cleared entries?
 	 */
@@ -431,13 +444,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
-	if (__tlb_remove_page_size(tlb, page, page_size))
+	if (__tlb_remove_page_size(tlb, page, page_size, 0))
 		tlb_flush_mmu(tlb);
 }
 
-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
 {
-	return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, page, PAGE_SIZE, flags);
 }
 
 /* tlb_remove_page
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..60a0f44f6e72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1432,6 +1432,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
+			unsigned int delay_rmap;
+
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
@@ -1443,20 +1445,26 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!page))
 				continue;
 
+			delay_rmap = 0;
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
-					force_flush = 1;
 					set_page_dirty(page);
+					if (tlb_delay_rmap(tlb)) {
+						delay_rmap = 1;
+						force_flush = 1;
+					}
 				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, vma, false);
-			if (unlikely(page_mapcount(page) < 0))
-				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
+			if (!delay_rmap) {
+				page_remove_rmap(page, vma, false);
+				if (unlikely(page_mapcount(page) < 0))
+					print_bad_pte(vma, addr, ptent, page);
+			}
+			if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
 				break;
@@ -1513,8 +1521,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (force_flush) {
 		tlb_flush_mmu_tlbonly(tlb);
+		if (tlb->delayed_rmap)
+			tlb_flush_rmaps(tlb, vma);
+	}
 	pte_unmap_unlock(start_pte, ptl);
 
 	/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 57b7850c1b5e..136f5fad43e3 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,6 +9,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/swap.h>
+#include <linux/rmap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
@@ -19,6 +20,10 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
+	/* No more batching if we have delayed rmaps pending */
+	if (tlb->delayed_rmap)
+		return false;
+
 	batch = tlb->active;
 	if (batch->next) {
 		tlb->active = batch->next;
@@ -43,6 +48,31 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	return true;
 }
 
+/**
+ * tlb_flush_rmaps - do pending rmap removals after we have flushed the TLB
+ * @tlb: the current mmu_gather
+ *
+ * Note that because of how tlb_next_batch() above works, we will
+ * never start new batches with pending delayed rmaps, so we only
+ * need to walk through the current active batch.
+ */
+void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	struct mmu_gather_batch *batch;
+
+	batch = tlb->active;
+	for (int i = 0; i < batch->nr; i++) {
+		struct encoded_page *enc = batch->encoded_pages[i];
+
+		if (encoded_page_flags(enc)) {
+			struct page *page = encoded_page_ptr(enc);
+			page_remove_rmap(page, vma, false);
+		}
+	}
+
+	tlb->delayed_rmap = 0;
+}
+
 static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
@@ -77,7 +107,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, unsigned int flags)
 {
 	struct mmu_gather_batch *batch;
 
@@ -92,7 +122,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->encoded_pages[batch->nr++] = encode_page(page, 0);
+	batch->encoded_pages[batch->nr++] = encode_page(page, flags);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
@@ -286,6 +316,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
 #endif
+	tlb->delayed_rmap = 0;
 
 	tlb_table_init(tlb);
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-08  4:28                                                                 ` Linus Torvalds
@ 2022-11-08 19:56                                                                   ` Linus Torvalds
  2022-11-08 20:03                                                                     ` Konstantin Ryabitsev
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08 19:56 UTC (permalink / raw)
  To: Johannes Weiner, Konstantin Ryabitsev
  Cc: Hugh Dickins, Stephen Rothwell, Alexander Gordeev,
	Peter Zijlstra, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Nadav Amit, Jann Horn, John Hubbard, X86 ML,
	Matthew Wilcox, Andrew Morton, kernel list, Linux-MM,
	Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel, Uros Bizjak,
	Alistair Popple, linux-arch

On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm sending this out because I'm stepping away from the keyboard,
> because that whole "let's massage it into something legible" was
> really somewhat exhausting. You don't see all the small side turns it
> took only to go "that's ugly, let's try again" ;)

Ok, I actually sent the individual patches with 'git-send-email',
although I only sent them to the mailing list and to people that were
mentioned in the commit descriptions.

I hope that makes review easier.

See

   https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org

for the series if you weren't mentioned and are interested.

Oh, and because I decided to just use the email in this thread as the
reference and cover letter, it turns out that this all confuses 'b4',
because it actually walks up the whole thread all the way to the
original 13-patch series by PeterZ that started this whole discussion.

I've seen that before with other peoples patch series, but now that it
happened to my own, I'm cc'ing Konstantine here too to see if there's
some magic for b4 to say "look, I pointed you to a msg-id that is
clearly a new series, don't walk all the way up and then take patches
from a completely different one.

Oh well. I guess I should just have not been lazy and done a
cover-letter and a whole new thread.

My bad.

Konstantin, please help me look like less of a tool?

                    Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-08 19:56                                                                   ` Linus Torvalds
@ 2022-11-08 20:03                                                                     ` Konstantin Ryabitsev
  2022-11-08 20:18                                                                       ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Konstantin Ryabitsev @ 2022-11-08 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Hugh Dickins, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Tue, Nov 08, 2022 at 11:56:13AM -0800, Linus Torvalds wrote:
> On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I'm sending this out because I'm stepping away from the keyboard,
> > because that whole "let's massage it into something legible" was
> > really somewhat exhausting. You don't see all the small side turns it
> > took only to go "that's ugly, let's try again" ;)
> 
> Ok, I actually sent the individual patches with 'git-send-email',
> although I only sent them to the mailing list and to people that were
> mentioned in the commit descriptions.
> 
> I hope that makes review easier.
> 
> See
> 
>    https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org
> 
> for the series if you weren't mentioned and are interested.
> 
> Oh, and because I decided to just use the email in this thread as the
> reference and cover letter, it turns out that this all confuses 'b4',
> because it actually walks up the whole thread all the way to the
> original 13-patch series by PeterZ that started this whole discussion.
> 
> I've seen that before with other peoples patch series, but now that it
> happened to my own, I'm cc'ing Konstantine here too to see if there's
> some magic for b4 to say "look, I pointed you to a msg-id that is
> clearly a new series, don't walk all the way up and then take patches
> from a completely different one.

Yes, --no-parent.

It's slightly more complicated in your case because the patches aren't
threaded to the first patch/cover letter, but you can choose an arbitrary
msgid upthread and tell b4 to ignore anything that came before it. E.g.:

b4 am -o/tmp --no-parent 20221108194139.57604-1-torvalds@linux-foundation.org

-K

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: mm: delay rmap removal until after TLB flush
  2022-11-08 20:03                                                                     ` Konstantin Ryabitsev
@ 2022-11-08 20:18                                                                       ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08 20:18 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Johannes Weiner, Hugh Dickins, Stephen Rothwell,
	Alexander Gordeev, Peter Zijlstra, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, Nadav Amit, Jann Horn,
	John Hubbard, X86 ML, Matthew Wilcox, Andrew Morton, kernel list,
	Linux-MM, Andrea Arcangeli, Kirill A . Shutemov, Joerg Roedel,
	Uros Bizjak, Alistair Popple, linux-arch

On Tue, Nov 8, 2022 at 12:03 PM Konstantin Ryabitsev <mricon@kernel.org> wrote:
>
> Yes, --no-parent.

Ahh, that's new. I guess I need to update my ancient b4-0.8.0 install..

But yes, with that, and the manual parent lookup (because otherwise
"--no-parent" will fetch *just* the patch itself, not even walking up
the single parent chain). it works.

Maybe a "--single-parent" or "--deadbeat-parent" option would be a good idea?

Anyway, with a more recent b4 version, the command

   b4 am --no-parent
CAHk-=wh6MxaCA4pXpt1F5Bn2__6MxCq0Dr-rES4i=MOL9ibjpg@mail.gmail.com

gets that series and only that series.

              Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-08 19:41                                                                 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
@ 2022-11-08 20:37                                                                   ` Nadav Amit
  2022-11-08 20:46                                                                     ` Linus Torvalds
  2022-11-09  6:36                                                                   ` Alexander Gordeev
  1 sibling, 1 reply; 148+ messages in thread
From: Nadav Amit @ 2022-11-08 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, kernel list, linux-mm

On Nov 8, 2022, at 11:41 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> We already have this notion in parts of the MM code (see the mlock code
> with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new
> case, and I refuse to do the same thing we've done before where we just
> put bits in the raw pointer and say it's still a normal pointer.
> 
> So this introduces a 'struct encoded_page' pointer that cannot be used
> for anything else than to encode a real page pointer and a couple of
> extra bits in the low bits.  That way the compiler can trivially track
> the state of the pointer and you just explicitly encode and decode the
> extra bits.

I tested again all of the patches with the PoC. They pass.

> 
> +struct encoded_page;
> +#define ENCODE_PAGE_BITS 3ul
> +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> +{
> +	return (struct encoded_page *)(flags | (unsigned long)page);
> +}
> +
> +static inline bool encoded_page_flags(struct encoded_page *page)
> +{
> +	return ENCODE_PAGE_BITS & (unsigned long)page;
> +}

I think this one wants to be some unsigned, as otherwise why have
ENCODE_PAGE_BITS as 3ul ?


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-08 20:37                                                                   ` Nadav Amit
@ 2022-11-08 20:46                                                                     ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08 20:46 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, kernel list, linux-mm

On Tue, Nov 8, 2022 at 12:37 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > +static inline bool encoded_page_flags(struct encoded_page *page)
> > +{
> > +     return ENCODE_PAGE_BITS & (unsigned long)page;
> > +}
>
> I think this one wants to be some unsigned, as otherwise why have
> ENCODE_PAGE_BITS as 3ul ?

Right you are. That came from my old old version where this was just
"bool dirty".

Will fix.

Doesn't matter for the TLB flushing case, but I really did hope that
we could use this for mlock too, and that case needs both bits.

I did look at converting mlock (and it's why I wanted to make
release_pages() take that whole encoded thing in general, rather than
make some special case for it), but the mlock code uses that "struct
pagevec" abstraction that seems entirely pointless ("pvec->nr" becomes
"pagevec_count(pvec)", which really doesn't seem to be any clearer at
alll), but whatever.

               Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-08 19:41                                                                 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
@ 2022-11-08 20:48                                                                   ` kernel test robot
  2022-11-08 21:01                                                                     ` Linus Torvalds
  2022-11-08 21:05                                                                   ` [PATCH 4/4] " Nadav Amit
  2022-11-09 15:53                                                                   ` Johannes Weiner
  2 siblings, 1 reply; 148+ messages in thread
From: kernel test robot @ 2022-11-08 20:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: oe-kbuild-all, lkp


FYI, we noticed a +115 bytes kernel size regression due to commit:

commit: 0309f160889482100fd90a49f93b0ef49d58f128 (mm: delay page_remove_rmap() until after the TLB has been flushed)
url: https://github.com/intel-lab-lkp/linux/commits/Linus-Torvalds/mm-introduce-encoded-page-pointers-with-embedded-extra-bits/20221109-034318
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch subject: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed


Details as below (size data is obtained by `nm --size-sort vmlinux`):

9b569da9: mm: mmu_gather: prepare to gather encoded page pointers with flags
0309f160: mm: delay page_remove_rmap() until after the TLB has been flushed

+------------------------------+----------+----------+-------+
|            symbol            | 9b569da9 | 0309f160 | delta |
+------------------------------+----------+----------+-------+
| bzImage                      | 491712   | 491840   | 128   |
| nm.T.tlb_flush_rmaps         | 0        | 50       | 50    |
| nm.t.zap_pte_range           | 561      | 610      | 49    |
| nm.T.__tlb_remove_page_size  | 98       | 106      | 8     |
| nm.t.tlb_flush_mmu_tlbonly   | 124      | 129      | 5     |
| nm.T.tlb_flush_mmu           | 168      | 173      | 5     |
| nm.T.___pte_free_tlb         | 48       | 51       | 3     |
| nm.t.change_protection_range | 416      | 414      | -2    |
| nm.T.unmap_page_range        | 268      | 265      | -3    |
+------------------------------+----------+----------+-------+



Thanks



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-08 20:48                                                                   ` [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] " kernel test robot
@ 2022-11-08 21:01                                                                     ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-08 21:01 UTC (permalink / raw)
  To: kernel test robot; +Cc: oe-kbuild-all

On Tue, Nov 8, 2022 at 12:49 PM kernel test robot <lkp@intel.com> wrote:
>
> FYI, we noticed a +115 bytes kernel size regression due to commit:

Heh. I don't think I personally care, but just for posterity and in
case somebody else does care, we could make the UP case not delay
rmaps at all, matching the S390 case.

The reason is simply that rmap delay is only needed for MP TLB
coherency. So not important for small UP systems (which is presumably
the only case that would care about those 115 bytes).

But I'll leave that as an exercise for the reader who cares.

                 Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-08 19:41                                                                 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
  2022-11-08 20:48                                                                   ` [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] " kernel test robot
@ 2022-11-08 21:05                                                                   ` Nadav Amit
  2022-11-09 15:53                                                                   ` Johannes Weiner
  2 siblings, 0 replies; 148+ messages in thread
From: Nadav Amit @ 2022-11-08 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, kernel list,
	Linux-MM, Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer,
	David Hildenbrand

On Nov 8, 2022, at 11:41 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> When we remove a page table entry, we are very careful to only free the
> page after we have flushed the TLB, because other CPUs could still be
> using the page through stale TLB entries until after the flush.

The patches (all 4) look fine to me.

I mean there are minor issues here and there, like s390’s tlb_flush_rmaps()
that can have VM_WARN_ON(1);the generic tlb_flush_rmaps() that is missing an
empty line after the ‘page' variable definition; or perhaps using __bitwise
for sparse (as David pointed) — but it can all be addressed later.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-08 19:41                                                                 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
  2022-11-08 20:37                                                                   ` Nadav Amit
@ 2022-11-09  6:36                                                                   ` Alexander Gordeev
  2022-11-09 18:00                                                                     ` Linus Torvalds
  1 sibling, 1 reply; 148+ messages in thread
From: Alexander Gordeev @ 2022-11-09  6:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Tue, Nov 08, 2022 at 11:41:36AM -0800, Linus Torvalds wrote:

Hi Linus,

[...]

> +struct encoded_page;
> +#define ENCODE_PAGE_BITS 3ul
> +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> +{

Any reaction in case ((flags & ~ENCODE_PAGE_BITS) != 0)?

> +	return (struct encoded_page *)(flags | (unsigned long)page);
> +}

Thanks!

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-08 19:41                                                                 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
  2022-11-08 20:48                                                                   ` [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] " kernel test robot
  2022-11-08 21:05                                                                   ` [PATCH 4/4] " Nadav Amit
@ 2022-11-09 15:53                                                                   ` Johannes Weiner
  2022-11-09 19:31                                                                     ` Hugh Dickins
  2 siblings, 1 reply; 148+ messages in thread
From: Johannes Weiner @ 2022-11-09 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Nadav Amit,
	Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

All 4 patches look good to me from an MM and cgroup point of view.

And with the pte still locked over rmap, we can continue with the
removal of the cgroup-specific locking and rely on native MM
synchronization, which is great as well.

Thanks,
Johannes

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-09  6:36                                                                   ` Alexander Gordeev
@ 2022-11-09 18:00                                                                     ` Linus Torvalds
  2022-11-09 20:02                                                                       ` Linus Torvalds
  0 siblings, 1 reply; 148+ messages in thread
From: Linus Torvalds @ 2022-11-09 18:00 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Tue, Nov 8, 2022 at 10:38 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> On Tue, Nov 08, 2022 at 11:41:36AM -0800, Linus Torvalds wrote:
>
> > +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> > +{
>
> Any reaction in case ((flags & ~ENCODE_PAGE_BITS) != 0)?

Heh. I've actually had three different implementations for that during
the development series, and I think I even posted them all at one
point or another (although usually just as attachments). And none of
them are good.

Those three trivial versions are: (a) use VM_BUG_ON(), (b) just
silently mask the bits and (c) just silently add them.

And (c) is that least annoying option that this latest patch uses,
because both (a) and (b) are just nasty.

Basically, all users are locally trivial to verify statically, so
VM_BUG_ON() is just conceptually wrong and generates extra pointless
code. And the silent masking - if it makes any difference - is just
another version of "just silently add the bits": regardless of whether
it clears them or not, it does the wrong thing if the bits don't fit.

So there are three bad options, I've gone back and forth between them
all, and I chose the least offensive one that is "invisible", in that
it at least doesn't do any extra pointless work.

Now, there are two non-offensive options too, and I actually
considered, but never implemented them. They both fix the problem
properly, by making it a *buildtime* check, but they have other
issues.

There's two ways to just make it a build-time check, and it's
annoyingly _close_ to being usable, but not quite there.

One is simply to require that the flags argument is always a plain
constant, and simply using BUILD_BUG_ON().

I actually almost went down that path - one of the things I considered
was to not add a 'flags' argument to __tlb_remove_page() at all, but
instead just have separate __tlb_remove_page() and
__tlb_remove_page_dirty() functions.

That would have meant that the argument to __tlb_remove_page_size
would have always been a built-time constant, and then it would be
trivial to just have that BUILD_BUG_ON(). Problem solved.

But it turns out that it's just nasty, particularly with different
configurations wanting different rules for what the dirty bit is. So
forcing it to some constant value was really not acceptable.

The thing that I actually *wanted* to do, but didn't actually dare,
was to just say "I will trust the compiler to do the value range
tracking".

Because *technically* our BUILD_BUG_ON() doesn't need a compile-time
constant. Because our implementation of BUILD_BUG_ON() is not the
garbage that the compiler gives us in "_Static_assert()" that really
requires a syntactically pure integer constant expression.

So the kernel version of BUILD_BUG_ON() is actually something much
smarter: it depends on the compiler actually *optimizing* the
expression, and it's only that optimized value that needs to be
determined at compile-time to be either true or false. You can use
things like inline functions etc, just as long as the end result is
obvious enough that the compiler ends up saying "ok, that's never the
case".

And *if* the compiler does any kind of reasonable range analysis, then a

        BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);

should actually work. In theory.

In practice? Not so much.

Because while the argument isn't constant (not even in the caller),
the compiler *should* be smart enough to see that in the use in
mm/memory.c, 'flags' is always that

        unsigned int delay_rmap;

which then gets initialized to

        delay_rmap = 0;

and conditionally set to '1' later. So it's not a *constant*, but the
compiler can see that the value of flags is clearly never larger than
ENCODE_PAGE_BITS.

But right now the compiler cannot track that over the non-inline
function in __tlb_remove_page_size().

Maybe if the 'encode_page()' was done in the caller, and
__tlb_remove_page_size() were to just take an encoded_page as the
argument, then the compiler would always only see this all through
inlined functions, and it would work.

But even if it were to work for me (I never tried), I'd have been much
too worried that some other compiler version, with some other config
options, on some other architecture, wouldn't make the required
optimizations.

We do require compiler optimizations to be on for 'BUILD_BUG_ON()' to
do anything at all:

   #ifdef __OPTIMIZE__
   # define __compiletime_assert(condition, msg, prefix, suffix)           \
   ..
   #else
   # define __compiletime_assert(condition, msg, prefix, suffix) do {
} while (0)
   #endif

and we have a lot of places that depend on BUILD_BUG_ON() to do basic
constant folding and other fairly simple optimizations.

But while I think a BUILD_BUG_ON() would be the right thing to do
here, I do not feel confident enough to really put that to the test.

              Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-09 15:53                                                                   ` Johannes Weiner
@ 2022-11-09 19:31                                                                     ` Hugh Dickins
  0 siblings, 0 replies; 148+ messages in thread
From: Hugh Dickins @ 2022-11-09 19:31 UTC (permalink / raw)
  To: Linus Torvalds, Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Nadav Amit,
	Will Deacon, Aneesh Kumar, Nick Piggin, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

On Wed, 9 Nov 2022, Johannes Weiner wrote:

> All 4 patches look good to me from an MM and cgroup point of view.

Yes, same here from me.

I was running my load on them (applied to 6.1-rc4) overnight, intending
to go for 20 hours.  It stopped just a few minutes short, for some fork
ENOMEM reason I've never (or not in a long time) seen before; but I don't
often run for that long, and I think if there were some new error in the
page freeing in the patches, it would have shown up very much quicker.

So I'd guess the failure was 99.9% likely unrelated, and please go ahead
with getting the patches into mm-unstable.

> 
> And with the pte still locked over rmap, we can continue with the
> removal of the cgroup-specific locking and rely on native MM
> synchronization, which is great as well.

Yes, please go ahead with that Johannes: and many thanks for coming to
the rescue with your input on the other thread.  But you'll find that
the mm/rmap.c source in mm-unstable is a bit different from 6.1-rc,
so your outlined patch will need some changes - or pass it over to
me for that if you prefer.  (And I do have one more patch to that,
hope to post later today: just rearranging the order of tests as
Linus preferred.)

Hugh

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-09 18:00                                                                     ` Linus Torvalds
@ 2022-11-09 20:02                                                                       ` Linus Torvalds
  0 siblings, 0 replies; 148+ messages in thread
From: Linus Torvalds @ 2022-11-09 20:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Wed, Nov 9, 2022 at 10:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But while I think a BUILD_BUG_ON() would be the right thing to do
> here, I do not feel confident enough to really put that to the test.

Oh, what the hell.

Just writing that whole explanation out made me just go "let's try to
re-organize it a bit so that we *can* inline everything, and see how
well it works".

And it does actually work to use BUILD_BUG_ON(), both with gcc and clang.

At least that's the case with the versions of gcc and clang _I_ use,
and in the configurations I tested.

So now I have a slightly massaged version of the patches (I did have
to move the 'encode_page()' around a bit), which has that
BUILD_BUG_ON() in it, and it passes for me.

And I find that I really like seeing that whole page pointer encoding
be so obviously much stricter. That was obviously the point of the
whole separate type system checking, now it does bit value validity
checking too.

So I'll walk through my patches one more time to check for it, but
I'll post it as a git branch and send out a new series (and do it in a
separate thread with a cover letter, to not confuse the little mind of
'b4' again).

If it turns out that some other compiler version or configuration
doesn't deal with the BUILD_BUG_ON() gracefully, it's easy enough to
remove, and it will hopefully show up in linux-next when Andrew picks
it up.

                  Linus

^ permalink raw reply	[flat|nested] 148+ messages in thread

* [tip: x86/mm] mm: Convert __HAVE_ARCH_P..P_GET to the new style
  2022-11-01 12:41     ` Peter Zijlstra
                         ` (2 preceding siblings ...)
  2022-11-03 21:15       ` tip-bot2 for Peter Zijlstra
@ 2022-12-17 18:55       ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 148+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2022-12-17 18:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Linus Torvalds, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     2dff2c359e829245bc3d80e42e296876d1f0cf8e
Gitweb:        https://git.kernel.org/tip/2dff2c359e829245bc3d80e42e296876d1f0cf8e
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 01 Nov 2022 12:53:18 +01:00
Committer:     Dave Hansen <dave.hansen@linux.intel.com>
CommitterDate: Thu, 15 Dec 2022 10:37:27 -08:00

mm: Convert __HAVE_ARCH_P..P_GET to the new style

Since __HAVE_ARCH_* style guards have been depricated in favour of
defining the function name onto itself, convert pxxp_get().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y2EUEBlQXNgaJgoI@hirez.programming.kicks-ass.net
---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +-
 include/linux/pgtable.h                      | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 0d40b33..cb1ac02 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -263,7 +263,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p
 }
 
 #ifdef CONFIG_PPC_16K_PAGES
-#define __HAVE_ARCH_PTEP_GET
+#define ptep_get ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
 	pte_basic_t val = READ_ONCE(ptep->pte);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2334852..70e2a7e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -291,14 +291,14 @@ static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
 	ptep_get_and_clear(mm, addr, ptep);
 }
 
-#ifndef __HAVE_ARCH_PTEP_GET
+#ifndef ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
 	return READ_ONCE(*ptep);
 }
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_GET
+#ifndef pmdp_get
 static inline pmd_t pmdp_get(pmd_t *pmdp)
 {
 	return READ_ONCE(*pmdp);

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH 11/13] x86_64: Remove pointless set_64bit() usage
  2022-11-04 17:15             ` Linus Torvalds
  2022-11-05 13:29               ` Jason A. Donenfeld
@ 2022-12-19 15:44               ` Peter Zijlstra
  1 sibling, 0 replies; 148+ messages in thread
From: Peter Zijlstra @ 2022-12-19 15:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nathan Chancellor, Uros Bizjak, x86, willy, akpm, linux-kernel,
	linux-mm, aarcange, kirill.shutemov, jroedel

On Fri, Nov 04, 2022 at 10:15:08AM -0700, Linus Torvalds wrote:
> On Fri, Nov 4, 2022 at 9:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So cmpxchg_double() does a cmpxchg on a double long value and is
> > currently supported by: i386, x86_64, arm64 and s390.
> >
> > On all those, except i386, two longs are u128.
> >
> > So how about we introduce u128 and cmpxchg128 -- then it directly
> > mirrors the u64 and cmpxchg64 usage we already have. It then also
> > naturally imposses the alignment thing.
> 
> Ack

Out now: https://lkml.kernel.org/r/20221219153525.632521981@infradead.org

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH 03/13] sh/mm: Make pmd_t similar to pte_t
  2022-10-22 11:14 ` [PATCH 03/13] sh/mm: " Peter Zijlstra
@ 2022-12-21 13:54   ` Guenter Roeck
  0 siblings, 0 replies; 148+ messages in thread
From: Guenter Roeck @ 2022-12-21 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, willy, torvalds, akpm, linux-kernel, linux-mm, aarcange,
	kirill.shutemov, jroedel, ubizjak

On Sat, Oct 22, 2022 at 01:14:06PM +0200, Peter Zijlstra wrote:
> Just like 64bit pte_t, have a low/high split in pmd_t.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

This patch causes a compile error when trying to build sh:defconfig.

In function 'follow_pmd_mask',
    inlined from 'follow_pud_mask' at mm/gup.c:735:9,
    inlined from 'follow_p4d_mask' at mm/gup.c:752:9,
    inlined from 'follow_page_mask' at mm/gup.c:809:9:
include/linux/compiler_types.h:358:45: error: call to '__compiletime_assert_263' declared with attribute error: Unsupported access size for {READ,WRITE}_ONCE().
...
mm/gup.c:661:18: note: in expansion of macro 'READ_ONCE'
  661 |         pmdval = READ_ONCE(*pmd);

> ---
>  arch/sh/include/asm/pgtable-3level.h |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> --- a/arch/sh/include/asm/pgtable-3level.h
> +++ b/arch/sh/include/asm/pgtable-3level.h
> @@ -28,9 +28,15 @@
>  #define pmd_ERROR(e) \
>  	printk("%s:%d: bad pmd %016llx.\n", __FILE__, __LINE__, pmd_val(e))
>  
> -typedef struct { unsigned long long pmd; } pmd_t;
> +typedef struct {

Was this supposed to be "union" ?

Guenter

> +	struct {
> +		unsigned long pmd_low;
> +		unsigned long pmd_high;
> +	};
> +	unsigned long long pmd;
> +} pmd_t;
>  #define pmd_val(x)	((x).pmd)
> -#define __pmd(x)	((pmd_t) { (x) } )
> +#define __pmd(x)	((pmd_t) { .pmd = (x) } )
>  
>  static inline pmd_t *pud_pgtable(pud_t pud)
>  {
> 
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

end of thread, other threads:[~2022-12-21 13:54 UTC | newest]

Thread overview: 148+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
2022-10-22 11:14 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2022-10-24  5:42   ` John Hubbard
2022-10-24  8:00     ` Peter Zijlstra
2022-10-24 19:58       ` Jann Horn
2022-10-24 20:19         ` Linus Torvalds
2022-10-24 20:23           ` Jann Horn
2022-10-24 20:36             ` Linus Torvalds
2022-10-25  3:21             ` Matthew Wilcox
2022-10-25  7:54               ` Alistair Popple
2022-10-25 13:33                 ` Peter Zijlstra
2022-10-25 13:44                 ` Jann Horn
2022-10-26  0:45                   ` Alistair Popple
2022-10-25 14:02         ` Peter Zijlstra
2022-10-25 14:18           ` Jann Horn
2022-10-25 15:06             ` Peter Zijlstra
2022-10-26 16:45               ` Jann Horn
2022-10-27  7:08                 ` Peter Zijlstra
2022-10-27 18:13                   ` Linus Torvalds
2022-10-27 19:35                     ` Peter Zijlstra
2022-10-27 19:43                       ` Linus Torvalds
2022-10-27 20:15                     ` Nadav Amit
2022-10-27 20:31                       ` Linus Torvalds
2022-10-27 21:44                         ` Nadav Amit
2022-10-28 23:57                           ` Nadav Amit
2022-10-29  0:42                             ` Linus Torvalds
2022-10-29 18:05                               ` Nadav Amit
2022-10-29 18:36                                 ` Linus Torvalds
2022-10-29 18:58                                   ` Linus Torvalds
2022-10-29 19:14                                     ` Linus Torvalds
2022-10-29 19:28                                       ` Nadav Amit
2022-10-30  0:18                                       ` Nadav Amit
2022-10-30  2:17                                     ` Nadav Amit
2022-10-30 18:19                                       ` Linus Torvalds
2022-10-30 18:51                                         ` Linus Torvalds
2022-10-30 22:47                                           ` Linus Torvalds
2022-10-31  1:47                                             ` Linus Torvalds
2022-10-31  4:09                                               ` Nadav Amit
2022-10-31  4:55                                                 ` Nadav Amit
2022-10-31  5:00                                                 ` Linus Torvalds
2022-10-31 15:43                                                   ` Nadav Amit
2022-10-31 17:32                                                     ` Linus Torvalds
2022-10-31  9:36                                               ` Peter Zijlstra
2022-10-31 17:28                                                 ` Linus Torvalds
2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
2022-11-02  9:14                                                     ` Christian Borntraeger
2022-11-02  9:23                                                       ` Christian Borntraeger
2022-11-02 17:55                                                       ` Linus Torvalds
2022-11-02 18:28                                                         ` Linus Torvalds
2022-11-02 22:29                                                         ` Gerald Schaefer
2022-11-02 12:45                                                     ` Peter Zijlstra
2022-11-02 22:31                                                     ` Gerald Schaefer
2022-11-02 23:13                                                       ` Linus Torvalds
2022-11-03  9:52                                                     ` David Hildenbrand
2022-11-03 16:54                                                       ` Linus Torvalds
2022-11-03 17:09                                                         ` Linus Torvalds
2022-11-03 17:36                                                           ` David Hildenbrand
2022-11-04  6:33                                                     ` Alexander Gordeev
2022-11-04 17:35                                                       ` Linus Torvalds
2022-11-06 21:06                                                         ` Hugh Dickins
2022-11-06 22:34                                                           ` Linus Torvalds
2022-11-06 23:14                                                             ` Andrew Morton
2022-11-07  0:06                                                               ` Stephen Rothwell
2022-11-07 16:19                                                               ` Linus Torvalds
2022-11-07 23:02                                                                 ` Andrew Morton
2022-11-07 23:44                                                                   ` Stephen Rothwell
2022-11-07  9:12                                                           ` Peter Zijlstra
2022-11-07 20:07                                                           ` Johannes Weiner
2022-11-07 20:29                                                             ` Linus Torvalds
2022-11-07 23:47                                                               ` Linus Torvalds
2022-11-08  4:28                                                                 ` Linus Torvalds
2022-11-08 19:56                                                                   ` Linus Torvalds
2022-11-08 20:03                                                                     ` Konstantin Ryabitsev
2022-11-08 20:18                                                                       ` Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
2022-11-08 20:37                                                                   ` Nadav Amit
2022-11-08 20:46                                                                     ` Linus Torvalds
2022-11-09  6:36                                                                   ` Alexander Gordeev
2022-11-09 18:00                                                                     ` Linus Torvalds
2022-11-09 20:02                                                                       ` Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
2022-11-08 20:48                                                                   ` [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] " kernel test robot
2022-11-08 21:01                                                                     ` Linus Torvalds
2022-11-08 21:05                                                                   ` [PATCH 4/4] " Nadav Amit
2022-11-09 15:53                                                                   ` Johannes Weiner
2022-11-09 19:31                                                                     ` Hugh Dickins
2022-10-31  9:39                                               ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2022-10-31 17:22                                                 ` Linus Torvalds
2022-10-31  9:46                                               ` Peter Zijlstra
2022-10-31  9:28                                             ` Peter Zijlstra
2022-10-31 17:19                                               ` Linus Torvalds
2022-10-30 19:34                                         ` Nadav Amit
2022-10-29 19:39                                   ` John Hubbard
2022-10-29 20:15                                     ` Linus Torvalds
2022-10-29 20:30                                       ` Linus Torvalds
2022-10-29 20:42                                         ` John Hubbard
2022-10-29 20:56                                       ` Nadav Amit
2022-10-29 21:03                                         ` Nadav Amit
2022-10-29 21:12                                         ` Linus Torvalds
2022-10-29 20:59                                       ` Theodore Ts'o
2022-10-26 19:43               ` Nadav Amit
2022-10-27  7:27                 ` Peter Zijlstra
2022-10-27 17:30                   ` Nadav Amit
2022-10-22 11:14 ` [PATCH 02/13] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
2022-10-22 11:14 ` [PATCH 03/13] sh/mm: " Peter Zijlstra
2022-12-21 13:54   ` Guenter Roeck
2022-10-22 11:14 ` [PATCH 04/13] mm: Fix pmd_read_atomic() Peter Zijlstra
2022-10-22 17:30   ` Linus Torvalds
2022-10-24  8:09     ` Peter Zijlstra
2022-11-01 12:41     ` Peter Zijlstra
2022-11-01 17:42       ` Linus Torvalds
2022-11-02  9:12       ` [tip: x86/mm] mm: Convert __HAVE_ARCH_P..P_GET to the new style tip-bot2 for Peter Zijlstra
2022-11-03 21:15       ` tip-bot2 for Peter Zijlstra
2022-12-17 18:55       ` tip-bot2 for Peter Zijlstra
2022-10-22 11:14 ` [PATCH 05/13] mm: Rename GUP_GET_PTE_LOW_HIGH Peter Zijlstra
2022-10-22 11:14 ` [PATCH 06/13] mm: Rename pmd_read_atomic() Peter Zijlstra
2022-10-22 11:14 ` [PATCH 07/13] mm/gup: Fix the lockless PMD access Peter Zijlstra
2022-10-23  0:42   ` Hugh Dickins
2022-10-24  7:42     ` Peter Zijlstra
2022-10-25  3:58       ` Hugh Dickins
2022-10-22 11:14 ` [PATCH 08/13] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
2022-10-22 11:14 ` [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
2022-10-22 17:42   ` Linus Torvalds
2022-10-24 10:21     ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
2022-10-22 17:53   ` Linus Torvalds
2022-10-24 11:13     ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 11/13] x86_64: Remove pointless set_64bit() usage Peter Zijlstra
2022-10-22 17:55   ` Linus Torvalds
2022-11-03 19:09   ` Nathan Chancellor
2022-11-03 19:23     ` Uros Bizjak
2022-11-03 19:35       ` Nathan Chancellor
2022-11-03 20:39         ` Linus Torvalds
2022-11-03 21:06           ` Peter Zijlstra
2022-11-04 16:01           ` Peter Zijlstra
2022-11-04 17:15             ` Linus Torvalds
2022-11-05 13:29               ` Jason A. Donenfeld
2022-11-05 15:14                 ` Peter Zijlstra
2022-11-05 20:54                   ` Jason A. Donenfeld
2022-11-07  9:14                   ` David Laight
2022-12-19 15:44               ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 12/13] x86/mm/pae: Get rid of set_64bit() Peter Zijlstra
2022-10-22 11:14 ` [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless() Peter Zijlstra
2022-10-22 19:59   ` Yu Zhao
2022-10-22 17:57 ` [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Linus Torvalds
2022-10-29 12:21 ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.