All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] membarrier cleanups
@ 2021-06-16  3:21 Andy Lutomirski
  2021-06-16  3:21 ` [PATCH 1/8] membarrier: Document why membarrier() works Andy Lutomirski
                   ` (7 more replies)
  0 siblings, 8 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86; +Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski

membarrier() is unnecessarily tangled with the core scheduler.  Clean it
up.  While we're at it, remove the documentation and drop the pretence that
SYNC_CORE can ever be a well-defined cross-arch operation.

Andy Lutomirski (8):
  membarrier: Document why membarrier() works
  x86/mm: Handle unlazying membarrier core sync in the arch code
  membarrier: Remove membarrier_arch_switch_mm() prototype in core code
  membarrier: Make the post-switch-mm barrier explicit
  membarrier, kthread: Use _ONCE accessors for task->mm
  powerpc/membarrier: Remove special barrier on mm switch
  membarrier: Remove arm (32) support for SYNC_CORE
  membarrier: Rewrite sync_core_before_usermode() and improve
    documentation

 .../membarrier-sync-core/arch-support.txt     | 68 +++++------------
 arch/arm/Kconfig                              |  1 -
 arch/arm64/include/asm/sync_core.h            | 19 +++++
 arch/powerpc/include/asm/membarrier.h         | 27 -------
 arch/powerpc/include/asm/sync_core.h          | 14 ++++
 arch/powerpc/mm/mmu_context.c                 |  2 -
 arch/x86/Kconfig                              |  1 -
 arch/x86/include/asm/sync_core.h              |  7 +-
 arch/x86/kernel/alternative.c                 |  2 +-
 arch/x86/kernel/cpu/mce/core.c                |  2 +-
 arch/x86/mm/tlb.c                             | 54 ++++++++++---
 drivers/misc/sgi-gru/grufault.c               |  2 +-
 drivers/misc/sgi-gru/gruhandles.c             |  2 +-
 drivers/misc/sgi-gru/grukservices.c           |  2 +-
 fs/exec.c                                     |  2 +-
 include/linux/sched/mm.h                      | 42 +++++-----
 include/linux/sync_core.h                     | 21 -----
 init/Kconfig                                  |  3 -
 kernel/kthread.c                              | 16 +---
 kernel/sched/core.c                           | 44 +++--------
 kernel/sched/membarrier.c                     | 76 +++++++++++++++++--
 21 files changed, 210 insertions(+), 197 deletions(-)
 create mode 100644 arch/arm64/include/asm/sync_core.h
 delete mode 100644 arch/powerpc/include/asm/membarrier.h
 create mode 100644 arch/powerpc/include/asm/sync_core.h
 delete mode 100644 include/linux/sync_core.h

-- 
2.31.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [PATCH 1/8] membarrier: Document why membarrier() works
  2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
@ 2021-06-16  3:21 ` Andy Lutomirski
  2021-06-16  4:00   ` Nicholas Piggin
  2021-06-16  3:21 ` [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra

We had a nice comment at the top of membarrier.c explaining why membarrier
worked in a handful of scenarios, but that consisted more of a list of
things not to forget than an actual description of the algorithm and why it
should be expected to work.

Add a comment explaining my understanding of the algorithm.  This exposes a
couple of implementation issues that I will hopefully fix up in subsequent
patches.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/sched/membarrier.c | 55 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b5add64d9698..3173b063d358 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -7,6 +7,61 @@
 #include "sched.h"
 
 /*
+ * The basic principle behind the regular memory barrier mode of membarrier()
+ * is as follows.  For each CPU, membarrier() operates in one of two
+ * modes.  Either it sends an IPI or it does not. If membarrier() sends an
+ * IPI, then we have the following sequence of events:
+ *
+ * 1. membarrier() does smp_mb().
+ * 2. membarrier() does a store (the IPI request payload) that is observed by
+ *    the target CPU.
+ * 3. The target CPU does smp_mb().
+ * 4. The target CPU does a store (the completion indication) that is observed
+ *    by membarrier()'s wait-for-IPIs-to-finish request.
+ * 5. membarrier() does smp_mb().
+ *
+ * So all pre-membarrier() local accesses are visible after the IPI on the
+ * target CPU and all pre-IPI remote accesses are visible after
+ * membarrier(). IOW membarrier() has synchronized both ways with the target
+ * CPU.
+ *
+ * (This has the caveat that membarrier() does not interrupt the CPU that it's
+ * running on at the time it sends the IPIs. However, if that is the CPU on
+ * which membarrier() starts and/or finishes, membarrier() does smp_mb() and,
+ * if not, then membarrier() scheduled, and scheduling had better include a
+ * full barrier somewhere for basic correctness regardless of membarrier.)
+ *
+ * If membarrier() does not send an IPI, this means that membarrier() reads
+ * cpu_rq(cpu)->curr->mm and that the result is not equal to the target
+ * mm.  Let's assume for now that tasks never change their mm field.  The
+ * sequence of events is:
+ *
+ * 1. Target CPU switches away from the target mm (or goes lazy or has never
+ *    run the target mm in the first place). This involves smp_mb() followed
+ *    by a write to cpu_rq(cpu)->curr.
+ * 2. membarrier() does smp_mb(). (This is NOT synchronized with any action
+ *    done by the target.)
+ * 3. membarrier() observes the value written in step 1 and does *not* observe
+ *    the value written in step 5.
+ * 4. membarrier() does smp_mb().
+ * 5. Target CPU switches back to the target mm and writes to
+ *    cpu_rq(cpu)->curr. (This is NOT synchronized with any action on
+ *    membarrier()'s part.)
+ * 6. Target CPU executes smp_mb()
+ *
+ * All pre-schedule accesses on the remote CPU are visible after membarrier()
+ * because they all precede the target's write in step 1 and are synchronized
+ * to the local CPU by steps 3 and 4.  All pre-membarrier() accesses on the
+ * local CPU are visible on the remote CPU after scheduling because they
+ * happen before the smp_mb(); read in steps 2 and 3 and that read preceeds
+ * the target's smp_mb() in step 6.
+ *
+ * However, tasks can change their ->mm, e.g., via kthread_use_mm().  So
+ * tasks that switch their ->mm must follow the same rules as the scheduler
+ * changing rq->curr, and the membarrier() code needs to do both dereferences
+ * carefully.
+ *
+ *
  * For documentation purposes, here are some membarrier ordering
  * scenarios to keep in mind:
  *
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code
  2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
  2021-06-16  3:21 ` [PATCH 1/8] membarrier: Document why membarrier() works Andy Lutomirski
@ 2021-06-16  3:21 ` Andy Lutomirski
  2021-06-16  4:25   ` Nicholas Piggin
  2021-06-16 17:49     ` Mathieu Desnoyers
  2021-06-16  3:21 ` [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra

The core scheduler isn't a great place for
membarrier_mm_sync_core_before_usermode() -- the core scheduler
doesn't actually know whether we are lazy.  With the old code, if a
CPU is running a membarrier-registered task, goes idle, gets unlazied
via a TLB shootdown IPI, and switches back to the
membarrier-registered task, it will do an unnecessary core sync.

Conveniently, x86 is the only architecture that does anything in this
sync_core_before_usermode(), so membarrier_mm_sync_core_before_usermode()
is a no-op on all other architectures and we can just move the code.

(I am not claiming that the SYNC_CORE code was correct before or after this
 change on any non-x86 architecture.  I merely claim that this change
 improves readability, is correct on x86, and makes no change on any other
 architecture.)

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c        | 53 +++++++++++++++++++++++++++++++---------
 include/linux/sched/mm.h | 13 ----------
 kernel/sched/core.c      | 13 ++++------
 3 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 78804680e923..59488d663e68 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,6 +8,7 @@
 #include <linux/export.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
+#include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -473,16 +474,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
 	/*
-	 * The membarrier system call requires a full memory barrier and
-	 * core serialization before returning to user-space, after
-	 * storing to rq->curr, when changing mm.  This is because
-	 * membarrier() sends IPIs to all CPUs that are in the target mm
-	 * to make them issue memory barriers.  However, if another CPU
-	 * switches to/from the target mm concurrently with
-	 * membarrier(), it can cause that CPU not to receive an IPI
-	 * when it really should issue a memory barrier.  Writing to CR3
-	 * provides that full memory barrier and core serializing
-	 * instruction.
+	 * membarrier() support requires that, when we change rq->curr->mm:
+	 *
+	 *  - If next->mm has membarrier registered, a full memory barrier
+	 *    after writing rq->curr (or rq->curr->mm if we switched the mm
+	 *    without switching tasks) and before returning to user mode.
+	 *
+	 *  - If next->mm has SYNC_CORE registered, then we sync core before
+	 *    returning to user mode.
+	 *
+	 * In the case where prev->mm == next->mm, membarrier() uses an IPI
+	 * instead, and no particular barriers are needed while context
+	 * switching.
+	 *
+	 * x86 gets all of this as a side-effect of writing to CR3 except
+	 * in the case where we unlazy without flushing.
+	 *
+	 * All other architectures are civilized and do all of this implicitly
+	 * when transitioning from kernel to user mode.
 	 */
 	if (real_prev == next) {
 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
@@ -500,7 +509,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
-		 * process. No TLB flush required.
+		 * process. No TLB flush or membarrier() synchronization
+		 * is required.
 		 */
 		if (!was_lazy)
 			return;
@@ -510,16 +520,35 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * If the TLB is up to date, just use it.
 		 * The barrier synchronizes with the tlb_gen increment in
 		 * the TLB shootdown code.
+		 *
+		 * As a future optimization opportunity, it's plausible
+		 * that the x86 memory model is strong enough that this
+		 * smp_mb() isn't needed.
 		 */
 		smp_mb();
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
-				next_tlb_gen)
+		    next_tlb_gen) {
+#ifdef CONFIG_MEMBARRIER
+			/*
+			 * We switched logical mm but we're not going to
+			 * write to CR3.  We already did smp_mb() above,
+			 * but membarrier() might require a sync_core()
+			 * as well.
+			 */
+			if (unlikely(atomic_read(&next->membarrier_state) &
+				     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
+				sync_core_before_usermode();
+#endif
+
 			return;
+		}
 
 		/*
 		 * TLB contents went out of date while we were in lazy
 		 * mode. Fall through to the TLB switching code below.
+		 * No need for an explicit membarrier invocation -- the CR3
+		 * write will serialize.
 		 */
 		new_asid = prev_asid;
 		need_flush = true;
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..24d97d1b6252 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -345,16 +345,6 @@ enum {
 #include <asm/membarrier.h>
 #endif
 
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
-{
-	if (current->mm != mm)
-		return;
-	if (likely(!(atomic_read(&mm->membarrier_state) &
-		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
-		return;
-	sync_core_before_usermode();
-}
-
 extern void membarrier_exec_mmap(struct mm_struct *mm);
 
 extern void membarrier_update_current_mm(struct mm_struct *next_mm);
@@ -370,9 +360,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
 static inline void membarrier_exec_mmap(struct mm_struct *mm)
 {
 }
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
-{
-}
 static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
 {
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..e4c122f8bf21 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4220,22 +4220,19 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	kmap_local_sched_in();
 
 	fire_sched_in_preempt_notifiers(current);
+
 	/*
 	 * When switching through a kernel thread, the loop in
 	 * membarrier_{private,global}_expedited() may have observed that
 	 * kernel thread and not issued an IPI. It is therefore possible to
 	 * schedule between user->kernel->user threads without passing though
 	 * switch_mm(). Membarrier requires a barrier after storing to
-	 * rq->curr, before returning to userspace, so provide them here:
-	 *
-	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-	 *   provided by mmdrop(),
-	 * - a sync_core for SYNC_CORE.
+	 * rq->curr, before returning to userspace, and mmdrop() provides
+	 * this barrier.
 	 */
-	if (mm) {
-		membarrier_mm_sync_core_before_usermode(mm);
+	if (mm)
 		mmdrop(mm);
-	}
+
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code
  2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
  2021-06-16  3:21 ` [PATCH 1/8] membarrier: Document why membarrier() works Andy Lutomirski
  2021-06-16  3:21 ` [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
@ 2021-06-16  3:21 ` Andy Lutomirski
  2021-06-16  4:26   ` Nicholas Piggin
  2021-06-16 17:52     ` Mathieu Desnoyers
  2021-06-16  3:21 ` [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra

membarrier_arch_switch_mm()'s sole implementation and caller are in
arch/powerpc.  Having a fallback implementation in include/linux is
confusing -- remove it.

It's still mentioned in a comment, but a subsequent patch will remove
it.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched/mm.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 24d97d1b6252..10aace21d25e 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -350,13 +350,6 @@ extern void membarrier_exec_mmap(struct mm_struct *mm);
 extern void membarrier_update_current_mm(struct mm_struct *next_mm);
 
 #else
-#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-					     struct mm_struct *next,
-					     struct task_struct *tsk)
-{
-}
-#endif
 static inline void membarrier_exec_mmap(struct mm_struct *mm)
 {
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
                   ` (2 preceding siblings ...)
  2021-06-16  3:21 ` [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
@ 2021-06-16  3:21 ` Andy Lutomirski
  2021-06-16  4:19   ` Nicholas Piggin
  2021-06-16  3:21 ` [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra

membarrier() needs a barrier after any CPU changes mm.  There is currently
a comment explaining why this barrier probably exists in all cases.  This
is very fragile -- any change to the relevant parts of the scheduler
might get rid of these barriers, and it's not really clear to me that
the barrier actually exists in all necessary cases.

Simplify the logic by adding an explicit barrier, and allow architectures
to override it as an optimization if they want to.

One of the deleted comments in this patch said "It is therefore
possible to schedule between user->kernel->user threads without
passing through switch_mm()".  It is possible to do this without, say,
writing to CR3 on x86, but the core scheduler indeed calls
switch_mm_irqs_off() to tell the arch code to go back from lazy mode
to no-lazy mode.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched/mm.h | 21 +++++++++++++++++++++
 kernel/kthread.c         | 12 +-----------
 kernel/sched/core.c      | 35 +++++++++--------------------------
 3 files changed, 31 insertions(+), 37 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 10aace21d25e..c6eebbafadb0 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -341,6 +341,27 @@ enum {
 	MEMBARRIER_FLAG_RSEQ		= (1U << 1),
 };
 
+#ifdef CONFIG_MEMBARRIER
+
+/*
+ * Called by the core scheduler after calling switch_mm_irqs_off().
+ * Architectures that have implicit barriers when switching mms can
+ * override this as an optimization.
+ */
+#ifndef membarrier_finish_switch_mm
+static inline void membarrier_finish_switch_mm(int membarrier_state)
+{
+	if (membarrier_state & (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED))
+		smp_mb();
+}
+#endif
+
+#else
+
+static inline void membarrier_finish_switch_mm(int membarrier_state) {}
+
+#endif
+
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
 #include <asm/membarrier.h>
 #endif
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe3f2a40d61e..8275b415acec 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1325,25 +1325,15 @@ void kthread_use_mm(struct mm_struct *mm)
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
+	membarrier_finish_switch_mm(atomic_read(&mm->membarrier_state));
 	local_irq_enable();
 	task_unlock(tsk);
 #ifdef finish_arch_post_lock_switch
 	finish_arch_post_lock_switch();
 #endif
 
-	/*
-	 * When a kthread starts operating on an address space, the loop
-	 * in membarrier_{private,global}_expedited() may not observe
-	 * that tsk->mm, and not issue an IPI. Membarrier requires a
-	 * memory barrier after storing to tsk->mm, before accessing
-	 * user-space memory. A full memory barrier for membarrier
-	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop(), or explicitly with smp_mb().
-	 */
 	if (active_mm != mm)
 		mmdrop(active_mm);
-	else
-		smp_mb();
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4c122f8bf21..329a6d2a4e13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4221,15 +4221,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 
 	fire_sched_in_preempt_notifiers(current);
 
-	/*
-	 * When switching through a kernel thread, the loop in
-	 * membarrier_{private,global}_expedited() may have observed that
-	 * kernel thread and not issued an IPI. It is therefore possible to
-	 * schedule between user->kernel->user threads without passing though
-	 * switch_mm(). Membarrier requires a barrier after storing to
-	 * rq->curr, before returning to userspace, and mmdrop() provides
-	 * this barrier.
-	 */
 	if (mm)
 		mmdrop(mm);
 
@@ -4311,15 +4302,14 @@ context_switch(struct rq *rq, struct task_struct *prev,
 			prev->active_mm = NULL;
 	} else {                                        // to user
 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
+		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+
 		/*
 		 * sys_membarrier() requires an smp_mb() between setting
-		 * rq->curr / membarrier_switch_mm() and returning to userspace.
-		 *
-		 * The below provides this either through switch_mm(), or in
-		 * case 'prev->active_mm == next->mm' through
-		 * finish_task_switch()'s mmdrop().
+		 * rq->curr->mm to a membarrier-enabled mm and returning
+		 * to userspace.
 		 */
-		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		membarrier_finish_switch_mm(rq->membarrier_state);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
@@ -5121,17 +5111,10 @@ static void __sched notrace __schedule(bool preempt)
 		RCU_INIT_POINTER(rq->curr, next);
 		/*
 		 * The membarrier system call requires each architecture
-		 * to have a full memory barrier after updating
-		 * rq->curr, before returning to user-space.
-		 *
-		 * Here are the schemes providing that barrier on the
-		 * various architectures:
-		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
-		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
-		 * - finish_lock_switch() for weakly-ordered
-		 *   architectures where spin_unlock is a full barrier,
-		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
-		 *   is a RELEASE barrier),
+		 * to have a full memory barrier before and after updating
+		 * rq->curr->mm, before returning to userspace.  This
+		 * is provided by membarrier_finish_switch_mm().  Architectures
+		 * that want to optimize this can override that function.
 		 */
 		++*switch_count;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm
  2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
                   ` (3 preceding siblings ...)
  2021-06-16  3:21 ` [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
@ 2021-06-16  3:21 ` Andy Lutomirski
  2021-06-16  4:28   ` Nicholas Piggin
  2021-06-16 18:08     ` Mathieu Desnoyers
  2021-06-16  3:21   ` Andy Lutomirski
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra

membarrier reads cpu_rq(remote cpu)->curr->mm without locking.  Use
READ_ONCE() and WRITE_ONCE() to remove the data races.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 fs/exec.c                 | 2 +-
 kernel/kthread.c          | 4 ++--
 kernel/sched/membarrier.c | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..2e63dea83411 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1007,7 +1007,7 @@ static int exec_mmap(struct mm_struct *mm)
 	local_irq_disable();
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
-	tsk->mm = mm;
+	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8275b415acec..4962794e02d5 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1322,7 +1322,7 @@ void kthread_use_mm(struct mm_struct *mm)
 		mmgrab(mm);
 		tsk->active_mm = mm;
 	}
-	tsk->mm = mm;
+	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
 	membarrier_finish_switch_mm(atomic_read(&mm->membarrier_state));
@@ -1363,7 +1363,7 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	smp_mb__after_spinlock();
 	sync_mm_rss(mm);
 	local_irq_disable();
-	tsk->mm = NULL;
+	WRITE_ONCE(tsk->mm, NULL);  /* membarrier reads this without locks */
 	membarrier_update_current_mm(NULL);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 3173b063d358..c32c32a2441e 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -410,7 +410,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 			goto out;
 		rcu_read_lock();
 		p = rcu_dereference(cpu_rq(cpu_id)->curr);
-		if (!p || p->mm != mm) {
+		if (!p || READ_ONCE(p->mm) != mm) {
 			rcu_read_unlock();
 			goto out;
 		}
@@ -423,7 +423,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 			struct task_struct *p;
 
 			p = rcu_dereference(cpu_rq(cpu)->curr);
-			if (p && p->mm == mm)
+			if (p && READ_ONCE(p->mm) == mm)
 				__cpumask_set_cpu(cpu, tmpmask);
 		}
 		rcu_read_unlock();
@@ -521,7 +521,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm)
 		struct task_struct *p;
 
 		p = rcu_dereference(rq->curr);
-		if (p && p->mm == mm)
+		if (p && READ_ONCE(p->mm) == mm)
 			__cpumask_set_cpu(cpu, tmpmask);
 	}
 	rcu_read_unlock();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 6/8] powerpc/membarrier: Remove special barrier on mm switch
  2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
@ 2021-06-16  3:21   ` Andy Lutomirski
  2021-06-16  3:21 ` [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Mathieu Desnoyers, Peter Zijlstra

powerpc did the following on some, but not all, paths through
switch_mm_irqs_off():

       /*
        * Only need the full barrier when switching between processes.
        * Barrier when switching from kernel to userspace is not
        * required here, given that it is implied by mmdrop(). Barrier
        * when switching from userspace to kernel is not needed after
        * store to rq->curr.
        */
       if (likely(!(atomic_read(&next->membarrier_state) &
                    (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
                     MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
               return;

This is puzzling: if !prev, then one might expect that we are switching
from kernel to user, not user to kernel, which is inconsistent with the
comment.  But this is all nonsense, because the one and only caller would
never have prev == NULL and would, in fact, OOPS if prev == NULL.

In any event, this code is unnecessary, since the new generic
membarrier_finish_switch_mm() provides the same barrier without arch help.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/powerpc/include/asm/membarrier.h | 27 ---------------------------
 arch/powerpc/mm/mmu_context.c         |  2 --
 2 files changed, 29 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
deleted file mode 100644
index 6e20bb5c74ea..000000000000
--- a/arch/powerpc/include/asm/membarrier.h
+++ /dev/null
@@ -1,27 +0,0 @@
-#ifndef _ASM_POWERPC_MEMBARRIER_H
-#define _ASM_POWERPC_MEMBARRIER_H
-
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-					     struct mm_struct *next,
-					     struct task_struct *tsk)
-{
-	/*
-	 * Only need the full barrier when switching between processes.
-	 * Barrier when switching from kernel to userspace is not
-	 * required here, given that it is implied by mmdrop(). Barrier
-	 * when switching from userspace to kernel is not needed after
-	 * store to rq->curr.
-	 */
-	if (likely(!(atomic_read(&next->membarrier_state) &
-		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
-		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
-		return;
-
-	/*
-	 * The membarrier system call requires a full memory barrier
-	 * after storing to rq->curr, before going back to user-space.
-	 */
-	smp_mb();
-}
-
-#endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index a857af401738..8daa95b3162b 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -85,8 +85,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 
 	if (new_on_cpu)
 		radix_kvm_prefetch_workaround(next);
-	else
-		membarrier_arch_switch_mm(prev, next, tsk);
 
 	/*
 	 * The actual HW switching method differs between the various
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 6/8] powerpc/membarrier: Remove special barrier on mm switch
@ 2021-06-16  3:21   ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: linux-mm, Peter Zijlstra, LKML, Nicholas Piggin, Dave Hansen,
	Paul Mackerras, Andy Lutomirski, Mathieu Desnoyers,
	Andrew Morton, linuxppc-dev

powerpc did the following on some, but not all, paths through
switch_mm_irqs_off():

       /*
        * Only need the full barrier when switching between processes.
        * Barrier when switching from kernel to userspace is not
        * required here, given that it is implied by mmdrop(). Barrier
        * when switching from userspace to kernel is not needed after
        * store to rq->curr.
        */
       if (likely(!(atomic_read(&next->membarrier_state) &
                    (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
                     MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
               return;

This is puzzling: if !prev, then one might expect that we are switching
from kernel to user, not user to kernel, which is inconsistent with the
comment.  But this is all nonsense, because the one and only caller would
never have prev == NULL and would, in fact, OOPS if prev == NULL.

In any event, this code is unnecessary, since the new generic
membarrier_finish_switch_mm() provides the same barrier without arch help.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/powerpc/include/asm/membarrier.h | 27 ---------------------------
 arch/powerpc/mm/mmu_context.c         |  2 --
 2 files changed, 29 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
deleted file mode 100644
index 6e20bb5c74ea..000000000000
--- a/arch/powerpc/include/asm/membarrier.h
+++ /dev/null
@@ -1,27 +0,0 @@
-#ifndef _ASM_POWERPC_MEMBARRIER_H
-#define _ASM_POWERPC_MEMBARRIER_H
-
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-					     struct mm_struct *next,
-					     struct task_struct *tsk)
-{
-	/*
-	 * Only need the full barrier when switching between processes.
-	 * Barrier when switching from kernel to userspace is not
-	 * required here, given that it is implied by mmdrop(). Barrier
-	 * when switching from userspace to kernel is not needed after
-	 * store to rq->curr.
-	 */
-	if (likely(!(atomic_read(&next->membarrier_state) &
-		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
-		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
-		return;
-
-	/*
-	 * The membarrier system call requires a full memory barrier
-	 * after storing to rq->curr, before going back to user-space.
-	 */
-	smp_mb();
-}
-
-#endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index a857af401738..8daa95b3162b 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -85,8 +85,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 
 	if (new_on_cpu)
 		radix_kvm_prefetch_workaround(next);
-	else
-		membarrier_arch_switch_mm(prev, next, tsk);
 
 	/*
 	 * The actual HW switching method differs between the various
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
@ 2021-06-16  3:21   ` Andy Lutomirski
  2021-06-16  3:21 ` [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra, Russell King,
	linux-arm-kernel

On arm32, the only way to safely flush icache from usermode is to call
cacheflush(2).  This also handles any required pipeline flushes, so
membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/arm/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 24804f11302d..89a885fba724 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -10,7 +10,6 @@ config ARM
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_KCOV
-	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
 	select ARCH_HAS_PHYS_TO_DMA
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16  3:21   ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra, Russell King,
	linux-arm-kernel

On arm32, the only way to safely flush icache from usermode is to call
cacheflush(2).  This also handles any required pipeline flushes, so
membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/arm/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 24804f11302d..89a885fba724 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -10,7 +10,6 @@ config ARM
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_KCOV
-	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
 	select ARCH_HAS_PHYS_TO_DMA
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
  2021-06-16  3:21 ` [PATCH 1/8] membarrier: Document why membarrier() works Andy Lutomirski
@ 2021-06-16  3:21   ` Andy Lutomirski
  2021-06-16  3:21 ` [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Mathieu Desnoyers, Peter Zijlstra, stable

The old sync_core_before_usermode() comments suggested that a non-icache-syncing
return-to-usermode instruction is x86-specific and that all other
architectures automatically notice cross-modified code on return to
userspace.

This is misleading.  The incantation needed to modify code from one
CPU and execute it on another CPU is highly architecture dependent.
On x86, according to the SDM, one must modify the code, issue SFENCE
if the modification was WC or nontemporal, and then issue a "serializing
instruction" on the CPU that will execute the code.  membarrier() can do
the latter.

On arm64 and powerpc, one must flush the icache and then flush the pipeline
on the target CPU, although the CPU manuals don't necessarily use this
language.

So let's drop any pretense that we can have a generic way to define or
implement membarrier's SYNC_CORE operation and instead require all
architectures to define the helper and supply their own documentation as to
how to use it.  This means x86, arm64, and powerpc for now.  Let's also
rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

(It may well be the case that, on real x86 processors, synchronizing the
 icache (which requires no action at all) and "flushing the pipeline" is
 sufficient, but trying to use this language would be confusing at best.
 LFENCE does something awfully like "flushing the pipeline", but the SDM
 does not permit LFENCE as an alternative to a "serializing instruction"
 for this purpose.)

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
 arch/arm64/include/asm/sync_core.h            | 19 ++++++
 arch/powerpc/include/asm/sync_core.h          | 14 ++++
 arch/x86/Kconfig                              |  1 -
 arch/x86/include/asm/sync_core.h              |  7 +-
 arch/x86/kernel/alternative.c                 |  2 +-
 arch/x86/kernel/cpu/mce/core.c                |  2 +-
 arch/x86/mm/tlb.c                             |  3 +-
 drivers/misc/sgi-gru/grufault.c               |  2 +-
 drivers/misc/sgi-gru/gruhandles.c             |  2 +-
 drivers/misc/sgi-gru/grukservices.c           |  2 +-
 include/linux/sched/mm.h                      |  1 -
 include/linux/sync_core.h                     | 21 ------
 init/Kconfig                                  |  3 -
 kernel/sched/membarrier.c                     | 15 ++--
 15 files changed, 75 insertions(+), 87 deletions(-)
 create mode 100644 arch/arm64/include/asm/sync_core.h
 create mode 100644 arch/powerpc/include/asm/sync_core.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 883d33b265d6..41c9ebcb275f 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,51 +5,25 @@
 #
 # Architecture requirements
 #
-# * arm/arm64/powerpc
 #
-# Rely on implicit context synchronization as a result of exception return
-# when returning from IPI handler, and when returning to user-space.
-#
-# * x86
-#
-# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
-# instruction is core serializing, but not SYSEXIT.
-#
-# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it can return to user-space through either SYSRETL (compat code),
-# SYSRETQ, or IRET.
-#
-# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
-# instead on write_cr3() performed by switch_mm() to provide core serialization
-# after changing the current mm, and deal with the special case of kthread ->
-# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
-#
-    -----------------------
-    |         arch |status|
-    -----------------------
-    |       alpha: | TODO |
-    |         arc: | TODO |
-    |         arm: |  ok  |
-    |       arm64: |  ok  |
-    |        csky: | TODO |
-    |       h8300: | TODO |
-    |     hexagon: | TODO |
-    |        ia64: | TODO |
-    |        m68k: | TODO |
-    |  microblaze: | TODO |
-    |        mips: | TODO |
-    |       nds32: | TODO |
-    |       nios2: | TODO |
-    |    openrisc: | TODO |
-    |      parisc: | TODO |
-    |     powerpc: |  ok  |
-    |       riscv: | TODO |
-    |        s390: | TODO |
-    |          sh: | TODO |
-    |       sparc: | TODO |
-    |          um: | TODO |
-    |         x86: |  ok  |
-    |      xtensa: | TODO |
-    -----------------------
+# An architecture that wants to support
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
+# is supposed to do and implement membarrier_sync_core_before_usermode() to
+# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
+# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
+# fantastic API and may not make sense on all architectures.  Once an
+# architecture meets these requirements,
+#
+# On x86, a program can safely modify code, issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
+# the modified address or an alias, from any thread in the calling process.
+#
+# On arm64, a program can modify code, flush the icache as needed, and issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
+# event", aka pipeline flush on all CPUs that might run the calling process.
+# Then the program can execute the modified code as long as it is executed
+# from an address consistent with the icache flush and the CPU's cache type.
+#
+# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
+# similarly to arm64.  It would be nice if the powerpc maintainers could
+# add a more clear explanantion.
diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
new file mode 100644
index 000000000000..74996bf533bb
--- /dev/null
+++ b/arch/arm64/include/asm/sync_core.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_SYNC_CORE_H
+#define _ASM_ARM64_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm64, anyone trying to use membarrier() to handle JIT code is
+ * required to first flush the icache and then do SYNC_CORE.  All that's
+ * needed after the icache flush is to execute a "context synchronization
+ * event".  Right now, ERET does this, and we are guaranteed to ERET before
+ * any user code runs.  If Linux ever programs the CPU to make ERET stop
+ * being a context synchronizing event, then this will need to be adjusted.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM64_SYNC_CORE_H */
diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
new file mode 100644
index 000000000000..589fdb34beab
--- /dev/null
+++ b/arch/powerpc/include/asm/sync_core.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SYNC_CORE_H
+#define _ASM_POWERPC_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * XXX: can a powerpc person put an appropriate comment here?
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_POWERPC_SYNC_CORE_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0045e1b44190..f010897a1e8a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -89,7 +89,6 @@ config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
-	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..c665b453969a 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -89,11 +89,10 @@ static inline void sync_core(void)
 }
 
 /*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
  */
-static inline void sync_core_before_usermode(void)
+static inline void membarrier_sync_core_before_usermode(void)
 {
 	/* With PTI, we unconditionally serialize before running user code. */
 	if (static_cpu_has(X86_FEATURE_PTI))
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 6974b5174495..52ead5f4fcdc 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -17,7 +17,7 @@
 #include <linux/kprobes.h>
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index bf7fe87a7e88..4a577980d4d1 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -41,12 +41,12 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/set_memory.h>
-#include <linux/sync_core.h>
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
+#include <asm/sync_core.h>
 #include <asm/traps.h>
 #include <asm/tlbflush.h>
 #include <asm/mce.h>
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 59488d663e68..35b622fd2ed1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -11,6 +11,7 @@
 #include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
+#include <asm/sync_core.h>
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
@@ -538,7 +539,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			 */
 			if (unlikely(atomic_read(&next->membarrier_state) &
 				     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
-				sync_core_before_usermode();
+				membarrier_sync_core_before_usermode();
 #endif
 
 			return;
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 723825524ea0..48fd5b101de1 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -20,8 +20,8 @@
 #include <linux/io.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
-#include <linux/sync_core.h>
 #include <linux/prefetch.h>
+#include <asm/sync_core.h>
 #include "gru.h"
 #include "grutables.h"
 #include "grulib.h"
diff --git a/drivers/misc/sgi-gru/gruhandles.c b/drivers/misc/sgi-gru/gruhandles.c
index 1d75d5e540bc..c8cba1c1b00f 100644
--- a/drivers/misc/sgi-gru/gruhandles.c
+++ b/drivers/misc/sgi-gru/gruhandles.c
@@ -16,7 +16,7 @@
 #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
 #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
 #else
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/tsc.h>
 #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
 #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
diff --git a/drivers/misc/sgi-gru/grukservices.c b/drivers/misc/sgi-gru/grukservices.c
index 0ea923fe6371..ce03ff3f7c3a 100644
--- a/drivers/misc/sgi-gru/grukservices.c
+++ b/drivers/misc/sgi-gru/grukservices.c
@@ -16,10 +16,10 @@
 #include <linux/miscdevice.h>
 #include <linux/proc_fs.h>
 #include <linux/interrupt.h>
-#include <linux/sync_core.h>
 #include <linux/uaccess.h>
 #include <linux/delay.h>
 #include <linux/export.h>
+#include <asm/sync_core.h>
 #include <asm/io_apic.h>
 #include "gru.h"
 #include "grulib.h"
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index c6eebbafadb0..845db11190cd 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
-#include <linux/sync_core.h>
 
 /*
  * Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index 1ea12c64e4c9..e5d552b0823e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2377,9 +2377,6 @@ source "kernel/Kconfig.locks"
 config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	bool
 
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-	bool
-
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index c32c32a2441e..f72a6ab3fac2 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,9 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#endif
 
 /*
  * The basic principle behind the regular memory barrier mode of membarrier()
@@ -221,6 +224,7 @@ static void ipi_mb(void *info)
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 }
 
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
 static void ipi_sync_core(void *info)
 {
 	/*
@@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
 	 * the big comment at the top of this file.
 	 *
 	 * A sync_core() would provide this guarantee, but
-	 * sync_core_before_usermode() might end up being deferred until
-	 * after membarrier()'s smp_mb().
+	 * membarrier_sync_core_before_usermode() might end up being deferred
+	 * until after membarrier()'s smp_mb().
 	 */
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 
-	sync_core_before_usermode();
+	membarrier_sync_core_before_usermode();
 }
+#endif
 
 static void ipi_rseq(void *info)
 {
@@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 	smp_call_func_t ipi_func = ipi_mb;
 
 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
-		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
 			return -EINVAL;
+#else
 		if (!(atomic_read(&mm->membarrier_state) &
 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
 			return -EPERM;
 		ipi_func = ipi_sync_core;
+#endif
 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
 		if (!IS_ENABLED(CONFIG_RSEQ))
 			return -EINVAL;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16  3:21   ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Catalin Marinas, Will Deacon, linux-mm, Peter Zijlstra, LKML,
	Nicholas Piggin, Dave Hansen, Paul Mackerras, stable,
	Andy Lutomirski, Mathieu Desnoyers, Andrew Morton, linuxppc-dev,
	linux-arm-kernel

The old sync_core_before_usermode() comments suggested that a non-icache-syncing
return-to-usermode instruction is x86-specific and that all other
architectures automatically notice cross-modified code on return to
userspace.

This is misleading.  The incantation needed to modify code from one
CPU and execute it on another CPU is highly architecture dependent.
On x86, according to the SDM, one must modify the code, issue SFENCE
if the modification was WC or nontemporal, and then issue a "serializing
instruction" on the CPU that will execute the code.  membarrier() can do
the latter.

On arm64 and powerpc, one must flush the icache and then flush the pipeline
on the target CPU, although the CPU manuals don't necessarily use this
language.

So let's drop any pretense that we can have a generic way to define or
implement membarrier's SYNC_CORE operation and instead require all
architectures to define the helper and supply their own documentation as to
how to use it.  This means x86, arm64, and powerpc for now.  Let's also
rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

(It may well be the case that, on real x86 processors, synchronizing the
 icache (which requires no action at all) and "flushing the pipeline" is
 sufficient, but trying to use this language would be confusing at best.
 LFENCE does something awfully like "flushing the pipeline", but the SDM
 does not permit LFENCE as an alternative to a "serializing instruction"
 for this purpose.)

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
 arch/arm64/include/asm/sync_core.h            | 19 ++++++
 arch/powerpc/include/asm/sync_core.h          | 14 ++++
 arch/x86/Kconfig                              |  1 -
 arch/x86/include/asm/sync_core.h              |  7 +-
 arch/x86/kernel/alternative.c                 |  2 +-
 arch/x86/kernel/cpu/mce/core.c                |  2 +-
 arch/x86/mm/tlb.c                             |  3 +-
 drivers/misc/sgi-gru/grufault.c               |  2 +-
 drivers/misc/sgi-gru/gruhandles.c             |  2 +-
 drivers/misc/sgi-gru/grukservices.c           |  2 +-
 include/linux/sched/mm.h                      |  1 -
 include/linux/sync_core.h                     | 21 ------
 init/Kconfig                                  |  3 -
 kernel/sched/membarrier.c                     | 15 ++--
 15 files changed, 75 insertions(+), 87 deletions(-)
 create mode 100644 arch/arm64/include/asm/sync_core.h
 create mode 100644 arch/powerpc/include/asm/sync_core.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 883d33b265d6..41c9ebcb275f 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,51 +5,25 @@
 #
 # Architecture requirements
 #
-# * arm/arm64/powerpc
 #
-# Rely on implicit context synchronization as a result of exception return
-# when returning from IPI handler, and when returning to user-space.
-#
-# * x86
-#
-# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
-# instruction is core serializing, but not SYSEXIT.
-#
-# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it can return to user-space through either SYSRETL (compat code),
-# SYSRETQ, or IRET.
-#
-# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
-# instead on write_cr3() performed by switch_mm() to provide core serialization
-# after changing the current mm, and deal with the special case of kthread ->
-# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
-#
-    -----------------------
-    |         arch |status|
-    -----------------------
-    |       alpha: | TODO |
-    |         arc: | TODO |
-    |         arm: |  ok  |
-    |       arm64: |  ok  |
-    |        csky: | TODO |
-    |       h8300: | TODO |
-    |     hexagon: | TODO |
-    |        ia64: | TODO |
-    |        m68k: | TODO |
-    |  microblaze: | TODO |
-    |        mips: | TODO |
-    |       nds32: | TODO |
-    |       nios2: | TODO |
-    |    openrisc: | TODO |
-    |      parisc: | TODO |
-    |     powerpc: |  ok  |
-    |       riscv: | TODO |
-    |        s390: | TODO |
-    |          sh: | TODO |
-    |       sparc: | TODO |
-    |          um: | TODO |
-    |         x86: |  ok  |
-    |      xtensa: | TODO |
-    -----------------------
+# An architecture that wants to support
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
+# is supposed to do and implement membarrier_sync_core_before_usermode() to
+# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
+# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
+# fantastic API and may not make sense on all architectures.  Once an
+# architecture meets these requirements,
+#
+# On x86, a program can safely modify code, issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
+# the modified address or an alias, from any thread in the calling process.
+#
+# On arm64, a program can modify code, flush the icache as needed, and issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
+# event", aka pipeline flush on all CPUs that might run the calling process.
+# Then the program can execute the modified code as long as it is executed
+# from an address consistent with the icache flush and the CPU's cache type.
+#
+# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
+# similarly to arm64.  It would be nice if the powerpc maintainers could
+# add a more clear explanantion.
diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
new file mode 100644
index 000000000000..74996bf533bb
--- /dev/null
+++ b/arch/arm64/include/asm/sync_core.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_SYNC_CORE_H
+#define _ASM_ARM64_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm64, anyone trying to use membarrier() to handle JIT code is
+ * required to first flush the icache and then do SYNC_CORE.  All that's
+ * needed after the icache flush is to execute a "context synchronization
+ * event".  Right now, ERET does this, and we are guaranteed to ERET before
+ * any user code runs.  If Linux ever programs the CPU to make ERET stop
+ * being a context synchronizing event, then this will need to be adjusted.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM64_SYNC_CORE_H */
diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
new file mode 100644
index 000000000000..589fdb34beab
--- /dev/null
+++ b/arch/powerpc/include/asm/sync_core.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SYNC_CORE_H
+#define _ASM_POWERPC_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * XXX: can a powerpc person put an appropriate comment here?
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_POWERPC_SYNC_CORE_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0045e1b44190..f010897a1e8a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -89,7 +89,6 @@ config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
-	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..c665b453969a 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -89,11 +89,10 @@ static inline void sync_core(void)
 }
 
 /*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
  */
-static inline void sync_core_before_usermode(void)
+static inline void membarrier_sync_core_before_usermode(void)
 {
 	/* With PTI, we unconditionally serialize before running user code. */
 	if (static_cpu_has(X86_FEATURE_PTI))
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 6974b5174495..52ead5f4fcdc 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -17,7 +17,7 @@
 #include <linux/kprobes.h>
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index bf7fe87a7e88..4a577980d4d1 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -41,12 +41,12 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/set_memory.h>
-#include <linux/sync_core.h>
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
+#include <asm/sync_core.h>
 #include <asm/traps.h>
 #include <asm/tlbflush.h>
 #include <asm/mce.h>
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 59488d663e68..35b622fd2ed1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -11,6 +11,7 @@
 #include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
+#include <asm/sync_core.h>
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
@@ -538,7 +539,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			 */
 			if (unlikely(atomic_read(&next->membarrier_state) &
 				     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
-				sync_core_before_usermode();
+				membarrier_sync_core_before_usermode();
 #endif
 
 			return;
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 723825524ea0..48fd5b101de1 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -20,8 +20,8 @@
 #include <linux/io.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
-#include <linux/sync_core.h>
 #include <linux/prefetch.h>
+#include <asm/sync_core.h>
 #include "gru.h"
 #include "grutables.h"
 #include "grulib.h"
diff --git a/drivers/misc/sgi-gru/gruhandles.c b/drivers/misc/sgi-gru/gruhandles.c
index 1d75d5e540bc..c8cba1c1b00f 100644
--- a/drivers/misc/sgi-gru/gruhandles.c
+++ b/drivers/misc/sgi-gru/gruhandles.c
@@ -16,7 +16,7 @@
 #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
 #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
 #else
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/tsc.h>
 #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
 #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
diff --git a/drivers/misc/sgi-gru/grukservices.c b/drivers/misc/sgi-gru/grukservices.c
index 0ea923fe6371..ce03ff3f7c3a 100644
--- a/drivers/misc/sgi-gru/grukservices.c
+++ b/drivers/misc/sgi-gru/grukservices.c
@@ -16,10 +16,10 @@
 #include <linux/miscdevice.h>
 #include <linux/proc_fs.h>
 #include <linux/interrupt.h>
-#include <linux/sync_core.h>
 #include <linux/uaccess.h>
 #include <linux/delay.h>
 #include <linux/export.h>
+#include <asm/sync_core.h>
 #include <asm/io_apic.h>
 #include "gru.h"
 #include "grulib.h"
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index c6eebbafadb0..845db11190cd 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
-#include <linux/sync_core.h>
 
 /*
  * Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index 1ea12c64e4c9..e5d552b0823e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2377,9 +2377,6 @@ source "kernel/Kconfig.locks"
 config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	bool
 
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-	bool
-
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index c32c32a2441e..f72a6ab3fac2 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,9 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#endif
 
 /*
  * The basic principle behind the regular memory barrier mode of membarrier()
@@ -221,6 +224,7 @@ static void ipi_mb(void *info)
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 }
 
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
 static void ipi_sync_core(void *info)
 {
 	/*
@@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
 	 * the big comment at the top of this file.
 	 *
 	 * A sync_core() would provide this guarantee, but
-	 * sync_core_before_usermode() might end up being deferred until
-	 * after membarrier()'s smp_mb().
+	 * membarrier_sync_core_before_usermode() might end up being deferred
+	 * until after membarrier()'s smp_mb().
 	 */
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 
-	sync_core_before_usermode();
+	membarrier_sync_core_before_usermode();
 }
+#endif
 
 static void ipi_rseq(void *info)
 {
@@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 	smp_call_func_t ipi_func = ipi_mb;
 
 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
-		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
 			return -EINVAL;
+#else
 		if (!(atomic_read(&mm->membarrier_state) &
 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
 			return -EPERM;
 		ipi_func = ipi_sync_core;
+#endif
 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
 		if (!IS_ENABLED(CONFIG_RSEQ))
 			return -EINVAL;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16  3:21   ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16  3:21 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Andy Lutomirski,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Mathieu Desnoyers, Peter Zijlstra, stable

The old sync_core_before_usermode() comments suggested that a non-icache-syncing
return-to-usermode instruction is x86-specific and that all other
architectures automatically notice cross-modified code on return to
userspace.

This is misleading.  The incantation needed to modify code from one
CPU and execute it on another CPU is highly architecture dependent.
On x86, according to the SDM, one must modify the code, issue SFENCE
if the modification was WC or nontemporal, and then issue a "serializing
instruction" on the CPU that will execute the code.  membarrier() can do
the latter.

On arm64 and powerpc, one must flush the icache and then flush the pipeline
on the target CPU, although the CPU manuals don't necessarily use this
language.

So let's drop any pretense that we can have a generic way to define or
implement membarrier's SYNC_CORE operation and instead require all
architectures to define the helper and supply their own documentation as to
how to use it.  This means x86, arm64, and powerpc for now.  Let's also
rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

(It may well be the case that, on real x86 processors, synchronizing the
 icache (which requires no action at all) and "flushing the pipeline" is
 sufficient, but trying to use this language would be confusing at best.
 LFENCE does something awfully like "flushing the pipeline", but the SDM
 does not permit LFENCE as an alternative to a "serializing instruction"
 for this purpose.)

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
 arch/arm64/include/asm/sync_core.h            | 19 ++++++
 arch/powerpc/include/asm/sync_core.h          | 14 ++++
 arch/x86/Kconfig                              |  1 -
 arch/x86/include/asm/sync_core.h              |  7 +-
 arch/x86/kernel/alternative.c                 |  2 +-
 arch/x86/kernel/cpu/mce/core.c                |  2 +-
 arch/x86/mm/tlb.c                             |  3 +-
 drivers/misc/sgi-gru/grufault.c               |  2 +-
 drivers/misc/sgi-gru/gruhandles.c             |  2 +-
 drivers/misc/sgi-gru/grukservices.c           |  2 +-
 include/linux/sched/mm.h                      |  1 -
 include/linux/sync_core.h                     | 21 ------
 init/Kconfig                                  |  3 -
 kernel/sched/membarrier.c                     | 15 ++--
 15 files changed, 75 insertions(+), 87 deletions(-)
 create mode 100644 arch/arm64/include/asm/sync_core.h
 create mode 100644 arch/powerpc/include/asm/sync_core.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 883d33b265d6..41c9ebcb275f 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,51 +5,25 @@
 #
 # Architecture requirements
 #
-# * arm/arm64/powerpc
 #
-# Rely on implicit context synchronization as a result of exception return
-# when returning from IPI handler, and when returning to user-space.
-#
-# * x86
-#
-# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
-# instruction is core serializing, but not SYSEXIT.
-#
-# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it can return to user-space through either SYSRETL (compat code),
-# SYSRETQ, or IRET.
-#
-# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
-# instead on write_cr3() performed by switch_mm() to provide core serialization
-# after changing the current mm, and deal with the special case of kthread ->
-# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
-#
-    -----------------------
-    |         arch |status|
-    -----------------------
-    |       alpha: | TODO |
-    |         arc: | TODO |
-    |         arm: |  ok  |
-    |       arm64: |  ok  |
-    |        csky: | TODO |
-    |       h8300: | TODO |
-    |     hexagon: | TODO |
-    |        ia64: | TODO |
-    |        m68k: | TODO |
-    |  microblaze: | TODO |
-    |        mips: | TODO |
-    |       nds32: | TODO |
-    |       nios2: | TODO |
-    |    openrisc: | TODO |
-    |      parisc: | TODO |
-    |     powerpc: |  ok  |
-    |       riscv: | TODO |
-    |        s390: | TODO |
-    |          sh: | TODO |
-    |       sparc: | TODO |
-    |          um: | TODO |
-    |         x86: |  ok  |
-    |      xtensa: | TODO |
-    -----------------------
+# An architecture that wants to support
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
+# is supposed to do and implement membarrier_sync_core_before_usermode() to
+# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
+# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
+# fantastic API and may not make sense on all architectures.  Once an
+# architecture meets these requirements,
+#
+# On x86, a program can safely modify code, issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
+# the modified address or an alias, from any thread in the calling process.
+#
+# On arm64, a program can modify code, flush the icache as needed, and issue
+# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
+# event", aka pipeline flush on all CPUs that might run the calling process.
+# Then the program can execute the modified code as long as it is executed
+# from an address consistent with the icache flush and the CPU's cache type.
+#
+# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
+# similarly to arm64.  It would be nice if the powerpc maintainers could
+# add a more clear explanantion.
diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
new file mode 100644
index 000000000000..74996bf533bb
--- /dev/null
+++ b/arch/arm64/include/asm/sync_core.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_SYNC_CORE_H
+#define _ASM_ARM64_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * On arm64, anyone trying to use membarrier() to handle JIT code is
+ * required to first flush the icache and then do SYNC_CORE.  All that's
+ * needed after the icache flush is to execute a "context synchronization
+ * event".  Right now, ERET does this, and we are guaranteed to ERET before
+ * any user code runs.  If Linux ever programs the CPU to make ERET stop
+ * being a context synchronizing event, then this will need to be adjusted.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_ARM64_SYNC_CORE_H */
diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
new file mode 100644
index 000000000000..589fdb34beab
--- /dev/null
+++ b/arch/powerpc/include/asm/sync_core.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SYNC_CORE_H
+#define _ASM_POWERPC_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * XXX: can a powerpc person put an appropriate comment here?
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+}
+
+#endif /* _ASM_POWERPC_SYNC_CORE_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0045e1b44190..f010897a1e8a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -89,7 +89,6 @@ config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
-	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..c665b453969a 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -89,11 +89,10 @@ static inline void sync_core(void)
 }
 
 /*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
  */
-static inline void sync_core_before_usermode(void)
+static inline void membarrier_sync_core_before_usermode(void)
 {
 	/* With PTI, we unconditionally serialize before running user code. */
 	if (static_cpu_has(X86_FEATURE_PTI))
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 6974b5174495..52ead5f4fcdc 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -17,7 +17,7 @@
 #include <linux/kprobes.h>
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index bf7fe87a7e88..4a577980d4d1 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -41,12 +41,12 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/set_memory.h>
-#include <linux/sync_core.h>
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
+#include <asm/sync_core.h>
 #include <asm/traps.h>
 #include <asm/tlbflush.h>
 #include <asm/mce.h>
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 59488d663e68..35b622fd2ed1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -11,6 +11,7 @@
 #include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
+#include <asm/sync_core.h>
 #include <asm/mmu_context.h>
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
@@ -538,7 +539,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			 */
 			if (unlikely(atomic_read(&next->membarrier_state) &
 				     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
-				sync_core_before_usermode();
+				membarrier_sync_core_before_usermode();
 #endif
 
 			return;
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 723825524ea0..48fd5b101de1 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -20,8 +20,8 @@
 #include <linux/io.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
-#include <linux/sync_core.h>
 #include <linux/prefetch.h>
+#include <asm/sync_core.h>
 #include "gru.h"
 #include "grutables.h"
 #include "grulib.h"
diff --git a/drivers/misc/sgi-gru/gruhandles.c b/drivers/misc/sgi-gru/gruhandles.c
index 1d75d5e540bc..c8cba1c1b00f 100644
--- a/drivers/misc/sgi-gru/gruhandles.c
+++ b/drivers/misc/sgi-gru/gruhandles.c
@@ -16,7 +16,7 @@
 #define GRU_OPERATION_TIMEOUT	(((cycles_t) local_cpu_data->itc_freq)*10)
 #define CLKS2NSEC(c)		((c) *1000000000 / local_cpu_data->itc_freq)
 #else
-#include <linux/sync_core.h>
+#include <asm/sync_core.h>
 #include <asm/tsc.h>
 #define GRU_OPERATION_TIMEOUT	((cycles_t) tsc_khz*10*1000)
 #define CLKS2NSEC(c)		((c) * 1000000 / tsc_khz)
diff --git a/drivers/misc/sgi-gru/grukservices.c b/drivers/misc/sgi-gru/grukservices.c
index 0ea923fe6371..ce03ff3f7c3a 100644
--- a/drivers/misc/sgi-gru/grukservices.c
+++ b/drivers/misc/sgi-gru/grukservices.c
@@ -16,10 +16,10 @@
 #include <linux/miscdevice.h>
 #include <linux/proc_fs.h>
 #include <linux/interrupt.h>
-#include <linux/sync_core.h>
 #include <linux/uaccess.h>
 #include <linux/delay.h>
 #include <linux/export.h>
+#include <asm/sync_core.h>
 #include <asm/io_apic.h>
 #include "gru.h"
 #include "grulib.h"
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index c6eebbafadb0..845db11190cd 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
-#include <linux/sync_core.h>
 
 /*
  * Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index 1ea12c64e4c9..e5d552b0823e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2377,9 +2377,6 @@ source "kernel/Kconfig.locks"
 config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	bool
 
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-	bool
-
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index c32c32a2441e..f72a6ab3fac2 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,9 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#endif
 
 /*
  * The basic principle behind the regular memory barrier mode of membarrier()
@@ -221,6 +224,7 @@ static void ipi_mb(void *info)
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 }
 
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
 static void ipi_sync_core(void *info)
 {
 	/*
@@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
 	 * the big comment at the top of this file.
 	 *
 	 * A sync_core() would provide this guarantee, but
-	 * sync_core_before_usermode() might end up being deferred until
-	 * after membarrier()'s smp_mb().
+	 * membarrier_sync_core_before_usermode() might end up being deferred
+	 * until after membarrier()'s smp_mb().
 	 */
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 
-	sync_core_before_usermode();
+	membarrier_sync_core_before_usermode();
 }
+#endif
 
 static void ipi_rseq(void *info)
 {
@@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 	smp_call_func_t ipi_func = ipi_mb;
 
 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
-		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
 			return -EINVAL;
+#else
 		if (!(atomic_read(&mm->membarrier_state) &
 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
 			return -EPERM;
 		ipi_func = ipi_sync_core;
+#endif
 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
 		if (!IS_ENABLED(CONFIG_RSEQ))
 			return -EINVAL;
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH 1/8] membarrier: Document why membarrier() works
  2021-06-16  3:21 ` [PATCH 1/8] membarrier: Document why membarrier() works Andy Lutomirski
@ 2021-06-16  4:00   ` Nicholas Piggin
  2021-06-16  7:30     ` Peter Zijlstra
  0 siblings, 1 reply; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:00 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers,
	Peter Zijlstra

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> We had a nice comment at the top of membarrier.c explaining why membarrier
> worked in a handful of scenarios, but that consisted more of a list of
> things not to forget than an actual description of the algorithm and why it
> should be expected to work.
> 
> Add a comment explaining my understanding of the algorithm.  This exposes a
> couple of implementation issues that I will hopefully fix up in subsequent
> patches.
> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  kernel/sched/membarrier.c | 55 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index b5add64d9698..3173b063d358 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -7,6 +7,61 @@
>  #include "sched.h"
>  

Precisely describing the orderings is great, not a fan of the style of the
comment though.

>  /*
> + * The basic principle behind the regular memory barrier mode of membarrier()
> + * is as follows.  For each CPU, membarrier() operates in one of two
> + * modes.

membarrier(2) is called by one CPU, and it iterates over target CPUs, 
and for each of them it...

> Either it sends an IPI or it does not. If membarrier() sends an
> + * IPI, then we have the following sequence of events:
> + *
> + * 1. membarrier() does smp_mb().
> + * 2. membarrier() does a store (the IPI request payload) that is observed by
> + *    the target CPU.
> + * 3. The target CPU does smp_mb().
> + * 4. The target CPU does a store (the completion indication) that is observed
> + *    by membarrier()'s wait-for-IPIs-to-finish request.
> + * 5. membarrier() does smp_mb().
> + *
> + * So all pre-membarrier() local accesses are visible after the IPI on the
> + * target CPU and all pre-IPI remote accesses are visible after
> + * membarrier(). IOW membarrier() has synchronized both ways with the target
> + * CPU.
> + *
> + * (This has the caveat that membarrier() does not interrupt the CPU that it's
> + * running on at the time it sends the IPIs. However, if that is the CPU on
> + * which membarrier() starts and/or finishes, membarrier() does smp_mb() and,
> + * if not, then membarrier() scheduled, and scheduling had better include a
> + * full barrier somewhere for basic correctness regardless of membarrier.)
> + *
> + * If membarrier() does not send an IPI, this means that membarrier() reads
> + * cpu_rq(cpu)->curr->mm and that the result is not equal to the target
> + * mm.

If membarrier(2) reads cpu_rq(target)->curr->mm and finds it != 
current->mm, this means it doesn't send an IPI. "Had read" even would at 
least make it past tense. I know what you mean, it just sounds backwards as
worded.

> Let's assume for now that tasks never change their mm field.  The
> + * sequence of events is:
> + *
> + * 1. Target CPU switches away from the target mm (or goes lazy or has never
> + *    run the target mm in the first place). This involves smp_mb() followed
> + *    by a write to cpu_rq(cpu)->curr.
> + * 2. membarrier() does smp_mb(). (This is NOT synchronized with any action
> + *    done by the target.)
> + * 3. membarrier() observes the value written in step 1 and does *not* observe
> + *    the value written in step 5.
> + * 4. membarrier() does smp_mb().
> + * 5. Target CPU switches back to the target mm and writes to
> + *    cpu_rq(cpu)->curr. (This is NOT synchronized with any action on
> + *    membarrier()'s part.)
> + * 6. Target CPU executes smp_mb()
> + *
> + * All pre-schedule accesses on the remote CPU are visible after membarrier()
> + * because they all precede the target's write in step 1 and are synchronized
> + * to the local CPU by steps 3 and 4.  All pre-membarrier() accesses on the
> + * local CPU are visible on the remote CPU after scheduling because they
> + * happen before the smp_mb(); read in steps 2 and 3 and that read preceeds
> + * the target's smp_mb() in step 6.
> + *
> + * However, tasks can change their ->mm, e.g., via kthread_use_mm().  So
> + * tasks that switch their ->mm must follow the same rules as the scheduler
> + * changing rq->curr, and the membarrier() code needs to do both dereferences
> + * carefully.

I would prefer the memory accesses and barriers and post-conditions made 
in a more precise style like the rest of the comments. I think it's a 
good idea to break down the higher level choices, and treat a single 
target CPU at a time, but it can be done in the same style

   p = rcu_dereference(rq->curr);
   if (p->mm == current->mm)
     // IPI case
   else
     // No IPI case

   // IPI case:
   ...

   // No IPI case:
   ...

> + *
> + *
>   * For documentation purposes, here are some membarrier ordering
>   * scenarios to keep in mind:

And I think it really needs to be integrated somehow with the rest of 
the comments that follow. For example your IPI case and the A/B cases
are treating the same subject, just with slightly different levels of 
detail.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-16  3:21 ` [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
@ 2021-06-16  4:19   ` Nicholas Piggin
  2021-06-16  7:35     ` Peter Zijlstra
  0 siblings, 1 reply; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:19 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers,
	Peter Zijlstra

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> membarrier() needs a barrier after any CPU changes mm.  There is currently
> a comment explaining why this barrier probably exists in all cases.  This
> is very fragile -- any change to the relevant parts of the scheduler
> might get rid of these barriers, and it's not really clear to me that
> the barrier actually exists in all necessary cases.

The comments and barriers in the mmdrop() hunks? I don't see what is 
fragile or maybe-buggy about this. The barrier definitely exists.

And any change can change anything, that doesn't make it fragile. My
lazy tlb refcounting change avoids the mmdrop in some cases, but it
replaces it with smp_mb for example.

If you have some later changes that require this, can you post them
or move this patch to them?

> 
> Simplify the logic by adding an explicit barrier, and allow architectures
> to override it as an optimization if they want to.
> 
> One of the deleted comments in this patch said "It is therefore
> possible to schedule between user->kernel->user threads without
> passing through switch_mm()".  It is possible to do this without, say,
> writing to CR3 on x86, but the core scheduler indeed calls
> switch_mm_irqs_off() to tell the arch code to go back from lazy mode
> to no-lazy mode.

Context switching threads provides a barrier as well, so that comment at 
least probably stands to be improved.

Thanks,
Nick

> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  include/linux/sched/mm.h | 21 +++++++++++++++++++++
>  kernel/kthread.c         | 12 +-----------
>  kernel/sched/core.c      | 35 +++++++++--------------------------
>  3 files changed, 31 insertions(+), 37 deletions(-)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 10aace21d25e..c6eebbafadb0 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -341,6 +341,27 @@ enum {
>  	MEMBARRIER_FLAG_RSEQ		= (1U << 1),
>  };
>  
> +#ifdef CONFIG_MEMBARRIER
> +
> +/*
> + * Called by the core scheduler after calling switch_mm_irqs_off().
> + * Architectures that have implicit barriers when switching mms can
> + * override this as an optimization.
> + */
> +#ifndef membarrier_finish_switch_mm
> +static inline void membarrier_finish_switch_mm(int membarrier_state)
> +{
> +	if (membarrier_state & (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED))
> +		smp_mb();
> +}
> +#endif
> +
> +#else
> +
> +static inline void membarrier_finish_switch_mm(int membarrier_state) {}
> +
> +#endif
> +
>  #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
>  #include <asm/membarrier.h>
>  #endif
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index fe3f2a40d61e..8275b415acec 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1325,25 +1325,15 @@ void kthread_use_mm(struct mm_struct *mm)
>  	tsk->mm = mm;
>  	membarrier_update_current_mm(mm);
>  	switch_mm_irqs_off(active_mm, mm, tsk);
> +	membarrier_finish_switch_mm(atomic_read(&mm->membarrier_state));
>  	local_irq_enable();
>  	task_unlock(tsk);
>  #ifdef finish_arch_post_lock_switch
>  	finish_arch_post_lock_switch();
>  #endif
>  
> -	/*
> -	 * When a kthread starts operating on an address space, the loop
> -	 * in membarrier_{private,global}_expedited() may not observe
> -	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> -	 * memory barrier after storing to tsk->mm, before accessing
> -	 * user-space memory. A full memory barrier for membarrier
> -	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> -	 * mmdrop(), or explicitly with smp_mb().
> -	 */
>  	if (active_mm != mm)
>  		mmdrop(active_mm);
> -	else
> -		smp_mb();
>  
>  	to_kthread(tsk)->oldfs = force_uaccess_begin();
>  }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e4c122f8bf21..329a6d2a4e13 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4221,15 +4221,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  
>  	fire_sched_in_preempt_notifiers(current);
>  
> -	/*
> -	 * When switching through a kernel thread, the loop in
> -	 * membarrier_{private,global}_expedited() may have observed that
> -	 * kernel thread and not issued an IPI. It is therefore possible to
> -	 * schedule between user->kernel->user threads without passing though
> -	 * switch_mm(). Membarrier requires a barrier after storing to
> -	 * rq->curr, before returning to userspace, and mmdrop() provides
> -	 * this barrier.
> -	 */
>  	if (mm)
>  		mmdrop(mm);
>  
> @@ -4311,15 +4302,14 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  			prev->active_mm = NULL;
>  	} else {                                        // to user
>  		membarrier_switch_mm(rq, prev->active_mm, next->mm);
> +		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
>  		/*
>  		 * sys_membarrier() requires an smp_mb() between setting
> -		 * rq->curr / membarrier_switch_mm() and returning to userspace.
> -		 *
> -		 * The below provides this either through switch_mm(), or in
> -		 * case 'prev->active_mm == next->mm' through
> -		 * finish_task_switch()'s mmdrop().
> +		 * rq->curr->mm to a membarrier-enabled mm and returning
> +		 * to userspace.
>  		 */
> -		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +		membarrier_finish_switch_mm(rq->membarrier_state);
>  
>  		if (!prev->mm) {                        // from kernel
>  			/* will mmdrop() in finish_task_switch(). */
> @@ -5121,17 +5111,10 @@ static void __sched notrace __schedule(bool preempt)
>  		RCU_INIT_POINTER(rq->curr, next);
>  		/*
>  		 * The membarrier system call requires each architecture
> -		 * to have a full memory barrier after updating
> -		 * rq->curr, before returning to user-space.
> -		 *
> -		 * Here are the schemes providing that barrier on the
> -		 * various architectures:
> -		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
> -		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
> -		 * - finish_lock_switch() for weakly-ordered
> -		 *   architectures where spin_unlock is a full barrier,
> -		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
> -		 *   is a RELEASE barrier),
> +		 * to have a full memory barrier before and after updating
> +		 * rq->curr->mm, before returning to userspace.  This
> +		 * is provided by membarrier_finish_switch_mm().  Architectures
> +		 * that want to optimize this can override that function.
>  		 */
>  		++*switch_count;
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code
  2021-06-16  3:21 ` [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
@ 2021-06-16  4:25   ` Nicholas Piggin
  2021-06-16 18:31     ` Andy Lutomirski
  2021-06-16 17:49     ` Mathieu Desnoyers
  1 sibling, 1 reply; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:25 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers,
	Peter Zijlstra

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> The core scheduler isn't a great place for
> membarrier_mm_sync_core_before_usermode() -- the core scheduler
> doesn't actually know whether we are lazy.  With the old code, if a
> CPU is running a membarrier-registered task, goes idle, gets unlazied
> via a TLB shootdown IPI, and switches back to the
> membarrier-registered task, it will do an unnecessary core sync.

I don't really mind, but ARM64 at least hints they might need it
at some point. They can always add it back then, but let's check.

> Conveniently, x86 is the only architecture that does anything in this
> sync_core_before_usermode(), so membarrier_mm_sync_core_before_usermode()
> is a no-op on all other architectures and we can just move the code.

If ARM64 does want it (now or later adds it back), x86 can always make 
the membarrier_mm_sync_core_before_usermode() a nop with comment 
explaining where it executes the serializing instruction.

I'm fine with the patch though, except I would leave the comment in the
core sched code saying any arch specific sequence to deal with
SYNC_CORE is required for that case.

Thanks,
Nick

> 
> (I am not claiming that the SYNC_CORE code was correct before or after this
>  change on any non-x86 architecture.  I merely claim that this change
>  improves readability, is correct on x86, and makes no change on any other
>  architecture.)
> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/mm/tlb.c        | 53 +++++++++++++++++++++++++++++++---------
>  include/linux/sched/mm.h | 13 ----------
>  kernel/sched/core.c      | 13 ++++------
>  3 files changed, 46 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 78804680e923..59488d663e68 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -8,6 +8,7 @@
>  #include <linux/export.h>
>  #include <linux/cpu.h>
>  #include <linux/debugfs.h>
> +#include <linux/sched/mm.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
> @@ -473,16 +474,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
>  
>  	/*
> -	 * The membarrier system call requires a full memory barrier and
> -	 * core serialization before returning to user-space, after
> -	 * storing to rq->curr, when changing mm.  This is because
> -	 * membarrier() sends IPIs to all CPUs that are in the target mm
> -	 * to make them issue memory barriers.  However, if another CPU
> -	 * switches to/from the target mm concurrently with
> -	 * membarrier(), it can cause that CPU not to receive an IPI
> -	 * when it really should issue a memory barrier.  Writing to CR3
> -	 * provides that full memory barrier and core serializing
> -	 * instruction.
> +	 * membarrier() support requires that, when we change rq->curr->mm:
> +	 *
> +	 *  - If next->mm has membarrier registered, a full memory barrier
> +	 *    after writing rq->curr (or rq->curr->mm if we switched the mm
> +	 *    without switching tasks) and before returning to user mode.
> +	 *
> +	 *  - If next->mm has SYNC_CORE registered, then we sync core before
> +	 *    returning to user mode.
> +	 *
> +	 * In the case where prev->mm == next->mm, membarrier() uses an IPI
> +	 * instead, and no particular barriers are needed while context
> +	 * switching.
> +	 *
> +	 * x86 gets all of this as a side-effect of writing to CR3 except
> +	 * in the case where we unlazy without flushing.
> +	 *
> +	 * All other architectures are civilized and do all of this implicitly
> +	 * when transitioning from kernel to user mode.
>  	 */
>  	if (real_prev == next) {
>  		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> @@ -500,7 +509,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		/*
>  		 * If the CPU is not in lazy TLB mode, we are just switching
>  		 * from one thread in a process to another thread in the same
> -		 * process. No TLB flush required.
> +		 * process. No TLB flush or membarrier() synchronization
> +		 * is required.
>  		 */
>  		if (!was_lazy)
>  			return;
> @@ -510,16 +520,35 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		 * If the TLB is up to date, just use it.
>  		 * The barrier synchronizes with the tlb_gen increment in
>  		 * the TLB shootdown code.
> +		 *
> +		 * As a future optimization opportunity, it's plausible
> +		 * that the x86 memory model is strong enough that this
> +		 * smp_mb() isn't needed.
>  		 */
>  		smp_mb();
>  		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>  		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> -				next_tlb_gen)
> +		    next_tlb_gen) {
> +#ifdef CONFIG_MEMBARRIER
> +			/*
> +			 * We switched logical mm but we're not going to
> +			 * write to CR3.  We already did smp_mb() above,
> +			 * but membarrier() might require a sync_core()
> +			 * as well.
> +			 */
> +			if (unlikely(atomic_read(&next->membarrier_state) &
> +				     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
> +				sync_core_before_usermode();
> +#endif
> +
>  			return;
> +		}
>  
>  		/*
>  		 * TLB contents went out of date while we were in lazy
>  		 * mode. Fall through to the TLB switching code below.
> +		 * No need for an explicit membarrier invocation -- the CR3
> +		 * write will serialize.
>  		 */
>  		new_asid = prev_asid;
>  		need_flush = true;
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e24b1fe348e3..24d97d1b6252 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -345,16 +345,6 @@ enum {
>  #include <asm/membarrier.h>
>  #endif
>  
> -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
> -{
> -	if (current->mm != mm)
> -		return;
> -	if (likely(!(atomic_read(&mm->membarrier_state) &
> -		     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
> -		return;
> -	sync_core_before_usermode();
> -}
> -
>  extern void membarrier_exec_mmap(struct mm_struct *mm);
>  
>  extern void membarrier_update_current_mm(struct mm_struct *next_mm);
> @@ -370,9 +360,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
>  static inline void membarrier_exec_mmap(struct mm_struct *mm)
>  {
>  }
> -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
> -{
> -}
>  static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
>  {
>  }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5226cc26a095..e4c122f8bf21 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4220,22 +4220,19 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	kmap_local_sched_in();
>  
>  	fire_sched_in_preempt_notifiers(current);
> +
>  	/*
>  	 * When switching through a kernel thread, the loop in
>  	 * membarrier_{private,global}_expedited() may have observed that
>  	 * kernel thread and not issued an IPI. It is therefore possible to
>  	 * schedule between user->kernel->user threads without passing though
>  	 * switch_mm(). Membarrier requires a barrier after storing to
> -	 * rq->curr, before returning to userspace, so provide them here:
> -	 *
> -	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -	 *   provided by mmdrop(),
> -	 * - a sync_core for SYNC_CORE.
> +	 * rq->curr, before returning to userspace, and mmdrop() provides
> +	 * this barrier.
>  	 */
> -	if (mm) {
> -		membarrier_mm_sync_core_before_usermode(mm);
> +	if (mm)
>  		mmdrop(mm);
> -	}
> +
>  	if (unlikely(prev_state == TASK_DEAD)) {
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code
  2021-06-16  3:21 ` [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
@ 2021-06-16  4:26   ` Nicholas Piggin
  2021-06-16 17:52     ` Mathieu Desnoyers
  1 sibling, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:26 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers,
	Peter Zijlstra

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> membarrier_arch_switch_mm()'s sole implementation and caller are in
> arch/powerpc.  Having a fallback implementation in include/linux is
> confusing -- remove it.
> 
> It's still mentioned in a comment, but a subsequent patch will remove
> it.
> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  include/linux/sched/mm.h | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 24d97d1b6252..10aace21d25e 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -350,13 +350,6 @@ extern void membarrier_exec_mmap(struct mm_struct *mm);
>  extern void membarrier_update_current_mm(struct mm_struct *next_mm);
>  
>  #else
> -#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -}
> -#endif
>  static inline void membarrier_exec_mmap(struct mm_struct *mm)
>  {
>  }
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm
  2021-06-16  3:21 ` [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
@ 2021-06-16  4:28   ` Nicholas Piggin
  2021-06-16 18:08     ` Mathieu Desnoyers
  1 sibling, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:28 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers,
	Peter Zijlstra

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> membarrier reads cpu_rq(remote cpu)->curr->mm without locking.  Use
> READ_ONCE() and WRITE_ONCE() to remove the data races.
> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  fs/exec.c                 | 2 +-
>  kernel/kthread.c          | 4 ++--
>  kernel/sched/membarrier.c | 6 +++---
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 18594f11c31f..2e63dea83411 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1007,7 +1007,7 @@ static int exec_mmap(struct mm_struct *mm)
>  	local_irq_disable();
>  	active_mm = tsk->active_mm;
>  	tsk->active_mm = mm;
> -	tsk->mm = mm;
> +	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
>  	/*
>  	 * This prevents preemption while active_mm is being loaded and
>  	 * it and mm are being updated, which could cause problems for
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 8275b415acec..4962794e02d5 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1322,7 +1322,7 @@ void kthread_use_mm(struct mm_struct *mm)
>  		mmgrab(mm);
>  		tsk->active_mm = mm;
>  	}
> -	tsk->mm = mm;
> +	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
>  	membarrier_update_current_mm(mm);
>  	switch_mm_irqs_off(active_mm, mm, tsk);
>  	membarrier_finish_switch_mm(atomic_read(&mm->membarrier_state));
> @@ -1363,7 +1363,7 @@ void kthread_unuse_mm(struct mm_struct *mm)
>  	smp_mb__after_spinlock();
>  	sync_mm_rss(mm);
>  	local_irq_disable();
> -	tsk->mm = NULL;
> +	WRITE_ONCE(tsk->mm, NULL);  /* membarrier reads this without locks */
>  	membarrier_update_current_mm(NULL);
>  	/* active_mm is still 'mm' */
>  	enter_lazy_tlb(mm, tsk);
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 3173b063d358..c32c32a2441e 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -410,7 +410,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
>  			goto out;
>  		rcu_read_lock();
>  		p = rcu_dereference(cpu_rq(cpu_id)->curr);
> -		if (!p || p->mm != mm) {
> +		if (!p || READ_ONCE(p->mm) != mm) {
>  			rcu_read_unlock();
>  			goto out;
>  		}
> @@ -423,7 +423,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
>  			struct task_struct *p;
>  
>  			p = rcu_dereference(cpu_rq(cpu)->curr);
> -			if (p && p->mm == mm)
> +			if (p && READ_ONCE(p->mm) == mm)

/* exec, kthread_use_mm write this without locks */ ?

Seems good to me.

Acked-by: Nicholas Piggin <npiggin@gmail.com>

>  				__cpumask_set_cpu(cpu, tmpmask);
>  		}
>  		rcu_read_unlock();
> @@ -521,7 +521,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm)
>  		struct task_struct *p;
>  
>  		p = rcu_dereference(rq->curr);
> -		if (p && p->mm == mm)
> +		if (p && READ_ONCE(p->mm) == mm)
>  			__cpumask_set_cpu(cpu, tmpmask);
>  	}
>  	rcu_read_unlock();
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 6/8] powerpc/membarrier: Remove special barrier on mm switch
  2021-06-16  3:21   ` Andy Lutomirski
@ 2021-06-16  4:36     ` Nicholas Piggin
  -1 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:36 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Andrew Morton, Benjamin Herrenschmidt, Dave Hansen, LKML,
	linux-mm, linuxppc-dev, Mathieu Desnoyers, Michael Ellerman,
	Paul Mackerras, Peter Zijlstra

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> powerpc did the following on some, but not all, paths through
> switch_mm_irqs_off():
> 
>        /*
>         * Only need the full barrier when switching between processes.
>         * Barrier when switching from kernel to userspace is not
>         * required here, given that it is implied by mmdrop(). Barrier
>         * when switching from userspace to kernel is not needed after
>         * store to rq->curr.
>         */
>        if (likely(!(atomic_read(&next->membarrier_state) &
>                     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
>                      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
>                return;
> 
> This is puzzling: if !prev, then one might expect that we are switching
> from kernel to user, not user to kernel, which is inconsistent with the
> comment.  But this is all nonsense, because the one and only caller would
> never have prev == NULL and would, in fact, OOPS if prev == NULL.

Yeah that's strange, code definitely doesn't match comment. Good catch.

> 
> In any event, this code is unnecessary, since the new generic
> membarrier_finish_switch_mm() provides the same barrier without arch help.

If that's merged then I think this could be too. I'll do a bit more 
digging into this too.

Thanks,
Nick

> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/powerpc/include/asm/membarrier.h | 27 ---------------------------
>  arch/powerpc/mm/mmu_context.c         |  2 --
>  2 files changed, 29 deletions(-)
>  delete mode 100644 arch/powerpc/include/asm/membarrier.h
> 
> diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
> deleted file mode 100644
> index 6e20bb5c74ea..000000000000
> --- a/arch/powerpc/include/asm/membarrier.h
> +++ /dev/null
> @@ -1,27 +0,0 @@
> -#ifndef _ASM_POWERPC_MEMBARRIER_H
> -#define _ASM_POWERPC_MEMBARRIER_H
> -
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -	/*
> -	 * Only need the full barrier when switching between processes.
> -	 * Barrier when switching from kernel to userspace is not
> -	 * required here, given that it is implied by mmdrop(). Barrier
> -	 * when switching from userspace to kernel is not needed after
> -	 * store to rq->curr.
> -	 */
> -	if (likely(!(atomic_read(&next->membarrier_state) &
> -		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> -		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
> -		return;
> -
> -	/*
> -	 * The membarrier system call requires a full memory barrier
> -	 * after storing to rq->curr, before going back to user-space.
> -	 */
> -	smp_mb();
> -}
> -
> -#endif /* _ASM_POWERPC_MEMBARRIER_H */
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index a857af401738..8daa95b3162b 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -85,8 +85,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  
>  	if (new_on_cpu)
>  		radix_kvm_prefetch_workaround(next);
> -	else
> -		membarrier_arch_switch_mm(prev, next, tsk);
>  
>  	/*
>  	 * The actual HW switching method differs between the various
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 6/8] powerpc/membarrier: Remove special barrier on mm switch
@ 2021-06-16  4:36     ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:36 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Dave Hansen, Peter Zijlstra, LKML, linux-mm, Mathieu Desnoyers,
	Paul Mackerras, Andrew Morton, linuxppc-dev

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> powerpc did the following on some, but not all, paths through
> switch_mm_irqs_off():
> 
>        /*
>         * Only need the full barrier when switching between processes.
>         * Barrier when switching from kernel to userspace is not
>         * required here, given that it is implied by mmdrop(). Barrier
>         * when switching from userspace to kernel is not needed after
>         * store to rq->curr.
>         */
>        if (likely(!(atomic_read(&next->membarrier_state) &
>                     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
>                      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
>                return;
> 
> This is puzzling: if !prev, then one might expect that we are switching
> from kernel to user, not user to kernel, which is inconsistent with the
> comment.  But this is all nonsense, because the one and only caller would
> never have prev == NULL and would, in fact, OOPS if prev == NULL.

Yeah that's strange, code definitely doesn't match comment. Good catch.

> 
> In any event, this code is unnecessary, since the new generic
> membarrier_finish_switch_mm() provides the same barrier without arch help.

If that's merged then I think this could be too. I'll do a bit more 
digging into this too.

Thanks,
Nick

> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/powerpc/include/asm/membarrier.h | 27 ---------------------------
>  arch/powerpc/mm/mmu_context.c         |  2 --
>  2 files changed, 29 deletions(-)
>  delete mode 100644 arch/powerpc/include/asm/membarrier.h
> 
> diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
> deleted file mode 100644
> index 6e20bb5c74ea..000000000000
> --- a/arch/powerpc/include/asm/membarrier.h
> +++ /dev/null
> @@ -1,27 +0,0 @@
> -#ifndef _ASM_POWERPC_MEMBARRIER_H
> -#define _ASM_POWERPC_MEMBARRIER_H
> -
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -	/*
> -	 * Only need the full barrier when switching between processes.
> -	 * Barrier when switching from kernel to userspace is not
> -	 * required here, given that it is implied by mmdrop(). Barrier
> -	 * when switching from userspace to kernel is not needed after
> -	 * store to rq->curr.
> -	 */
> -	if (likely(!(atomic_read(&next->membarrier_state) &
> -		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> -		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
> -		return;
> -
> -	/*
> -	 * The membarrier system call requires a full memory barrier
> -	 * after storing to rq->curr, before going back to user-space.
> -	 */
> -	smp_mb();
> -}
> -
> -#endif /* _ASM_POWERPC_MEMBARRIER_H */
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index a857af401738..8daa95b3162b 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -85,8 +85,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  
>  	if (new_on_cpu)
>  		radix_kvm_prefetch_workaround(next);
> -	else
> -		membarrier_arch_switch_mm(prev, next, tsk);
>  
>  	/*
>  	 * The actual HW switching method differs between the various
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16  3:21   ` Andy Lutomirski
  (?)
@ 2021-06-16  4:45     ` Nicholas Piggin
  -1 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:45 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, LKML, linux-mm, linuxppc-dev,
	Mathieu Desnoyers, Michael Ellerman, Paul Mackerras,
	Peter Zijlstra, stable, Will Deacon

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
>  icache (which requires no action at all) and "flushing the pipeline" is
>  sufficient, but trying to use this language would be confusing at best.
>  LFENCE does something awfully like "flushing the pipeline", but the SDM
>  does not permit LFENCE as an alternative to a "serializing instruction"
>  for this purpose.)
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
>  arch/arm64/include/asm/sync_core.h            | 19 ++++++
>  arch/powerpc/include/asm/sync_core.h          | 14 ++++
>  arch/x86/Kconfig                              |  1 -
>  arch/x86/include/asm/sync_core.h              |  7 +-
>  arch/x86/kernel/alternative.c                 |  2 +-
>  arch/x86/kernel/cpu/mce/core.c                |  2 +-
>  arch/x86/mm/tlb.c                             |  3 +-
>  drivers/misc/sgi-gru/grufault.c               |  2 +-
>  drivers/misc/sgi-gru/gruhandles.c             |  2 +-
>  drivers/misc/sgi-gru/grukservices.c           |  2 +-
>  include/linux/sched/mm.h                      |  1 -
>  include/linux/sync_core.h                     | 21 ------
>  init/Kconfig                                  |  3 -
>  kernel/sched/membarrier.c                     | 15 ++--
>  15 files changed, 75 insertions(+), 87 deletions(-)
>  create mode 100644 arch/arm64/include/asm/sync_core.h
>  create mode 100644 arch/powerpc/include/asm/sync_core.h
>  delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> index 883d33b265d6..41c9ebcb275f 100644
> --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> @@ -5,51 +5,25 @@
>  #
>  # Architecture requirements
>  #
> -# * arm/arm64/powerpc
>  #
> -# Rely on implicit context synchronization as a result of exception return
> -# when returning from IPI handler, and when returning to user-space.
> -#
> -# * x86
> -#
> -# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
> -# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
> -# instruction is core serializing, but not SYSEXIT.
> -#
> -# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
> -# However, it can return to user-space through either SYSRETL (compat code),
> -# SYSRETQ, or IRET.
> -#
> -# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
> -# instead on write_cr3() performed by switch_mm() to provide core serialization
> -# after changing the current mm, and deal with the special case of kthread ->
> -# uthread (temporarily keeping current mm into active_mm) by issuing a
> -# sync_core_before_usermode() in that specific case.
> -#
> -    -----------------------
> -    |         arch |status|
> -    -----------------------
> -    |       alpha: | TODO |
> -    |         arc: | TODO |
> -    |         arm: |  ok  |
> -    |       arm64: |  ok  |
> -    |        csky: | TODO |
> -    |       h8300: | TODO |
> -    |     hexagon: | TODO |
> -    |        ia64: | TODO |
> -    |        m68k: | TODO |
> -    |  microblaze: | TODO |
> -    |        mips: | TODO |
> -    |       nds32: | TODO |
> -    |       nios2: | TODO |
> -    |    openrisc: | TODO |
> -    |      parisc: | TODO |
> -    |     powerpc: |  ok  |
> -    |       riscv: | TODO |
> -    |        s390: | TODO |
> -    |          sh: | TODO |
> -    |       sparc: | TODO |
> -    |          um: | TODO |
> -    |         x86: |  ok  |
> -    |      xtensa: | TODO |
> -    -----------------------
> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,
> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.
> diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..74996bf533bb
> --- /dev/null
> +++ b/arch/arm64/include/asm/sync_core.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM64_SYNC_CORE_H
> +#define _ASM_ARM64_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * On arm64, anyone trying to use membarrier() to handle JIT code is
> + * required to first flush the icache and then do SYNC_CORE.  All that's
> + * needed after the icache flush is to execute a "context synchronization
> + * event".  Right now, ERET does this, and we are guaranteed to ERET before
> + * any user code runs.  If Linux ever programs the CPU to make ERET stop
> + * being a context synchronizing event, then this will need to be adjusted.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +}
> +
> +#endif /* _ASM_ARM64_SYNC_CORE_H */
> diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..589fdb34beab
> --- /dev/null
> +++ b/arch/powerpc/include/asm/sync_core.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_SYNC_CORE_H
> +#define _ASM_POWERPC_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * XXX: can a powerpc person put an appropriate comment here?
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +}
> +
> +#endif /* _ASM_POWERPC_SYNC_CORE_H */

powerpc's can just go in asm/membarrier.h

/*
 * The RFI family of instructions are context synchronising, and
 * that is how we return to userspace, so nothing is required here.
 */

> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index c32c32a2441e..f72a6ab3fac2 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -5,6 +5,9 @@
>   * membarrier system call
>   */
>  #include "sched.h"
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> +#include <asm/sync_core.h>
> +#endif

Can you

#else
static inline void membarrier_sync_core_before_usermode(void)
{
 /* this gets constant folded out */
}
#endif

And avoid adding the ifdefs in the following code?

Otherwise I think this is good.

Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

>  
>  /*
>   * The basic principle behind the regular memory barrier mode of membarrier()
> @@ -221,6 +224,7 @@ static void ipi_mb(void *info)
>  	smp_mb();	/* IPIs should be serializing but paranoid. */
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>  static void ipi_sync_core(void *info)
>  {
>  	/*
> @@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
>  	 * the big comment at the top of this file.
>  	 *
>  	 * A sync_core() would provide this guarantee, but
> -	 * sync_core_before_usermode() might end up being deferred until
> -	 * after membarrier()'s smp_mb().
> +	 * membarrier_sync_core_before_usermode() might end up being deferred
> +	 * until after membarrier()'s smp_mb().
>  	 */
>  	smp_mb();	/* IPIs should be serializing but paranoid. */
>  
> -	sync_core_before_usermode();
> +	membarrier_sync_core_before_usermode();
>  }
> +#endif
>  
>  static void ipi_rseq(void *info)
>  {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id)
>  	smp_call_func_t ipi_func = ipi_mb;
>  
>  	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> -		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>  			return -EINVAL;
> +#else
>  		if (!(atomic_read(&mm->membarrier_state) &
>  		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>  			return -EPERM;
>  		ipi_func = ipi_sync_core;
> +#endif
>  	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
>  		if (!IS_ENABLED(CONFIG_RSEQ))
>  			return -EINVAL;
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16  4:45     ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:45 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Will Deacon, linux-mm, Peter Zijlstra, LKML, stable, Dave Hansen,
	Mathieu Desnoyers, Catalin Marinas, Paul Mackerras,
	Andrew Morton, linuxppc-dev, linux-arm-kernel

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
>  icache (which requires no action at all) and "flushing the pipeline" is
>  sufficient, but trying to use this language would be confusing at best.
>  LFENCE does something awfully like "flushing the pipeline", but the SDM
>  does not permit LFENCE as an alternative to a "serializing instruction"
>  for this purpose.)
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
>  arch/arm64/include/asm/sync_core.h            | 19 ++++++
>  arch/powerpc/include/asm/sync_core.h          | 14 ++++
>  arch/x86/Kconfig                              |  1 -
>  arch/x86/include/asm/sync_core.h              |  7 +-
>  arch/x86/kernel/alternative.c                 |  2 +-
>  arch/x86/kernel/cpu/mce/core.c                |  2 +-
>  arch/x86/mm/tlb.c                             |  3 +-
>  drivers/misc/sgi-gru/grufault.c               |  2 +-
>  drivers/misc/sgi-gru/gruhandles.c             |  2 +-
>  drivers/misc/sgi-gru/grukservices.c           |  2 +-
>  include/linux/sched/mm.h                      |  1 -
>  include/linux/sync_core.h                     | 21 ------
>  init/Kconfig                                  |  3 -
>  kernel/sched/membarrier.c                     | 15 ++--
>  15 files changed, 75 insertions(+), 87 deletions(-)
>  create mode 100644 arch/arm64/include/asm/sync_core.h
>  create mode 100644 arch/powerpc/include/asm/sync_core.h
>  delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> index 883d33b265d6..41c9ebcb275f 100644
> --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> @@ -5,51 +5,25 @@
>  #
>  # Architecture requirements
>  #
> -# * arm/arm64/powerpc
>  #
> -# Rely on implicit context synchronization as a result of exception return
> -# when returning from IPI handler, and when returning to user-space.
> -#
> -# * x86
> -#
> -# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
> -# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
> -# instruction is core serializing, but not SYSEXIT.
> -#
> -# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
> -# However, it can return to user-space through either SYSRETL (compat code),
> -# SYSRETQ, or IRET.
> -#
> -# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
> -# instead on write_cr3() performed by switch_mm() to provide core serialization
> -# after changing the current mm, and deal with the special case of kthread ->
> -# uthread (temporarily keeping current mm into active_mm) by issuing a
> -# sync_core_before_usermode() in that specific case.
> -#
> -    -----------------------
> -    |         arch |status|
> -    -----------------------
> -    |       alpha: | TODO |
> -    |         arc: | TODO |
> -    |         arm: |  ok  |
> -    |       arm64: |  ok  |
> -    |        csky: | TODO |
> -    |       h8300: | TODO |
> -    |     hexagon: | TODO |
> -    |        ia64: | TODO |
> -    |        m68k: | TODO |
> -    |  microblaze: | TODO |
> -    |        mips: | TODO |
> -    |       nds32: | TODO |
> -    |       nios2: | TODO |
> -    |    openrisc: | TODO |
> -    |      parisc: | TODO |
> -    |     powerpc: |  ok  |
> -    |       riscv: | TODO |
> -    |        s390: | TODO |
> -    |          sh: | TODO |
> -    |       sparc: | TODO |
> -    |          um: | TODO |
> -    |         x86: |  ok  |
> -    |      xtensa: | TODO |
> -    -----------------------
> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,
> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.
> diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..74996bf533bb
> --- /dev/null
> +++ b/arch/arm64/include/asm/sync_core.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM64_SYNC_CORE_H
> +#define _ASM_ARM64_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * On arm64, anyone trying to use membarrier() to handle JIT code is
> + * required to first flush the icache and then do SYNC_CORE.  All that's
> + * needed after the icache flush is to execute a "context synchronization
> + * event".  Right now, ERET does this, and we are guaranteed to ERET before
> + * any user code runs.  If Linux ever programs the CPU to make ERET stop
> + * being a context synchronizing event, then this will need to be adjusted.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +}
> +
> +#endif /* _ASM_ARM64_SYNC_CORE_H */
> diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..589fdb34beab
> --- /dev/null
> +++ b/arch/powerpc/include/asm/sync_core.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_SYNC_CORE_H
> +#define _ASM_POWERPC_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * XXX: can a powerpc person put an appropriate comment here?
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +}
> +
> +#endif /* _ASM_POWERPC_SYNC_CORE_H */

powerpc's can just go in asm/membarrier.h

/*
 * The RFI family of instructions are context synchronising, and
 * that is how we return to userspace, so nothing is required here.
 */

> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index c32c32a2441e..f72a6ab3fac2 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -5,6 +5,9 @@
>   * membarrier system call
>   */
>  #include "sched.h"
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> +#include <asm/sync_core.h>
> +#endif

Can you

#else
static inline void membarrier_sync_core_before_usermode(void)
{
 /* this gets constant folded out */
}
#endif

And avoid adding the ifdefs in the following code?

Otherwise I think this is good.

Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

>  
>  /*
>   * The basic principle behind the regular memory barrier mode of membarrier()
> @@ -221,6 +224,7 @@ static void ipi_mb(void *info)
>  	smp_mb();	/* IPIs should be serializing but paranoid. */
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>  static void ipi_sync_core(void *info)
>  {
>  	/*
> @@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
>  	 * the big comment at the top of this file.
>  	 *
>  	 * A sync_core() would provide this guarantee, but
> -	 * sync_core_before_usermode() might end up being deferred until
> -	 * after membarrier()'s smp_mb().
> +	 * membarrier_sync_core_before_usermode() might end up being deferred
> +	 * until after membarrier()'s smp_mb().
>  	 */
>  	smp_mb();	/* IPIs should be serializing but paranoid. */
>  
> -	sync_core_before_usermode();
> +	membarrier_sync_core_before_usermode();
>  }
> +#endif
>  
>  static void ipi_rseq(void *info)
>  {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id)
>  	smp_call_func_t ipi_func = ipi_mb;
>  
>  	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> -		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>  			return -EINVAL;
> +#else
>  		if (!(atomic_read(&mm->membarrier_state) &
>  		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>  			return -EPERM;
>  		ipi_func = ipi_sync_core;
> +#endif
>  	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
>  		if (!IS_ENABLED(CONFIG_RSEQ))
>  			return -EINVAL;
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16  4:45     ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-16  4:45 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, LKML, linux-mm, linuxppc-dev,
	Mathieu Desnoyers, Michael Ellerman, Paul Mackerras,
	Peter Zijlstra, stable, Will Deacon

Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
>  icache (which requires no action at all) and "flushing the pipeline" is
>  sufficient, but trying to use this language would be confusing at best.
>  LFENCE does something awfully like "flushing the pipeline", but the SDM
>  does not permit LFENCE as an alternative to a "serializing instruction"
>  for this purpose.)
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
>  arch/arm64/include/asm/sync_core.h            | 19 ++++++
>  arch/powerpc/include/asm/sync_core.h          | 14 ++++
>  arch/x86/Kconfig                              |  1 -
>  arch/x86/include/asm/sync_core.h              |  7 +-
>  arch/x86/kernel/alternative.c                 |  2 +-
>  arch/x86/kernel/cpu/mce/core.c                |  2 +-
>  arch/x86/mm/tlb.c                             |  3 +-
>  drivers/misc/sgi-gru/grufault.c               |  2 +-
>  drivers/misc/sgi-gru/gruhandles.c             |  2 +-
>  drivers/misc/sgi-gru/grukservices.c           |  2 +-
>  include/linux/sched/mm.h                      |  1 -
>  include/linux/sync_core.h                     | 21 ------
>  init/Kconfig                                  |  3 -
>  kernel/sched/membarrier.c                     | 15 ++--
>  15 files changed, 75 insertions(+), 87 deletions(-)
>  create mode 100644 arch/arm64/include/asm/sync_core.h
>  create mode 100644 arch/powerpc/include/asm/sync_core.h
>  delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> index 883d33b265d6..41c9ebcb275f 100644
> --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> @@ -5,51 +5,25 @@
>  #
>  # Architecture requirements
>  #
> -# * arm/arm64/powerpc
>  #
> -# Rely on implicit context synchronization as a result of exception return
> -# when returning from IPI handler, and when returning to user-space.
> -#
> -# * x86
> -#
> -# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
> -# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
> -# instruction is core serializing, but not SYSEXIT.
> -#
> -# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
> -# However, it can return to user-space through either SYSRETL (compat code),
> -# SYSRETQ, or IRET.
> -#
> -# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
> -# instead on write_cr3() performed by switch_mm() to provide core serialization
> -# after changing the current mm, and deal with the special case of kthread ->
> -# uthread (temporarily keeping current mm into active_mm) by issuing a
> -# sync_core_before_usermode() in that specific case.
> -#
> -    -----------------------
> -    |         arch |status|
> -    -----------------------
> -    |       alpha: | TODO |
> -    |         arc: | TODO |
> -    |         arm: |  ok  |
> -    |       arm64: |  ok  |
> -    |        csky: | TODO |
> -    |       h8300: | TODO |
> -    |     hexagon: | TODO |
> -    |        ia64: | TODO |
> -    |        m68k: | TODO |
> -    |  microblaze: | TODO |
> -    |        mips: | TODO |
> -    |       nds32: | TODO |
> -    |       nios2: | TODO |
> -    |    openrisc: | TODO |
> -    |      parisc: | TODO |
> -    |     powerpc: |  ok  |
> -    |       riscv: | TODO |
> -    |        s390: | TODO |
> -    |          sh: | TODO |
> -    |       sparc: | TODO |
> -    |          um: | TODO |
> -    |         x86: |  ok  |
> -    |      xtensa: | TODO |
> -    -----------------------
> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,
> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.
> diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..74996bf533bb
> --- /dev/null
> +++ b/arch/arm64/include/asm/sync_core.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM64_SYNC_CORE_H
> +#define _ASM_ARM64_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * On arm64, anyone trying to use membarrier() to handle JIT code is
> + * required to first flush the icache and then do SYNC_CORE.  All that's
> + * needed after the icache flush is to execute a "context synchronization
> + * event".  Right now, ERET does this, and we are guaranteed to ERET before
> + * any user code runs.  If Linux ever programs the CPU to make ERET stop
> + * being a context synchronizing event, then this will need to be adjusted.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +}
> +
> +#endif /* _ASM_ARM64_SYNC_CORE_H */
> diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..589fdb34beab
> --- /dev/null
> +++ b/arch/powerpc/include/asm/sync_core.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_SYNC_CORE_H
> +#define _ASM_POWERPC_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * XXX: can a powerpc person put an appropriate comment here?
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +}
> +
> +#endif /* _ASM_POWERPC_SYNC_CORE_H */

powerpc's can just go in asm/membarrier.h

/*
 * The RFI family of instructions are context synchronising, and
 * that is how we return to userspace, so nothing is required here.
 */

> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index c32c32a2441e..f72a6ab3fac2 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -5,6 +5,9 @@
>   * membarrier system call
>   */
>  #include "sched.h"
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> +#include <asm/sync_core.h>
> +#endif

Can you

#else
static inline void membarrier_sync_core_before_usermode(void)
{
 /* this gets constant folded out */
}
#endif

And avoid adding the ifdefs in the following code?

Otherwise I think this is good.

Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

>  
>  /*
>   * The basic principle behind the regular memory barrier mode of membarrier()
> @@ -221,6 +224,7 @@ static void ipi_mb(void *info)
>  	smp_mb();	/* IPIs should be serializing but paranoid. */
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>  static void ipi_sync_core(void *info)
>  {
>  	/*
> @@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
>  	 * the big comment at the top of this file.
>  	 *
>  	 * A sync_core() would provide this guarantee, but
> -	 * sync_core_before_usermode() might end up being deferred until
> -	 * after membarrier()'s smp_mb().
> +	 * membarrier_sync_core_before_usermode() might end up being deferred
> +	 * until after membarrier()'s smp_mb().
>  	 */
>  	smp_mb();	/* IPIs should be serializing but paranoid. */
>  
> -	sync_core_before_usermode();
> +	membarrier_sync_core_before_usermode();
>  }
> +#endif
>  
>  static void ipi_rseq(void *info)
>  {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id)
>  	smp_call_func_t ipi_func = ipi_mb;
>  
>  	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> -		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>  			return -EINVAL;
> +#else
>  		if (!(atomic_read(&mm->membarrier_state) &
>  		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>  			return -EPERM;
>  		ipi_func = ipi_sync_core;
> +#endif
>  	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
>  		if (!IS_ENABLED(CONFIG_RSEQ))
>  			return -EINVAL;
> -- 
> 2.31.1
> 
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 1/8] membarrier: Document why membarrier() works
  2021-06-16  4:00   ` Nicholas Piggin
@ 2021-06-16  7:30     ` Peter Zijlstra
  2021-06-17 23:45       ` Andy Lutomirski
  0 siblings, 1 reply; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-16  7:30 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andy Lutomirski, x86, Andrew Morton, Dave Hansen, LKML, linux-mm,
	Mathieu Desnoyers

On Wed, Jun 16, 2021 at 02:00:37PM +1000, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> > We had a nice comment at the top of membarrier.c explaining why membarrier
> > worked in a handful of scenarios, but that consisted more of a list of
> > things not to forget than an actual description of the algorithm and why it
> > should be expected to work.
> > 
> > Add a comment explaining my understanding of the algorithm.  This exposes a
> > couple of implementation issues that I will hopefully fix up in subsequent
> > patches.
> > 
> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Cc: Nicholas Piggin <npiggin@gmail.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> >  kernel/sched/membarrier.c | 55 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 55 insertions(+)
> > 
> > diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> > index b5add64d9698..3173b063d358 100644
> > --- a/kernel/sched/membarrier.c
> > +++ b/kernel/sched/membarrier.c
> > @@ -7,6 +7,61 @@
> >  #include "sched.h"
> >  
> 
> Precisely describing the orderings is great, not a fan of the style of the
> comment though.

I'm with Nick on that; I can't read it :/ It only makes things more
confusing. If you want precision, English (or any natural language) is
your enemy.

To describe ordering use the diagrams and/or litmus tests.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-16  4:19   ` Nicholas Piggin
@ 2021-06-16  7:35     ` Peter Zijlstra
  2021-06-16 18:41       ` Andy Lutomirski
  0 siblings, 1 reply; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-16  7:35 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andy Lutomirski, x86, Andrew Morton, Dave Hansen, LKML, linux-mm,
	Mathieu Desnoyers

On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> > membarrier() needs a barrier after any CPU changes mm.  There is currently
> > a comment explaining why this barrier probably exists in all cases.  This
> > is very fragile -- any change to the relevant parts of the scheduler
> > might get rid of these barriers, and it's not really clear to me that
> > the barrier actually exists in all necessary cases.
> 
> The comments and barriers in the mmdrop() hunks? I don't see what is 
> fragile or maybe-buggy about this. The barrier definitely exists.
> 
> And any change can change anything, that doesn't make it fragile. My
> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> replaces it with smp_mb for example.

I'm with Nick again, on this. You're adding extra barriers for no
discernible reason, that's not generally encouraged, seeing how extra
barriers is extra slow.

Both mmdrop() itself, as well as the callsite have comments saying how
membarrier relies on the implied barrier, what's fragile about that?

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16  3:21   ` Andy Lutomirski
@ 2021-06-16  9:28     ` Russell King (Oracle)
  -1 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16  9:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra,
	linux-arm-kernel

On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Yay! About time too.

Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16  9:28     ` Russell King (Oracle)
  0 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16  9:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra,
	linux-arm-kernel

On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Yay! About time too.

Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16  3:21   ` Andy Lutomirski
@ 2021-06-16 10:16     ` Peter Zijlstra
  -1 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-16 10:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Russell King,
	linux-arm-kernel, Will Deacon

On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

So SYNC_CORE is there to help an architecture that needs to do something
per CPU. If I$ invalidation is broadcast and I$ invalidation also
triggers the flush of any uarch caches derived from it (if there are
any).

Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
which, if I read things right, end up in arch/arm/mm/*.S, but that
doesn't consider cache_ops_need_broadcast().

Will suggests that perhaps ARM 11MPCore might need this due to their I$
flush maybe not being broadcast

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 10:16     ` Peter Zijlstra
  0 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-16 10:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Russell King,
	linux-arm-kernel, Will Deacon

On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

So SYNC_CORE is there to help an architecture that needs to do something
per CPU. If I$ invalidation is broadcast and I$ invalidation also
triggers the flush of any uarch caches derived from it (if there are
any).

Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
which, if I read things right, end up in arch/arm/mm/*.S, but that
doesn't consider cache_ops_need_broadcast().

Will suggests that perhaps ARM 11MPCore might need this due to their I$
flush maybe not being broadcast

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 10:16     ` Peter Zijlstra
@ 2021-06-16 10:20       ` Peter Zijlstra
  -1 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-16 10:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Russell King,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > On arm32, the only way to safely flush icache from usermode is to call
> > cacheflush(2).  This also handles any required pipeline flushes, so
> > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> 
> So SYNC_CORE is there to help an architecture that needs to do something
> per CPU. If I$ invalidation is broadcast and I$ invalidation also
> triggers the flush of any uarch caches derived from it (if there are
> any).

Incomplete sentence there: + then we don't need SYNC_CORE.

> Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> which, if I read things right, end up in arch/arm/mm/*.S, but that
> doesn't consider cache_ops_need_broadcast().
> 
> Will suggests that perhaps ARM 11MPCore might need this due to their I$
> flush maybe not being broadcast


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 10:20       ` Peter Zijlstra
  0 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-16 10:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Russell King,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > On arm32, the only way to safely flush icache from usermode is to call
> > cacheflush(2).  This also handles any required pipeline flushes, so
> > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> 
> So SYNC_CORE is there to help an architecture that needs to do something
> per CPU. If I$ invalidation is broadcast and I$ invalidation also
> triggers the flush of any uarch caches derived from it (if there are
> any).

Incomplete sentence there: + then we don't need SYNC_CORE.

> Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> which, if I read things right, end up in arch/arm/mm/*.S, but that
> doesn't consider cache_ops_need_broadcast().
> 
> Will suggests that perhaps ARM 11MPCore might need this due to their I$
> flush maybe not being broadcast


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16  3:21   ` Andy Lutomirski
  (?)
@ 2021-06-16 10:20     ` Will Deacon
  -1 siblings, 0 replies; 165+ messages in thread
From: Will Deacon @ 2021-06-16 10:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, linux-arm-kernel,
	Mathieu Desnoyers, Peter Zijlstra, stable

On Tue, Jun 15, 2021 at 08:21:13PM -0700, Andy Lutomirski wrote:
> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
>  icache (which requires no action at all) and "flushing the pipeline" is
>  sufficient, but trying to use this language would be confusing at best.
>  LFENCE does something awfully like "flushing the pipeline", but the SDM
>  does not permit LFENCE as an alternative to a "serializing instruction"
>  for this purpose.)
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
>  arch/arm64/include/asm/sync_core.h            | 19 ++++++
>  arch/powerpc/include/asm/sync_core.h          | 14 ++++
>  arch/x86/Kconfig                              |  1 -
>  arch/x86/include/asm/sync_core.h              |  7 +-
>  arch/x86/kernel/alternative.c                 |  2 +-
>  arch/x86/kernel/cpu/mce/core.c                |  2 +-
>  arch/x86/mm/tlb.c                             |  3 +-
>  drivers/misc/sgi-gru/grufault.c               |  2 +-
>  drivers/misc/sgi-gru/gruhandles.c             |  2 +-
>  drivers/misc/sgi-gru/grukservices.c           |  2 +-
>  include/linux/sched/mm.h                      |  1 -
>  include/linux/sync_core.h                     | 21 ------
>  init/Kconfig                                  |  3 -
>  kernel/sched/membarrier.c                     | 15 ++--
>  15 files changed, 75 insertions(+), 87 deletions(-)
>  create mode 100644 arch/arm64/include/asm/sync_core.h
>  create mode 100644 arch/powerpc/include/asm/sync_core.h
>  delete mode 100644 include/linux/sync_core.h

For the arm64 bits (docs and asm/sync_core.h):

Acked-by: Will Deacon <will@kernel.org>

Will

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16 10:20     ` Will Deacon
  0 siblings, 0 replies; 165+ messages in thread
From: Will Deacon @ 2021-06-16 10:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Catalin Marinas, linux-mm, Peter Zijlstra, x86, LKML,
	Nicholas Piggin, Dave Hansen, Paul Mackerras, stable,
	Mathieu Desnoyers, Andrew Morton, linuxppc-dev, linux-arm-kernel

On Tue, Jun 15, 2021 at 08:21:13PM -0700, Andy Lutomirski wrote:
> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
>  icache (which requires no action at all) and "flushing the pipeline" is
>  sufficient, but trying to use this language would be confusing at best.
>  LFENCE does something awfully like "flushing the pipeline", but the SDM
>  does not permit LFENCE as an alternative to a "serializing instruction"
>  for this purpose.)
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
>  arch/arm64/include/asm/sync_core.h            | 19 ++++++
>  arch/powerpc/include/asm/sync_core.h          | 14 ++++
>  arch/x86/Kconfig                              |  1 -
>  arch/x86/include/asm/sync_core.h              |  7 +-
>  arch/x86/kernel/alternative.c                 |  2 +-
>  arch/x86/kernel/cpu/mce/core.c                |  2 +-
>  arch/x86/mm/tlb.c                             |  3 +-
>  drivers/misc/sgi-gru/grufault.c               |  2 +-
>  drivers/misc/sgi-gru/gruhandles.c             |  2 +-
>  drivers/misc/sgi-gru/grukservices.c           |  2 +-
>  include/linux/sched/mm.h                      |  1 -
>  include/linux/sync_core.h                     | 21 ------
>  init/Kconfig                                  |  3 -
>  kernel/sched/membarrier.c                     | 15 ++--
>  15 files changed, 75 insertions(+), 87 deletions(-)
>  create mode 100644 arch/arm64/include/asm/sync_core.h
>  create mode 100644 arch/powerpc/include/asm/sync_core.h
>  delete mode 100644 include/linux/sync_core.h

For the arm64 bits (docs and asm/sync_core.h):

Acked-by: Will Deacon <will@kernel.org>

Will

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16 10:20     ` Will Deacon
  0 siblings, 0 replies; 165+ messages in thread
From: Will Deacon @ 2021-06-16 10:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, linux-arm-kernel,
	Mathieu Desnoyers, Peter Zijlstra, stable

On Tue, Jun 15, 2021 at 08:21:13PM -0700, Andy Lutomirski wrote:
> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.  This means x86, arm64, and powerpc for now.  Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
> 
> (It may well be the case that, on real x86 processors, synchronizing the
>  icache (which requires no action at all) and "flushing the pipeline" is
>  sufficient, but trying to use this language would be confusing at best.
>  LFENCE does something awfully like "flushing the pipeline", but the SDM
>  does not permit LFENCE as an alternative to a "serializing instruction"
>  for this purpose.)
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  .../membarrier-sync-core/arch-support.txt     | 68 ++++++-------------
>  arch/arm64/include/asm/sync_core.h            | 19 ++++++
>  arch/powerpc/include/asm/sync_core.h          | 14 ++++
>  arch/x86/Kconfig                              |  1 -
>  arch/x86/include/asm/sync_core.h              |  7 +-
>  arch/x86/kernel/alternative.c                 |  2 +-
>  arch/x86/kernel/cpu/mce/core.c                |  2 +-
>  arch/x86/mm/tlb.c                             |  3 +-
>  drivers/misc/sgi-gru/grufault.c               |  2 +-
>  drivers/misc/sgi-gru/gruhandles.c             |  2 +-
>  drivers/misc/sgi-gru/grukservices.c           |  2 +-
>  include/linux/sched/mm.h                      |  1 -
>  include/linux/sync_core.h                     | 21 ------
>  init/Kconfig                                  |  3 -
>  kernel/sched/membarrier.c                     | 15 ++--
>  15 files changed, 75 insertions(+), 87 deletions(-)
>  create mode 100644 arch/arm64/include/asm/sync_core.h
>  create mode 100644 arch/powerpc/include/asm/sync_core.h
>  delete mode 100644 include/linux/sync_core.h

For the arm64 bits (docs and asm/sync_core.h):

Acked-by: Will Deacon <will@kernel.org>

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 10:20       ` Peter Zijlstra
@ 2021-06-16 10:34         ` Russell King (Oracle)
  -1 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16 10:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > On arm32, the only way to safely flush icache from usermode is to call
> > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > 
> > So SYNC_CORE is there to help an architecture that needs to do something
> > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > triggers the flush of any uarch caches derived from it (if there are
> > any).
> 
> Incomplete sentence there: + then we don't need SYNC_CORE.
> 
> > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > doesn't consider cache_ops_need_broadcast().
> > 
> > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > flush maybe not being broadcast

If it leaves other cores with incoherent I cache, then that's already
a problem for SMP cores, since there could be no guarantee that the
modifications made by one core will be visible to some other core that
ends up running that code - and there is little option for userspace to
work around that except by pinning the thread making the modifications
and subsequently executing the code to a core.

The same is also true of flush_icache_range() - which is used when
loading a kernel module. In the case Will is referring to, these alias
to the same code.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 10:34         ` Russell King (Oracle)
  0 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16 10:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > On arm32, the only way to safely flush icache from usermode is to call
> > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > 
> > So SYNC_CORE is there to help an architecture that needs to do something
> > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > triggers the flush of any uarch caches derived from it (if there are
> > any).
> 
> Incomplete sentence there: + then we don't need SYNC_CORE.
> 
> > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > doesn't consider cache_ops_need_broadcast().
> > 
> > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > flush maybe not being broadcast

If it leaves other cores with incoherent I cache, then that's already
a problem for SMP cores, since there could be no guarantee that the
modifications made by one core will be visible to some other core that
ends up running that code - and there is little option for userspace to
work around that except by pinning the thread making the modifications
and subsequently executing the code to a core.

The same is also true of flush_icache_range() - which is used when
loading a kernel module. In the case Will is referring to, these alias
to the same code.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 10:34         ` Russell King (Oracle)
@ 2021-06-16 11:10           ` Peter Zijlstra
  -1 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-16 11:10 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > 
> > > So SYNC_CORE is there to help an architecture that needs to do something
> > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > triggers the flush of any uarch caches derived from it (if there are
> > > any).
> > 
> > Incomplete sentence there: + then we don't need SYNC_CORE.
> > 
> > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > doesn't consider cache_ops_need_broadcast().
> > > 
> > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > flush maybe not being broadcast
> 
> If it leaves other cores with incoherent I cache, then that's already
> a problem for SMP cores, since there could be no guarantee that the
> modifications made by one core will be visible to some other core that
> ends up running that code - and there is little option for userspace to
> work around that except by pinning the thread making the modifications
> and subsequently executing the code to a core.

That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
system wide IPI.

> The same is also true of flush_icache_range() - which is used when
> loading a kernel module. In the case Will is referring to, these alias
> to the same code.

Yes, cache_ops_need_broadcast() seems to be missing in more places.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 11:10           ` Peter Zijlstra
  0 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-16 11:10 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > 
> > > So SYNC_CORE is there to help an architecture that needs to do something
> > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > triggers the flush of any uarch caches derived from it (if there are
> > > any).
> > 
> > Incomplete sentence there: + then we don't need SYNC_CORE.
> > 
> > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > doesn't consider cache_ops_need_broadcast().
> > > 
> > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > flush maybe not being broadcast
> 
> If it leaves other cores with incoherent I cache, then that's already
> a problem for SMP cores, since there could be no guarantee that the
> modifications made by one core will be visible to some other core that
> ends up running that code - and there is little option for userspace to
> work around that except by pinning the thread making the modifications
> and subsequently executing the code to a core.

That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
system wide IPI.

> The same is also true of flush_icache_range() - which is used when
> loading a kernel module. In the case Will is referring to, these alias
> to the same code.

Yes, cache_ops_need_broadcast() seems to be missing in more places.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 11:10           ` Peter Zijlstra
@ 2021-06-16 13:22             ` Russell King (Oracle)
  -1 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16 13:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

On Wed, Jun 16, 2021 at 01:10:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> > On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > 
> > > > So SYNC_CORE is there to help an architecture that needs to do something
> > > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > > triggers the flush of any uarch caches derived from it (if there are
> > > > any).
> > > 
> > > Incomplete sentence there: + then we don't need SYNC_CORE.
> > > 
> > > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > > doesn't consider cache_ops_need_broadcast().
> > > > 
> > > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > > flush maybe not being broadcast
> > 
> > If it leaves other cores with incoherent I cache, then that's already
> > a problem for SMP cores, since there could be no guarantee that the
> > modifications made by one core will be visible to some other core that
> > ends up running that code - and there is little option for userspace to
> > work around that except by pinning the thread making the modifications
> > and subsequently executing the code to a core.
> 
> That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
> system wide IPI.

If it's a problem, then it needs fixing. sys_cacheflush() is used to
implement GCC's __builtin___clear_cache(). I'm not sure who added this
to gcc.

> > The same is also true of flush_icache_range() - which is used when
> > loading a kernel module. In the case Will is referring to, these alias
> > to the same code.
> 
> Yes, cache_ops_need_broadcast() seems to be missing in more places.

Likely only in places where we care about I/D coherency - as the data
cache is required to be PIPT on these SMP platforms.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 13:22             ` Russell King (Oracle)
  0 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16 13:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

On Wed, Jun 16, 2021 at 01:10:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> > On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > 
> > > > So SYNC_CORE is there to help an architecture that needs to do something
> > > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > > triggers the flush of any uarch caches derived from it (if there are
> > > > any).
> > > 
> > > Incomplete sentence there: + then we don't need SYNC_CORE.
> > > 
> > > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > > doesn't consider cache_ops_need_broadcast().
> > > > 
> > > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > > flush maybe not being broadcast
> > 
> > If it leaves other cores with incoherent I cache, then that's already
> > a problem for SMP cores, since there could be no guarantee that the
> > modifications made by one core will be visible to some other core that
> > ends up running that code - and there is little option for userspace to
> > work around that except by pinning the thread making the modifications
> > and subsequently executing the code to a core.
> 
> That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
> system wide IPI.

If it's a problem, then it needs fixing. sys_cacheflush() is used to
implement GCC's __builtin___clear_cache(). I'm not sure who added this
to gcc.

> > The same is also true of flush_icache_range() - which is used when
> > loading a kernel module. In the case Will is referring to, these alias
> > to the same code.
> 
> Yes, cache_ops_need_broadcast() seems to be missing in more places.

Likely only in places where we care about I/D coherency - as the data
cache is required to be PIPT on these SMP platforms.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 13:22             ` Russell King (Oracle)
@ 2021-06-16 15:04               ` Catalin Marinas
  -1 siblings, 0 replies; 165+ messages in thread
From: Catalin Marinas @ 2021-06-16 15:04 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> On Wed, Jun 16, 2021 at 01:10:58PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> > > On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > > 
> > > > > So SYNC_CORE is there to help an architecture that needs to do something
> > > > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > > > triggers the flush of any uarch caches derived from it (if there are
> > > > > any).
> > > > 
> > > > Incomplete sentence there: + then we don't need SYNC_CORE.
> > > > 
> > > > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > > > doesn't consider cache_ops_need_broadcast().
> > > > > 
> > > > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > > > flush maybe not being broadcast
> > > 
> > > If it leaves other cores with incoherent I cache, then that's already
> > > a problem for SMP cores, since there could be no guarantee that the
> > > modifications made by one core will be visible to some other core that
> > > ends up running that code - and there is little option for userspace to
> > > work around that except by pinning the thread making the modifications
> > > and subsequently executing the code to a core.
> > 
> > That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
> > system wide IPI.
> 
> If it's a problem, then it needs fixing. sys_cacheflush() is used to
> implement GCC's __builtin___clear_cache(). I'm not sure who added this
> to gcc.

I'm surprised that it works. I guess it's just luck that the thread
doing the code writing doesn't migrate before the sys_cacheflush() call.

> > > The same is also true of flush_icache_range() - which is used when
> > > loading a kernel module. In the case Will is referring to, these alias
> > > to the same code.
> > 
> > Yes, cache_ops_need_broadcast() seems to be missing in more places.
> 
> Likely only in places where we care about I/D coherency - as the data
> cache is required to be PIPT on these SMP platforms.

We had similar issue with the cache maintenance for DMA. The hack we
employed (in cache.S) is relying on the MESI protocol internals and
forcing a read/write for ownership before the D-cache maintenance.
Luckily ARM11MPCore doesn't do speculative data loads to trigger some
migration back.

The simpler fix for flush_icache_range() is to disable preemption, read
a word in a cacheline to force any dirty lines on another CPU to be
evicted and then issue the D-cache maintenance (for those cache lines
which are still dirty on the current CPU).

It's a hack that only works on ARM11MPCore. Newer MP cores are saner.

-- 
Catalin

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 15:04               ` Catalin Marinas
  0 siblings, 0 replies; 165+ messages in thread
From: Catalin Marinas @ 2021-06-16 15:04 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> On Wed, Jun 16, 2021 at 01:10:58PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> > > On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > > 
> > > > > So SYNC_CORE is there to help an architecture that needs to do something
> > > > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > > > triggers the flush of any uarch caches derived from it (if there are
> > > > > any).
> > > > 
> > > > Incomplete sentence there: + then we don't need SYNC_CORE.
> > > > 
> > > > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > > > doesn't consider cache_ops_need_broadcast().
> > > > > 
> > > > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > > > flush maybe not being broadcast
> > > 
> > > If it leaves other cores with incoherent I cache, then that's already
> > > a problem for SMP cores, since there could be no guarantee that the
> > > modifications made by one core will be visible to some other core that
> > > ends up running that code - and there is little option for userspace to
> > > work around that except by pinning the thread making the modifications
> > > and subsequently executing the code to a core.
> > 
> > That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
> > system wide IPI.
> 
> If it's a problem, then it needs fixing. sys_cacheflush() is used to
> implement GCC's __builtin___clear_cache(). I'm not sure who added this
> to gcc.

I'm surprised that it works. I guess it's just luck that the thread
doing the code writing doesn't migrate before the sys_cacheflush() call.

> > > The same is also true of flush_icache_range() - which is used when
> > > loading a kernel module. In the case Will is referring to, these alias
> > > to the same code.
> > 
> > Yes, cache_ops_need_broadcast() seems to be missing in more places.
> 
> Likely only in places where we care about I/D coherency - as the data
> cache is required to be PIPT on these SMP platforms.

We had similar issue with the cache maintenance for DMA. The hack we
employed (in cache.S) is relying on the MESI protocol internals and
forcing a read/write for ownership before the D-cache maintenance.
Luckily ARM11MPCore doesn't do speculative data loads to trigger some
migration back.

The simpler fix for flush_icache_range() is to disable preemption, read
a word in a cacheline to force any dirty lines on another CPU to be
evicted and then issue the D-cache maintenance (for those cache lines
which are still dirty on the current CPU).

It's a hack that only works on ARM11MPCore. Newer MP cores are saner.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 15:04               ` Catalin Marinas
@ 2021-06-16 15:23                 ` Russell King (Oracle)
  -1 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16 15:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> > If it's a problem, then it needs fixing. sys_cacheflush() is used to
> > implement GCC's __builtin___clear_cache(). I'm not sure who added this
> > to gcc.
> 
> I'm surprised that it works. I guess it's just luck that the thread
> doing the code writing doesn't migrate before the sys_cacheflush() call.

Maybe the platforms that use ARM MPCore avoid the issue somehow (maybe
by not using self-modifying code?)

> > Likely only in places where we care about I/D coherency - as the data
> > cache is required to be PIPT on these SMP platforms.
> 
> We had similar issue with the cache maintenance for DMA. The hack we
> employed (in cache.S) is relying on the MESI protocol internals and
> forcing a read/write for ownership before the D-cache maintenance.
> Luckily ARM11MPCore doesn't do speculative data loads to trigger some
> migration back.

That's very similar to the hack that was originally implemented for
MPCore DMA - see the DMA_CACHE_RWFO configuration option.

An interesting point here is that cache_ops_need_broadcast() reads
MMFR3 bits 12..15, which in the MPCore TRM has nothing to with cache
operation broadcasting - but luckily is documented as containing zero.
So, cache_ops_need_broadcast() returns correctly (true) here.

> The simpler fix for flush_icache_range() is to disable preemption, read
> a word in a cacheline to force any dirty lines on another CPU to be
> evicted and then issue the D-cache maintenance (for those cache lines
> which are still dirty on the current CPU).

Is just reading sufficient? If so, why do we do a read-then-write in
the MPCore DMA cache ops? Don't we need the write to force exclusive
ownership? If we don't have exclusive ownership of the dirty line,
how can we be sure to write it out of the caches?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 15:23                 ` Russell King (Oracle)
  0 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16 15:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> > If it's a problem, then it needs fixing. sys_cacheflush() is used to
> > implement GCC's __builtin___clear_cache(). I'm not sure who added this
> > to gcc.
> 
> I'm surprised that it works. I guess it's just luck that the thread
> doing the code writing doesn't migrate before the sys_cacheflush() call.

Maybe the platforms that use ARM MPCore avoid the issue somehow (maybe
by not using self-modifying code?)

> > Likely only in places where we care about I/D coherency - as the data
> > cache is required to be PIPT on these SMP platforms.
> 
> We had similar issue with the cache maintenance for DMA. The hack we
> employed (in cache.S) is relying on the MESI protocol internals and
> forcing a read/write for ownership before the D-cache maintenance.
> Luckily ARM11MPCore doesn't do speculative data loads to trigger some
> migration back.

That's very similar to the hack that was originally implemented for
MPCore DMA - see the DMA_CACHE_RWFO configuration option.

An interesting point here is that cache_ops_need_broadcast() reads
MMFR3 bits 12..15, which in the MPCore TRM has nothing to with cache
operation broadcasting - but luckily is documented as containing zero.
So, cache_ops_need_broadcast() returns correctly (true) here.

> The simpler fix for flush_icache_range() is to disable preemption, read
> a word in a cacheline to force any dirty lines on another CPU to be
> evicted and then issue the D-cache maintenance (for those cache lines
> which are still dirty on the current CPU).

Is just reading sufficient? If so, why do we do a read-then-write in
the MPCore DMA cache ops? Don't we need the write to force exclusive
ownership? If we don't have exclusive ownership of the dirty line,
how can we be sure to write it out of the caches?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 15:23                 ` Russell King (Oracle)
@ 2021-06-16 15:45                   ` Catalin Marinas
  -1 siblings, 0 replies; 165+ messages in thread
From: Catalin Marinas @ 2021-06-16 15:45 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> > > If it's a problem, then it needs fixing. sys_cacheflush() is used to
> > > implement GCC's __builtin___clear_cache(). I'm not sure who added this
> > > to gcc.
> > 
> > I'm surprised that it works. I guess it's just luck that the thread
> > doing the code writing doesn't migrate before the sys_cacheflush() call.
> 
> Maybe the platforms that use ARM MPCore avoid the issue somehow (maybe
> by not using self-modifying code?)

Not sure how widely it is/was used with JITs. In general, I think the
systems at the time were quite tolerant to missing I-cache maintenance
(maybe small caches?). We ran Linux for a while without 826cbdaff297
("[ARM] 5092/1: Fix the I-cache invalidation on ARMv6 and later CPUs").

> > > Likely only in places where we care about I/D coherency - as the data
> > > cache is required to be PIPT on these SMP platforms.
> > 
> > We had similar issue with the cache maintenance for DMA. The hack we
> > employed (in cache.S) is relying on the MESI protocol internals and
> > forcing a read/write for ownership before the D-cache maintenance.
> > Luckily ARM11MPCore doesn't do speculative data loads to trigger some
> > migration back.
> 
> That's very similar to the hack that was originally implemented for
> MPCore DMA - see the DMA_CACHE_RWFO configuration option.

Well, yes, that's what I wrote above ;) (I added the hack and config
option IIRC).

> An interesting point here is that cache_ops_need_broadcast() reads
> MMFR3 bits 12..15, which in the MPCore TRM has nothing to with cache
> operation broadcasting - but luckily is documented as containing zero.
> So, cache_ops_need_broadcast() returns correctly (true) here.

That's typical with any new feature. The 12..15 field was added in ARMv7
stating that cache maintenance is broadcast in hardware. Prior to this,
the field was read-as-zero. So it's not luck but we could have avoided
negating the meaning here, i.e. call it cache_ops_are_broadcast().

> > The simpler fix for flush_icache_range() is to disable preemption, read
> > a word in a cacheline to force any dirty lines on another CPU to be
> > evicted and then issue the D-cache maintenance (for those cache lines
> > which are still dirty on the current CPU).
> 
> Is just reading sufficient? If so, why do we do a read-then-write in
> the MPCore DMA cache ops? Don't we need the write to force exclusive
> ownership? If we don't have exclusive ownership of the dirty line,
> how can we be sure to write it out of the caches?

For cleaning (which is the case for I/D coherency), we only need reading
since we are fine with clean lines being left in the D-cache on other
CPUs. For invalidation, we indeed need to force the exclusive ownership,
hence the write.

-- 
Catalin

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 15:45                   ` Catalin Marinas
  0 siblings, 0 replies; 165+ messages in thread
From: Catalin Marinas @ 2021-06-16 15:45 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> > > If it's a problem, then it needs fixing. sys_cacheflush() is used to
> > > implement GCC's __builtin___clear_cache(). I'm not sure who added this
> > > to gcc.
> > 
> > I'm surprised that it works. I guess it's just luck that the thread
> > doing the code writing doesn't migrate before the sys_cacheflush() call.
> 
> Maybe the platforms that use ARM MPCore avoid the issue somehow (maybe
> by not using self-modifying code?)

Not sure how widely it is/was used with JITs. In general, I think the
systems at the time were quite tolerant to missing I-cache maintenance
(maybe small caches?). We ran Linux for a while without 826cbdaff297
("[ARM] 5092/1: Fix the I-cache invalidation on ARMv6 and later CPUs").

> > > Likely only in places where we care about I/D coherency - as the data
> > > cache is required to be PIPT on these SMP platforms.
> > 
> > We had similar issue with the cache maintenance for DMA. The hack we
> > employed (in cache.S) is relying on the MESI protocol internals and
> > forcing a read/write for ownership before the D-cache maintenance.
> > Luckily ARM11MPCore doesn't do speculative data loads to trigger some
> > migration back.
> 
> That's very similar to the hack that was originally implemented for
> MPCore DMA - see the DMA_CACHE_RWFO configuration option.

Well, yes, that's what I wrote above ;) (I added the hack and config
option IIRC).

> An interesting point here is that cache_ops_need_broadcast() reads
> MMFR3 bits 12..15, which in the MPCore TRM has nothing to with cache
> operation broadcasting - but luckily is documented as containing zero.
> So, cache_ops_need_broadcast() returns correctly (true) here.

That's typical with any new feature. The 12..15 field was added in ARMv7
stating that cache maintenance is broadcast in hardware. Prior to this,
the field was read-as-zero. So it's not luck but we could have avoided
negating the meaning here, i.e. call it cache_ops_are_broadcast().

> > The simpler fix for flush_icache_range() is to disable preemption, read
> > a word in a cacheline to force any dirty lines on another CPU to be
> > evicted and then issue the D-cache maintenance (for those cache lines
> > which are still dirty on the current CPU).
> 
> Is just reading sufficient? If so, why do we do a read-then-write in
> the MPCore DMA cache ops? Don't we need the write to force exclusive
> ownership? If we don't have exclusive ownership of the dirty line,
> how can we be sure to write it out of the caches?

For cleaning (which is the case for I/D coherency), we only need reading
since we are fine with clean lines being left in the D-cache on other
CPUs. For invalidation, we indeed need to force the exclusive ownership,
hence the write.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 15:45                   ` Catalin Marinas
@ 2021-06-16 16:00                     ` Catalin Marinas
  -1 siblings, 0 replies; 165+ messages in thread
From: Catalin Marinas @ 2021-06-16 16:00 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 04:45:29PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> > On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > > The simpler fix for flush_icache_range() is to disable preemption, read
> > > a word in a cacheline to force any dirty lines on another CPU to be
> > > evicted and then issue the D-cache maintenance (for those cache lines
> > > which are still dirty on the current CPU).
> > 
> > Is just reading sufficient? If so, why do we do a read-then-write in
> > the MPCore DMA cache ops? Don't we need the write to force exclusive
> > ownership? If we don't have exclusive ownership of the dirty line,
> > how can we be sure to write it out of the caches?
> 
> For cleaning (which is the case for I/D coherency), we only need reading
> since we are fine with clean lines being left in the D-cache on other
> CPUs. For invalidation, we indeed need to force the exclusive ownership,
> hence the write.

Ah, I'm not sure the I-cache is broadcast in hardware on ARM11MPCore
either. So fixing the D side won't be sufficient.

-- 
Catalin

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 16:00                     ` Catalin Marinas
  0 siblings, 0 replies; 165+ messages in thread
From: Catalin Marinas @ 2021-06-16 16:00 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 04:45:29PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> > On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > > The simpler fix for flush_icache_range() is to disable preemption, read
> > > a word in a cacheline to force any dirty lines on another CPU to be
> > > evicted and then issue the D-cache maintenance (for those cache lines
> > > which are still dirty on the current CPU).
> > 
> > Is just reading sufficient? If so, why do we do a read-then-write in
> > the MPCore DMA cache ops? Don't we need the write to force exclusive
> > ownership? If we don't have exclusive ownership of the dirty line,
> > how can we be sure to write it out of the caches?
> 
> For cleaning (which is the case for I/D coherency), we only need reading
> since we are fine with clean lines being left in the D-cache on other
> CPUs. For invalidation, we indeed need to force the exclusive ownership,
> hence the write.

Ah, I'm not sure the I-cache is broadcast in hardware on ARM11MPCore
either. So fixing the D side won't be sufficient.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 16:00                     ` Catalin Marinas
@ 2021-06-16 16:27                       ` Russell King (Oracle)
  -1 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16 16:27 UTC (permalink / raw)
  To: Catalin Marinas, Linus Walleij, Krzysztof Halasa, Neil Armstrong
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 05:00:51PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 04:45:29PM +0100, Catalin Marinas wrote:
> > On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> > > On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > > > The simpler fix for flush_icache_range() is to disable preemption, read
> > > > a word in a cacheline to force any dirty lines on another CPU to be
> > > > evicted and then issue the D-cache maintenance (for those cache lines
> > > > which are still dirty on the current CPU).
> > > 
> > > Is just reading sufficient? If so, why do we do a read-then-write in
> > > the MPCore DMA cache ops? Don't we need the write to force exclusive
> > > ownership? If we don't have exclusive ownership of the dirty line,
> > > how can we be sure to write it out of the caches?
> > 
> > For cleaning (which is the case for I/D coherency), we only need reading
> > since we are fine with clean lines being left in the D-cache on other
> > CPUs. For invalidation, we indeed need to force the exclusive ownership,
> > hence the write.
> 
> Ah, I'm not sure the I-cache is broadcast in hardware on ARM11MPCore
> either. So fixing the D side won't be sufficient.

The other question is... do we bother to fix this.

Arnd tells me that the current remaining ARM11MPCore users are:
- CNS3xxx (where there is some martinal interest in the Gateworks
  Laguna platform)
- Similar for OXNAS
- There used to be the Realview MPCore tile - I haven't turned that on
  in ages, and it may be that the 3V cell that backs up the encryption
  keys is dead so it may not even boot.
- Not sure about the story with QEMU - Arnd doesn't think there would
  be a problem there as it may not model caches.

So it seems to come down to a question about CNS3xxx and OXNAS. If
these aren't being used, maybe we can drop ARM11MPCore support and
the associated platforms?

Linus, Krzysztof, Neil, any input?

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-16 16:27                       ` Russell King (Oracle)
  0 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-16 16:27 UTC (permalink / raw)
  To: Catalin Marinas, Linus Walleij, Krzysztof Halasa, Neil Armstrong
  Cc: Peter Zijlstra, Andy Lutomirski, x86, Dave Hansen, LKML,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel, Will Deacon

On Wed, Jun 16, 2021 at 05:00:51PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 04:45:29PM +0100, Catalin Marinas wrote:
> > On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> > > On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > > > The simpler fix for flush_icache_range() is to disable preemption, read
> > > > a word in a cacheline to force any dirty lines on another CPU to be
> > > > evicted and then issue the D-cache maintenance (for those cache lines
> > > > which are still dirty on the current CPU).
> > > 
> > > Is just reading sufficient? If so, why do we do a read-then-write in
> > > the MPCore DMA cache ops? Don't we need the write to force exclusive
> > > ownership? If we don't have exclusive ownership of the dirty line,
> > > how can we be sure to write it out of the caches?
> > 
> > For cleaning (which is the case for I/D coherency), we only need reading
> > since we are fine with clean lines being left in the D-cache on other
> > CPUs. For invalidation, we indeed need to force the exclusive ownership,
> > hence the write.
> 
> Ah, I'm not sure the I-cache is broadcast in hardware on ARM11MPCore
> either. So fixing the D side won't be sufficient.

The other question is... do we bother to fix this.

Arnd tells me that the current remaining ARM11MPCore users are:
- CNS3xxx (where there is some martinal interest in the Gateworks
  Laguna platform)
- Similar for OXNAS
- There used to be the Realview MPCore tile - I haven't turned that on
  in ages, and it may be that the 3V cell that backs up the encryption
  keys is dead so it may not even boot.
- Not sure about the story with QEMU - Arnd doesn't think there would
  be a problem there as it may not model caches.

So it seems to come down to a question about CNS3xxx and OXNAS. If
these aren't being used, maybe we can drop ARM11MPCore support and
the associated platforms?

Linus, Krzysztof, Neil, any input?

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code
  2021-06-16  3:21 ` [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
@ 2021-06-16 17:49     ` Mathieu Desnoyers
  2021-06-16 17:49     ` Mathieu Desnoyers
  1 sibling, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-16 17:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Nicholas Piggin, Peter Zijlstra

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:
[...]
> @@ -473,16 +474,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,

[...]

> @@ -510,16 +520,35 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
> 		 * If the TLB is up to date, just use it.
> 		 * The barrier synchronizes with the tlb_gen increment in
> 		 * the TLB shootdown code.
> +		 *
> +		 * As a future optimization opportunity, it's plausible
> +		 * that the x86 memory model is strong enough that this
> +		 * smp_mb() isn't needed.
> 		 */
> 		smp_mb();
> 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> 		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> -				next_tlb_gen)
> +		    next_tlb_gen) {
> +#ifdef CONFIG_MEMBARRIER
> +			/*
> +			 * We switched logical mm but we're not going to
> +			 * write to CR3.  We already did smp_mb() above,
> +			 * but membarrier() might require a sync_core()
> +			 * as well.
> +			 */
> +			if (unlikely(atomic_read(&next->membarrier_state) &
> +				     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
> +				sync_core_before_usermode();
> +#endif
> +
> 			return;
> +		}

[...]

I find that mixing up preprocessor #ifdef and code logic hurts readability.
Can you lift this into a static function within the same compile unit, and
provides an empty implementation for the #else case ?

Thanks,

Mathieu

	prev->sched_class->task_dead(prev);



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code
@ 2021-06-16 17:49     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-16 17:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Nicholas Piggin, Peter Zijlstra

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:
[...]
> @@ -473,16 +474,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,

[...]

> @@ -510,16 +520,35 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
> 		 * If the TLB is up to date, just use it.
> 		 * The barrier synchronizes with the tlb_gen increment in
> 		 * the TLB shootdown code.
> +		 *
> +		 * As a future optimization opportunity, it's plausible
> +		 * that the x86 memory model is strong enough that this
> +		 * smp_mb() isn't needed.
> 		 */
> 		smp_mb();
> 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> 		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> -				next_tlb_gen)
> +		    next_tlb_gen) {
> +#ifdef CONFIG_MEMBARRIER
> +			/*
> +			 * We switched logical mm but we're not going to
> +			 * write to CR3.  We already did smp_mb() above,
> +			 * but membarrier() might require a sync_core()
> +			 * as well.
> +			 */
> +			if (unlikely(atomic_read(&next->membarrier_state) &
> +				     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
> +				sync_core_before_usermode();
> +#endif
> +
> 			return;
> +		}

[...]

I find that mixing up preprocessor #ifdef and code logic hurts readability.
Can you lift this into a static function within the same compile unit, and
provides an empty implementation for the #else case ?

Thanks,

Mathieu

	prev->sched_class->task_dead(prev);



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code
  2021-06-16  3:21 ` [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
@ 2021-06-16 17:52     ` Mathieu Desnoyers
  2021-06-16 17:52     ` Mathieu Desnoyers
  1 sibling, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-16 17:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Nicholas Piggin, Peter Zijlstra

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

> membarrier_arch_switch_mm()'s sole implementation and caller are in
> arch/powerpc.  Having a fallback implementation in include/linux is
> confusing -- remove it.
> 
> It's still mentioned in a comment, but a subsequent patch will remove
> it.
> 

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> include/linux/sched/mm.h | 7 -------
> 1 file changed, 7 deletions(-)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 24d97d1b6252..10aace21d25e 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -350,13 +350,6 @@ extern void membarrier_exec_mmap(struct mm_struct *mm);
> extern void membarrier_update_current_mm(struct mm_struct *next_mm);
> 
> #else
> -#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -}
> -#endif
> static inline void membarrier_exec_mmap(struct mm_struct *mm)
> {
> }
> --
> 2.31.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code
@ 2021-06-16 17:52     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-16 17:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Nicholas Piggin, Peter Zijlstra

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

> membarrier_arch_switch_mm()'s sole implementation and caller are in
> arch/powerpc.  Having a fallback implementation in include/linux is
> confusing -- remove it.
> 
> It's still mentioned in a comment, but a subsequent patch will remove
> it.
> 

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> include/linux/sched/mm.h | 7 -------
> 1 file changed, 7 deletions(-)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 24d97d1b6252..10aace21d25e 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -350,13 +350,6 @@ extern void membarrier_exec_mmap(struct mm_struct *mm);
> extern void membarrier_update_current_mm(struct mm_struct *next_mm);
> 
> #else
> -#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -					     struct mm_struct *next,
> -					     struct task_struct *tsk)
> -{
> -}
> -#endif
> static inline void membarrier_exec_mmap(struct mm_struct *mm)
> {
> }
> --
> 2.31.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm
  2021-06-16  3:21 ` [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
@ 2021-06-16 18:08     ` Mathieu Desnoyers
  2021-06-16 18:08     ` Mathieu Desnoyers
  1 sibling, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-16 18:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Nicholas Piggin, Peter Zijlstra

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

> membarrier reads cpu_rq(remote cpu)->curr->mm without locking.  Use
> READ_ONCE() and WRITE_ONCE() to remove the data races.

I notice that kernel/exit.c:exit_mm() also has:

        current->mm = NULL;

I suspect you may want to add a WRITE_ONCE() there as well ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm
@ 2021-06-16 18:08     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-16 18:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Nicholas Piggin, Peter Zijlstra

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

> membarrier reads cpu_rq(remote cpu)->curr->mm without locking.  Use
> READ_ONCE() and WRITE_ONCE() to remove the data races.

I notice that kernel/exit.c:exit_mm() also has:

        current->mm = NULL;

I suspect you may want to add a WRITE_ONCE() there as well ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code
  2021-06-16 17:49     ` Mathieu Desnoyers
  (?)
@ 2021-06-16 18:31     ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 18:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Nicholas Piggin, Peter Zijlstra

On 6/16/21 10:49 AM, Mathieu Desnoyers wrote:
> ----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:
> [...]
>> @@ -473,16 +474,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
> 
> [...]
> 
>> @@ -510,16 +520,35 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>> 		 * If the TLB is up to date, just use it.
>> 		 * The barrier synchronizes with the tlb_gen increment in
>> 		 * the TLB shootdown code.
>> +		 *
>> +		 * As a future optimization opportunity, it's plausible
>> +		 * that the x86 memory model is strong enough that this
>> +		 * smp_mb() isn't needed.
>> 		 */
>> 		smp_mb();
>> 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>> 		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
>> -				next_tlb_gen)
>> +		    next_tlb_gen) {
>> +#ifdef CONFIG_MEMBARRIER
>> +			/*
>> +			 * We switched logical mm but we're not going to
>> +			 * write to CR3.  We already did smp_mb() above,
>> +			 * but membarrier() might require a sync_core()
>> +			 * as well.
>> +			 */
>> +			if (unlikely(atomic_read(&next->membarrier_state) &
>> +				     MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
>> +				sync_core_before_usermode();
>> +#endif
>> +
>> 			return;
>> +		}
> 
> [...]
> 
> I find that mixing up preprocessor #ifdef and code logic hurts readability.
> Can you lift this into a static function within the same compile unit, and
> provides an empty implementation for the #else case ?

Done.

> 
> Thanks,
> 
> Mathieu
> 
> 	prev->sched_class->task_dead(prev);
> 
> 
> 


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code
  2021-06-16  4:25   ` Nicholas Piggin
@ 2021-06-16 18:31     ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 18:31 UTC (permalink / raw)
  To: Nicholas Piggin, x86
  Cc: Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers,
	Peter Zijlstra

On 6/15/21 9:25 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:

> I'm fine with the patch though, except I would leave the comment in the
> core sched code saying any arch specific sequence to deal with
> SYNC_CORE is required for that case.

Done.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-16  7:35     ` Peter Zijlstra
@ 2021-06-16 18:41       ` Andy Lutomirski
  2021-06-17  1:37         ` Nicholas Piggin
  2021-06-17  8:45         ` Peter Zijlstra
  0 siblings, 2 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 18:41 UTC (permalink / raw)
  To: Peter Zijlstra, Nicholas Piggin
  Cc: x86, Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers

On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>> a comment explaining why this barrier probably exists in all cases.  This
>>> is very fragile -- any change to the relevant parts of the scheduler
>>> might get rid of these barriers, and it's not really clear to me that
>>> the barrier actually exists in all necessary cases.
>>
>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>> fragile or maybe-buggy about this. The barrier definitely exists.
>>
>> And any change can change anything, that doesn't make it fragile. My
>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>> replaces it with smp_mb for example.
> 
> I'm with Nick again, on this. You're adding extra barriers for no
> discernible reason, that's not generally encouraged, seeing how extra
> barriers is extra slow.
> 
> Both mmdrop() itself, as well as the callsite have comments saying how
> membarrier relies on the implied barrier, what's fragile about that?
> 

My real motivation is that mmgrab() and mmdrop() don't actually need to
be full barriers.  The current implementation has them being full
barriers, and the current implementation is quite slow.  So let's try
that commit message again:

membarrier() needs a barrier after any CPU changes mm.  There is currently
a comment explaining why this barrier probably exists in all cases. The
logic is based on ensuring that the barrier exists on every control flow
path through the scheduler.  It also relies on mmgrab() and mmdrop() being
full barriers.

mmgrab() and mmdrop() would be better if they were not full barriers.  As a
trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
could use a release on architectures that have these operations.  Larger
optimizations are also in the works.  Doing any of these optimizations
while preserving an unnecessary barrier will complicate the code and
penalize non-membarrier-using tasks.

Simplify the logic by adding an explicit barrier, and allow architectures
to override it as an optimization if they want to.

One of the deleted comments in this patch said "It is therefore
possible to schedule between user->kernel->user threads without
passing through switch_mm()".  It is possible to do this without, say,
writing to CR3 on x86, but the core scheduler indeed calls
switch_mm_irqs_off() to tell the arch code to go back from lazy mode
to no-lazy mode.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm
  2021-06-16 18:08     ` Mathieu Desnoyers
  (?)
@ 2021-06-16 18:45     ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 18:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Nicholas Piggin, Peter Zijlstra

On 6/16/21 11:08 AM, Mathieu Desnoyers wrote:
> ----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:
> 
>> membarrier reads cpu_rq(remote cpu)->curr->mm without locking.  Use
>> READ_ONCE() and WRITE_ONCE() to remove the data races.
> 
> I notice that kernel/exit.c:exit_mm() also has:
> 
>         current->mm = NULL;
> 
> I suspect you may want to add a WRITE_ONCE() there as well ?

Good catch.  I was thinking that exit_mm() couldn't execute concurrently
with membarrier(), but that's wrong.

--Andy

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16  4:45     ` Nicholas Piggin
  (?)
@ 2021-06-16 18:52       ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 18:52 UTC (permalink / raw)
  To: Nicholas Piggin, x86
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, LKML, linux-mm, linuxppc-dev,
	Mathieu Desnoyers, Michael Ellerman, Paul Mackerras,
	Peter Zijlstra, stable, Will Deacon

On 6/15/21 9:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
>> return-to-usermode instruction is x86-specific and that all other
>> architectures automatically notice cross-modified code on return to
>> userspace.

>> +/*
>> + * XXX: can a powerpc person put an appropriate comment here?
>> + */
>> +static inline void membarrier_sync_core_before_usermode(void)
>> +{
>> +}
>> +
>> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> 
> powerpc's can just go in asm/membarrier.h

$ ls arch/powerpc/include/asm/membarrier.h
ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
or directory


> 
> /*
>  * The RFI family of instructions are context synchronising, and
>  * that is how we return to userspace, so nothing is required here.
>  */

Thanks!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16 18:52       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 18:52 UTC (permalink / raw)
  To: Nicholas Piggin, x86
  Cc: Will Deacon, linux-mm, Peter Zijlstra, LKML, stable, Dave Hansen,
	Mathieu Desnoyers, Catalin Marinas, Paul Mackerras,
	Andrew Morton, linuxppc-dev, linux-arm-kernel

On 6/15/21 9:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
>> return-to-usermode instruction is x86-specific and that all other
>> architectures automatically notice cross-modified code on return to
>> userspace.

>> +/*
>> + * XXX: can a powerpc person put an appropriate comment here?
>> + */
>> +static inline void membarrier_sync_core_before_usermode(void)
>> +{
>> +}
>> +
>> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> 
> powerpc's can just go in asm/membarrier.h

$ ls arch/powerpc/include/asm/membarrier.h
ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
or directory


> 
> /*
>  * The RFI family of instructions are context synchronising, and
>  * that is how we return to userspace, so nothing is required here.
>  */

Thanks!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16 18:52       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 18:52 UTC (permalink / raw)
  To: Nicholas Piggin, x86
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, LKML, linux-mm, linuxppc-dev,
	Mathieu Desnoyers, Michael Ellerman, Paul Mackerras,
	Peter Zijlstra, stable, Will Deacon

On 6/15/21 9:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
>> return-to-usermode instruction is x86-specific and that all other
>> architectures automatically notice cross-modified code on return to
>> userspace.

>> +/*
>> + * XXX: can a powerpc person put an appropriate comment here?
>> + */
>> +static inline void membarrier_sync_core_before_usermode(void)
>> +{
>> +}
>> +
>> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> 
> powerpc's can just go in asm/membarrier.h

$ ls arch/powerpc/include/asm/membarrier.h
ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
or directory


> 
> /*
>  * The RFI family of instructions are context synchronising, and
>  * that is how we return to userspace, so nothing is required here.
>  */

Thanks!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16 18:52       ` Andy Lutomirski
  (?)
@ 2021-06-16 23:48         ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 23:48 UTC (permalink / raw)
  To: Nicholas Piggin, the arch/x86 maintainers
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mm, linuxppc-dev, Mathieu Desnoyers, Michael Ellerman,
	Paul Mackerras, Peter Zijlstra (Intel),
	stable, Will Deacon

On Wed, Jun 16, 2021, at 11:52 AM, Andy Lutomirski wrote:
> On 6/15/21 9:45 PM, Nicholas Piggin wrote:
> > Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> >> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> >> return-to-usermode instruction is x86-specific and that all other
> >> architectures automatically notice cross-modified code on return to
> >> userspace.
> 
> >> +/*
> >> + * XXX: can a powerpc person put an appropriate comment here?
> >> + */
> >> +static inline void membarrier_sync_core_before_usermode(void)
> >> +{
> >> +}
> >> +
> >> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> > 
> > powerpc's can just go in asm/membarrier.h
> 
> $ ls arch/powerpc/include/asm/membarrier.h
> ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
> or directory

Which is because I deleted it.  Duh.  I'll clean this up.

> 
> 
> > 
> > /*
> >  * The RFI family of instructions are context synchronising, and
> >  * that is how we return to userspace, so nothing is required here.
> >  */
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16 23:48         ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 23:48 UTC (permalink / raw)
  To: Nicholas Piggin, the arch/x86 maintainers
  Cc: Will Deacon, linux-mm, Peter Zijlstra (Intel),
	Linux Kernel Mailing List, stable, Dave Hansen,
	Mathieu Desnoyers, Catalin Marinas, Paul Mackerras,
	Andrew Morton, linuxppc-dev, linux-arm-kernel

On Wed, Jun 16, 2021, at 11:52 AM, Andy Lutomirski wrote:
> On 6/15/21 9:45 PM, Nicholas Piggin wrote:
> > Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> >> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> >> return-to-usermode instruction is x86-specific and that all other
> >> architectures automatically notice cross-modified code on return to
> >> userspace.
> 
> >> +/*
> >> + * XXX: can a powerpc person put an appropriate comment here?
> >> + */
> >> +static inline void membarrier_sync_core_before_usermode(void)
> >> +{
> >> +}
> >> +
> >> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> > 
> > powerpc's can just go in asm/membarrier.h
> 
> $ ls arch/powerpc/include/asm/membarrier.h
> ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
> or directory

Which is because I deleted it.  Duh.  I'll clean this up.

> 
> 
> > 
> > /*
> >  * The RFI family of instructions are context synchronising, and
> >  * that is how we return to userspace, so nothing is required here.
> >  */
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16 23:48         ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 23:48 UTC (permalink / raw)
  To: Nicholas Piggin, the arch/x86 maintainers
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mm, linuxppc-dev, Mathieu Desnoyers, Michael Ellerman,
	Paul Mackerras, Peter Zijlstra (Intel),
	stable, Will Deacon

On Wed, Jun 16, 2021, at 11:52 AM, Andy Lutomirski wrote:
> On 6/15/21 9:45 PM, Nicholas Piggin wrote:
> > Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> >> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> >> return-to-usermode instruction is x86-specific and that all other
> >> architectures automatically notice cross-modified code on return to
> >> userspace.
> 
> >> +/*
> >> + * XXX: can a powerpc person put an appropriate comment here?
> >> + */
> >> +static inline void membarrier_sync_core_before_usermode(void)
> >> +{
> >> +}
> >> +
> >> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> > 
> > powerpc's can just go in asm/membarrier.h
> 
> $ ls arch/powerpc/include/asm/membarrier.h
> ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
> or directory

Which is because I deleted it.  Duh.  I'll clean this up.

> 
> 
> > 
> > /*
> >  * The RFI family of instructions are context synchronising, and
> >  * that is how we return to userspace, so nothing is required here.
> >  */
> 
> Thanks!
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16 10:20     ` Will Deacon
  (?)
@ 2021-06-16 23:58       ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 23:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	Nicholas Piggin, Catalin Marinas, linux-arm-kernel,
	Mathieu Desnoyers, Peter Zijlstra (Intel),
	stable

On Wed, Jun 16, 2021, at 3:20 AM, Will Deacon wrote:
> 
> For the arm64 bits (docs and asm/sync_core.h):
> 
> Acked-by: Will Deacon <will@kernel.org>
> 

Thanks.

Per Nick's suggestion, I renamed the header to membarrier.h.  Unless I hear otherwise, I'll keep the ack.

> Will
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16 23:58       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 23:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, linux-mm, Peter Zijlstra (Intel),
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Nicholas Piggin, Dave Hansen, Paul Mackerras, stable,
	Mathieu Desnoyers, Andrew Morton, linuxppc-dev, linux-arm-kernel

On Wed, Jun 16, 2021, at 3:20 AM, Will Deacon wrote:
> 
> For the arm64 bits (docs and asm/sync_core.h):
> 
> Acked-by: Will Deacon <will@kernel.org>
> 

Thanks.

Per Nick's suggestion, I renamed the header to membarrier.h.  Unless I hear otherwise, I'll keep the ack.

> Will
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-16 23:58       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-16 23:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	Nicholas Piggin, Catalin Marinas, linux-arm-kernel,
	Mathieu Desnoyers, Peter Zijlstra (Intel),
	stable

On Wed, Jun 16, 2021, at 3:20 AM, Will Deacon wrote:
> 
> For the arm64 bits (docs and asm/sync_core.h):
> 
> Acked-by: Will Deacon <will@kernel.org>
> 

Thanks.

Per Nick's suggestion, I renamed the header to membarrier.h.  Unless I hear otherwise, I'll keep the ack.

> Will
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-16 18:41       ` Andy Lutomirski
@ 2021-06-17  1:37         ` Nicholas Piggin
  2021-06-17  2:57           ` Andy Lutomirski
  2021-06-17  8:45         ` Peter Zijlstra
  1 sibling, 1 reply; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-17  1:37 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers, x86

Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>> a comment explaining why this barrier probably exists in all cases.  This
>>>> is very fragile -- any change to the relevant parts of the scheduler
>>>> might get rid of these barriers, and it's not really clear to me that
>>>> the barrier actually exists in all necessary cases.
>>>
>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>>> fragile or maybe-buggy about this. The barrier definitely exists.
>>>
>>> And any change can change anything, that doesn't make it fragile. My
>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>>> replaces it with smp_mb for example.
>> 
>> I'm with Nick again, on this. You're adding extra barriers for no
>> discernible reason, that's not generally encouraged, seeing how extra
>> barriers is extra slow.
>> 
>> Both mmdrop() itself, as well as the callsite have comments saying how
>> membarrier relies on the implied barrier, what's fragile about that?
>> 
> 
> My real motivation is that mmgrab() and mmdrop() don't actually need to
> be full barriers.  The current implementation has them being full
> barriers, and the current implementation is quite slow.  So let's try
> that commit message again:
> 
> membarrier() needs a barrier after any CPU changes mm.  There is currently
> a comment explaining why this barrier probably exists in all cases. The
> logic is based on ensuring that the barrier exists on every control flow
> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> full barriers.
> 
> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> could use a release on architectures that have these operations.

I'm not against the idea, I've looked at something similar before (not
for mmdrop but a different primitive). Also my lazy tlb shootdown series 
could possibly take advantage of this, I might cherry pick it and test 
performance :)

I don't think it belongs in this series though. Should go together with
something that takes advantage of it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-17  1:37         ` Nicholas Piggin
@ 2021-06-17  2:57           ` Andy Lutomirski
  2021-06-17  5:32             ` Andy Lutomirski
  0 siblings, 1 reply; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17  2:57 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra (Intel)
  Cc: Andrew Morton, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, the arch/x86 maintainers



On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> >>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> >>>> a comment explaining why this barrier probably exists in all cases.  This
> >>>> is very fragile -- any change to the relevant parts of the scheduler
> >>>> might get rid of these barriers, and it's not really clear to me that
> >>>> the barrier actually exists in all necessary cases.
> >>>
> >>> The comments and barriers in the mmdrop() hunks? I don't see what is 
> >>> fragile or maybe-buggy about this. The barrier definitely exists.
> >>>
> >>> And any change can change anything, that doesn't make it fragile. My
> >>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> >>> replaces it with smp_mb for example.
> >> 
> >> I'm with Nick again, on this. You're adding extra barriers for no
> >> discernible reason, that's not generally encouraged, seeing how extra
> >> barriers is extra slow.
> >> 
> >> Both mmdrop() itself, as well as the callsite have comments saying how
> >> membarrier relies on the implied barrier, what's fragile about that?
> >> 
> > 
> > My real motivation is that mmgrab() and mmdrop() don't actually need to
> > be full barriers.  The current implementation has them being full
> > barriers, and the current implementation is quite slow.  So let's try
> > that commit message again:
> > 
> > membarrier() needs a barrier after any CPU changes mm.  There is currently
> > a comment explaining why this barrier probably exists in all cases. The
> > logic is based on ensuring that the barrier exists on every control flow
> > path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> > full barriers.
> > 
> > mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> > could use a release on architectures that have these operations.
> 
> I'm not against the idea, I've looked at something similar before (not
> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
> could possibly take advantage of this, I might cherry pick it and test 
> performance :)
> 
> I don't think it belongs in this series though. Should go together with
> something that takes advantage of it.

I’m going to see if I can get hazard pointers into shape quickly.

> 
> Thanks,
> Nick
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-17  2:57           ` Andy Lutomirski
@ 2021-06-17  5:32             ` Andy Lutomirski
  2021-06-17  6:51               ` Nicholas Piggin
                                 ` (2 more replies)
  0 siblings, 3 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17  5:32 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra (Intel), Rik van Riel
  Cc: Andrew Morton, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, the arch/x86 maintainers, Paul E. McKenney

On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
> 
> 
> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
> > Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> > > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> > >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> > >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> > >>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> > >>>> a comment explaining why this barrier probably exists in all cases.  This
> > >>>> is very fragile -- any change to the relevant parts of the scheduler
> > >>>> might get rid of these barriers, and it's not really clear to me that
> > >>>> the barrier actually exists in all necessary cases.
> > >>>
> > >>> The comments and barriers in the mmdrop() hunks? I don't see what is 
> > >>> fragile or maybe-buggy about this. The barrier definitely exists.
> > >>>
> > >>> And any change can change anything, that doesn't make it fragile. My
> > >>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> > >>> replaces it with smp_mb for example.
> > >> 
> > >> I'm with Nick again, on this. You're adding extra barriers for no
> > >> discernible reason, that's not generally encouraged, seeing how extra
> > >> barriers is extra slow.
> > >> 
> > >> Both mmdrop() itself, as well as the callsite have comments saying how
> > >> membarrier relies on the implied barrier, what's fragile about that?
> > >> 
> > > 
> > > My real motivation is that mmgrab() and mmdrop() don't actually need to
> > > be full barriers.  The current implementation has them being full
> > > barriers, and the current implementation is quite slow.  So let's try
> > > that commit message again:
> > > 
> > > membarrier() needs a barrier after any CPU changes mm.  There is currently
> > > a comment explaining why this barrier probably exists in all cases. The
> > > logic is based on ensuring that the barrier exists on every control flow
> > > path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> > > full barriers.
> > > 
> > > mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> > > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> > > could use a release on architectures that have these operations.
> > 
> > I'm not against the idea, I've looked at something similar before (not
> > for mmdrop but a different primitive). Also my lazy tlb shootdown series 
> > could possibly take advantage of this, I might cherry pick it and test 
> > performance :)
> > 
> > I don't think it belongs in this series though. Should go together with
> > something that takes advantage of it.
> 
> I’m going to see if I can get hazard pointers into shape quickly.

Here it is.  Not even boot tested!

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31

Nick, I think you can accomplish much the same thing as your patch by:

#define for_each_possible_lazymm_cpu while (false)

although a more clever definition might be even more performant.

I would appreciate everyone's thoughts as to whether this scheme is sane.

Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.

--Andy

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-17  5:32             ` Andy Lutomirski
@ 2021-06-17  6:51               ` Nicholas Piggin
  2021-06-17 23:49                 ` Andy Lutomirski
  2021-06-17  9:08               ` [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms Peter Zijlstra
  2021-06-17 15:02               ` [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit Paul E. McKenney
  2 siblings, 1 reply; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-17  6:51 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra (Intel), Rik van Riel
  Cc: Andrew Morton, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, Paul E. McKenney, the arch/x86 maintainers

Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>> 
>> 
>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>> > Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>> > > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>> > >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>> > >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> > >>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>> > >>>> a comment explaining why this barrier probably exists in all cases.  This
>> > >>>> is very fragile -- any change to the relevant parts of the scheduler
>> > >>>> might get rid of these barriers, and it's not really clear to me that
>> > >>>> the barrier actually exists in all necessary cases.
>> > >>>
>> > >>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>> > >>> fragile or maybe-buggy about this. The barrier definitely exists.
>> > >>>
>> > >>> And any change can change anything, that doesn't make it fragile. My
>> > >>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>> > >>> replaces it with smp_mb for example.
>> > >> 
>> > >> I'm with Nick again, on this. You're adding extra barriers for no
>> > >> discernible reason, that's not generally encouraged, seeing how extra
>> > >> barriers is extra slow.
>> > >> 
>> > >> Both mmdrop() itself, as well as the callsite have comments saying how
>> > >> membarrier relies on the implied barrier, what's fragile about that?
>> > >> 
>> > > 
>> > > My real motivation is that mmgrab() and mmdrop() don't actually need to
>> > > be full barriers.  The current implementation has them being full
>> > > barriers, and the current implementation is quite slow.  So let's try
>> > > that commit message again:
>> > > 
>> > > membarrier() needs a barrier after any CPU changes mm.  There is currently
>> > > a comment explaining why this barrier probably exists in all cases. The
>> > > logic is based on ensuring that the barrier exists on every control flow
>> > > path through the scheduler.  It also relies on mmgrab() and mmdrop() being
>> > > full barriers.
>> > > 
>> > > mmgrab() and mmdrop() would be better if they were not full barriers.  As a
>> > > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>> > > could use a release on architectures that have these operations.
>> > 
>> > I'm not against the idea, I've looked at something similar before (not
>> > for mmdrop but a different primitive). Also my lazy tlb shootdown series 
>> > could possibly take advantage of this, I might cherry pick it and test 
>> > performance :)
>> > 
>> > I don't think it belongs in this series though. Should go together with
>> > something that takes advantage of it.
>> 
>> I’m going to see if I can get hazard pointers into shape quickly.
> 
> Here it is.  Not even boot tested!
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> 
> Nick, I think you can accomplish much the same thing as your patch by:
> 
> #define for_each_possible_lazymm_cpu while (false)

I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
as lazy at this point. I must be missing something.

> 
> although a more clever definition might be even more performant.
> 
> I would appreciate everyone's thoughts as to whether this scheme is sane.

powerpc has no use for it, after the series in akpm's tree there's just
a small change required for radix TLB flushing to make the final flush 
IPI also purge lazies, and then the shootdown scheme runs with zero
additional IPIs so essentially no benefit to the hazard pointer games.
I have found the additional IPIs aren't bad anyway, so not something 
we'd bother trying to optmise away on hash, which is slowly being
de-prioritized.

I must say, I still see active_mm featuring prominently in our patch
which comes as a surprise. I would have thought some preparation and 
cleanup work first to fix the x86 deficienies you were talking about 
should go in first, I'm eager to see those. But either way I don't see
a fundamental reason this couldn't be done to support archs for which 
the standard or shootdown refcounting options aren't sufficient.

IIRC I didn't see a fundamental hole in it last time you posted the
idea but I admittedly didn't go through it super carefully.

Thanks,
Nick

> 
> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.
> 
> --Andy
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-16 18:41       ` Andy Lutomirski
  2021-06-17  1:37         ` Nicholas Piggin
@ 2021-06-17  8:45         ` Peter Zijlstra
  1 sibling, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17  8:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicholas Piggin, x86, Andrew Morton, Dave Hansen, LKML, linux-mm,
	Mathieu Desnoyers

On Wed, Jun 16, 2021 at 11:41:19AM -0700, Andy Lutomirski wrote:
> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> trivial optimization,

> mmgrab() could use a relaxed atomic and mmdrop()
> could use a release on architectures that have these operations.

mmgrab() *is* relaxed, mmdrop() is a full barrier but could trivially be
made weaker once membarrier stops caring about it.

static inline void mmdrop(struct mm_struct *mm)
{
	unsigned int val = atomic_dec_return_release(&mm->mm_count);
	if (unlikely(!val)) {
		/* Provide REL+ACQ ordering for free() */
		smp_acquire__after_ctrl_dep();
		__mmdrop(mm);
	}
}

It's slightly less optimal for not being able to use the flags from the
decrement. Or convert the whole thing to refcount_t (if appropriate)
which already does something like the above.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 16:27                       ` Russell King (Oracle)
  (?)
@ 2021-06-17  8:55                         ` Krzysztof Hałasa
  -1 siblings, 0 replies; 165+ messages in thread
From: Krzysztof Hałasa @ 2021-06-17  8:55 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Catalin Marinas, Linus Walleij, Neil Armstrong, Peter Zijlstra,
	Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

"Russell King (Oracle)" <linux@armlinux.org.uk> writes:

> So it seems to come down to a question about CNS3xxx and OXNAS. If
> these aren't being used, maybe we can drop ARM11MPCore support and
> the associated platforms?

Well, it appears we haven't updated software on our Gateworks Lagunas
(CNS3xxx dual core) for 4 years. This is old stuff, pre-DTB and all. We
have replacement setups (i.MX6 + mPCIe to mPCI bridge) which we don't
use either (due to lack of interest in mPCI - the old parallel, not the
express).

I don't have a problem with the CNS3xxx being dropped. In fact, we don't
use anything (ARM) older than v7 here.

Chris.

-- 
Krzysztof Hałasa

Sieć Badawcza Łukasiewicz
Przemysłowy Instytut Automatyki i Pomiarów PIAP
Al. Jerozolimskie 202, 02-486 Warszawa

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17  8:55                         ` Krzysztof Hałasa
  0 siblings, 0 replies; 165+ messages in thread
From: Krzysztof Hałasa @ 2021-06-17  8:55 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Catalin Marinas, Linus Walleij, Neil Armstrong, Peter Zijlstra,
	Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

"Russell King (Oracle)" <linux@armlinux.org.uk> writes:

> So it seems to come down to a question about CNS3xxx and OXNAS. If
> these aren't being used, maybe we can drop ARM11MPCore support and
> the associated platforms?

Well, it appears we haven't updated software on our Gateworks Lagunas
(CNS3xxx dual core) for 4 years. This is old stuff, pre-DTB and all. We
have replacement setups (i.MX6 + mPCIe to mPCI bridge) which we don't
use either (due to lack of interest in mPCI - the old parallel, not the
express).

I don't have a problem with the CNS3xxx being dropped. In fact, we don't
use anything (ARM) older than v7 here.

Chris.

-- 
Krzysztof Hałasa

Sieć Badawcza Łukasiewicz
Przemysłowy Instytut Automatyki i Pomiarów PIAP
Al. Jerozolimskie 202, 02-486 Warszawa


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17  8:55                         ` Krzysztof Hałasa
  0 siblings, 0 replies; 165+ messages in thread
From: Krzysztof Hałasa @ 2021-06-17  8:55 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Catalin Marinas, Linus Walleij, Neil Armstrong, Peter Zijlstra,
	Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, linux-arm-kernel,
	Will Deacon

"Russell King (Oracle)" <linux@armlinux.org.uk> writes:

> So it seems to come down to a question about CNS3xxx and OXNAS. If
> these aren't being used, maybe we can drop ARM11MPCore support and
> the associated platforms?

Well, it appears we haven't updated software on our Gateworks Lagunas
(CNS3xxx dual core) for 4 years. This is old stuff, pre-DTB and all. We
have replacement setups (i.MX6 + mPCIe to mPCI bridge) which we don't
use either (due to lack of interest in mPCI - the old parallel, not the
express).

I don't have a problem with the CNS3xxx being dropped. In fact, we don't
use anything (ARM) older than v7 here.

Chris.

-- 
Krzysztof Hałasa

Sieć Badawcza Łukasiewicz
Przemysłowy Instytut Automatyki i Pomiarów PIAP
Al. Jerozolimskie 202, 02-486 Warszawa

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  5:32             ` Andy Lutomirski
  2021-06-17  6:51               ` Nicholas Piggin
@ 2021-06-17  9:08               ` Peter Zijlstra
  2021-06-17  9:10                 ` Peter Zijlstra
                                   ` (4 more replies)
  2021-06-17 15:02               ` [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit Paul E. McKenney
  2 siblings, 5 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17  9:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers, Paul E. McKenney

On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> Here it is.  Not even boot tested!

It is now, it even builds a kernel.. so it must be perfect :-)

> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31

Since I had to turn it into a patch to post, so that I could comment on
it, I've cleaned it up a little for you.

I'll reply to self with some notes, but I think I like it.

---
 arch/x86/include/asm/mmu.h |   5 ++
 include/linux/sched/mm.h   |   3 +
 kernel/fork.c              |   2 +
 kernel/sched/core.c        | 138 ++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h       |  10 +++-
 5 files changed, 130 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..ce94162168c2 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -66,4 +66,9 @@ typedef struct {
 void leave_mm(int cpu);
 #define leave_mm leave_mm
 
+/* On x86, mm_cpumask(mm) contains all CPUs that might be lazily using mm */
+#define for_each_possible_lazymm_cpu(cpu, mm) \
+	for_each_cpu((cpu), mm_cpumask((mm)))
+
+
 #endif /* _ASM_X86_MMU_H */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..5c7eafee6fea 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -77,6 +77,9 @@ static inline bool mmget_not_zero(struct mm_struct *mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
+
+extern void mm_unlazy_mm_count(struct mm_struct *mm);
+
 #ifdef CONFIG_MMU
 /* same as above but performs the slow path from the async context. Can
  * be called from the atomic context as well
diff --git a/kernel/fork.c b/kernel/fork.c
index e595e77913eb..57415cca088c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1104,6 +1104,8 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+
+	mm_unlazy_mm_count(mm);
 	mmdrop(mm);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8ac693d542f6..e102ec53c2f6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -19,6 +19,7 @@
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
+#include <asm/mmu.h>
 
 #include "../workqueue_internal.h"
 #include "../../fs/io-wq.h"
@@ -4501,6 +4502,81 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	prepare_arch_switch(next);
 }
 
+static void mmdrop_lazy(struct rq *rq)
+{
+	struct mm_struct *old_mm;
+
+	if (likely(!READ_ONCE(rq->drop_mm)))
+		return;
+
+	/*
+	 * Slow path.  This only happens when we recently stopped using
+	 * an mm that is exiting.
+	 */
+	old_mm = xchg(&rq->drop_mm, NULL);
+	if (old_mm)
+		mmdrop(old_mm);
+}
+
+#ifndef for_each_possible_lazymm_cpu
+#define for_each_possible_lazymm_cpu(cpu, mm) for_each_online_cpu((cpu))
+#endif
+
+/*
+ * This converts all lazy_mm references to mm to mm_count refcounts.  Our
+ * caller holds an mm_count reference, so we don't need to worry about mm
+ * being freed out from under us.
+ */
+void mm_unlazy_mm_count(struct mm_struct *mm)
+{
+	unsigned int drop_count = num_possible_cpus();
+	int cpu;
+
+	/*
+	 * mm_users is zero, so no cpu will set its rq->lazy_mm to mm.
+	 */
+	WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0);
+
+	/* Grab enough references for the rest of this function. */
+	atomic_add(drop_count, &mm->mm_count);
+
+	for_each_possible_lazymm_cpu(cpu, mm) {
+		struct rq *rq = cpu_rq(cpu);
+		struct mm_struct *old_mm;
+
+		if (smp_load_acquire(&rq->lazy_mm) != mm)
+			continue;
+
+		drop_count--;	/* grab a reference; cpu will drop it later. */
+
+		old_mm = xchg(&rq->drop_mm, mm);
+
+		/*
+		 * We know that old_mm != mm: when we did the xchg(), we were
+		 * the only cpu to be putting mm into any drop_mm variable.
+		 */
+		WARN_ON_ONCE(old_mm == mm);
+		if (unlikely(old_mm)) {
+			/*
+			 * We just stole an mm reference from the target CPU.
+			 *
+			 * drop_mm was set to old by another call to
+			 * mm_unlazy_mm_count().  After that call xchg'd old
+			 * into drop_mm, the target CPU did:
+			 *
+			 *  smp_store_release(&rq->lazy_mm, mm);
+			 *
+			 * which synchronized with our smp_load_acquire()
+			 * above, so we know that the target CPU is done with
+			 * old. Drop old on its behalf.
+			 */
+			mmdrop(old_mm);
+		}
+	}
+
+	atomic_sub(drop_count, &mm->mm_count);
+}
+
 /**
  * finish_task_switch - clean up after a task-switch
  * @prev: the thread we just switched away from.
@@ -4524,7 +4600,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
-	struct mm_struct *mm = rq->prev_mm;
 	long prev_state;
 
 	/*
@@ -4543,8 +4618,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		      current->comm, current->pid, preempt_count()))
 		preempt_count_set(FORK_PREEMPT_COUNT);
 
-	rq->prev_mm = NULL;
-
 	/*
 	 * A task struct has one reference for the use as "current".
 	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
@@ -4574,22 +4647,16 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	kmap_local_sched_in();
 
 	fire_sched_in_preempt_notifiers(current);
+
 	/*
-	 * When switching through a kernel thread, the loop in
-	 * membarrier_{private,global}_expedited() may have observed that
-	 * kernel thread and not issued an IPI. It is therefore possible to
-	 * schedule between user->kernel->user threads without passing though
-	 * switch_mm(). Membarrier requires a barrier after storing to
-	 * rq->curr, before returning to userspace, so provide them here:
-	 *
-	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-	 *   provided by mmdrop(),
-	 * - a sync_core for SYNC_CORE.
+	 * Do this unconditionally.  There's a race in which a remote CPU
+	 * sees rq->lazy_mm != NULL and gives us an extra mm ref while we
+	 * are executing this code and we don't notice.  Instead of letting
+	 * that ref sit around until the next time we unlazy, do it on every
+	 * context switch.
 	 */
-	if (mm) {
-		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop(mm);
-	}
+	mmdrop_lazy(rq);
+
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
@@ -4652,25 +4719,32 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 	/*
 	 * kernel -> kernel   lazy + transfer active
-	 *   user -> kernel   lazy + mmgrab() active
+	 *   user -> kernel   lazy + lazy_mm grab active
 	 *
-	 * kernel ->   user   switch + mmdrop() active
+	 * kernel ->   user   switch + lazy_mm release active
 	 *   user ->   user   switch
 	 */
 	if (!next->mm) {                                // to kernel
 		enter_lazy_tlb(prev->active_mm, next);
 
 		next->active_mm = prev->active_mm;
-		if (prev->mm)                           // from user
-			mmgrab(prev->active_mm);
-		else
+		if (prev->mm) {                         // from user
+			SCHED_WARN_ON(rq->lazy_mm);
+
+			/*
+			 * Acqure a lazy_mm reference to the active
+			 * (lazy) mm.  No explicit barrier needed: we still
+			 * hold an explicit (mm_users) reference.  __mmput()
+			 * can't be called until we call mmput() to drop
+			 * our reference, and __mmput() is a release barrier.
+			 */
+			WRITE_ONCE(rq->lazy_mm, next->active_mm);
+		} else {
 			prev->active_mm = NULL;
+		}
 	} else {                                        // to user
 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
 		/*
-		 * sys_membarrier() requires an smp_mb() between setting
-		 * rq->curr / membarrier_switch_mm() and returning to userspace.
-		 *
 		 * The below provides this either through switch_mm(), or in
 		 * case 'prev->active_mm == next->mm' through
 		 * finish_task_switch()'s mmdrop().
@@ -4678,9 +4752,19 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop() in finish_task_switch(). */
-			rq->prev_mm = prev->active_mm;
+			/*
+			 * Even though nothing should reference ->active_mm
+			 * for a non-current task, don't leave a stale pointer
+			 * to an mm that might be freed.
+			 */
 			prev->active_mm = NULL;
+
+			/*
+			 * Drop our lazy_mm reference to the old lazy mm.
+			 * After this, any CPU may free it if it is
+			 * unreferenced.
+			 */
+			smp_store_release(&rq->lazy_mm, NULL);
 		}
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f0194cee0ba..703d95a4abd0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -966,7 +966,15 @@ struct rq {
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
-	struct mm_struct	*prev_mm;
+
+	/*
+	 * Fast refcounting scheme for lazy mm.  lazy_mm is a hazard pointer:
+	 * setting it to point to a lazily used mm keeps that mm from being
+	 * freed.  drop_mm points to am mm that needs an mmdrop() call
+	 * after the CPU owning the rq is done with it.
+	 */
+	struct mm_struct	*lazy_mm;
+	struct mm_struct	*drop_mm;
 
 	unsigned int		clock_update_flags;
 	u64			clock;

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  9:08               ` [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms Peter Zijlstra
@ 2021-06-17  9:10                 ` Peter Zijlstra
  2021-06-17 10:00                   ` Nicholas Piggin
  2021-06-17  9:13                 ` Peter Zijlstra
                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17  9:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers, Paul E. McKenney

On Thu, Jun 17, 2021 at 11:08:03AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:

> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -66,4 +66,9 @@ typedef struct {
>  void leave_mm(int cpu);
>  #define leave_mm leave_mm
>  
> +/* On x86, mm_cpumask(mm) contains all CPUs that might be lazily using mm */
> +#define for_each_possible_lazymm_cpu(cpu, mm) \
> +	for_each_cpu((cpu), mm_cpumask((mm)))
> +
> +
>  #endif /* _ASM_X86_MMU_H */

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8ac693d542f6..e102ec53c2f6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -19,6 +19,7 @@
>  

> +
> +#ifndef for_each_possible_lazymm_cpu
> +#define for_each_possible_lazymm_cpu(cpu, mm) for_each_online_cpu((cpu))
> +#endif
> +

Why can't the x86 implementation be the default? IIRC the problem with
mm_cpumask() is that (some) architectures don't clear bits, but IIRC
they all should be setting bits, or were there archs that didn't even do
that?


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  9:08               ` [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms Peter Zijlstra
  2021-06-17  9:10                 ` Peter Zijlstra
@ 2021-06-17  9:13                 ` Peter Zijlstra
  2021-06-17 14:06                   ` Andy Lutomirski
  2021-06-17  9:28                 ` Peter Zijlstra
                                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17  9:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers, Paul E. McKenney

On Thu, Jun 17, 2021 at 11:08:03AM +0200, Peter Zijlstra wrote:
> +static void mmdrop_lazy(struct rq *rq)
> +{
> +	struct mm_struct *old_mm;
> +
> +	if (likely(!READ_ONCE(rq->drop_mm)))
> +		return;
> +
> +	/*
> +	 * Slow path.  This only happens when we recently stopped using
> +	 * an mm that is exiting.
> +	 */
> +	old_mm = xchg(&rq->drop_mm, NULL);
> +	if (old_mm)
> +		mmdrop(old_mm);
> +}

AFAICT if we observe a !NULL value on the load, the xchg() *MUST* also
see !NULL (although it might see a different !NULL value). So do we want
to write it something like so instead?

static void mmdrop_lazy(struct rq *rq)
{
	struct mm_struct *old_mm;

	if (likely(!READ_ONCE(rq->drop_mm)))
		return;

	/*
	 * Slow path.  This only happens when we recently stopped using
	 * an mm that is exiting.
	 */
	old_mm = xchg(&rq->drop_mm, NULL);
	if (WARN_ON_ONCE(!old_mm))
		return;

	mmdrop(old_mm);
}

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  9:08               ` [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms Peter Zijlstra
  2021-06-17  9:10                 ` Peter Zijlstra
  2021-06-17  9:13                 ` Peter Zijlstra
@ 2021-06-17  9:28                 ` Peter Zijlstra
  2021-06-17 14:03                   ` Andy Lutomirski
  2021-06-17 14:10                 ` Andy Lutomirski
  2021-06-18  3:29                 ` Paul E. McKenney
  4 siblings, 1 reply; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17  9:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers, Paul E. McKenney

On Thu, Jun 17, 2021 at 11:08:03AM +0200, Peter Zijlstra wrote:

> diff --git a/kernel/fork.c b/kernel/fork.c
> index e595e77913eb..57415cca088c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1104,6 +1104,8 @@ static inline void __mmput(struct mm_struct *mm)
>  	}
>  	if (mm->binfmt)
>  		module_put(mm->binfmt->module);
> +
> +	mm_unlazy_mm_count(mm);
>  	mmdrop(mm);
>  }
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8ac693d542f6..e102ec53c2f6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -19,6 +19,7 @@

> +/*
> + * This converts all lazy_mm references to mm to mm_count refcounts.  Our
> + * caller holds an mm_count reference, so we don't need to worry about mm
> + * being freed out from under us.
> + */
> +void mm_unlazy_mm_count(struct mm_struct *mm)
> +{
> +	unsigned int drop_count = num_possible_cpus();
> +	int cpu;
> +
> +	/*
> +	 * mm_users is zero, so no cpu will set its rq->lazy_mm to mm.
> +	 */
> +	WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0);
> +
> +	/* Grab enough references for the rest of this function. */
> +	atomic_add(drop_count, &mm->mm_count);

So that had me puzzled for a little while. Would something like this be
a better comment?

	/*
	 * Because this can race with mmdrop_lazy(), mm_count must be
	 * incremented before setting any rq->drop_mm value, otherwise
	 * it is possible to free mm early.
	 */

> +
> +	for_each_possible_lazymm_cpu(cpu, mm) {
> +		struct rq *rq = cpu_rq(cpu);
> +		struct mm_struct *old_mm;
> +
> +		if (smp_load_acquire(&rq->lazy_mm) != mm)
> +			continue;
> +
> +		drop_count--;	/* grab a reference; cpu will drop it later. */

Totally confusing comment that :-)

> +

And with that, we rely on xchg() here to be at at least RELEASE, such
that that mm_count increment must be visible when drop_mm is seen.

> +		old_mm = xchg(&rq->drop_mm, mm);

Similarly, we rely on it being at least ACQUIRE for the !NULL return
case I think.

> +
> +		/*
> +		 * We know that old_mm != mm: when we did the xchg(), we were
> +		 * the only cpu to be putting mm into any drop_mm variable.
> +		 */
> +		WARN_ON_ONCE(old_mm == mm);
> +		if (unlikely(old_mm)) {
> +			/*
> +			 * We just stole an mm reference from the target CPU.
> +			 *
> +			 * drop_mm was set to old by another call to
> +			 * mm_unlazy_mm_count().  After that call xchg'd old
> +			 * into drop_mm, the target CPU did:
> +			 *
> +			 *  smp_store_release(&rq->lazy_mm, mm);
> +			 *
> +			 * which synchronized with our smp_load_acquire()
> +			 * above, so we know that the target CPU is done with
> +			 * old. Drop old on its behalf.
> +			 */
> +			mmdrop(old_mm);
> +		}
> +	}
> +
> +	atomic_sub(drop_count, &mm->mm_count);
> +}




^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  9:10                 ` Peter Zijlstra
@ 2021-06-17 10:00                   ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-17 10:00 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Andrew Morton, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, Paul E. McKenney, Rik van Riel,
	the arch/x86 maintainers

Excerpts from Peter Zijlstra's message of June 17, 2021 7:10 pm:
> On Thu, Jun 17, 2021 at 11:08:03AM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> 
>> --- a/arch/x86/include/asm/mmu.h
>> +++ b/arch/x86/include/asm/mmu.h
>> @@ -66,4 +66,9 @@ typedef struct {
>>  void leave_mm(int cpu);
>>  #define leave_mm leave_mm
>>  
>> +/* On x86, mm_cpumask(mm) contains all CPUs that might be lazily using mm */
>> +#define for_each_possible_lazymm_cpu(cpu, mm) \
>> +	for_each_cpu((cpu), mm_cpumask((mm)))
>> +
>> +
>>  #endif /* _ASM_X86_MMU_H */
> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 8ac693d542f6..e102ec53c2f6 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -19,6 +19,7 @@
>>  
> 
>> +
>> +#ifndef for_each_possible_lazymm_cpu
>> +#define for_each_possible_lazymm_cpu(cpu, mm) for_each_online_cpu((cpu))
>> +#endif
>> +
> 
> Why can't the x86 implementation be the default? IIRC the problem with
> mm_cpumask() is that (some) architectures don't clear bits, but IIRC
> they all should be setting bits, or were there archs that didn't even do
> that?

There are. alpha, arm64, hexagon (of the SMP supporting ones), AFAICT.

I have a patch for alpha though (it's 2 lines :))

Thanks,
Nick

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16  3:21   ` Andy Lutomirski
@ 2021-06-17 10:40     ` Mark Rutland
  -1 siblings, 0 replies; 165+ messages in thread
From: Mark Rutland @ 2021-06-17 10:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra, Russell King,
	linux-arm-kernel

On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Unfortunately, it's a bit more complicated than that, and these days
SYNC_CORE is equally necessary on arm as on arm64. This is something
that changed in the architecture over time, but since ARMv7 we generally
need both the cache maintenance *and* a context synchronization event
(the latter must occur on the CPU which will execute the instructions).

If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
A3.5.4 "Concurrent modification and execution of instructions" covers
this. That manual can be found at:

	https://developer.arm.com/documentation/ddi0406/latest/

Likewise for ARMv8-A; the latest manual (ARM DDI 0487G.a) covers this in
sections B2.2.5 and E2.3.5. That manual can be found at:

	https://developer.arm.com/documentation/ddi0487/ga

I am not sure about exactly what's required 11MPcore, since that's
somewhat a special case as the only SMP design prior to ARMv7-A
mandating broadcast maintenance.

For intuition's sake, one reason for this is that once a CPU has fetched
an instruction from an instruction cache into its pipeline and that
instruction is "in-flight", changes to that instruction cache are not
guaranteed to affect the "in-flight" copy (which e.g. could be
decomposed into micro-ops and so on). While these parts of a CPU aren't
necessarily designed as caches, they effectively transiently cache a
stale copy of the instruction while it is being executed.

This is more pronounced on newer designs with more complex execution
pipelines (e.g. with bigger windows for out-of-order execution and
speculation), and generally it's unlikely for this to be noticed on
smaller/simpler designs.

As above, modifying instructions requires two things:

1) Making sure that *subsequent* instruction fetches will see the new
   instructions. This is what cacheflush(2) does, and this is similar to
   what SW does on arm64 with DC CVAU + IC IVAU instructions and
   associated memory barriers.

2) Making sure that a CPU fetches the instructions *after* the cache
   maintenance is complete. There are a few ways to do this:

   * A context synchronization event (e.g. an ISB or exception return)
     on the CPU that will execute the instructions. This is what
     membarrier(SYNC_CORE) does.

   * In ARMv8-A there are some restrictions on the order in which
     modified instructions are guaranteed to be observed (e.g. if you
     publish a function, then subsequently install a branch to that new
     function), where an ISB may not be necessary. In the latest ARMv8-A
     manual as linked above, those are described in sections:

     - B2.3.8 "Ordering of instruction fetches" (for 64-bit)
     - E2.3.8 "Ordering of instruction fetches" (for 32-bit)

   * Where we can guarantee that a CPU cannot possibly have an
     instruction in-flight (e.g. due to a lack of a mapping to fetch
     instructions from), nothing is necessary. This is what we rely on
     when faulting in code pages. In these cases, the CPU is liable to
     take fault on the missing translation anyway.

Thanks,
Mark.

> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/arm/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 24804f11302d..89a885fba724 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -10,7 +10,6 @@ config ARM
>  	select ARCH_HAS_FORTIFY_SOURCE
>  	select ARCH_HAS_KEEPINITRD
>  	select ARCH_HAS_KCOV
> -	select ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>  	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
>  	select ARCH_HAS_PHYS_TO_DMA
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 10:40     ` Mark Rutland
  0 siblings, 0 replies; 165+ messages in thread
From: Mark Rutland @ 2021-06-17 10:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra, Russell King,
	linux-arm-kernel

On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Unfortunately, it's a bit more complicated than that, and these days
SYNC_CORE is equally necessary on arm as on arm64. This is something
that changed in the architecture over time, but since ARMv7 we generally
need both the cache maintenance *and* a context synchronization event
(the latter must occur on the CPU which will execute the instructions).

If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
A3.5.4 "Concurrent modification and execution of instructions" covers
this. That manual can be found at:

	https://developer.arm.com/documentation/ddi0406/latest/

Likewise for ARMv8-A; the latest manual (ARM DDI 0487G.a) covers this in
sections B2.2.5 and E2.3.5. That manual can be found at:

	https://developer.arm.com/documentation/ddi0487/ga

I am not sure about exactly what's required 11MPcore, since that's
somewhat a special case as the only SMP design prior to ARMv7-A
mandating broadcast maintenance.

For intuition's sake, one reason for this is that once a CPU has fetched
an instruction from an instruction cache into its pipeline and that
instruction is "in-flight", changes to that instruction cache are not
guaranteed to affect the "in-flight" copy (which e.g. could be
decomposed into micro-ops and so on). While these parts of a CPU aren't
necessarily designed as caches, they effectively transiently cache a
stale copy of the instruction while it is being executed.

This is more pronounced on newer designs with more complex execution
pipelines (e.g. with bigger windows for out-of-order execution and
speculation), and generally it's unlikely for this to be noticed on
smaller/simpler designs.

As above, modifying instructions requires two things:

1) Making sure that *subsequent* instruction fetches will see the new
   instructions. This is what cacheflush(2) does, and this is similar to
   what SW does on arm64 with DC CVAU + IC IVAU instructions and
   associated memory barriers.

2) Making sure that a CPU fetches the instructions *after* the cache
   maintenance is complete. There are a few ways to do this:

   * A context synchronization event (e.g. an ISB or exception return)
     on the CPU that will execute the instructions. This is what
     membarrier(SYNC_CORE) does.

   * In ARMv8-A there are some restrictions on the order in which
     modified instructions are guaranteed to be observed (e.g. if you
     publish a function, then subsequently install a branch to that new
     function), where an ISB may not be necessary. In the latest ARMv8-A
     manual as linked above, those are described in sections:

     - B2.3.8 "Ordering of instruction fetches" (for 64-bit)
     - E2.3.8 "Ordering of instruction fetches" (for 32-bit)

   * Where we can guarantee that a CPU cannot possibly have an
     instruction in-flight (e.g. due to a lack of a mapping to fetch
     instructions from), nothing is necessary. This is what we rely on
     when faulting in code pages. In these cases, the CPU is liable to
     take fault on the missing translation anyway.

Thanks,
Mark.

> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/arm/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 24804f11302d..89a885fba724 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -10,7 +10,6 @@ config ARM
>  	select ARCH_HAS_FORTIFY_SOURCE
>  	select ARCH_HAS_KEEPINITRD
>  	select ARCH_HAS_KCOV
> -	select ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>  	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
>  	select ARCH_HAS_PHYS_TO_DMA
> -- 
> 2.31.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 10:40     ` Mark Rutland
@ 2021-06-17 11:23       ` Russell King (Oracle)
  -1 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-17 11:23 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > On arm32, the only way to safely flush icache from usermode is to call
> > cacheflush(2).  This also handles any required pipeline flushes, so
> > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> 
> Unfortunately, it's a bit more complicated than that, and these days
> SYNC_CORE is equally necessary on arm as on arm64. This is something
> that changed in the architecture over time, but since ARMv7 we generally
> need both the cache maintenance *and* a context synchronization event
> (the latter must occur on the CPU which will execute the instructions).
> 
> If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> A3.5.4 "Concurrent modification and execution of instructions" covers
> this. That manual can be found at:
> 
> 	https://developer.arm.com/documentation/ddi0406/latest/

Looking at that, sys_cacheflush() meets this. The manual details a
series of cache maintenance calls in "step 1" that the modifying thread
must issue - this is exactly what sys_cacheflush() does. The same is
true for ARMv6, except the "ISB" terminology is replaced by a
"PrefetchFlush" terminology. (I checked DDI0100I).

"step 2" requires an ISB on the "other CPU" prior to executing that
code. As I understand it, in ARMv7, userspace can issue an ISB itself.

For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
for this that isn't availble to userspace. This is where we come to
the situation about ARM 11MPCore, and whether we continue to support
it or not.

So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
as userspace has everything that's required. ARMv6K is a different
matter as we've already identified for several reasons.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 11:23       ` Russell King (Oracle)
  0 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-17 11:23 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > On arm32, the only way to safely flush icache from usermode is to call
> > cacheflush(2).  This also handles any required pipeline flushes, so
> > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> 
> Unfortunately, it's a bit more complicated than that, and these days
> SYNC_CORE is equally necessary on arm as on arm64. This is something
> that changed in the architecture over time, but since ARMv7 we generally
> need both the cache maintenance *and* a context synchronization event
> (the latter must occur on the CPU which will execute the instructions).
> 
> If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> A3.5.4 "Concurrent modification and execution of instructions" covers
> this. That manual can be found at:
> 
> 	https://developer.arm.com/documentation/ddi0406/latest/

Looking at that, sys_cacheflush() meets this. The manual details a
series of cache maintenance calls in "step 1" that the modifying thread
must issue - this is exactly what sys_cacheflush() does. The same is
true for ARMv6, except the "ISB" terminology is replaced by a
"PrefetchFlush" terminology. (I checked DDI0100I).

"step 2" requires an ISB on the "other CPU" prior to executing that
code. As I understand it, in ARMv7, userspace can issue an ISB itself.

For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
for this that isn't availble to userspace. This is where we come to
the situation about ARM 11MPCore, and whether we continue to support
it or not.

So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
as userspace has everything that's required. ARMv6K is a different
matter as we've already identified for several reasons.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 11:23       ` Russell King (Oracle)
@ 2021-06-17 11:33         ` Mark Rutland
  -1 siblings, 0 replies; 165+ messages in thread
From: Mark Rutland @ 2021-06-17 11:33 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > On arm32, the only way to safely flush icache from usermode is to call
> > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > 
> > Unfortunately, it's a bit more complicated than that, and these days
> > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > that changed in the architecture over time, but since ARMv7 we generally
> > need both the cache maintenance *and* a context synchronization event
> > (the latter must occur on the CPU which will execute the instructions).
> > 
> > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > A3.5.4 "Concurrent modification and execution of instructions" covers
> > this. That manual can be found at:
> > 
> > 	https://developer.arm.com/documentation/ddi0406/latest/
> 
> Looking at that, sys_cacheflush() meets this. The manual details a
> series of cache maintenance calls in "step 1" that the modifying thread
> must issue - this is exactly what sys_cacheflush() does. The same is
> true for ARMv6, except the "ISB" terminology is replaced by a
> "PrefetchFlush" terminology. (I checked DDI0100I).
> 
> "step 2" requires an ISB on the "other CPU" prior to executing that
> code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> 
> For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> for this that isn't availble to userspace. This is where we come to
> the situation about ARM 11MPCore, and whether we continue to support
> it or not.
> 
> So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> as userspace has everything that's required. ARMv6K is a different
> matter as we've already identified for several reasons.

Sure, and I agree we should not change cacheflush().

The point of membarrier(SYNC_CORE) is that you can move the cost of that
ISB out of the fast-path in the executing thread(s) and into the
slow-path on the thread which generated the code.

So e.g. rather than an executing thread always having to do:

	LDR	<reg>, [<funcptr>]
	ISB	// in case funcptr was just updated
	BLR	<reg>

... you have the thread generating the code use membarrier(SYNC_CORE)
prior to plublishing the funcptr, and the fast-path on all the executing
threads can be:

	LDR	<reg> [<funcptr>]
	BLR	<reg>

... and thus I think we still want membarrier(SYNC_CORE) so that people
can do this, even if there are other means to achieve the same
functionality.

Thanks,
Mark.

> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 11:33         ` Mark Rutland
  0 siblings, 0 replies; 165+ messages in thread
From: Mark Rutland @ 2021-06-17 11:33 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andy Lutomirski, x86, Dave Hansen, LKML, linux-mm, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Peter Zijlstra,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > On arm32, the only way to safely flush icache from usermode is to call
> > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > 
> > Unfortunately, it's a bit more complicated than that, and these days
> > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > that changed in the architecture over time, but since ARMv7 we generally
> > need both the cache maintenance *and* a context synchronization event
> > (the latter must occur on the CPU which will execute the instructions).
> > 
> > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > A3.5.4 "Concurrent modification and execution of instructions" covers
> > this. That manual can be found at:
> > 
> > 	https://developer.arm.com/documentation/ddi0406/latest/
> 
> Looking at that, sys_cacheflush() meets this. The manual details a
> series of cache maintenance calls in "step 1" that the modifying thread
> must issue - this is exactly what sys_cacheflush() does. The same is
> true for ARMv6, except the "ISB" terminology is replaced by a
> "PrefetchFlush" terminology. (I checked DDI0100I).
> 
> "step 2" requires an ISB on the "other CPU" prior to executing that
> code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> 
> For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> for this that isn't availble to userspace. This is where we come to
> the situation about ARM 11MPCore, and whether we continue to support
> it or not.
> 
> So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> as userspace has everything that's required. ARMv6K is a different
> matter as we've already identified for several reasons.

Sure, and I agree we should not change cacheflush().

The point of membarrier(SYNC_CORE) is that you can move the cost of that
ISB out of the fast-path in the executing thread(s) and into the
slow-path on the thread which generated the code.

So e.g. rather than an executing thread always having to do:

	LDR	<reg>, [<funcptr>]
	ISB	// in case funcptr was just updated
	BLR	<reg>

... you have the thread generating the code use membarrier(SYNC_CORE)
prior to plublishing the funcptr, and the fast-path on all the executing
threads can be:

	LDR	<reg> [<funcptr>]
	BLR	<reg>

... and thus I think we still want membarrier(SYNC_CORE) so that people
can do this, even if there are other means to achieve the same
functionality.

Thanks,
Mark.

> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 11:33         ` Mark Rutland
@ 2021-06-17 13:41           ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 13:41 UTC (permalink / raw)
  To: Mark Rutland, Russell King (Oracle)
  Cc: the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	Peter Zijlstra (Intel),
	linux-arm-kernel



On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > 
> > > Unfortunately, it's a bit more complicated than that, and these days
> > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > that changed in the architecture over time, but since ARMv7 we generally
> > > need both the cache maintenance *and* a context synchronization event
> > > (the latter must occur on the CPU which will execute the instructions).
> > > 
> > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > this. That manual can be found at:
> > > 
> > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > 
> > Looking at that, sys_cacheflush() meets this. The manual details a
> > series of cache maintenance calls in "step 1" that the modifying thread
> > must issue - this is exactly what sys_cacheflush() does. The same is
> > true for ARMv6, except the "ISB" terminology is replaced by a
> > "PrefetchFlush" terminology. (I checked DDI0100I).
> > 
> > "step 2" requires an ISB on the "other CPU" prior to executing that
> > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > 
> > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > for this that isn't availble to userspace. This is where we come to
> > the situation about ARM 11MPCore, and whether we continue to support
> > it or not.
> > 
> > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > as userspace has everything that's required. ARMv6K is a different
> > matter as we've already identified for several reasons.
> 
> Sure, and I agree we should not change cacheflush().
> 
> The point of membarrier(SYNC_CORE) is that you can move the cost of that
> ISB out of the fast-path in the executing thread(s) and into the
> slow-path on the thread which generated the code.
> 
> So e.g. rather than an executing thread always having to do:
> 
> 	LDR	<reg>, [<funcptr>]
> 	ISB	// in case funcptr was just updated
> 	BLR	<reg>
> 
> ... you have the thread generating the code use membarrier(SYNC_CORE)
> prior to plublishing the funcptr, and the fast-path on all the executing
> threads can be:
> 
> 	LDR	<reg> [<funcptr>]
> 	BLR	<reg>
> 
> ... and thus I think we still want membarrier(SYNC_CORE) so that people
> can do this, even if there are other means to achieve the same
> functionality.

I had the impression that sys_cacheflush() did that.  Am I wrong?

In any event, I’m even more convinced that no new SYNC_CORE arches should be added. We need a new API that just does the right thing. 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 13:41           ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 13:41 UTC (permalink / raw)
  To: Mark Rutland, Russell King (Oracle)
  Cc: the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	Peter Zijlstra (Intel),
	linux-arm-kernel



On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > 
> > > Unfortunately, it's a bit more complicated than that, and these days
> > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > that changed in the architecture over time, but since ARMv7 we generally
> > > need both the cache maintenance *and* a context synchronization event
> > > (the latter must occur on the CPU which will execute the instructions).
> > > 
> > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > this. That manual can be found at:
> > > 
> > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > 
> > Looking at that, sys_cacheflush() meets this. The manual details a
> > series of cache maintenance calls in "step 1" that the modifying thread
> > must issue - this is exactly what sys_cacheflush() does. The same is
> > true for ARMv6, except the "ISB" terminology is replaced by a
> > "PrefetchFlush" terminology. (I checked DDI0100I).
> > 
> > "step 2" requires an ISB on the "other CPU" prior to executing that
> > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > 
> > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > for this that isn't availble to userspace. This is where we come to
> > the situation about ARM 11MPCore, and whether we continue to support
> > it or not.
> > 
> > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > as userspace has everything that's required. ARMv6K is a different
> > matter as we've already identified for several reasons.
> 
> Sure, and I agree we should not change cacheflush().
> 
> The point of membarrier(SYNC_CORE) is that you can move the cost of that
> ISB out of the fast-path in the executing thread(s) and into the
> slow-path on the thread which generated the code.
> 
> So e.g. rather than an executing thread always having to do:
> 
> 	LDR	<reg>, [<funcptr>]
> 	ISB	// in case funcptr was just updated
> 	BLR	<reg>
> 
> ... you have the thread generating the code use membarrier(SYNC_CORE)
> prior to plublishing the funcptr, and the fast-path on all the executing
> threads can be:
> 
> 	LDR	<reg> [<funcptr>]
> 	BLR	<reg>
> 
> ... and thus I think we still want membarrier(SYNC_CORE) so that people
> can do this, even if there are other means to achieve the same
> functionality.

I had the impression that sys_cacheflush() did that.  Am I wrong?

In any event, I’m even more convinced that no new SYNC_CORE arches should be added. We need a new API that just does the right thing. 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 13:41           ` Andy Lutomirski
@ 2021-06-17 13:51             ` Mark Rutland
  -1 siblings, 0 replies; 165+ messages in thread
From: Mark Rutland @ 2021-06-17 13:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	Peter Zijlstra (Intel),
	linux-arm-kernel

On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> 
> 
> On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> > On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > 
> > > > Unfortunately, it's a bit more complicated than that, and these days
> > > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > > that changed in the architecture over time, but since ARMv7 we generally
> > > > need both the cache maintenance *and* a context synchronization event
> > > > (the latter must occur on the CPU which will execute the instructions).
> > > > 
> > > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > > this. That manual can be found at:
> > > > 
> > > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > > 
> > > Looking at that, sys_cacheflush() meets this. The manual details a
> > > series of cache maintenance calls in "step 1" that the modifying thread
> > > must issue - this is exactly what sys_cacheflush() does. The same is
> > > true for ARMv6, except the "ISB" terminology is replaced by a
> > > "PrefetchFlush" terminology. (I checked DDI0100I).
> > > 
> > > "step 2" requires an ISB on the "other CPU" prior to executing that
> > > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > > 
> > > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > > for this that isn't availble to userspace. This is where we come to
> > > the situation about ARM 11MPCore, and whether we continue to support
> > > it or not.
> > > 
> > > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > > as userspace has everything that's required. ARMv6K is a different
> > > matter as we've already identified for several reasons.
> > 
> > Sure, and I agree we should not change cacheflush().
> > 
> > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > ISB out of the fast-path in the executing thread(s) and into the
> > slow-path on the thread which generated the code.
> > 
> > So e.g. rather than an executing thread always having to do:
> > 
> > 	LDR	<reg>, [<funcptr>]
> > 	ISB	// in case funcptr was just updated
> > 	BLR	<reg>
> > 
> > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > prior to plublishing the funcptr, and the fast-path on all the executing
> > threads can be:
> > 
> > 	LDR	<reg> [<funcptr>]
> > 	BLR	<reg>
> > 
> > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > can do this, even if there are other means to achieve the same
> > functionality.
> 
> I had the impression that sys_cacheflush() did that.  Am I wrong?

Currently sys_cacheflush() doesn't do this, and IIUC it has never done
remote context synchronization even for architectures that need that
(e.g. x86 requiring a serializing instruction).

> In any event, I’m even more convinced that no new SYNC_CORE arches
> should be added. We need a new API that just does the right thing. 

My intuition is the other way around, and that this is a gnereally
useful thing for architectures that require context synchronization.

It's not clear to me what "the right thing" would mean specifically, and
on architectures with userspace cache maintenance JITs can usually do
the most optimal maintenance, and only need help for the context
synchronization.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 13:51             ` Mark Rutland
  0 siblings, 0 replies; 165+ messages in thread
From: Mark Rutland @ 2021-06-17 13:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	Peter Zijlstra (Intel),
	linux-arm-kernel

On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> 
> 
> On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> > On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > 
> > > > Unfortunately, it's a bit more complicated than that, and these days
> > > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > > that changed in the architecture over time, but since ARMv7 we generally
> > > > need both the cache maintenance *and* a context synchronization event
> > > > (the latter must occur on the CPU which will execute the instructions).
> > > > 
> > > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > > this. That manual can be found at:
> > > > 
> > > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > > 
> > > Looking at that, sys_cacheflush() meets this. The manual details a
> > > series of cache maintenance calls in "step 1" that the modifying thread
> > > must issue - this is exactly what sys_cacheflush() does. The same is
> > > true for ARMv6, except the "ISB" terminology is replaced by a
> > > "PrefetchFlush" terminology. (I checked DDI0100I).
> > > 
> > > "step 2" requires an ISB on the "other CPU" prior to executing that
> > > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > > 
> > > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > > for this that isn't availble to userspace. This is where we come to
> > > the situation about ARM 11MPCore, and whether we continue to support
> > > it or not.
> > > 
> > > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > > as userspace has everything that's required. ARMv6K is a different
> > > matter as we've already identified for several reasons.
> > 
> > Sure, and I agree we should not change cacheflush().
> > 
> > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > ISB out of the fast-path in the executing thread(s) and into the
> > slow-path on the thread which generated the code.
> > 
> > So e.g. rather than an executing thread always having to do:
> > 
> > 	LDR	<reg>, [<funcptr>]
> > 	ISB	// in case funcptr was just updated
> > 	BLR	<reg>
> > 
> > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > prior to plublishing the funcptr, and the fast-path on all the executing
> > threads can be:
> > 
> > 	LDR	<reg> [<funcptr>]
> > 	BLR	<reg>
> > 
> > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > can do this, even if there are other means to achieve the same
> > functionality.
> 
> I had the impression that sys_cacheflush() did that.  Am I wrong?

Currently sys_cacheflush() doesn't do this, and IIUC it has never done
remote context synchronization even for architectures that need that
(e.g. x86 requiring a serializing instruction).

> In any event, I’m even more convinced that no new SYNC_CORE arches
> should be added. We need a new API that just does the right thing. 

My intuition is the other way around, and that this is a gnereally
useful thing for architectures that require context synchronization.

It's not clear to me what "the right thing" would mean specifically, and
on architectures with userspace cache maintenance JITs can usually do
the most optimal maintenance, and only need help for the context
synchronization.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 13:51             ` Mark Rutland
@ 2021-06-17 14:00               ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 14:00 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	Peter Zijlstra (Intel),
	linux-arm-kernel



On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:

> > In any event, I’m even more convinced that no new SYNC_CORE arches
> > should be added. We need a new API that just does the right thing. 
> 
> My intuition is the other way around, and that this is a gnereally
> useful thing for architectures that require context synchronization.

Except that you can't use it in a generic way.  You have to know the specific rules for your arch.

> 
> It's not clear to me what "the right thing" would mean specifically, and
> on architectures with userspace cache maintenance JITs can usually do
> the most optimal maintenance, and only need help for the context
> synchronization.
> 

This I simply don't believe -- I doubt that any sane architecture really works like this.  I wrote an email about it to Intel that apparently generated internal discussion but no results.  Consider:

mmap(some shared library, some previously unmapped address);

this does no heavyweight synchronization, at least on x86.  There is no "serializing" instruction in the fast path, and it *works* despite anything the SDM may or may not say.

We can and, IMO, should develop a sane way for user programs to install instructions into VMAs, for security-conscious software to verify them (by splitting the read and write sides?), and for their consumers to execute them, without knowing any arch details.  And I think this can be done with no IPIs except for possible TLB flushing when needed, at least on most architectures.  It would require a nontrivial amount of design work, and it would not resemble sys_cacheflush() or SYNC_CORE.

--Andy

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 14:00               ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 14:00 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	Peter Zijlstra (Intel),
	linux-arm-kernel



On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:

> > In any event, I’m even more convinced that no new SYNC_CORE arches
> > should be added. We need a new API that just does the right thing. 
> 
> My intuition is the other way around, and that this is a gnereally
> useful thing for architectures that require context synchronization.

Except that you can't use it in a generic way.  You have to know the specific rules for your arch.

> 
> It's not clear to me what "the right thing" would mean specifically, and
> on architectures with userspace cache maintenance JITs can usually do
> the most optimal maintenance, and only need help for the context
> synchronization.
> 

This I simply don't believe -- I doubt that any sane architecture really works like this.  I wrote an email about it to Intel that apparently generated internal discussion but no results.  Consider:

mmap(some shared library, some previously unmapped address);

this does no heavyweight synchronization, at least on x86.  There is no "serializing" instruction in the fast path, and it *works* despite anything the SDM may or may not say.

We can and, IMO, should develop a sane way for user programs to install instructions into VMAs, for security-conscious software to verify them (by splitting the read and write sides?), and for their consumers to execute them, without knowing any arch details.  And I think this can be done with no IPIs except for possible TLB flushing when needed, at least on most architectures.  It would require a nontrivial amount of design work, and it would not resemble sys_cacheflush() or SYNC_CORE.

--Andy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  9:28                 ` Peter Zijlstra
@ 2021-06-17 14:03                   ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 14:03 UTC (permalink / raw)
  To: Peter Zijlstra (Intel)
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers, Paul E. McKenney



On Thu, Jun 17, 2021, at 2:28 AM, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 11:08:03AM +0200, Peter Zijlstra wrote:
> 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index e595e77913eb..57415cca088c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1104,6 +1104,8 @@ static inline void __mmput(struct mm_struct *mm)
> >  	}
> >  	if (mm->binfmt)
> >  		module_put(mm->binfmt->module);
> > +
> > +	mm_unlazy_mm_count(mm);
> >  	mmdrop(mm);
> >  }
> >  
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 8ac693d542f6..e102ec53c2f6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -19,6 +19,7 @@
> 
> > +/*
> > + * This converts all lazy_mm references to mm to mm_count refcounts.  Our
> > + * caller holds an mm_count reference, so we don't need to worry about mm
> > + * being freed out from under us.
> > + */
> > +void mm_unlazy_mm_count(struct mm_struct *mm)
> > +{
> > +	unsigned int drop_count = num_possible_cpus();
> > +	int cpu;
> > +
> > +	/*
> > +	 * mm_users is zero, so no cpu will set its rq->lazy_mm to mm.
> > +	 */
> > +	WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0);
> > +
> > +	/* Grab enough references for the rest of this function. */
> > +	atomic_add(drop_count, &mm->mm_count);
> 
> So that had me puzzled for a little while. Would something like this be
> a better comment?
> 
> 	/*
> 	 * Because this can race with mmdrop_lazy(), mm_count must be
> 	 * incremented before setting any rq->drop_mm value, otherwise
> 	 * it is possible to free mm early.
> 	 */

Nope, because the caller already did it.  It's an optimization, but maybe it's a poorly done optimization -- I'd rather do two atomic ops than many.

How about:

drop_count = 0;

...

if (!drop_count) {
   /* Collect lots of references.  We'll drop the ones we don't use. */
  drop_count = num_possible_cpus();
  atomic_inc(drop_count, &->mm_count);
}
drop_count--;


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 13:41           ` Andy Lutomirski
@ 2021-06-17 14:05             ` Peter Zijlstra
  -1 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17 14:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:

> > Sure, and I agree we should not change cacheflush().
> > 
> > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > ISB out of the fast-path in the executing thread(s) and into the
> > slow-path on the thread which generated the code.
> > 
> > So e.g. rather than an executing thread always having to do:
> > 
> > 	LDR	<reg>, [<funcptr>]
> > 	ISB	// in case funcptr was just updated
> > 	BLR	<reg>
> > 
> > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > prior to plublishing the funcptr, and the fast-path on all the executing
> > threads can be:
> > 
> > 	LDR	<reg> [<funcptr>]
> > 	BLR	<reg>
> > 
> > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > can do this, even if there are other means to achieve the same
> > functionality.
> 
> I had the impression that sys_cacheflush() did that.  Am I wrong?

Yes, sys_cacheflush() only does what it says on the tin (and only
correctly for hardware broadcast -- everything except 11mpcore).

It only invalidates the caches, but not the per CPU derived state like
prefetch buffers and micro-op buffers, and certainly not instructions
already in flight.

So anything OoO needs at the very least a complete pipeline stall
injected, but probably something stronger to make it flush the buffers.

> In any event, I’m even more convinced that no new SYNC_CORE arches
> should be added. We need a new API that just does the right thing. 

I really don't understand why you hate the thing so much; SYNC_CORE is a
means of injecting whatever instruction is required to flush all uarch
state related to instructions on all theads (not all CPUs) of a process
as efficient as possible.

The alternative is sending signals to all threads (including the
non-running ones) which is known to scale very poorly indeed, or, as
Mark suggests above, have very expensive instructions unconditinoally in
the instruction stream, which is also undesired.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 14:05             ` Peter Zijlstra
  0 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17 14:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:

> > Sure, and I agree we should not change cacheflush().
> > 
> > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > ISB out of the fast-path in the executing thread(s) and into the
> > slow-path on the thread which generated the code.
> > 
> > So e.g. rather than an executing thread always having to do:
> > 
> > 	LDR	<reg>, [<funcptr>]
> > 	ISB	// in case funcptr was just updated
> > 	BLR	<reg>
> > 
> > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > prior to plublishing the funcptr, and the fast-path on all the executing
> > threads can be:
> > 
> > 	LDR	<reg> [<funcptr>]
> > 	BLR	<reg>
> > 
> > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > can do this, even if there are other means to achieve the same
> > functionality.
> 
> I had the impression that sys_cacheflush() did that.  Am I wrong?

Yes, sys_cacheflush() only does what it says on the tin (and only
correctly for hardware broadcast -- everything except 11mpcore).

It only invalidates the caches, but not the per CPU derived state like
prefetch buffers and micro-op buffers, and certainly not instructions
already in flight.

So anything OoO needs at the very least a complete pipeline stall
injected, but probably something stronger to make it flush the buffers.

> In any event, I’m even more convinced that no new SYNC_CORE arches
> should be added. We need a new API that just does the right thing. 

I really don't understand why you hate the thing so much; SYNC_CORE is a
means of injecting whatever instruction is required to flush all uarch
state related to instructions on all theads (not all CPUs) of a process
as efficient as possible.

The alternative is sending signals to all threads (including the
non-running ones) which is known to scale very poorly indeed, or, as
Mark suggests above, have very expensive instructions unconditinoally in
the instruction stream, which is also undesired.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  9:13                 ` Peter Zijlstra
@ 2021-06-17 14:06                   ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 14:06 UTC (permalink / raw)
  To: Peter Zijlstra (Intel)
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers, Paul E. McKenney



On Thu, Jun 17, 2021, at 2:13 AM, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 11:08:03AM +0200, Peter Zijlstra wrote:
> > +static void mmdrop_lazy(struct rq *rq)
> > +{
> > +	struct mm_struct *old_mm;
> > +
> > +	if (likely(!READ_ONCE(rq->drop_mm)))
> > +		return;
> > +
> > +	/*
> > +	 * Slow path.  This only happens when we recently stopped using
> > +	 * an mm that is exiting.
> > +	 */
> > +	old_mm = xchg(&rq->drop_mm, NULL);
> > +	if (old_mm)
> > +		mmdrop(old_mm);
> > +}
> 
> AFAICT if we observe a !NULL value on the load, the xchg() *MUST* also
> see !NULL (although it might see a different !NULL value). So do we want
> to write it something like so instead?

Like so?

> 
> static void mmdrop_lazy(struct rq *rq)
> {
> 	struct mm_struct *old_mm;
> 
> 	if (likely(!READ_ONCE(rq->drop_mm)))
> 		return;
> 
> 	/*
> 	 * Slow path.  This only happens when we recently stopped using
> 	 * an mm that is exiting.

* This xchg is the only thing that can change rq->drop_mm from non-NULL to NULL, and
* multiple mmdrop_lazy() calls can't run concurrently on the same CPU.

> 	 */
> 	old_mm = xchg(&rq->drop_mm, NULL);
> 	if (WARN_ON_ONCE(!old_mm))
> 		return;
> 
> 	mmdrop(old_mm);
> }
> 

--Andy

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  9:08               ` [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms Peter Zijlstra
                                   ` (2 preceding siblings ...)
  2021-06-17  9:28                 ` Peter Zijlstra
@ 2021-06-17 14:10                 ` Andy Lutomirski
  2021-06-17 15:45                   ` Peter Zijlstra
  2021-06-18  3:29                 ` Paul E. McKenney
  4 siblings, 1 reply; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 14:10 UTC (permalink / raw)
  To: Peter Zijlstra (Intel)
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers, Paul E. McKenney

On Thu, Jun 17, 2021, at 2:08 AM, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> > Here it is.  Not even boot tested!
> 
> It is now, it even builds a kernel.. so it must be perfect :-)
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> 
> Since I had to turn it into a patch to post, so that I could comment on
> it, I've cleaned it up a little for you.
> 
> I'll reply to self with some notes, but I think I like it.
> 
> ---
>  arch/x86/include/asm/mmu.h |   5 ++
>  include/linux/sched/mm.h   |   3 +
>  kernel/fork.c              |   2 +
>  kernel/sched/core.c        | 138 ++++++++++++++++++++++++++++++++++++---------
>  kernel/sched/sched.h       |  10 +++-
>  5 files changed, 130 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 5d7494631ea9..ce94162168c2 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -66,4 +66,9 @@ typedef struct {
>  void leave_mm(int cpu);
>  #define leave_mm leave_mm
>  
> +/* On x86, mm_cpumask(mm) contains all CPUs that might be lazily using mm */
> +#define for_each_possible_lazymm_cpu(cpu, mm) \
> +	for_each_cpu((cpu), mm_cpumask((mm)))
> +
> +
>  #endif /* _ASM_X86_MMU_H */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e24b1fe348e3..5c7eafee6fea 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -77,6 +77,9 @@ static inline bool mmget_not_zero(struct mm_struct *mm)
>  
>  /* mmput gets rid of the mappings and all user-space */
>  extern void mmput(struct mm_struct *);
> +
> +extern void mm_unlazy_mm_count(struct mm_struct *mm);

You didn't like mm_many_words_in_the_name_of_the_function()? :)

> -	if (mm) {
> -		membarrier_mm_sync_core_before_usermode(mm);
> -		mmdrop(mm);
> -	}

What happened here?

I think that my membarrier work should land before this patch.  Specifically, I want the scheduler to be in a state where nothing depends on the barrier-ness of mmdrop() so that we can change the mmdrop() calls to stop being barriers without our brains exploding trying to understand two different fancy synchronization schemes at the same time.

Other than that I like your cleanups.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 13:51             ` Mark Rutland
@ 2021-06-17 14:16               ` Mathieu Desnoyers
  -1 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 14:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Nicholas Piggin, Peter Zijlstra (Intel),
	linux-arm-kernel

On 17-Jun-2021 02:51:33 PM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> > > On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > > > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > > 
> > > > > Unfortunately, it's a bit more complicated than that, and these days
> > > > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > > > that changed in the architecture over time, but since ARMv7 we generally
> > > > > need both the cache maintenance *and* a context synchronization event
> > > > > (the latter must occur on the CPU which will execute the instructions).
> > > > > 
> > > > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > > > this. That manual can be found at:
> > > > > 
> > > > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > > > 
> > > > Looking at that, sys_cacheflush() meets this. The manual details a
> > > > series of cache maintenance calls in "step 1" that the modifying thread
> > > > must issue - this is exactly what sys_cacheflush() does. The same is
> > > > true for ARMv6, except the "ISB" terminology is replaced by a
> > > > "PrefetchFlush" terminology. (I checked DDI0100I).
> > > > 
> > > > "step 2" requires an ISB on the "other CPU" prior to executing that
> > > > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > > > 
> > > > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > > > for this that isn't availble to userspace. This is where we come to
> > > > the situation about ARM 11MPCore, and whether we continue to support
> > > > it or not.
> > > > 
> > > > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > > > as userspace has everything that's required. ARMv6K is a different
> > > > matter as we've already identified for several reasons.
> > > 
> > > Sure, and I agree we should not change cacheflush().
> > > 
> > > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > > ISB out of the fast-path in the executing thread(s) and into the
> > > slow-path on the thread which generated the code.
> > > 
> > > So e.g. rather than an executing thread always having to do:
> > > 
> > > 	LDR	<reg>, [<funcptr>]
> > > 	ISB	// in case funcptr was just updated
> > > 	BLR	<reg>
> > > 
> > > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > > prior to plublishing the funcptr, and the fast-path on all the executing
> > > threads can be:
> > > 
> > > 	LDR	<reg> [<funcptr>]
> > > 	BLR	<reg>
> > > 
> > > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > > can do this, even if there are other means to achieve the same
> > > functionality.
> > 
> > I had the impression that sys_cacheflush() did that.  Am I wrong?
> 
> Currently sys_cacheflush() doesn't do this, and IIUC it has never done
> remote context synchronization even for architectures that need that
> (e.g. x86 requiring a serializing instruction).
> 
> > In any event, I’m even more convinced that no new SYNC_CORE arches
> > should be added. We need a new API that just does the right thing. 
> 
> My intuition is the other way around, and that this is a gnereally
> useful thing for architectures that require context synchronization.
> 
> It's not clear to me what "the right thing" would mean specifically, and
> on architectures with userspace cache maintenance JITs can usually do
> the most optimal maintenance, and only need help for the context
> synchronization.

If I can attempt to summarize the current situation for ARMv7:

- In addition to the cache flushing on the core doing the code update,
  the architecture requires every core to perform a context synchronizing
  instruction before executing the updated code.

- sys_cacheflush() don't do this core sync on every core. It also takes a
  single address range as parameter.

- ARM, ARM64, powerpc, powerpc64, x86, x86-64 all currently handle the
  context synchronization requirement for updating user-space code on
  SMP with sys_membarrier SYNC_CORE. It's not, however, meant to replace
  explicit cache flushing operations if those are needed.

So removing membarrier SYNC_CORE from ARM would be a step backward here.
On ARMv7, the SYNC_CORE is needed _in addition_ to sys_cacheflush.

Adding a sync-core operation at the end of sys_cacheflush would be
inefficient for common GC use-cases where a rather large set of address
ranges are invalidated in one go: for this, we either want the GC to:

- Invoke sys_cacheflush for each targeted range, and then issue a single
  sys_membarrier SYNC_CORE, or

- Implement a new "sys_cacheflush_iov" which takes an iovec input. There
  I see that it could indeed invalidate all relevant cache lines *and*
  issue the SYNC_CORE at the end.

But shoehorning the SYNC_CORE in the pre-existing sys_cacheflush after
the fact seems like a bad idea.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 14:16               ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 14:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Nicholas Piggin, Peter Zijlstra (Intel),
	linux-arm-kernel

On 17-Jun-2021 02:51:33 PM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> > > On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > > > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > > 
> > > > > Unfortunately, it's a bit more complicated than that, and these days
> > > > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > > > that changed in the architecture over time, but since ARMv7 we generally
> > > > > need both the cache maintenance *and* a context synchronization event
> > > > > (the latter must occur on the CPU which will execute the instructions).
> > > > > 
> > > > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > > > this. That manual can be found at:
> > > > > 
> > > > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > > > 
> > > > Looking at that, sys_cacheflush() meets this. The manual details a
> > > > series of cache maintenance calls in "step 1" that the modifying thread
> > > > must issue - this is exactly what sys_cacheflush() does. The same is
> > > > true for ARMv6, except the "ISB" terminology is replaced by a
> > > > "PrefetchFlush" terminology. (I checked DDI0100I).
> > > > 
> > > > "step 2" requires an ISB on the "other CPU" prior to executing that
> > > > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > > > 
> > > > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > > > for this that isn't availble to userspace. This is where we come to
> > > > the situation about ARM 11MPCore, and whether we continue to support
> > > > it or not.
> > > > 
> > > > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > > > as userspace has everything that's required. ARMv6K is a different
> > > > matter as we've already identified for several reasons.
> > > 
> > > Sure, and I agree we should not change cacheflush().
> > > 
> > > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > > ISB out of the fast-path in the executing thread(s) and into the
> > > slow-path on the thread which generated the code.
> > > 
> > > So e.g. rather than an executing thread always having to do:
> > > 
> > > 	LDR	<reg>, [<funcptr>]
> > > 	ISB	// in case funcptr was just updated
> > > 	BLR	<reg>
> > > 
> > > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > > prior to plublishing the funcptr, and the fast-path on all the executing
> > > threads can be:
> > > 
> > > 	LDR	<reg> [<funcptr>]
> > > 	BLR	<reg>
> > > 
> > > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > > can do this, even if there are other means to achieve the same
> > > functionality.
> > 
> > I had the impression that sys_cacheflush() did that.  Am I wrong?
> 
> Currently sys_cacheflush() doesn't do this, and IIUC it has never done
> remote context synchronization even for architectures that need that
> (e.g. x86 requiring a serializing instruction).
> 
> > In any event, I’m even more convinced that no new SYNC_CORE arches
> > should be added. We need a new API that just does the right thing. 
> 
> My intuition is the other way around, and that this is a gnereally
> useful thing for architectures that require context synchronization.
> 
> It's not clear to me what "the right thing" would mean specifically, and
> on architectures with userspace cache maintenance JITs can usually do
> the most optimal maintenance, and only need help for the context
> synchronization.

If I can attempt to summarize the current situation for ARMv7:

- In addition to the cache flushing on the core doing the code update,
  the architecture requires every core to perform a context synchronizing
  instruction before executing the updated code.

- sys_cacheflush() don't do this core sync on every core. It also takes a
  single address range as parameter.

- ARM, ARM64, powerpc, powerpc64, x86, x86-64 all currently handle the
  context synchronization requirement for updating user-space code on
  SMP with sys_membarrier SYNC_CORE. It's not, however, meant to replace
  explicit cache flushing operations if those are needed.

So removing membarrier SYNC_CORE from ARM would be a step backward here.
On ARMv7, the SYNC_CORE is needed _in addition_ to sys_cacheflush.

Adding a sync-core operation at the end of sys_cacheflush would be
inefficient for common GC use-cases where a rather large set of address
ranges are invalidated in one go: for this, we either want the GC to:

- Invoke sys_cacheflush for each targeted range, and then issue a single
  sys_membarrier SYNC_CORE, or

- Implement a new "sys_cacheflush_iov" which takes an iovec input. There
  I see that it could indeed invalidate all relevant cache lines *and*
  issue the SYNC_CORE at the end.

But shoehorning the SYNC_CORE in the pre-existing sys_cacheflush after
the fact seems like a bad idea.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 14:00               ` Andy Lutomirski
@ 2021-06-17 14:20                 ` Mark Rutland
  -1 siblings, 0 replies; 165+ messages in thread
From: Mark Rutland @ 2021-06-17 14:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	Peter Zijlstra (Intel),
	linux-arm-kernel

On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> 
> 
> On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> > On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> 
> > > In any event, I’m even more convinced that no new SYNC_CORE arches
> > > should be added. We need a new API that just does the right thing. 
> > 
> > My intuition is the other way around, and that this is a gnereally
> > useful thing for architectures that require context synchronization.
> 
> Except that you can't use it in a generic way.  You have to know the
> specific rules for your arch.

That's generally true for modifying instruction streams though? The man
page for cacheflush(2) calls out that it is not portable.

I think what's necessary here is some mandatory per-arch documentation?

> > It's not clear to me what "the right thing" would mean specifically, and
> > on architectures with userspace cache maintenance JITs can usually do
> > the most optimal maintenance, and only need help for the context
> > synchronization.
> > 
> 
> This I simply don't believe -- I doubt that any sane architecture
> really works like this.  I wrote an email about it to Intel that
> apparently generated internal discussion but no results.  Consider:
> 
> mmap(some shared library, some previously unmapped address);
> 
> this does no heavyweight synchronization, at least on x86.  There is
> no "serializing" instruction in the fast path, and it *works* despite
> anything the SDM may or may not say.

Sure, and I called this case out specifically when I said:

|   * Where we can guarantee that a CPU cannot possibly have an
|     instruction in-flight (e.g. due to a lack of a mapping to fetch
|     instructions from), nothing is necessary. This is what we rely on
|     when faulting in code pages. In these cases, the CPU is liable to
|     take fault on the missing translation anyway.

.. what really matters is whether the CPU had the oppoprtunity to fetch
something stale; the context synchronization is necessary to discard
that.

Bear in mind that in many cases where this could occur in theory, we
don't hit in practice because CPUs don't happen to predict/speculate as
aggressively as they are permitted to. On arm/arm64 it's obvious that
this is a problem because the documentation clearly defines the
boundaries of what a CPU is permitted to do, whereas on other
architectures docuentation is not necessarily as clear whether this is
permited or whether the architecture mandates additional guarantees.

> We can and, IMO, should develop a sane way for user programs to
> install instructions into VMAs, for security-conscious software to
> verify them (by splitting the read and write sides?), and for their
> consumers to execute them, without knowing any arch details.  And I
> think this can be done with no IPIs except for possible TLB flushing
> when needed, at least on most architectures.  It would require a
> nontrivial amount of design work, and it would not resemble
> sys_cacheflush() or SYNC_CORE.

I'm not opposed to adding new interfaces for stuff like that, but I
don't think that membarrier(SYNC_CORE) or cacheflush(2) are necessarily
wrong as-is.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 14:20                 ` Mark Rutland
  0 siblings, 0 replies; 165+ messages in thread
From: Mark Rutland @ 2021-06-17 14:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	Peter Zijlstra (Intel),
	linux-arm-kernel

On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> 
> 
> On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> > On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> 
> > > In any event, I’m even more convinced that no new SYNC_CORE arches
> > > should be added. We need a new API that just does the right thing. 
> > 
> > My intuition is the other way around, and that this is a gnereally
> > useful thing for architectures that require context synchronization.
> 
> Except that you can't use it in a generic way.  You have to know the
> specific rules for your arch.

That's generally true for modifying instruction streams though? The man
page for cacheflush(2) calls out that it is not portable.

I think what's necessary here is some mandatory per-arch documentation?

> > It's not clear to me what "the right thing" would mean specifically, and
> > on architectures with userspace cache maintenance JITs can usually do
> > the most optimal maintenance, and only need help for the context
> > synchronization.
> > 
> 
> This I simply don't believe -- I doubt that any sane architecture
> really works like this.  I wrote an email about it to Intel that
> apparently generated internal discussion but no results.  Consider:
> 
> mmap(some shared library, some previously unmapped address);
> 
> this does no heavyweight synchronization, at least on x86.  There is
> no "serializing" instruction in the fast path, and it *works* despite
> anything the SDM may or may not say.

Sure, and I called this case out specifically when I said:

|   * Where we can guarantee that a CPU cannot possibly have an
|     instruction in-flight (e.g. due to a lack of a mapping to fetch
|     instructions from), nothing is necessary. This is what we rely on
|     when faulting in code pages. In these cases, the CPU is liable to
|     take fault on the missing translation anyway.

.. what really matters is whether the CPU had the oppoprtunity to fetch
something stale; the context synchronization is necessary to discard
that.

Bear in mind that in many cases where this could occur in theory, we
don't hit in practice because CPUs don't happen to predict/speculate as
aggressively as they are permitted to. On arm/arm64 it's obvious that
this is a problem because the documentation clearly defines the
boundaries of what a CPU is permitted to do, whereas on other
architectures docuentation is not necessarily as clear whether this is
permited or whether the architecture mandates additional guarantees.

> We can and, IMO, should develop a sane way for user programs to
> install instructions into VMAs, for security-conscious software to
> verify them (by splitting the read and write sides?), and for their
> consumers to execute them, without knowing any arch details.  And I
> think this can be done with no IPIs except for possible TLB flushing
> when needed, at least on most architectures.  It would require a
> nontrivial amount of design work, and it would not resemble
> sys_cacheflush() or SYNC_CORE.

I'm not opposed to adding new interfaces for stuff like that, but I
don't think that membarrier(SYNC_CORE) or cacheflush(2) are necessarily
wrong as-is.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16  3:21   ` Andy Lutomirski
  (?)
  (?)
@ 2021-06-17 14:47     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 14:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.

Agreed. Documentation of the sequence of operations that need to be performed
when cross-modifying code on SMP should be per-architecture. The documentation
of the architectural effects of membarrier sync-core should be per-arch as well.

> This means x86, arm64, and powerpc for now.

And also arm32, as discussed in the other leg of the patchset's email thread.

> Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

OK

> 
[...]
> 
> static void ipi_rseq(void *info)
> {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 	smp_call_func_t ipi_func = ipi_mb;
> 
> 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> -		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> 			return -EINVAL;
> +#else
> 		if (!(atomic_read(&mm->membarrier_state) &
> 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> 			return -EPERM;
> 		ipi_func = ipi_sync_core;
> +#endif

Please change back this #ifndef / #else / #endif within function for

if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
  ...
} else {
  ...
}

I don't think mixing up preprocessor and code logic makes it more readable.

Thanks,

Mathieu

> 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
> 		if (!IS_ENABLED(CONFIG_RSEQ))
> 			return -EINVAL;
> --
> 2.31.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-17 14:47     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 14:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.

Agreed. Documentation of the sequence of operations that need to be performed
when cross-modifying code on SMP should be per-architecture. The documentation
of the architectural effects of membarrier sync-core should be per-arch as well.

> This means x86, arm64, and powerpc for now.

And also arm32, as discussed in the other leg of the patchset's email thread.

> Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

OK

> 
[...]
> 
> static void ipi_rseq(void *info)
> {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 	smp_call_func_t ipi_func = ipi_mb;
> 
> 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> -		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> 			return -EINVAL;
> +#else
> 		if (!(atomic_read(&mm->membarrier_state) &
> 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> 			return -EPERM;
> 		ipi_func = ipi_sync_core;
> +#endif

Please change back this #ifndef / #else / #endif within function for

if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
  ...
} else {
  ...
}

I don't think mixing up preprocessor and code logic makes it more readable.

Thanks,

Mathieu

> 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
> 		if (!IS_ENABLED(CONFIG_RSEQ))
> 			return -EINVAL;
> --
> 2.31.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-17 14:47     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 14:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Catalin Marinas, Will Deacon, linux-mm, Peter Zijlstra, x86,
	linux-kernel, Nicholas Piggin, Dave Hansen, Paul Mackerras,
	stable, Andrew Morton, linuxppc-dev, linux-arm-kernel

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.

Agreed. Documentation of the sequence of operations that need to be performed
when cross-modifying code on SMP should be per-architecture. The documentation
of the architectural effects of membarrier sync-core should be per-arch as well.

> This means x86, arm64, and powerpc for now.

And also arm32, as discussed in the other leg of the patchset's email thread.

> Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

OK

> 
[...]
> 
> static void ipi_rseq(void *info)
> {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 	smp_call_func_t ipi_func = ipi_mb;
> 
> 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> -		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> 			return -EINVAL;
> +#else
> 		if (!(atomic_read(&mm->membarrier_state) &
> 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> 			return -EPERM;
> 		ipi_func = ipi_sync_core;
> +#endif

Please change back this #ifndef / #else / #endif within function for

if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
  ...
} else {
  ...
}

I don't think mixing up preprocessor and code logic makes it more readable.

Thanks,

Mathieu

> 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
> 		if (!IS_ENABLED(CONFIG_RSEQ))
> 			return -EINVAL;
> --
> 2.31.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-17 14:47     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 14:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.

Agreed. Documentation of the sequence of operations that need to be performed
when cross-modifying code on SMP should be per-architecture. The documentation
of the architectural effects of membarrier sync-core should be per-arch as well.

> This means x86, arm64, and powerpc for now.

And also arm32, as discussed in the other leg of the patchset's email thread.

> Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

OK

> 
[...]
> 
> static void ipi_rseq(void *info)
> {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 	smp_call_func_t ipi_func = ipi_mb;
> 
> 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> -		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> 			return -EINVAL;
> +#else
> 		if (!(atomic_read(&mm->membarrier_state) &
> 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> 			return -EPERM;
> 		ipi_func = ipi_sync_core;
> +#endif

Please change back this #ifndef / #else / #endif within function for

if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
  ...
} else {
  ...
}

I don't think mixing up preprocessor and code logic makes it more readable.

Thanks,

Mathieu

> 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
> 		if (!IS_ENABLED(CONFIG_RSEQ))
> 			return -EINVAL;
> --
> 2.31.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 14:00               ` Andy Lutomirski
@ 2021-06-17 15:01                 ` Peter Zijlstra
  -1 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17 15:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:

> > It's not clear to me what "the right thing" would mean specifically, and
> > on architectures with userspace cache maintenance JITs can usually do
> > the most optimal maintenance, and only need help for the context
> > synchronization.
> > 
> 
> This I simply don't believe -- I doubt that any sane architecture
> really works like this.  I wrote an email about it to Intel that
> apparently generated internal discussion but no results.  Consider:
> 
> mmap(some shared library, some previously unmapped address);
> 
> this does no heavyweight synchronization, at least on x86.  There is
> no "serializing" instruction in the fast path, and it *works* despite
> anything the SDM may or may not say.

I'm confused; why do you think that is relevant?

The only way to get into a memory address space is CR3 write, which is
serializing and will flush everything. Since there wasn't anything
mapped, nothing could be 'cached' from that location.

So that has to work...

> We can and, IMO, should develop a sane way for user programs to
> install instructions into VMAs, for security-conscious software to
> verify them (by splitting the read and write sides?), and for their
> consumers to execute them, without knowing any arch details.  And I
> think this can be done with no IPIs except for possible TLB flushing
> when needed, at least on most architectures.  It would require a
> nontrivial amount of design work, and it would not resemble
> sys_cacheflush() or SYNC_CORE.

The interesting use-case is where we modify code that is under active
execution in a multi-threaded process; where CPU0 runs code and doesn't
make any syscalls at all, while CPU1 modifies code that is visible to
CPU0.

In that case CPU0 can have various internal state still reflecting the
old instructions that no longer exist in memory -- presumably.

We also need to inject at least a full memory barrier and pipeline
flush, to create a proper before and after modified. To reason about
when the *other* threads will be able to observe the new code.

Now, the SDM documents that prefetch and trace buffers are not flushed
on i$ invalidate (actual implementations might of course differ) and
doing this requires the SERIALIZE instruction or one of the many
instructions that implies this, one of which is IRET.

For the cross-modifying case, I really don't see how you can not send an
IPI and expect behavour one can reason about, irrespective of any
non-coherent behaviour.

Now, the SDM documents non-coherent behaviour and requires SERIALIZE,
while at the same time any IPI already implies IRET which implies
SERIALIZE -- except some Luto guy was having plans to optimize the IRET
paths so we couldn't rely on that.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 15:01                 ` Peter Zijlstra
  0 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17 15:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:

> > It's not clear to me what "the right thing" would mean specifically, and
> > on architectures with userspace cache maintenance JITs can usually do
> > the most optimal maintenance, and only need help for the context
> > synchronization.
> > 
> 
> This I simply don't believe -- I doubt that any sane architecture
> really works like this.  I wrote an email about it to Intel that
> apparently generated internal discussion but no results.  Consider:
> 
> mmap(some shared library, some previously unmapped address);
> 
> this does no heavyweight synchronization, at least on x86.  There is
> no "serializing" instruction in the fast path, and it *works* despite
> anything the SDM may or may not say.

I'm confused; why do you think that is relevant?

The only way to get into a memory address space is CR3 write, which is
serializing and will flush everything. Since there wasn't anything
mapped, nothing could be 'cached' from that location.

So that has to work...

> We can and, IMO, should develop a sane way for user programs to
> install instructions into VMAs, for security-conscious software to
> verify them (by splitting the read and write sides?), and for their
> consumers to execute them, without knowing any arch details.  And I
> think this can be done with no IPIs except for possible TLB flushing
> when needed, at least on most architectures.  It would require a
> nontrivial amount of design work, and it would not resemble
> sys_cacheflush() or SYNC_CORE.

The interesting use-case is where we modify code that is under active
execution in a multi-threaded process; where CPU0 runs code and doesn't
make any syscalls at all, while CPU1 modifies code that is visible to
CPU0.

In that case CPU0 can have various internal state still reflecting the
old instructions that no longer exist in memory -- presumably.

We also need to inject at least a full memory barrier and pipeline
flush, to create a proper before and after modified. To reason about
when the *other* threads will be able to observe the new code.

Now, the SDM documents that prefetch and trace buffers are not flushed
on i$ invalidate (actual implementations might of course differ) and
doing this requires the SERIALIZE instruction or one of the many
instructions that implies this, one of which is IRET.

For the cross-modifying case, I really don't see how you can not send an
IPI and expect behavour one can reason about, irrespective of any
non-coherent behaviour.

Now, the SDM documents non-coherent behaviour and requires SERIALIZE,
while at the same time any IPI already implies IRET which implies
SERIALIZE -- except some Luto guy was having plans to optimize the IRET
paths so we couldn't rely on that.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-17  5:32             ` Andy Lutomirski
  2021-06-17  6:51               ` Nicholas Piggin
  2021-06-17  9:08               ` [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms Peter Zijlstra
@ 2021-06-17 15:02               ` Paul E. McKenney
  2021-06-18  0:06                 ` Andy Lutomirski
  2 siblings, 1 reply; 165+ messages in thread
From: Paul E. McKenney @ 2021-06-17 15:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicholas Piggin, Peter Zijlstra (Intel),
	Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers

On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
> > On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
> > > Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> > > > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> > > >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> > > >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> > > >>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> > > >>>> a comment explaining why this barrier probably exists in all cases.  This
> > > >>>> is very fragile -- any change to the relevant parts of the scheduler
> > > >>>> might get rid of these barriers, and it's not really clear to me that
> > > >>>> the barrier actually exists in all necessary cases.
> > > >>>
> > > >>> The comments and barriers in the mmdrop() hunks? I don't see what is 
> > > >>> fragile or maybe-buggy about this. The barrier definitely exists.
> > > >>>
> > > >>> And any change can change anything, that doesn't make it fragile. My
> > > >>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> > > >>> replaces it with smp_mb for example.
> > > >> 
> > > >> I'm with Nick again, on this. You're adding extra barriers for no
> > > >> discernible reason, that's not generally encouraged, seeing how extra
> > > >> barriers is extra slow.
> > > >> 
> > > >> Both mmdrop() itself, as well as the callsite have comments saying how
> > > >> membarrier relies on the implied barrier, what's fragile about that?
> > > >> 
> > > > 
> > > > My real motivation is that mmgrab() and mmdrop() don't actually need to
> > > > be full barriers.  The current implementation has them being full
> > > > barriers, and the current implementation is quite slow.  So let's try
> > > > that commit message again:
> > > > 
> > > > membarrier() needs a barrier after any CPU changes mm.  There is currently
> > > > a comment explaining why this barrier probably exists in all cases. The
> > > > logic is based on ensuring that the barrier exists on every control flow
> > > > path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> > > > full barriers.
> > > > 
> > > > mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> > > > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> > > > could use a release on architectures that have these operations.
> > > 
> > > I'm not against the idea, I've looked at something similar before (not
> > > for mmdrop but a different primitive). Also my lazy tlb shootdown series 
> > > could possibly take advantage of this, I might cherry pick it and test 
> > > performance :)
> > > 
> > > I don't think it belongs in this series though. Should go together with
> > > something that takes advantage of it.
> > 
> > I’m going to see if I can get hazard pointers into shape quickly.

One textbook C implementation is in perfbook CodeSamples/defer/hazptr.[hc]
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

A production-tested C++ implementation is in the folly library:

https://github.com/facebook/folly/blob/master/folly/synchronization/Hazptr.h

However, the hazard-pointers get-a-reference operation requires a full
barrier.  There are ways to optimize this away in some special cases,
one of which is used in the folly-library hash-map code.

> Here it is.  Not even boot tested!
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> 
> Nick, I think you can accomplish much the same thing as your patch by:
> 
> #define for_each_possible_lazymm_cpu while (false)
> 
> although a more clever definition might be even more performant.
> 
> I would appreciate everyone's thoughts as to whether this scheme is sane.
> 
> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.

OK, I should look at your patch, but that will be after morning meetings.

On RCU and idle, much of the idle code now allows rcu_read_lock() to be
directly, thanks to Peter's recent work.  Any sort of interrupt or NMI
from idle can also use rcu_read_lock(), including the IPIs that are now
done directly from idle.  RCU_NONIDLE() makes RCU pay attention to the
code supplied as its sole argument.

Or is your patch really having the CPU expect a mm_struct to stick around
across the full idle sojourn, and without the assistance of mmgrab()
and mmdrop()?

Anyway, off to meetings...  Hope this helps in the meantime.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-17 15:01                 ` Peter Zijlstra
@ 2021-06-17 15:13                   ` Peter Zijlstra
  -1 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17 15:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 05:01:53PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> > On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> 
> > > It's not clear to me what "the right thing" would mean specifically, and
> > > on architectures with userspace cache maintenance JITs can usually do
> > > the most optimal maintenance, and only need help for the context
> > > synchronization.
> > > 
> > 
> > This I simply don't believe -- I doubt that any sane architecture
> > really works like this.  I wrote an email about it to Intel that
> > apparently generated internal discussion but no results.  Consider:
> > 
> > mmap(some shared library, some previously unmapped address);
> > 
> > this does no heavyweight synchronization, at least on x86.  There is
> > no "serializing" instruction in the fast path, and it *works* despite
> > anything the SDM may or may not say.
> 
> I'm confused; why do you think that is relevant?
> 
> The only way to get into a memory address space is CR3 write, which is
> serializing and will flush everything. Since there wasn't anything
> mapped, nothing could be 'cached' from that location.
> 
> So that has to work...

Ooh, you mean mmap where there was something mmap'ed before. Not virgin
space so to say.

But in that case, the unmap() would've caused a TLB invalidate, which on
x86 is IPIs, which is IRET.

Other architectures include I/D cache flushes in their TLB
invalidations -- but as elsewhere in the thread, that might not be
suffient on its own.

But yes, I think TLBI has to imply flushing micro-arch instruction
related buffers for any of that to work.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-17 15:13                   ` Peter Zijlstra
  0 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17 15:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Russell King (Oracle),
	the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Mathieu Desnoyers, Nicholas Piggin,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 05:01:53PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> > On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> 
> > > It's not clear to me what "the right thing" would mean specifically, and
> > > on architectures with userspace cache maintenance JITs can usually do
> > > the most optimal maintenance, and only need help for the context
> > > synchronization.
> > > 
> > 
> > This I simply don't believe -- I doubt that any sane architecture
> > really works like this.  I wrote an email about it to Intel that
> > apparently generated internal discussion but no results.  Consider:
> > 
> > mmap(some shared library, some previously unmapped address);
> > 
> > this does no heavyweight synchronization, at least on x86.  There is
> > no "serializing" instruction in the fast path, and it *works* despite
> > anything the SDM may or may not say.
> 
> I'm confused; why do you think that is relevant?
> 
> The only way to get into a memory address space is CR3 write, which is
> serializing and will flush everything. Since there wasn't anything
> mapped, nothing could be 'cached' from that location.
> 
> So that has to work...

Ooh, you mean mmap where there was something mmap'ed before. Not virgin
space so to say.

But in that case, the unmap() would've caused a TLB invalidate, which on
x86 is IPIs, which is IRET.

Other architectures include I/D cache flushes in their TLB
invalidations -- but as elsewhere in the thread, that might not be
suffient on its own.

But yes, I think TLBI has to imply flushing micro-arch instruction
related buffers for any of that to work.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16  3:21   ` Andy Lutomirski
  (?)
  (?)
@ 2021-06-17 15:16     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 15:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

[...]

> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,

Can we please remove the editorial comment about the quality of the membarrier
sync-core's API ?

At least it's better than having all userspace rely on mprotect() undocumented
side-effects to perform something which typically works, until it won't, or until
this prevents mprotect's implementation to be improved because it will start breaking
JITs all over the place.

We can simply state that the definition of what membarrier sync-core does is defined
per-architecture, and document the sequence of operations to perform when doing
cross-modifying code specifically for each architecture.

Now if there are weird architectures where membarrier is an odd fit (I've heard that
riscv might need address ranges to which the core sync needs to apply), then those
might need to implement their own arch-specific system call, which is all fine.

> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

We should document the requirements on ARMv7 as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-17 15:16     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 15:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

[...]

> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,

Can we please remove the editorial comment about the quality of the membarrier
sync-core's API ?

At least it's better than having all userspace rely on mprotect() undocumented
side-effects to perform something which typically works, until it won't, or until
this prevents mprotect's implementation to be improved because it will start breaking
JITs all over the place.

We can simply state that the definition of what membarrier sync-core does is defined
per-architecture, and document the sequence of operations to perform when doing
cross-modifying code specifically for each architecture.

Now if there are weird architectures where membarrier is an odd fit (I've heard that
riscv might need address ranges to which the core sync needs to apply), then those
might need to implement their own arch-specific system call, which is all fine.

> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

We should document the requirements on ARMv7 as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-17 15:16     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 15:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Catalin Marinas, Will Deacon, linux-mm, Peter Zijlstra, x86,
	linux-kernel, Nicholas Piggin, Dave Hansen, Paul Mackerras,
	stable, Andrew Morton, linuxppc-dev, linux-arm-kernel

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

[...]

> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,

Can we please remove the editorial comment about the quality of the membarrier
sync-core's API ?

At least it's better than having all userspace rely on mprotect() undocumented
side-effects to perform something which typically works, until it won't, or until
this prevents mprotect's implementation to be improved because it will start breaking
JITs all over the place.

We can simply state that the definition of what membarrier sync-core does is defined
per-architecture, and document the sequence of operations to perform when doing
cross-modifying code specifically for each architecture.

Now if there are weird architectures where membarrier is an odd fit (I've heard that
riscv might need address ranges to which the core sync needs to apply), then those
might need to implement their own arch-specific system call, which is all fine.

> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

We should document the requirements on ARMv7 as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-17 15:16     ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-17 15:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:

[...]

> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,

Can we please remove the editorial comment about the quality of the membarrier
sync-core's API ?

At least it's better than having all userspace rely on mprotect() undocumented
side-effects to perform something which typically works, until it won't, or until
this prevents mprotect's implementation to be improved because it will start breaking
JITs all over the place.

We can simply state that the definition of what membarrier sync-core does is defined
per-architecture, and document the sequence of operations to perform when doing
cross-modifying code specifically for each architecture.

Now if there are weird architectures where membarrier is an odd fit (I've heard that
riscv might need address ranges to which the core sync needs to apply), then those
might need to implement their own arch-specific system call, which is all fine.

> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

We should document the requirements on ARMv7 as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17 14:10                 ` Andy Lutomirski
@ 2021-06-17 15:45                   ` Peter Zijlstra
  0 siblings, 0 replies; 165+ messages in thread
From: Peter Zijlstra @ 2021-06-17 15:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers, Paul E. McKenney

On Thu, Jun 17, 2021 at 07:10:41AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 2:08 AM, Peter Zijlstra wrote:
> > +extern void mm_unlazy_mm_count(struct mm_struct *mm);
> 
> You didn't like mm_many_words_in_the_name_of_the_function()? :)

:-)

> > -	if (mm) {
> > -		membarrier_mm_sync_core_before_usermode(mm);
> > -		mmdrop(mm);
> > -	}
> 
> What happened here?
> 

I forced your patch ontop of tip/master without bothering about the
membarrier cleanups and figured I could live without that call for a
little while.

But yes, that needs cleaning up.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 1/8] membarrier: Document why membarrier() works
  2021-06-16  7:30     ` Peter Zijlstra
@ 2021-06-17 23:45       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 23:45 UTC (permalink / raw)
  To: Peter Zijlstra, Nicholas Piggin
  Cc: x86, Andrew Morton, Dave Hansen, LKML, linux-mm, Mathieu Desnoyers

On 6/16/21 12:30 AM, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 02:00:37PM +1000, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>> We had a nice comment at the top of membarrier.c explaining why membarrier
>>> worked in a handful of scenarios, but that consisted more of a list of
>>> things not to forget than an actual description of the algorithm and why it
>>> should be expected to work.
>>>
>>> Add a comment explaining my understanding of the algorithm.  This exposes a
>>> couple of implementation issues that I will hopefully fix up in subsequent
>>> patches.
>>>
>>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> Cc: Nicholas Piggin <npiggin@gmail.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>> ---
>>>  kernel/sched/membarrier.c | 55 +++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 55 insertions(+)
>>>
>>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>>> index b5add64d9698..3173b063d358 100644
>>> --- a/kernel/sched/membarrier.c
>>> +++ b/kernel/sched/membarrier.c
>>> @@ -7,6 +7,61 @@
>>>  #include "sched.h"
>>>  
>>
>> Precisely describing the orderings is great, not a fan of the style of the
>> comment though.
> 
> I'm with Nick on that; I can't read it :/ It only makes things more
> confusing. If you want precision, English (or any natural language) is
> your enemy.
> 
> To describe ordering use the diagrams and/or litmus tests.
> 

I made some changes.  Maybe it's better now.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-17  6:51               ` Nicholas Piggin
@ 2021-06-17 23:49                 ` Andy Lutomirski
  2021-06-19  2:53                   ` Nicholas Piggin
  0 siblings, 1 reply; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-17 23:49 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra (Intel), Rik van Riel
  Cc: Andrew Morton, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, Paul E. McKenney, the arch/x86 maintainers

On 6/16/21 11:51 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>>>
>>>
>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>>>>>> a comment explaining why this barrier probably exists in all cases.  This
>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
>>>>>>>> might get rid of these barriers, and it's not really clear to me that
>>>>>>>> the barrier actually exists in all necessary cases.
>>>>>>>
>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
>>>>>>>
>>>>>>> And any change can change anything, that doesn't make it fragile. My
>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>>>>>>> replaces it with smp_mb for example.
>>>>>>
>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
>>>>>> discernible reason, that's not generally encouraged, seeing how extra
>>>>>> barriers is extra slow.
>>>>>>
>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
>>>>>> membarrier relies on the implied barrier, what's fragile about that?
>>>>>>
>>>>>
>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
>>>>> be full barriers.  The current implementation has them being full
>>>>> barriers, and the current implementation is quite slow.  So let's try
>>>>> that commit message again:
>>>>>
>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>>> a comment explaining why this barrier probably exists in all cases. The
>>>>> logic is based on ensuring that the barrier exists on every control flow
>>>>> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
>>>>> full barriers.
>>>>>
>>>>> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>>>>> could use a release on architectures that have these operations.
>>>>
>>>> I'm not against the idea, I've looked at something similar before (not
>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
>>>> could possibly take advantage of this, I might cherry pick it and test 
>>>> performance :)
>>>>
>>>> I don't think it belongs in this series though. Should go together with
>>>> something that takes advantage of it.
>>>
>>> I’m going to see if I can get hazard pointers into shape quickly.
>>
>> Here it is.  Not even boot tested!
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
>>
>> Nick, I think you can accomplish much the same thing as your patch by:
>>
>> #define for_each_possible_lazymm_cpu while (false)
> 
> I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
> as lazy at this point. I must be missing something.

What I mean is: if you want to shoot down lazies instead of doing the
hazard pointer trick to track them, you could do:

#define for_each_possible_lazymm_cpu while (false)

which would promise to the core code that you don't have any lazies left
by the time exit_mmap() is done.  You might need a new hook in
exit_mmap() depending on exactly how you implement the lazy shootdown.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-17 15:02               ` [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit Paul E. McKenney
@ 2021-06-18  0:06                 ` Andy Lutomirski
  2021-06-18  3:35                   ` Paul E. McKenney
  0 siblings, 1 reply; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:06 UTC (permalink / raw)
  To: paulmck
  Cc: Nicholas Piggin, Peter Zijlstra (Intel),
	Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers

On 6/17/21 8:02 AM, Paul E. McKenney wrote:
> On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
>> I would appreciate everyone's thoughts as to whether this scheme is sane.
>>
>> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.
> 
> OK, I should look at your patch, but that will be after morning meetings.
> 
> On RCU and idle, much of the idle code now allows rcu_read_lock() to be
> directly, thanks to Peter's recent work.  Any sort of interrupt or NMI
> from idle can also use rcu_read_lock(), including the IPIs that are now
> done directly from idle.  RCU_NONIDLE() makes RCU pay attention to the
> code supplied as its sole argument.
> 
> Or is your patch really having the CPU expect a mm_struct to stick around
> across the full idle sojourn, and without the assistance of mmgrab()
> and mmdrop()?

I really do expect it to stick around across the full idle sojourn.
Unless RCU is more magical than I think it is, this means I can't use RCU.

--Andy

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16  3:21   ` Andy Lutomirski
@ 2021-06-18  0:07     ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:07 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Mathieu Desnoyers,
	Nicholas Piggin, Peter Zijlstra, Russell King, linux-arm-kernel

On 6/15/21 8:21 PM, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

After all the discussion, I'm dropping this patch.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-18  0:07     ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:07 UTC (permalink / raw)
  To: x86
  Cc: Dave Hansen, LKML, linux-mm, Andrew Morton, Mathieu Desnoyers,
	Nicholas Piggin, Peter Zijlstra, Russell King, linux-arm-kernel

On 6/15/21 8:21 PM, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

After all the discussion, I'm dropping this patch.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-17 14:47     ` Mathieu Desnoyers
  (?)
@ 2021-06-18  0:12       ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:

> Please change back this #ifndef / #else / #endif within function for
> 
> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>   ...
> } else {
>   ...
> }
> 
> I don't think mixing up preprocessor and code logic makes it more readable.

I agree, but I don't know how to make the result work well.
membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
case, so either I need to fake up a definition or use #ifdef.

If I faked up a definition, I would want to assert, at build time, that
it isn't called.  I don't think we can do:

static void membarrier_sync_core_before_usermode()
{
    BUILD_BUG_IF_REACHABLE();
}


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18  0:12       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Catalin Marinas, Will Deacon, linux-mm, Peter Zijlstra, x86,
	linux-kernel, Nicholas Piggin, Dave Hansen, Paul Mackerras,
	stable, Andrew Morton, linuxppc-dev, linux-arm-kernel

On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:

> Please change back this #ifndef / #else / #endif within function for
> 
> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>   ...
> } else {
>   ...
> }
> 
> I don't think mixing up preprocessor and code logic makes it more readable.

I agree, but I don't know how to make the result work well.
membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
case, so either I need to fake up a definition or use #ifdef.

If I faked up a definition, I would want to assert, at build time, that
it isn't called.  I don't think we can do:

static void membarrier_sync_core_before_usermode()
{
    BUILD_BUG_IF_REACHABLE();
}


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18  0:12       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:

> Please change back this #ifndef / #else / #endif within function for
> 
> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>   ...
> } else {
>   ...
> }
> 
> I don't think mixing up preprocessor and code logic makes it more readable.

I agree, but I don't know how to make the result work well.
membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
case, so either I need to fake up a definition or use #ifdef.

If I faked up a definition, I would want to assert, at build time, that
it isn't called.  I don't think we can do:

static void membarrier_sync_core_before_usermode()
{
    BUILD_BUG_IF_REACHABLE();
}


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-17 15:16     ` Mathieu Desnoyers
  (?)
@ 2021-06-18  0:13       ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

On 6/17/21 8:16 AM, Mathieu Desnoyers wrote:
> ----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:
> 
> [...]
> 
>> +# An architecture that wants to support
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
>> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
>> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
>> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
>> +# fantastic API and may not make sense on all architectures.  Once an
>> +# architecture meets these requirements,
> 
> Can we please remove the editorial comment about the quality of the membarrier
> sync-core's API ?

Done
>> +#
>> +# On x86, a program can safely modify code, issue
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
>> +# the modified address or an alias, from any thread in the calling process.
>> +#
>> +# On arm64, a program can modify code, flush the icache as needed, and issue
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
>> +# event", aka pipeline flush on all CPUs that might run the calling process.
>> +# Then the program can execute the modified code as long as it is executed
>> +# from an address consistent with the icache flush and the CPU's cache type.
>> +#
>> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
>> +# similarly to arm64.  It would be nice if the powerpc maintainers could
>> +# add a more clear explanantion.
> 
> We should document the requirements on ARMv7 as well.

Done.

> 
> Thanks,
> 
> Mathieu
> 


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18  0:13       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Catalin Marinas, Will Deacon, linux-mm, Peter Zijlstra, x86,
	linux-kernel, Nicholas Piggin, Dave Hansen, Paul Mackerras,
	stable, Andrew Morton, linuxppc-dev, linux-arm-kernel

On 6/17/21 8:16 AM, Mathieu Desnoyers wrote:
> ----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:
> 
> [...]
> 
>> +# An architecture that wants to support
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
>> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
>> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
>> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
>> +# fantastic API and may not make sense on all architectures.  Once an
>> +# architecture meets these requirements,
> 
> Can we please remove the editorial comment about the quality of the membarrier
> sync-core's API ?

Done
>> +#
>> +# On x86, a program can safely modify code, issue
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
>> +# the modified address or an alias, from any thread in the calling process.
>> +#
>> +# On arm64, a program can modify code, flush the icache as needed, and issue
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
>> +# event", aka pipeline flush on all CPUs that might run the calling process.
>> +# Then the program can execute the modified code as long as it is executed
>> +# from an address consistent with the icache flush and the CPU's cache type.
>> +#
>> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
>> +# similarly to arm64.  It would be nice if the powerpc maintainers could
>> +# add a more clear explanantion.
> 
> We should document the requirements on ARMv7 as well.

Done.

> 
> Thanks,
> 
> Mathieu
> 


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18  0:13       ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  0:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

On 6/17/21 8:16 AM, Mathieu Desnoyers wrote:
> ----- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski luto@kernel.org wrote:
> 
> [...]
> 
>> +# An architecture that wants to support
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it
>> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
>> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
>> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
>> +# fantastic API and may not make sense on all architectures.  Once an
>> +# architecture meets these requirements,
> 
> Can we please remove the editorial comment about the quality of the membarrier
> sync-core's API ?

Done
>> +#
>> +# On x86, a program can safely modify code, issue
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
>> +# the modified address or an alias, from any thread in the calling process.
>> +#
>> +# On arm64, a program can modify code, flush the icache as needed, and issue
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context synchronizing
>> +# event", aka pipeline flush on all CPUs that might run the calling process.
>> +# Then the program can execute the modified code as long as it is executed
>> +# from an address consistent with the icache flush and the CPU's cache type.
>> +#
>> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
>> +# similarly to arm64.  It would be nice if the powerpc maintainers could
>> +# add a more clear explanantion.
> 
> We should document the requirements on ARMv7 as well.

Done.

> 
> Thanks,
> 
> Mathieu
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-17  9:08               ` [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms Peter Zijlstra
                                   ` (3 preceding siblings ...)
  2021-06-17 14:10                 ` Andy Lutomirski
@ 2021-06-18  3:29                 ` Paul E. McKenney
  2021-06-18  5:04                   ` Andy Lutomirski
  4 siblings, 1 reply; 165+ messages in thread
From: Paul E. McKenney @ 2021-06-18  3:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Nicholas Piggin, Rik van Riel, Andrew Morton,
	Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, the arch/x86 maintainers

On Thu, Jun 17, 2021 at 11:08:03AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> > Here it is.  Not even boot tested!
> 
> It is now, it even builds a kernel.. so it must be perfect :-)
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> 
> Since I had to turn it into a patch to post, so that I could comment on
> it, I've cleaned it up a little for you.
> 
> I'll reply to self with some notes, but I think I like it.

But rcutorture isn't too happy with it when applied to current
mainline:

------------------------------------------------------------------------
[   32.559192] ------------[ cut here ]------------
[   32.559528] WARNING: CPU: 0 PID: 175 at kernel/fork.c:686 __mmdrop+0x9f/0xb0
[   32.560197] Modules linked in:
[   32.560470] CPU: 0 PID: 175 Comm: torture_onoff Not tainted 5.13.0-rc6+ #23
[   32.561077] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[   32.561809] RIP: 0010:__mmdrop+0x9f/0xb0
[   32.562179] Code: fb 20 75 e6 48 8b 45 68 48 85 c0 0f 85 1e 48 ad 00 48 8b 3d 93 e0 c3 01 5b 48 89 ee 5d 41 5c e9 97 45 18 00 0f 0b 0f 0b eb 87 <0f> 0b eb 95 48 89 ef e8 a5 f1 17 00 eb a9 0f 1f 00 48 81 ef c0 03
[   32.563822] RSP: 0018:ffff944c40623d68 EFLAGS: 00010246
[   32.564331] RAX: ffff8e84c2339c00 RBX: ffff8e84df5572e0 RCX: 00000000fffffffa
[   32.564978] RDX: 0000000000000000 RSI: 0000000000000033 RDI: ffff8e84c29a0000
[   32.565648] RBP: ffff8e84c29a0000 R08: ffff8e84c11c774a R09: 0000000000000001
[   32.566256] R10: ffff8e85411c773f R11: ffff8e84c11c774a R12: 0000000000000057
[   32.566909] R13: 0000000000000000 R14: ffffffffb0e487f8 R15: 000000000000000d
[   32.567584] FS:  0000000000000000(0000) GS:ffff8e84df200000(0000) knlGS:0000000000000000
[   32.568321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.568860] CR2: 0000000000000000 CR3: 00000000029ec000 CR4: 00000000000006f0
[   32.569559] Call Trace:
[   32.569804]  ? takedown_cpu+0xd0/0xd0
[   32.570123]  finish_cpu+0x2e/0x40
[   32.570449]  cpuhp_invoke_callback+0xf6/0x3f0
[   32.570755]  cpuhp_invoke_callback_range+0x3b/0x80
[   32.571137]  _cpu_down+0xdf/0x2a0
[   32.571467]  cpu_down+0x2a/0x50
[   32.571771]  device_offline+0x80/0xb0
[   32.572101]  remove_cpu+0x1a/0x30
[   32.572393]  torture_offline+0x80/0x140
[   32.572730]  torture_onoff+0x147/0x260
[   32.573068]  ? torture_kthread_stopping+0xa0/0xa0
[   32.573488]  kthread+0xf9/0x130
[   32.573777]  ? kthread_park+0x80/0x80
[   32.574119]  ret_from_fork+0x22/0x30
[   32.574418] ---[ end trace b77effd8aab7f902 ]---
[   32.574819] BUG: Bad rss-counter state mm:00000000bccc5a55 type:MM_ANONPAGES val:1
[   32.575450] BUG: non-zero pgtables_bytes on freeing mm: 24576
------------------------------------------------------------------------

Are we absolutely sure that the mmdrop()s are balanced in all cases?

							Thanx, Paul

> ---
>  arch/x86/include/asm/mmu.h |   5 ++
>  include/linux/sched/mm.h   |   3 +
>  kernel/fork.c              |   2 +
>  kernel/sched/core.c        | 138 ++++++++++++++++++++++++++++++++++++---------
>  kernel/sched/sched.h       |  10 +++-
>  5 files changed, 130 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 5d7494631ea9..ce94162168c2 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -66,4 +66,9 @@ typedef struct {
>  void leave_mm(int cpu);
>  #define leave_mm leave_mm
>  
> +/* On x86, mm_cpumask(mm) contains all CPUs that might be lazily using mm */
> +#define for_each_possible_lazymm_cpu(cpu, mm) \
> +	for_each_cpu((cpu), mm_cpumask((mm)))
> +
> +
>  #endif /* _ASM_X86_MMU_H */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e24b1fe348e3..5c7eafee6fea 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -77,6 +77,9 @@ static inline bool mmget_not_zero(struct mm_struct *mm)
>  
>  /* mmput gets rid of the mappings and all user-space */
>  extern void mmput(struct mm_struct *);
> +
> +extern void mm_unlazy_mm_count(struct mm_struct *mm);
> +
>  #ifdef CONFIG_MMU
>  /* same as above but performs the slow path from the async context. Can
>   * be called from the atomic context as well
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e595e77913eb..57415cca088c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1104,6 +1104,8 @@ static inline void __mmput(struct mm_struct *mm)
>  	}
>  	if (mm->binfmt)
>  		module_put(mm->binfmt->module);
> +
> +	mm_unlazy_mm_count(mm);
>  	mmdrop(mm);
>  }
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8ac693d542f6..e102ec53c2f6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -19,6 +19,7 @@
>  
>  #include <asm/switch_to.h>
>  #include <asm/tlb.h>
> +#include <asm/mmu.h>
>  
>  #include "../workqueue_internal.h"
>  #include "../../fs/io-wq.h"
> @@ -4501,6 +4502,81 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  	prepare_arch_switch(next);
>  }
>  
> +static void mmdrop_lazy(struct rq *rq)
> +{
> +	struct mm_struct *old_mm;
> +
> +	if (likely(!READ_ONCE(rq->drop_mm)))
> +		return;
> +
> +	/*
> +	 * Slow path.  This only happens when we recently stopped using
> +	 * an mm that is exiting.
> +	 */
> +	old_mm = xchg(&rq->drop_mm, NULL);
> +	if (old_mm)
> +		mmdrop(old_mm);
> +}
> +
> +#ifndef for_each_possible_lazymm_cpu
> +#define for_each_possible_lazymm_cpu(cpu, mm) for_each_online_cpu((cpu))
> +#endif
> +
> +/*
> + * This converts all lazy_mm references to mm to mm_count refcounts.  Our
> + * caller holds an mm_count reference, so we don't need to worry about mm
> + * being freed out from under us.
> + */
> +void mm_unlazy_mm_count(struct mm_struct *mm)
> +{
> +	unsigned int drop_count = num_possible_cpus();
> +	int cpu;
> +
> +	/*
> +	 * mm_users is zero, so no cpu will set its rq->lazy_mm to mm.
> +	 */
> +	WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0);
> +
> +	/* Grab enough references for the rest of this function. */
> +	atomic_add(drop_count, &mm->mm_count);
> +
> +	for_each_possible_lazymm_cpu(cpu, mm) {
> +		struct rq *rq = cpu_rq(cpu);
> +		struct mm_struct *old_mm;
> +
> +		if (smp_load_acquire(&rq->lazy_mm) != mm)
> +			continue;
> +
> +		drop_count--;	/* grab a reference; cpu will drop it later. */
> +
> +		old_mm = xchg(&rq->drop_mm, mm);
> +
> +		/*
> +		 * We know that old_mm != mm: when we did the xchg(), we were
> +		 * the only cpu to be putting mm into any drop_mm variable.
> +		 */
> +		WARN_ON_ONCE(old_mm == mm);
> +		if (unlikely(old_mm)) {
> +			/*
> +			 * We just stole an mm reference from the target CPU.
> +			 *
> +			 * drop_mm was set to old by another call to
> +			 * mm_unlazy_mm_count().  After that call xchg'd old
> +			 * into drop_mm, the target CPU did:
> +			 *
> +			 *  smp_store_release(&rq->lazy_mm, mm);
> +			 *
> +			 * which synchronized with our smp_load_acquire()
> +			 * above, so we know that the target CPU is done with
> +			 * old. Drop old on its behalf.
> +			 */
> +			mmdrop(old_mm);
> +		}
> +	}
> +
> +	atomic_sub(drop_count, &mm->mm_count);
> +}
> +
>  /**
>   * finish_task_switch - clean up after a task-switch
>   * @prev: the thread we just switched away from.
> @@ -4524,7 +4600,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	__releases(rq->lock)
>  {
>  	struct rq *rq = this_rq();
> -	struct mm_struct *mm = rq->prev_mm;
>  	long prev_state;
>  
>  	/*
> @@ -4543,8 +4618,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  		      current->comm, current->pid, preempt_count()))
>  		preempt_count_set(FORK_PREEMPT_COUNT);
>  
> -	rq->prev_mm = NULL;
> -
>  	/*
>  	 * A task struct has one reference for the use as "current".
>  	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
> @@ -4574,22 +4647,16 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	kmap_local_sched_in();
>  
>  	fire_sched_in_preempt_notifiers(current);
> +
>  	/*
> -	 * When switching through a kernel thread, the loop in
> -	 * membarrier_{private,global}_expedited() may have observed that
> -	 * kernel thread and not issued an IPI. It is therefore possible to
> -	 * schedule between user->kernel->user threads without passing though
> -	 * switch_mm(). Membarrier requires a barrier after storing to
> -	 * rq->curr, before returning to userspace, so provide them here:
> -	 *
> -	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -	 *   provided by mmdrop(),
> -	 * - a sync_core for SYNC_CORE.
> +	 * Do this unconditionally.  There's a race in which a remote CPU
> +	 * sees rq->lazy_mm != NULL and gives us an extra mm ref while we
> +	 * are executing this code and we don't notice.  Instead of letting
> +	 * that ref sit around until the next time we unlazy, do it on every
> +	 * context switch.
>  	 */
> -	if (mm) {
> -		membarrier_mm_sync_core_before_usermode(mm);
> -		mmdrop(mm);
> -	}
> +	mmdrop_lazy(rq);
> +
>  	if (unlikely(prev_state == TASK_DEAD)) {
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);
> @@ -4652,25 +4719,32 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  
>  	/*
>  	 * kernel -> kernel   lazy + transfer active
> -	 *   user -> kernel   lazy + mmgrab() active
> +	 *   user -> kernel   lazy + lazy_mm grab active
>  	 *
> -	 * kernel ->   user   switch + mmdrop() active
> +	 * kernel ->   user   switch + lazy_mm release active
>  	 *   user ->   user   switch
>  	 */
>  	if (!next->mm) {                                // to kernel
>  		enter_lazy_tlb(prev->active_mm, next);
>  
>  		next->active_mm = prev->active_mm;
> -		if (prev->mm)                           // from user
> -			mmgrab(prev->active_mm);
> -		else
> +		if (prev->mm) {                         // from user
> +			SCHED_WARN_ON(rq->lazy_mm);
> +
> +			/*
> +			 * Acqure a lazy_mm reference to the active
> +			 * (lazy) mm.  No explicit barrier needed: we still
> +			 * hold an explicit (mm_users) reference.  __mmput()
> +			 * can't be called until we call mmput() to drop
> +			 * our reference, and __mmput() is a release barrier.
> +			 */
> +			WRITE_ONCE(rq->lazy_mm, next->active_mm);
> +		} else {
>  			prev->active_mm = NULL;
> +		}
>  	} else {                                        // to user
>  		membarrier_switch_mm(rq, prev->active_mm, next->mm);
>  		/*
> -		 * sys_membarrier() requires an smp_mb() between setting
> -		 * rq->curr / membarrier_switch_mm() and returning to userspace.
> -		 *
>  		 * The below provides this either through switch_mm(), or in
>  		 * case 'prev->active_mm == next->mm' through
>  		 * finish_task_switch()'s mmdrop().
> @@ -4678,9 +4752,19 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  		switch_mm_irqs_off(prev->active_mm, next->mm, next);
>  
>  		if (!prev->mm) {                        // from kernel
> -			/* will mmdrop() in finish_task_switch(). */
> -			rq->prev_mm = prev->active_mm;
> +			/*
> +			 * Even though nothing should reference ->active_mm
> +			 * for a non-current task, don't leave a stale pointer
> +			 * to an mm that might be freed.
> +			 */
>  			prev->active_mm = NULL;
> +
> +			/*
> +			 * Drop our lazy_mm reference to the old lazy mm.
> +			 * After this, any CPU may free it if it is
> +			 * unreferenced.
> +			 */
> +			smp_store_release(&rq->lazy_mm, NULL);
>  		}
>  	}
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 8f0194cee0ba..703d95a4abd0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -966,7 +966,15 @@ struct rq {
>  	struct task_struct	*idle;
>  	struct task_struct	*stop;
>  	unsigned long		next_balance;
> -	struct mm_struct	*prev_mm;
> +
> +	/*
> +	 * Fast refcounting scheme for lazy mm.  lazy_mm is a hazard pointer:
> +	 * setting it to point to a lazily used mm keeps that mm from being
> +	 * freed.  drop_mm points to am mm that needs an mmdrop() call
> +	 * after the CPU owning the rq is done with it.
> +	 */
> +	struct mm_struct	*lazy_mm;
> +	struct mm_struct	*drop_mm;
>  
>  	unsigned int		clock_update_flags;
>  	u64			clock;

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-18  0:06                 ` Andy Lutomirski
@ 2021-06-18  3:35                   ` Paul E. McKenney
  0 siblings, 0 replies; 165+ messages in thread
From: Paul E. McKenney @ 2021-06-18  3:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicholas Piggin, Peter Zijlstra (Intel),
	Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers

On Thu, Jun 17, 2021 at 05:06:02PM -0700, Andy Lutomirski wrote:
> On 6/17/21 8:02 AM, Paul E. McKenney wrote:
> > On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> >> I would appreciate everyone's thoughts as to whether this scheme is sane.
> >>
> >> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.
> > 
> > OK, I should look at your patch, but that will be after morning meetings.
> > 
> > On RCU and idle, much of the idle code now allows rcu_read_lock() to be
> > directly, thanks to Peter's recent work.  Any sort of interrupt or NMI
> > from idle can also use rcu_read_lock(), including the IPIs that are now
> > done directly from idle.  RCU_NONIDLE() makes RCU pay attention to the
> > code supplied as its sole argument.
> > 
> > Or is your patch really having the CPU expect a mm_struct to stick around
> > across the full idle sojourn, and without the assistance of mmgrab()
> > and mmdrop()?
> 
> I really do expect it to stick around across the full idle sojourn.
> Unless RCU is more magical than I think it is, this means I can't use RCU.

You are quite correct.  And unfortunately, making RCU pay attention
across the full idle sojourn would make the battery-powered embedded
guys quite annoyed.  And would result in OOM.  You could use something
like percpu_ref, but at a large memory expense.  You could use something
like SRCU or Tasks Trace RCU, but this would increase the overhead of
freeing mm_struct structures.

Your use of per-CPU pointers seems sound in principle, but I am uncertain
of some of the corner cases.  And either current mainline gained an
mmdrop-balance bug or rcutorture is also uncertain of those corner cases.
But again, the overall concept looks quite good.  Just some bugs to
be found and fixed, whether in this patch or in current mainline.
As always...  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms
  2021-06-18  3:29                 ` Paul E. McKenney
@ 2021-06-18  5:04                   ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18  5:04 UTC (permalink / raw)
  To: Paul E. McKenney, Peter Zijlstra (Intel)
  Cc: Nicholas Piggin, Rik van Riel, Andrew Morton, Dave Hansen,
	Linux Kernel Mailing List, linux-mm, Mathieu Desnoyers,
	the arch/x86 maintainers

On Thu, Jun 17, 2021, at 8:29 PM, Paul E. McKenney wrote:
> On Thu, Jun 17, 2021 at 11:08:03AM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> > > Here it is.  Not even boot tested!
> > 
> > It is now, it even builds a kernel.. so it must be perfect :-)
> > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> > 
> > Since I had to turn it into a patch to post, so that I could comment on
> > it, I've cleaned it up a little for you.
> > 
> > I'll reply to self with some notes, but I think I like it.
> 
> But rcutorture isn't too happy with it when applied to current
> mainline:
> 
> ------------------------------------------------------------------------
> [   32.559192] ------------[ cut here ]------------
> [   32.559528] WARNING: CPU: 0 PID: 175 at kernel/fork.c:686 
> __mmdrop+0x9f/0xb0
> [   32.560197] Modules linked in:
> [   32.560470] CPU: 0 PID: 175 Comm: torture_onoff Not tainted 
> 5.13.0-rc6+ #23
> [   32.561077] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.13.0-1ubuntu1.1 04/01/2014
> [   32.561809] RIP: 0010:__mmdrop+0x9f/0xb0
> [   32.562179] Code: fb 20 75 e6 48 8b 45 68 48 85 c0 0f 85 1e 48 ad 00 
> 48 8b 3d 93 e0 c3 01 5b 48 89 ee 5d 41 5c e9 97 45 18 00 0f 0b 0f 0b eb 
> 87 <0f> 0b eb 95 48 89 ef e8 a5 f1 17 00 eb a9 0f 1f 00 48 81 ef c0 03
> [   32.563822] RSP: 0018:ffff944c40623d68 EFLAGS: 00010246
> [   32.564331] RAX: ffff8e84c2339c00 RBX: ffff8e84df5572e0 RCX: 
> 00000000fffffffa
> [   32.564978] RDX: 0000000000000000 RSI: 0000000000000033 RDI: 
> ffff8e84c29a0000
> [   32.565648] RBP: ffff8e84c29a0000 R08: ffff8e84c11c774a R09: 
> 0000000000000001
> [   32.566256] R10: ffff8e85411c773f R11: ffff8e84c11c774a R12: 
> 0000000000000057
> [   32.566909] R13: 0000000000000000 R14: ffffffffb0e487f8 R15: 
> 000000000000000d
> [   32.567584] FS:  0000000000000000(0000) GS:ffff8e84df200000(0000) 
> knlGS:0000000000000000
> [   32.568321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   32.568860] CR2: 0000000000000000 CR3: 00000000029ec000 CR4: 
> 00000000000006f0
> [   32.569559] Call Trace:
> [   32.569804]  ? takedown_cpu+0xd0/0xd0
> [   32.570123]  finish_cpu+0x2e/0x40

Whoops, good catch!  The interaction between idle_thread_exit() and finish_cpu() is utter nonsense with my patch applied.  I need to figure out why it works the way it does in the first place and rework it to do the right thing.

> [   32.570449]  cpuhp_invoke_callback+0xf6/0x3f0
> [   32.570755]  cpuhp_invoke_callback_range+0x3b/0x80
> [   32.571137]  _cpu_down+0xdf/0x2a0
> [   32.571467]  cpu_down+0x2a/0x50
> [   32.571771]  device_offline+0x80/0xb0
> [   32.572101]  remove_cpu+0x1a/0x30
> [   32.572393]  torture_offline+0x80/0x140
> [   32.572730]  torture_onoff+0x147/0x260
> [   32.573068]  ? torture_kthread_stopping+0xa0/0xa0
> [   32.573488]  kthread+0xf9/0x130
> [   32.573777]  ? kthread_park+0x80/0x80
> [   32.574119]  ret_from_fork+0x22/0x30
> [   32.574418] ---[ end trace b77effd8aab7f902 ]---
> [   32.574819] BUG: Bad rss-counter state mm:00000000bccc5a55 
> type:MM_ANONPAGES val:1
> [   32.575450] BUG: non-zero pgtables_bytes on freeing mm: 24576
> ------------------------------------------------------------------------
> 
> Are we absolutely sure that the mmdrop()s are balanced in all cases?
> 
> 							Thanx, Paul
> 
> > ---
> >  arch/x86/include/asm/mmu.h |   5 ++
> >  include/linux/sched/mm.h   |   3 +
> >  kernel/fork.c              |   2 +
> >  kernel/sched/core.c        | 138 ++++++++++++++++++++++++++++++++++++---------
> >  kernel/sched/sched.h       |  10 +++-
> >  5 files changed, 130 insertions(+), 28 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> > index 5d7494631ea9..ce94162168c2 100644
> > --- a/arch/x86/include/asm/mmu.h
> > +++ b/arch/x86/include/asm/mmu.h
> > @@ -66,4 +66,9 @@ typedef struct {
> >  void leave_mm(int cpu);
> >  #define leave_mm leave_mm
> >  
> > +/* On x86, mm_cpumask(mm) contains all CPUs that might be lazily using mm */
> > +#define for_each_possible_lazymm_cpu(cpu, mm) \
> > +	for_each_cpu((cpu), mm_cpumask((mm)))
> > +
> > +
> >  #endif /* _ASM_X86_MMU_H */
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index e24b1fe348e3..5c7eafee6fea 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -77,6 +77,9 @@ static inline bool mmget_not_zero(struct mm_struct *mm)
> >  
> >  /* mmput gets rid of the mappings and all user-space */
> >  extern void mmput(struct mm_struct *);
> > +
> > +extern void mm_unlazy_mm_count(struct mm_struct *mm);
> > +
> >  #ifdef CONFIG_MMU
> >  /* same as above but performs the slow path from the async context. Can
> >   * be called from the atomic context as well
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index e595e77913eb..57415cca088c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1104,6 +1104,8 @@ static inline void __mmput(struct mm_struct *mm)
> >  	}
> >  	if (mm->binfmt)
> >  		module_put(mm->binfmt->module);
> > +
> > +	mm_unlazy_mm_count(mm);
> >  	mmdrop(mm);
> >  }
> >  
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 8ac693d542f6..e102ec53c2f6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -19,6 +19,7 @@
> >  
> >  #include <asm/switch_to.h>
> >  #include <asm/tlb.h>
> > +#include <asm/mmu.h>
> >  
> >  #include "../workqueue_internal.h"
> >  #include "../../fs/io-wq.h"
> > @@ -4501,6 +4502,81 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> >  	prepare_arch_switch(next);
> >  }
> >  
> > +static void mmdrop_lazy(struct rq *rq)
> > +{
> > +	struct mm_struct *old_mm;
> > +
> > +	if (likely(!READ_ONCE(rq->drop_mm)))
> > +		return;
> > +
> > +	/*
> > +	 * Slow path.  This only happens when we recently stopped using
> > +	 * an mm that is exiting.
> > +	 */
> > +	old_mm = xchg(&rq->drop_mm, NULL);
> > +	if (old_mm)
> > +		mmdrop(old_mm);
> > +}
> > +
> > +#ifndef for_each_possible_lazymm_cpu
> > +#define for_each_possible_lazymm_cpu(cpu, mm) for_each_online_cpu((cpu))
> > +#endif
> > +
> > +/*
> > + * This converts all lazy_mm references to mm to mm_count refcounts.  Our
> > + * caller holds an mm_count reference, so we don't need to worry about mm
> > + * being freed out from under us.
> > + */
> > +void mm_unlazy_mm_count(struct mm_struct *mm)
> > +{
> > +	unsigned int drop_count = num_possible_cpus();
> > +	int cpu;
> > +
> > +	/*
> > +	 * mm_users is zero, so no cpu will set its rq->lazy_mm to mm.
> > +	 */
> > +	WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0);
> > +
> > +	/* Grab enough references for the rest of this function. */
> > +	atomic_add(drop_count, &mm->mm_count);
> > +
> > +	for_each_possible_lazymm_cpu(cpu, mm) {
> > +		struct rq *rq = cpu_rq(cpu);
> > +		struct mm_struct *old_mm;
> > +
> > +		if (smp_load_acquire(&rq->lazy_mm) != mm)
> > +			continue;
> > +
> > +		drop_count--;	/* grab a reference; cpu will drop it later. */
> > +
> > +		old_mm = xchg(&rq->drop_mm, mm);
> > +
> > +		/*
> > +		 * We know that old_mm != mm: when we did the xchg(), we were
> > +		 * the only cpu to be putting mm into any drop_mm variable.
> > +		 */
> > +		WARN_ON_ONCE(old_mm == mm);
> > +		if (unlikely(old_mm)) {
> > +			/*
> > +			 * We just stole an mm reference from the target CPU.
> > +			 *
> > +			 * drop_mm was set to old by another call to
> > +			 * mm_unlazy_mm_count().  After that call xchg'd old
> > +			 * into drop_mm, the target CPU did:
> > +			 *
> > +			 *  smp_store_release(&rq->lazy_mm, mm);
> > +			 *
> > +			 * which synchronized with our smp_load_acquire()
> > +			 * above, so we know that the target CPU is done with
> > +			 * old. Drop old on its behalf.
> > +			 */
> > +			mmdrop(old_mm);
> > +		}
> > +	}
> > +
> > +	atomic_sub(drop_count, &mm->mm_count);
> > +}
> > +
> >  /**
> >   * finish_task_switch - clean up after a task-switch
> >   * @prev: the thread we just switched away from.
> > @@ -4524,7 +4600,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >  	__releases(rq->lock)
> >  {
> >  	struct rq *rq = this_rq();
> > -	struct mm_struct *mm = rq->prev_mm;
> >  	long prev_state;
> >  
> >  	/*
> > @@ -4543,8 +4618,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >  		      current->comm, current->pid, preempt_count()))
> >  		preempt_count_set(FORK_PREEMPT_COUNT);
> >  
> > -	rq->prev_mm = NULL;
> > -
> >  	/*
> >  	 * A task struct has one reference for the use as "current".
> >  	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
> > @@ -4574,22 +4647,16 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >  	kmap_local_sched_in();
> >  
> >  	fire_sched_in_preempt_notifiers(current);
> > +
> >  	/*
> > -	 * When switching through a kernel thread, the loop in
> > -	 * membarrier_{private,global}_expedited() may have observed that
> > -	 * kernel thread and not issued an IPI. It is therefore possible to
> > -	 * schedule between user->kernel->user threads without passing though
> > -	 * switch_mm(). Membarrier requires a barrier after storing to
> > -	 * rq->curr, before returning to userspace, so provide them here:
> > -	 *
> > -	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> > -	 *   provided by mmdrop(),
> > -	 * - a sync_core for SYNC_CORE.
> > +	 * Do this unconditionally.  There's a race in which a remote CPU
> > +	 * sees rq->lazy_mm != NULL and gives us an extra mm ref while we
> > +	 * are executing this code and we don't notice.  Instead of letting
> > +	 * that ref sit around until the next time we unlazy, do it on every
> > +	 * context switch.
> >  	 */
> > -	if (mm) {
> > -		membarrier_mm_sync_core_before_usermode(mm);
> > -		mmdrop(mm);
> > -	}
> > +	mmdrop_lazy(rq);
> > +
> >  	if (unlikely(prev_state == TASK_DEAD)) {
> >  		if (prev->sched_class->task_dead)
> >  			prev->sched_class->task_dead(prev);
> > @@ -4652,25 +4719,32 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >  
> >  	/*
> >  	 * kernel -> kernel   lazy + transfer active
> > -	 *   user -> kernel   lazy + mmgrab() active
> > +	 *   user -> kernel   lazy + lazy_mm grab active
> >  	 *
> > -	 * kernel ->   user   switch + mmdrop() active
> > +	 * kernel ->   user   switch + lazy_mm release active
> >  	 *   user ->   user   switch
> >  	 */
> >  	if (!next->mm) {                                // to kernel
> >  		enter_lazy_tlb(prev->active_mm, next);
> >  
> >  		next->active_mm = prev->active_mm;
> > -		if (prev->mm)                           // from user
> > -			mmgrab(prev->active_mm);
> > -		else
> > +		if (prev->mm) {                         // from user
> > +			SCHED_WARN_ON(rq->lazy_mm);
> > +
> > +			/*
> > +			 * Acqure a lazy_mm reference to the active
> > +			 * (lazy) mm.  No explicit barrier needed: we still
> > +			 * hold an explicit (mm_users) reference.  __mmput()
> > +			 * can't be called until we call mmput() to drop
> > +			 * our reference, and __mmput() is a release barrier.
> > +			 */
> > +			WRITE_ONCE(rq->lazy_mm, next->active_mm);
> > +		} else {
> >  			prev->active_mm = NULL;
> > +		}
> >  	} else {                                        // to user
> >  		membarrier_switch_mm(rq, prev->active_mm, next->mm);
> >  		/*
> > -		 * sys_membarrier() requires an smp_mb() between setting
> > -		 * rq->curr / membarrier_switch_mm() and returning to userspace.
> > -		 *
> >  		 * The below provides this either through switch_mm(), or in
> >  		 * case 'prev->active_mm == next->mm' through
> >  		 * finish_task_switch()'s mmdrop().
> > @@ -4678,9 +4752,19 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >  		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> >  
> >  		if (!prev->mm) {                        // from kernel
> > -			/* will mmdrop() in finish_task_switch(). */
> > -			rq->prev_mm = prev->active_mm;
> > +			/*
> > +			 * Even though nothing should reference ->active_mm
> > +			 * for a non-current task, don't leave a stale pointer
> > +			 * to an mm that might be freed.
> > +			 */
> >  			prev->active_mm = NULL;
> > +
> > +			/*
> > +			 * Drop our lazy_mm reference to the old lazy mm.
> > +			 * After this, any CPU may free it if it is
> > +			 * unreferenced.
> > +			 */
> > +			smp_store_release(&rq->lazy_mm, NULL);
> >  		}
> >  	}
> >  
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 8f0194cee0ba..703d95a4abd0 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -966,7 +966,15 @@ struct rq {
> >  	struct task_struct	*idle;
> >  	struct task_struct	*stop;
> >  	unsigned long		next_balance;
> > -	struct mm_struct	*prev_mm;
> > +
> > +	/*
> > +	 * Fast refcounting scheme for lazy mm.  lazy_mm is a hazard pointer:
> > +	 * setting it to point to a lazily used mm keeps that mm from being
> > +	 * freed.  drop_mm points to am mm that needs an mmdrop() call
> > +	 * after the CPU owning the rq is done with it.
> > +	 */
> > +	struct mm_struct	*lazy_mm;
> > +	struct mm_struct	*drop_mm;
> >  
> >  	unsigned int		clock_update_flags;
> >  	u64			clock;
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-16 16:27                       ` Russell King (Oracle)
  (?)
@ 2021-06-18 12:54                         ` Linus Walleij
  -1 siblings, 0 replies; 165+ messages in thread
From: Linus Walleij @ 2021-06-18 12:54 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Catalin Marinas, Krzysztof Halasa, Neil Armstrong,
	Peter Zijlstra, Andy Lutomirski,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Dave Hansen, LKML, Linux Memory Management List, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Linux ARM, Will Deacon

On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:

> Arnd tells me that the current remaining ARM11MPCore users are:
> - CNS3xxx (where there is some martinal interest in the Gateworks
>   Laguna platform)
> - Similar for OXNAS
> - There used to be the Realview MPCore tile - I haven't turned that on
>   in ages, and it may be that the 3V cell that backs up the encryption
>   keys is dead so it may not even boot.

I have this machine with 4 x ARM11 MPCore, it works like a charm.
I use it to test exactly this kind of stuff, I know if a kernel works
on ARM11MPCore it works on anything because of how fragile
it is.

> So it seems to come down to a question about CNS3xxx and OXNAS. If
> these aren't being used, maybe we can drop ARM11MPCore support and
> the associated platforms?
>
> Linus, Krzysztof, Neil, any input?

I don't especially need to keep the ARM11MPCore machine alive,
it is just a testchip after all. The Oxnas is another story, that has wide
deployment and was contributed recently (2016) and has excellent
support in OpenWrt so I wouldn't really want
to axe that.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-18 12:54                         ` Linus Walleij
  0 siblings, 0 replies; 165+ messages in thread
From: Linus Walleij @ 2021-06-18 12:54 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Catalin Marinas, Krzysztof Halasa, Neil Armstrong,
	Peter Zijlstra, Andy Lutomirski,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Dave Hansen, LKML, Linux Memory Management List, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Linux ARM, Will Deacon

On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:

> Arnd tells me that the current remaining ARM11MPCore users are:
> - CNS3xxx (where there is some martinal interest in the Gateworks
>   Laguna platform)
> - Similar for OXNAS
> - There used to be the Realview MPCore tile - I haven't turned that on
>   in ages, and it may be that the 3V cell that backs up the encryption
>   keys is dead so it may not even boot.

I have this machine with 4 x ARM11 MPCore, it works like a charm.
I use it to test exactly this kind of stuff, I know if a kernel works
on ARM11MPCore it works on anything because of how fragile
it is.

> So it seems to come down to a question about CNS3xxx and OXNAS. If
> these aren't being used, maybe we can drop ARM11MPCore support and
> the associated platforms?
>
> Linus, Krzysztof, Neil, any input?

I don't especially need to keep the ARM11MPCore machine alive,
it is just a testchip after all. The Oxnas is another story, that has wide
deployment and was contributed recently (2016) and has excellent
support in OpenWrt so I wouldn't really want
to axe that.

Yours,
Linus Walleij


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-18 12:54                         ` Linus Walleij
  0 siblings, 0 replies; 165+ messages in thread
From: Linus Walleij @ 2021-06-18 12:54 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Catalin Marinas, Krzysztof Halasa, Neil Armstrong,
	Peter Zijlstra, Andy Lutomirski,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Dave Hansen, LKML, Linux Memory Management List, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Linux ARM, Will Deacon

On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:

> Arnd tells me that the current remaining ARM11MPCore users are:
> - CNS3xxx (where there is some martinal interest in the Gateworks
>   Laguna platform)
> - Similar for OXNAS
> - There used to be the Realview MPCore tile - I haven't turned that on
>   in ages, and it may be that the 3V cell that backs up the encryption
>   keys is dead so it may not even boot.

I have this machine with 4 x ARM11 MPCore, it works like a charm.
I use it to test exactly this kind of stuff, I know if a kernel works
on ARM11MPCore it works on anything because of how fragile
it is.

> So it seems to come down to a question about CNS3xxx and OXNAS. If
> these aren't being used, maybe we can drop ARM11MPCore support and
> the associated platforms?
>
> Linus, Krzysztof, Neil, any input?

I don't especially need to keep the ARM11MPCore machine alive,
it is just a testchip after all. The Oxnas is another story, that has wide
deployment and was contributed recently (2016) and has excellent
support in OpenWrt so I wouldn't really want
to axe that.

Yours,
Linus Walleij

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-18 12:54                         ` Linus Walleij
@ 2021-06-18 13:19                           ` Russell King (Oracle)
  -1 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-18 13:19 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Catalin Marinas, Krzysztof Halasa, Neil Armstrong,
	Peter Zijlstra, Andy Lutomirski,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Dave Hansen, LKML, Linux Memory Management List, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Linux ARM, Will Deacon

On Fri, Jun 18, 2021 at 02:54:05PM +0200, Linus Walleij wrote:
> On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> 
> > Arnd tells me that the current remaining ARM11MPCore users are:
> > - CNS3xxx (where there is some martinal interest in the Gateworks
> >   Laguna platform)
> > - Similar for OXNAS
> > - There used to be the Realview MPCore tile - I haven't turned that on
> >   in ages, and it may be that the 3V cell that backs up the encryption
> >   keys is dead so it may not even boot.
> 
> I have this machine with 4 x ARM11 MPCore, it works like a charm.
> I use it to test exactly this kind of stuff, I know if a kernel works
> on ARM11MPCore it works on anything because of how fragile
> it is.
> 
> > So it seems to come down to a question about CNS3xxx and OXNAS. If
> > these aren't being used, maybe we can drop ARM11MPCore support and
> > the associated platforms?
> >
> > Linus, Krzysztof, Neil, any input?
> 
> I don't especially need to keep the ARM11MPCore machine alive,
> it is just a testchip after all. The Oxnas is another story, that has wide
> deployment and was contributed recently (2016) and has excellent
> support in OpenWrt so I wouldn't really want
> to axe that.

So I suppose the next question is... are these issues (with userland
self-modifying code and kernel module loading) entirely theoretical
or can they be produced on real hardware?

If they can't be produced on real hardware, and we attempt to fix them
how do we know that the fix has worked...

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-18 13:19                           ` Russell King (Oracle)
  0 siblings, 0 replies; 165+ messages in thread
From: Russell King (Oracle) @ 2021-06-18 13:19 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Catalin Marinas, Krzysztof Halasa, Neil Armstrong,
	Peter Zijlstra, Andy Lutomirski,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Dave Hansen, LKML, Linux Memory Management List, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Linux ARM, Will Deacon

On Fri, Jun 18, 2021 at 02:54:05PM +0200, Linus Walleij wrote:
> On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> 
> > Arnd tells me that the current remaining ARM11MPCore users are:
> > - CNS3xxx (where there is some martinal interest in the Gateworks
> >   Laguna platform)
> > - Similar for OXNAS
> > - There used to be the Realview MPCore tile - I haven't turned that on
> >   in ages, and it may be that the 3V cell that backs up the encryption
> >   keys is dead so it may not even boot.
> 
> I have this machine with 4 x ARM11 MPCore, it works like a charm.
> I use it to test exactly this kind of stuff, I know if a kernel works
> on ARM11MPCore it works on anything because of how fragile
> it is.
> 
> > So it seems to come down to a question about CNS3xxx and OXNAS. If
> > these aren't being used, maybe we can drop ARM11MPCore support and
> > the associated platforms?
> >
> > Linus, Krzysztof, Neil, any input?
> 
> I don't especially need to keep the ARM11MPCore machine alive,
> it is just a testchip after all. The Oxnas is another story, that has wide
> deployment and was contributed recently (2016) and has excellent
> support in OpenWrt so I wouldn't really want
> to axe that.

So I suppose the next question is... are these issues (with userland
self-modifying code and kernel module loading) entirely theoretical
or can they be produced on real hardware?

If they can't be produced on real hardware, and we attempt to fix them
how do we know that the fix has worked...

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
  2021-06-18 12:54                         ` Linus Walleij
  (?)
@ 2021-06-18 13:36                           ` Arnd Bergmann
  -1 siblings, 0 replies; 165+ messages in thread
From: Arnd Bergmann @ 2021-06-18 13:36 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Russell King (Oracle),
	Catalin Marinas, Krzysztof Halasa, Neil Armstrong,
	Peter Zijlstra, Andy Lutomirski,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Dave Hansen, LKML, Linux Memory Management List, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Linux ARM, Will Deacon,
	Daniel Golle

On Fri, Jun 18, 2021 at 2:54 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote:

> > So it seems to come down to a question about CNS3xxx and OXNAS. If
> > these aren't being used, maybe we can drop ARM11MPCore support and
> > the associated platforms?
> >
> > Linus, Krzysztof, Neil, any input?
>
> I don't especially need to keep the ARM11MPCore machine alive,
> it is just a testchip after all. The Oxnas is another story, that has wide
> deployment and was contributed recently (2016) and has excellent
> support in OpenWrt so I wouldn't really want to axe that.

Agreed, as long as oxnas and/or cns3xxx are around, we should just keep
the realview 11mpcore support, but if both of the commercial platforms
are gone, then the realview can be retired as far as I'm concerned.

Regarding oxnas, I see that OpenWRT has a number of essential
device drivers (sata, pcie, usb and reset) that look like they could just
be merged upstream, but that effort appears to have stalled: no
device support was added to the dts files since the original 2016
merge. While the support in OpenWRT may be excellent, the platform
support in the mainline kernel is limited to ethernet, nand, uart
and gpio.

       Arnd

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-18 13:36                           ` Arnd Bergmann
  0 siblings, 0 replies; 165+ messages in thread
From: Arnd Bergmann @ 2021-06-18 13:36 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Russell King (Oracle),
	Catalin Marinas, Krzysztof Halasa, Neil Armstrong,
	Peter Zijlstra, Andy Lutomirski,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Dave Hansen, LKML, Linux Memory Management List, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Linux ARM, Will Deacon,
	Daniel Golle

On Fri, Jun 18, 2021 at 2:54 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote:

> > So it seems to come down to a question about CNS3xxx and OXNAS. If
> > these aren't being used, maybe we can drop ARM11MPCore support and
> > the associated platforms?
> >
> > Linus, Krzysztof, Neil, any input?
>
> I don't especially need to keep the ARM11MPCore machine alive,
> it is just a testchip after all. The Oxnas is another story, that has wide
> deployment and was contributed recently (2016) and has excellent
> support in OpenWrt so I wouldn't really want to axe that.

Agreed, as long as oxnas and/or cns3xxx are around, we should just keep
the realview 11mpcore support, but if both of the commercial platforms
are gone, then the realview can be retired as far as I'm concerned.

Regarding oxnas, I see that OpenWRT has a number of essential
device drivers (sata, pcie, usb and reset) that look like they could just
be merged upstream, but that effort appears to have stalled: no
device support was added to the dts files since the original 2016
merge. While the support in OpenWRT may be excellent, the platform
support in the mainline kernel is limited to ethernet, nand, uart
and gpio.

       Arnd


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
@ 2021-06-18 13:36                           ` Arnd Bergmann
  0 siblings, 0 replies; 165+ messages in thread
From: Arnd Bergmann @ 2021-06-18 13:36 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Russell King (Oracle),
	Catalin Marinas, Krzysztof Halasa, Neil Armstrong,
	Peter Zijlstra, Andy Lutomirski,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Dave Hansen, LKML, Linux Memory Management List, Andrew Morton,
	Mathieu Desnoyers, Nicholas Piggin, Linux ARM, Will Deacon,
	Daniel Golle

On Fri, Jun 18, 2021 at 2:54 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote:

> > So it seems to come down to a question about CNS3xxx and OXNAS. If
> > these aren't being used, maybe we can drop ARM11MPCore support and
> > the associated platforms?
> >
> > Linus, Krzysztof, Neil, any input?
>
> I don't especially need to keep the ARM11MPCore machine alive,
> it is just a testchip after all. The Oxnas is another story, that has wide
> deployment and was contributed recently (2016) and has excellent
> support in OpenWrt so I wouldn't really want to axe that.

Agreed, as long as oxnas and/or cns3xxx are around, we should just keep
the realview 11mpcore support, but if both of the commercial platforms
are gone, then the realview can be retired as far as I'm concerned.

Regarding oxnas, I see that OpenWRT has a number of essential
device drivers (sata, pcie, usb and reset) that look like they could just
be merged upstream, but that effort appears to have stalled: no
device support was added to the dts files since the original 2016
merge. While the support in OpenWRT may be excellent, the platform
support in the mainline kernel is limited to ethernet, nand, uart
and gpio.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-16 18:52       ` Andy Lutomirski
  (?)
@ 2021-06-18 15:27         ` Christophe Leroy
  -1 siblings, 0 replies; 165+ messages in thread
From: Christophe Leroy @ 2021-06-18 15:27 UTC (permalink / raw)
  To: Andy Lutomirski, Nicholas Piggin, x86
  Cc: Will Deacon, linux-mm, Peter Zijlstra, LKML, stable, Dave Hansen,
	Mathieu Desnoyers, Catalin Marinas, Paul Mackerras,
	Andrew Morton, linuxppc-dev, linux-arm-kernel



Le 16/06/2021 à 20:52, Andy Lutomirski a écrit :
> On 6/15/21 9:45 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
>>> return-to-usermode instruction is x86-specific and that all other
>>> architectures automatically notice cross-modified code on return to
>>> userspace.
> 
>>> +/*
>>> + * XXX: can a powerpc person put an appropriate comment here?
>>> + */
>>> +static inline void membarrier_sync_core_before_usermode(void)
>>> +{
>>> +}
>>> +
>>> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
>>
>> powerpc's can just go in asm/membarrier.h
> 
> $ ls arch/powerpc/include/asm/membarrier.h
> ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
> or directory

https://github.com/torvalds/linux/blob/master/arch/powerpc/include/asm/membarrier.h


Was added by https://github.com/torvalds/linux/commit/3ccfebedd8cf54e291c809c838d8ad5cc00f5688

> 
> 
>>
>> /*
>>   * The RFI family of instructions are context synchronising, and
>>   * that is how we return to userspace, so nothing is required here.
>>   */
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 15:27         ` Christophe Leroy
  0 siblings, 0 replies; 165+ messages in thread
From: Christophe Leroy @ 2021-06-18 15:27 UTC (permalink / raw)
  To: Andy Lutomirski, Nicholas Piggin, x86
  Cc: Dave Hansen, Peter Zijlstra, Catalin Marinas, linuxppc-dev, LKML,
	stable, linux-mm, Mathieu Desnoyers, Paul Mackerras,
	Andrew Morton, Will Deacon, linux-arm-kernel



Le 16/06/2021 à 20:52, Andy Lutomirski a écrit :
> On 6/15/21 9:45 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
>>> return-to-usermode instruction is x86-specific and that all other
>>> architectures automatically notice cross-modified code on return to
>>> userspace.
> 
>>> +/*
>>> + * XXX: can a powerpc person put an appropriate comment here?
>>> + */
>>> +static inline void membarrier_sync_core_before_usermode(void)
>>> +{
>>> +}
>>> +
>>> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
>>
>> powerpc's can just go in asm/membarrier.h
> 
> $ ls arch/powerpc/include/asm/membarrier.h
> ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
> or directory

https://github.com/torvalds/linux/blob/master/arch/powerpc/include/asm/membarrier.h


Was added by https://github.com/torvalds/linux/commit/3ccfebedd8cf54e291c809c838d8ad5cc00f5688

> 
> 
>>
>> /*
>>   * The RFI family of instructions are context synchronising, and
>>   * that is how we return to userspace, so nothing is required here.
>>   */
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 15:27         ` Christophe Leroy
  0 siblings, 0 replies; 165+ messages in thread
From: Christophe Leroy @ 2021-06-18 15:27 UTC (permalink / raw)
  To: Andy Lutomirski, Nicholas Piggin, x86
  Cc: Will Deacon, linux-mm, Peter Zijlstra, LKML, stable, Dave Hansen,
	Mathieu Desnoyers, Catalin Marinas, Paul Mackerras,
	Andrew Morton, linuxppc-dev, linux-arm-kernel



Le 16/06/2021 à 20:52, Andy Lutomirski a écrit :
> On 6/15/21 9:45 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>> The old sync_core_before_usermode() comments suggested that a non-icache-syncing
>>> return-to-usermode instruction is x86-specific and that all other
>>> architectures automatically notice cross-modified code on return to
>>> userspace.
> 
>>> +/*
>>> + * XXX: can a powerpc person put an appropriate comment here?
>>> + */
>>> +static inline void membarrier_sync_core_before_usermode(void)
>>> +{
>>> +}
>>> +
>>> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
>>
>> powerpc's can just go in asm/membarrier.h
> 
> $ ls arch/powerpc/include/asm/membarrier.h
> ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
> or directory

https://github.com/torvalds/linux/blob/master/arch/powerpc/include/asm/membarrier.h


Was added by https://github.com/torvalds/linux/commit/3ccfebedd8cf54e291c809c838d8ad5cc00f5688

> 
> 
>>
>> /*
>>   * The RFI family of instructions are context synchronising, and
>>   * that is how we return to userspace, so nothing is required here.
>>   */
> 
> Thanks!
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-18  0:12       ` Andy Lutomirski
  (?)
  (?)
@ 2021-06-18 16:31         ` Mathieu Desnoyers
  -1 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 16:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:

> On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> 
>> Please change back this #ifndef / #else / #endif within function for
>> 
>> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>   ...
>> } else {
>>   ...
>> }
>> 
>> I don't think mixing up preprocessor and code logic makes it more readable.
> 
> I agree, but I don't know how to make the result work well.
> membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> case, so either I need to fake up a definition or use #ifdef.
> 
> If I faked up a definition, I would want to assert, at build time, that
> it isn't called.  I don't think we can do:
> 
> static void membarrier_sync_core_before_usermode()
> {
>    BUILD_BUG_IF_REACHABLE();
> }

Let's look at the context here:

static void ipi_sync_core(void *info)
{
    [....]
    membarrier_sync_core_before_usermode()
}

^ this can be within #ifdef / #endif

static int membarrier_private_expedited(int flags, int cpu_id)
[...]
               if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
                        return -EINVAL;
                if (!(atomic_read(&mm->membarrier_state) &
                      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
                        return -EPERM;
                ipi_func = ipi_sync_core;

All we need to make the line above work is to define an empty ipi_sync_core
function in the #else case after the ipi_sync_core() function definition.

Or am I missing your point ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 16:31         ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 16:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:

> On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> 
>> Please change back this #ifndef / #else / #endif within function for
>> 
>> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>   ...
>> } else {
>>   ...
>> }
>> 
>> I don't think mixing up preprocessor and code logic makes it more readable.
> 
> I agree, but I don't know how to make the result work well.
> membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> case, so either I need to fake up a definition or use #ifdef.
> 
> If I faked up a definition, I would want to assert, at build time, that
> it isn't called.  I don't think we can do:
> 
> static void membarrier_sync_core_before_usermode()
> {
>    BUILD_BUG_IF_REACHABLE();
> }

Let's look at the context here:

static void ipi_sync_core(void *info)
{
    [....]
    membarrier_sync_core_before_usermode()
}

^ this can be within #ifdef / #endif

static int membarrier_private_expedited(int flags, int cpu_id)
[...]
               if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
                        return -EINVAL;
                if (!(atomic_read(&mm->membarrier_state) &
                      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
                        return -EPERM;
                ipi_func = ipi_sync_core;

All we need to make the line above work is to define an empty ipi_sync_core
function in the #else case after the ipi_sync_core() function definition.

Or am I missing your point ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 16:31         ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 16:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Catalin Marinas, Will Deacon, linux-mm, Peter Zijlstra, x86,
	linux-kernel, Nicholas Piggin, Dave Hansen, Paul Mackerras,
	stable, Andrew Morton, linuxppc-dev, linux-arm-kernel

----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:

> On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> 
>> Please change back this #ifndef / #else / #endif within function for
>> 
>> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>   ...
>> } else {
>>   ...
>> }
>> 
>> I don't think mixing up preprocessor and code logic makes it more readable.
> 
> I agree, but I don't know how to make the result work well.
> membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> case, so either I need to fake up a definition or use #ifdef.
> 
> If I faked up a definition, I would want to assert, at build time, that
> it isn't called.  I don't think we can do:
> 
> static void membarrier_sync_core_before_usermode()
> {
>    BUILD_BUG_IF_REACHABLE();
> }

Let's look at the context here:

static void ipi_sync_core(void *info)
{
    [....]
    membarrier_sync_core_before_usermode()
}

^ this can be within #ifdef / #endif

static int membarrier_private_expedited(int flags, int cpu_id)
[...]
               if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
                        return -EINVAL;
                if (!(atomic_read(&mm->membarrier_state) &
                      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
                        return -EPERM;
                ipi_func = ipi_sync_core;

All we need to make the line above work is to define an empty ipi_sync_core
function in the #else case after the ipi_sync_core() function definition.

Or am I missing your point ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 16:31         ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 16:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:

> On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> 
>> Please change back this #ifndef / #else / #endif within function for
>> 
>> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>   ...
>> } else {
>>   ...
>> }
>> 
>> I don't think mixing up preprocessor and code logic makes it more readable.
> 
> I agree, but I don't know how to make the result work well.
> membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> case, so either I need to fake up a definition or use #ifdef.
> 
> If I faked up a definition, I would want to assert, at build time, that
> it isn't called.  I don't think we can do:
> 
> static void membarrier_sync_core_before_usermode()
> {
>    BUILD_BUG_IF_REACHABLE();
> }

Let's look at the context here:

static void ipi_sync_core(void *info)
{
    [....]
    membarrier_sync_core_before_usermode()
}

^ this can be within #ifdef / #endif

static int membarrier_private_expedited(int flags, int cpu_id)
[...]
               if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
                        return -EINVAL;
                if (!(atomic_read(&mm->membarrier_state) &
                      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
                        return -EPERM;
                ipi_func = ipi_sync_core;

All we need to make the line above work is to define an empty ipi_sync_core
function in the #else case after the ipi_sync_core() function definition.

Or am I missing your point ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-18 16:31         ` Mathieu Desnoyers
  (?)
@ 2021-06-18 19:58           ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18 19:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	Nicholas Piggin, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Peter Zijlstra (Intel),
	stable



On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
> 
> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> > 
> >> Please change back this #ifndef / #else / #endif within function for
> >> 
> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
> >>   ...
> >> } else {
> >>   ...
> >> }
> >> 
> >> I don't think mixing up preprocessor and code logic makes it more readable.
> > 
> > I agree, but I don't know how to make the result work well.
> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> > case, so either I need to fake up a definition or use #ifdef.
> > 
> > If I faked up a definition, I would want to assert, at build time, that
> > it isn't called.  I don't think we can do:
> > 
> > static void membarrier_sync_core_before_usermode()
> > {
> >    BUILD_BUG_IF_REACHABLE();
> > }
> 
> Let's look at the context here:
> 
> static void ipi_sync_core(void *info)
> {
>     [....]
>     membarrier_sync_core_before_usermode()
> }
> 
> ^ this can be within #ifdef / #endif
> 
> static int membarrier_private_expedited(int flags, int cpu_id)
> [...]
>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>                         return -EINVAL;
>                 if (!(atomic_read(&mm->membarrier_state) &
>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>                         return -EPERM;
>                 ipi_func = ipi_sync_core;
> 
> All we need to make the line above work is to define an empty ipi_sync_core
> function in the #else case after the ipi_sync_core() function definition.
> 
> Or am I missing your point ?

Maybe?

My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.  I would be fine with that if I could have the compiler statically verify that it’s not called, but I’m uncomfortable having it there if the implementation is actively incorrect.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 19:58           ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18 19:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Catalin Marinas, Will Deacon, linux-mm, Peter Zijlstra (Intel),
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Nicholas Piggin, Dave Hansen, Paul Mackerras, stable,
	Andrew Morton, linuxppc-dev, linux-arm-kernel



On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
> 
> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> > 
> >> Please change back this #ifndef / #else / #endif within function for
> >> 
> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
> >>   ...
> >> } else {
> >>   ...
> >> }
> >> 
> >> I don't think mixing up preprocessor and code logic makes it more readable.
> > 
> > I agree, but I don't know how to make the result work well.
> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> > case, so either I need to fake up a definition or use #ifdef.
> > 
> > If I faked up a definition, I would want to assert, at build time, that
> > it isn't called.  I don't think we can do:
> > 
> > static void membarrier_sync_core_before_usermode()
> > {
> >    BUILD_BUG_IF_REACHABLE();
> > }
> 
> Let's look at the context here:
> 
> static void ipi_sync_core(void *info)
> {
>     [....]
>     membarrier_sync_core_before_usermode()
> }
> 
> ^ this can be within #ifdef / #endif
> 
> static int membarrier_private_expedited(int flags, int cpu_id)
> [...]
>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>                         return -EINVAL;
>                 if (!(atomic_read(&mm->membarrier_state) &
>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>                         return -EPERM;
>                 ipi_func = ipi_sync_core;
> 
> All we need to make the line above work is to define an empty ipi_sync_core
> function in the #else case after the ipi_sync_core() function definition.
> 
> Or am I missing your point ?

Maybe?

My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.  I would be fine with that if I could have the compiler statically verify that it’s not called, but I’m uncomfortable having it there if the implementation is actively incorrect.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 19:58           ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-18 19:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: the arch/x86 maintainers, Dave Hansen, Linux Kernel Mailing List,
	linux-mm, Andrew Morton, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	Nicholas Piggin, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Peter Zijlstra (Intel),
	stable



On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
> 
> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> > 
> >> Please change back this #ifndef / #else / #endif within function for
> >> 
> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
> >>   ...
> >> } else {
> >>   ...
> >> }
> >> 
> >> I don't think mixing up preprocessor and code logic makes it more readable.
> > 
> > I agree, but I don't know how to make the result work well.
> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> > case, so either I need to fake up a definition or use #ifdef.
> > 
> > If I faked up a definition, I would want to assert, at build time, that
> > it isn't called.  I don't think we can do:
> > 
> > static void membarrier_sync_core_before_usermode()
> > {
> >    BUILD_BUG_IF_REACHABLE();
> > }
> 
> Let's look at the context here:
> 
> static void ipi_sync_core(void *info)
> {
>     [....]
>     membarrier_sync_core_before_usermode()
> }
> 
> ^ this can be within #ifdef / #endif
> 
> static int membarrier_private_expedited(int flags, int cpu_id)
> [...]
>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>                         return -EINVAL;
>                 if (!(atomic_read(&mm->membarrier_state) &
>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>                         return -EPERM;
>                 ipi_func = ipi_sync_core;
> 
> All we need to make the line above work is to define an empty ipi_sync_core
> function in the #else case after the ipi_sync_core() function definition.
> 
> Or am I missing your point ?

Maybe?

My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.  I would be fine with that if I could have the compiler statically verify that it’s not called, but I’m uncomfortable having it there if the implementation is actively incorrect.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-18 19:58           ` Andy Lutomirski
  (?)
  (?)
@ 2021-06-18 20:09             ` Mathieu Desnoyers
  -1 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 20:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:

> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>> 
>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> > 
>> >> Please change back this #ifndef / #else / #endif within function for
>> >> 
>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>   ...
>> >> } else {
>> >>   ...
>> >> }
>> >> 
>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>> > 
>> > I agree, but I don't know how to make the result work well.
>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> > case, so either I need to fake up a definition or use #ifdef.
>> > 
>> > If I faked up a definition, I would want to assert, at build time, that
>> > it isn't called.  I don't think we can do:
>> > 
>> > static void membarrier_sync_core_before_usermode()
>> > {
>> >    BUILD_BUG_IF_REACHABLE();
>> > }
>> 
>> Let's look at the context here:
>> 
>> static void ipi_sync_core(void *info)
>> {
>>     [....]
>>     membarrier_sync_core_before_usermode()
>> }
>> 
>> ^ this can be within #ifdef / #endif
>> 
>> static int membarrier_private_expedited(int flags, int cpu_id)
>> [...]
>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>>                         return -EINVAL;
>>                 if (!(atomic_read(&mm->membarrier_state) &
>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>>                         return -EPERM;
>>                 ipi_func = ipi_sync_core;
>> 
>> All we need to make the line above work is to define an empty ipi_sync_core
>> function in the #else case after the ipi_sync_core() function definition.
>> 
>> Or am I missing your point ?
> 
> Maybe?
> 
> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
> I would be fine with that if I could have the compiler statically verify that
> it’s not called, but I’m uncomfortable having it there if the implementation is
> actively incorrect.

I see. Another approach would be to implement a "setter" function to populate
"ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
implementation.

Would that be better ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 20:09             ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 20:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:

> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>> 
>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> > 
>> >> Please change back this #ifndef / #else / #endif within function for
>> >> 
>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>   ...
>> >> } else {
>> >>   ...
>> >> }
>> >> 
>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>> > 
>> > I agree, but I don't know how to make the result work well.
>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> > case, so either I need to fake up a definition or use #ifdef.
>> > 
>> > If I faked up a definition, I would want to assert, at build time, that
>> > it isn't called.  I don't think we can do:
>> > 
>> > static void membarrier_sync_core_before_usermode()
>> > {
>> >    BUILD_BUG_IF_REACHABLE();
>> > }
>> 
>> Let's look at the context here:
>> 
>> static void ipi_sync_core(void *info)
>> {
>>     [....]
>>     membarrier_sync_core_before_usermode()
>> }
>> 
>> ^ this can be within #ifdef / #endif
>> 
>> static int membarrier_private_expedited(int flags, int cpu_id)
>> [...]
>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>>                         return -EINVAL;
>>                 if (!(atomic_read(&mm->membarrier_state) &
>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>>                         return -EPERM;
>>                 ipi_func = ipi_sync_core;
>> 
>> All we need to make the line above work is to define an empty ipi_sync_core
>> function in the #else case after the ipi_sync_core() function definition.
>> 
>> Or am I missing your point ?
> 
> Maybe?
> 
> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
> I would be fine with that if I could have the compiler statically verify that
> it’s not called, but I’m uncomfortable having it there if the implementation is
> actively incorrect.

I see. Another approach would be to implement a "setter" function to populate
"ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
implementation.

Would that be better ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 20:09             ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 20:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Catalin Marinas, Will Deacon, linux-mm, Peter Zijlstra, x86,
	linux-kernel, Nicholas Piggin, Dave Hansen, Paul Mackerras,
	stable, Andrew Morton, linuxppc-dev, linux-arm-kernel

----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:

> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>> 
>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> > 
>> >> Please change back this #ifndef / #else / #endif within function for
>> >> 
>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>   ...
>> >> } else {
>> >>   ...
>> >> }
>> >> 
>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>> > 
>> > I agree, but I don't know how to make the result work well.
>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> > case, so either I need to fake up a definition or use #ifdef.
>> > 
>> > If I faked up a definition, I would want to assert, at build time, that
>> > it isn't called.  I don't think we can do:
>> > 
>> > static void membarrier_sync_core_before_usermode()
>> > {
>> >    BUILD_BUG_IF_REACHABLE();
>> > }
>> 
>> Let's look at the context here:
>> 
>> static void ipi_sync_core(void *info)
>> {
>>     [....]
>>     membarrier_sync_core_before_usermode()
>> }
>> 
>> ^ this can be within #ifdef / #endif
>> 
>> static int membarrier_private_expedited(int flags, int cpu_id)
>> [...]
>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>>                         return -EINVAL;
>>                 if (!(atomic_read(&mm->membarrier_state) &
>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>>                         return -EPERM;
>>                 ipi_func = ipi_sync_core;
>> 
>> All we need to make the line above work is to define an empty ipi_sync_core
>> function in the #else case after the ipi_sync_core() function definition.
>> 
>> Or am I missing your point ?
> 
> Maybe?
> 
> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
> I would be fine with that if I could have the compiler statically verify that
> it’s not called, but I’m uncomfortable having it there if the implementation is
> actively incorrect.

I see. Another approach would be to implement a "setter" function to populate
"ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
implementation.

Would that be better ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-18 20:09             ` Mathieu Desnoyers
  0 siblings, 0 replies; 165+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 20:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Dave Hansen, linux-kernel, linux-mm, Andrew Morton,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, Nicholas Piggin, Catalin Marinas, Will Deacon,
	linux-arm-kernel, Peter Zijlstra, stable

----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:

> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>> 
>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> > 
>> >> Please change back this #ifndef / #else / #endif within function for
>> >> 
>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>   ...
>> >> } else {
>> >>   ...
>> >> }
>> >> 
>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>> > 
>> > I agree, but I don't know how to make the result work well.
>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> > case, so either I need to fake up a definition or use #ifdef.
>> > 
>> > If I faked up a definition, I would want to assert, at build time, that
>> > it isn't called.  I don't think we can do:
>> > 
>> > static void membarrier_sync_core_before_usermode()
>> > {
>> >    BUILD_BUG_IF_REACHABLE();
>> > }
>> 
>> Let's look at the context here:
>> 
>> static void ipi_sync_core(void *info)
>> {
>>     [....]
>>     membarrier_sync_core_before_usermode()
>> }
>> 
>> ^ this can be within #ifdef / #endif
>> 
>> static int membarrier_private_expedited(int flags, int cpu_id)
>> [...]
>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>>                         return -EINVAL;
>>                 if (!(atomic_read(&mm->membarrier_state) &
>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>>                         return -EPERM;
>>                 ipi_func = ipi_sync_core;
>> 
>> All we need to make the line above work is to define an empty ipi_sync_core
>> function in the #else case after the ipi_sync_core() function definition.
>> 
>> Or am I missing your point ?
> 
> Maybe?
> 
> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
> I would be fine with that if I could have the compiler statically verify that
> it’s not called, but I’m uncomfortable having it there if the implementation is
> actively incorrect.

I see. Another approach would be to implement a "setter" function to populate
"ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
implementation.

Would that be better ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-17 23:49                 ` Andy Lutomirski
@ 2021-06-19  2:53                   ` Nicholas Piggin
  2021-06-19  3:20                     ` Andy Lutomirski
  0 siblings, 1 reply; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-19  2:53 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra (Intel), Rik van Riel
  Cc: Andrew Morton, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, Paul E. McKenney, the arch/x86 maintainers

Excerpts from Andy Lutomirski's message of June 18, 2021 9:49 am:
> On 6/16/21 11:51 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
>>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>>>>
>>>>
>>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>>>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>>>>>>> a comment explaining why this barrier probably exists in all cases.  This
>>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
>>>>>>>>> might get rid of these barriers, and it's not really clear to me that
>>>>>>>>> the barrier actually exists in all necessary cases.
>>>>>>>>
>>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
>>>>>>>>
>>>>>>>> And any change can change anything, that doesn't make it fragile. My
>>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>>>>>>>> replaces it with smp_mb for example.
>>>>>>>
>>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
>>>>>>> discernible reason, that's not generally encouraged, seeing how extra
>>>>>>> barriers is extra slow.
>>>>>>>
>>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
>>>>>>> membarrier relies on the implied barrier, what's fragile about that?
>>>>>>>
>>>>>>
>>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
>>>>>> be full barriers.  The current implementation has them being full
>>>>>> barriers, and the current implementation is quite slow.  So let's try
>>>>>> that commit message again:
>>>>>>
>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>>>> a comment explaining why this barrier probably exists in all cases. The
>>>>>> logic is based on ensuring that the barrier exists on every control flow
>>>>>> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
>>>>>> full barriers.
>>>>>>
>>>>>> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
>>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>>>>>> could use a release on architectures that have these operations.
>>>>>
>>>>> I'm not against the idea, I've looked at something similar before (not
>>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
>>>>> could possibly take advantage of this, I might cherry pick it and test 
>>>>> performance :)
>>>>>
>>>>> I don't think it belongs in this series though. Should go together with
>>>>> something that takes advantage of it.
>>>>
>>>> I’m going to see if I can get hazard pointers into shape quickly.
>>>
>>> Here it is.  Not even boot tested!
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
>>>
>>> Nick, I think you can accomplish much the same thing as your patch by:
>>>
>>> #define for_each_possible_lazymm_cpu while (false)
>> 
>> I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
>> as lazy at this point. I must be missing something.
> 
> What I mean is: if you want to shoot down lazies instead of doing the
> hazard pointer trick to track them, you could do:
> 
> #define for_each_possible_lazymm_cpu while (false)
> 
> which would promise to the core code that you don't have any lazies left
> by the time exit_mmap() is done.  You might need a new hook in
> exit_mmap() depending on exactly how you implement the lazy shootdown.

Oh for configuring it away entirely. I'll have to see how it falls out, 
I suspect we'd want to just no-op that entire function and avoid the 2 
atomics if we are taking care of our lazy mms with shootdowns.

The more important thing would be the context switch fast path, but even 
there, there's really no reason why the two approaches couldn't be made 
to both work with some careful helper functions or structuring of the 
code.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-19  2:53                   ` Nicholas Piggin
@ 2021-06-19  3:20                     ` Andy Lutomirski
  2021-06-19  4:27                       ` Nicholas Piggin
  0 siblings, 1 reply; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-19  3:20 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra (Intel), Rik van Riel
  Cc: Andrew Morton, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, Paul E. McKenney, the arch/x86 maintainers



On Fri, Jun 18, 2021, at 7:53 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 18, 2021 9:49 am:
> > On 6/16/21 11:51 PM, Nicholas Piggin wrote:
> >> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
> >>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
> >>>>
> >>>>
> >>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
> >>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> >>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> >>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> >>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> >>>>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> >>>>>>>>> a comment explaining why this barrier probably exists in all cases.  This
> >>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
> >>>>>>>>> might get rid of these barriers, and it's not really clear to me that
> >>>>>>>>> the barrier actually exists in all necessary cases.
> >>>>>>>>
> >>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
> >>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
> >>>>>>>>
> >>>>>>>> And any change can change anything, that doesn't make it fragile. My
> >>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> >>>>>>>> replaces it with smp_mb for example.
> >>>>>>>
> >>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
> >>>>>>> discernible reason, that's not generally encouraged, seeing how extra
> >>>>>>> barriers is extra slow.
> >>>>>>>
> >>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
> >>>>>>> membarrier relies on the implied barrier, what's fragile about that?
> >>>>>>>
> >>>>>>
> >>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
> >>>>>> be full barriers.  The current implementation has them being full
> >>>>>> barriers, and the current implementation is quite slow.  So let's try
> >>>>>> that commit message again:
> >>>>>>
> >>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> >>>>>> a comment explaining why this barrier probably exists in all cases. The
> >>>>>> logic is based on ensuring that the barrier exists on every control flow
> >>>>>> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> >>>>>> full barriers.
> >>>>>>
> >>>>>> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> >>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> >>>>>> could use a release on architectures that have these operations.
> >>>>>
> >>>>> I'm not against the idea, I've looked at something similar before (not
> >>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
> >>>>> could possibly take advantage of this, I might cherry pick it and test 
> >>>>> performance :)
> >>>>>
> >>>>> I don't think it belongs in this series though. Should go together with
> >>>>> something that takes advantage of it.
> >>>>
> >>>> I’m going to see if I can get hazard pointers into shape quickly.
> >>>
> >>> Here it is.  Not even boot tested!
> >>>
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> >>>
> >>> Nick, I think you can accomplish much the same thing as your patch by:
> >>>
> >>> #define for_each_possible_lazymm_cpu while (false)
> >> 
> >> I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
> >> as lazy at this point. I must be missing something.
> > 
> > What I mean is: if you want to shoot down lazies instead of doing the
> > hazard pointer trick to track them, you could do:
> > 
> > #define for_each_possible_lazymm_cpu while (false)
> > 
> > which would promise to the core code that you don't have any lazies left
> > by the time exit_mmap() is done.  You might need a new hook in
> > exit_mmap() depending on exactly how you implement the lazy shootdown.
> 
> Oh for configuring it away entirely. I'll have to see how it falls out, 
> I suspect we'd want to just no-op that entire function and avoid the 2 
> atomics if we are taking care of our lazy mms with shootdowns.

Do you mean the smp_store_release()?  On x86 and similar architectures, that’s almost free.  I’m also not convinced it needs to be a real release.

> 
> The more important thing would be the context switch fast path, but even 
> there, there's really no reason why the two approaches couldn't be made 
> to both work with some careful helper functions or structuring of the 
> code.
> 
> Thanks,
> Nick
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit
  2021-06-19  3:20                     ` Andy Lutomirski
@ 2021-06-19  4:27                       ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-19  4:27 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra (Intel), Rik van Riel
  Cc: Andrew Morton, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Mathieu Desnoyers, Paul E. McKenney, the arch/x86 maintainers

Excerpts from Andy Lutomirski's message of June 19, 2021 1:20 pm:
> 
> 
> On Fri, Jun 18, 2021, at 7:53 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 18, 2021 9:49 am:
>> > On 6/16/21 11:51 PM, Nicholas Piggin wrote:
>> >> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
>> >>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>> >>>>
>> >>>>
>> >>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>> >>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>> >>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>> >>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>> >>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> >>>>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>> >>>>>>>>> a comment explaining why this barrier probably exists in all cases.  This
>> >>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
>> >>>>>>>>> might get rid of these barriers, and it's not really clear to me that
>> >>>>>>>>> the barrier actually exists in all necessary cases.
>> >>>>>>>>
>> >>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>> >>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
>> >>>>>>>>
>> >>>>>>>> And any change can change anything, that doesn't make it fragile. My
>> >>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>> >>>>>>>> replaces it with smp_mb for example.
>> >>>>>>>
>> >>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
>> >>>>>>> discernible reason, that's not generally encouraged, seeing how extra
>> >>>>>>> barriers is extra slow.
>> >>>>>>>
>> >>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
>> >>>>>>> membarrier relies on the implied barrier, what's fragile about that?
>> >>>>>>>
>> >>>>>>
>> >>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
>> >>>>>> be full barriers.  The current implementation has them being full
>> >>>>>> barriers, and the current implementation is quite slow.  So let's try
>> >>>>>> that commit message again:
>> >>>>>>
>> >>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>> >>>>>> a comment explaining why this barrier probably exists in all cases. The
>> >>>>>> logic is based on ensuring that the barrier exists on every control flow
>> >>>>>> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
>> >>>>>> full barriers.
>> >>>>>>
>> >>>>>> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
>> >>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>> >>>>>> could use a release on architectures that have these operations.
>> >>>>>
>> >>>>> I'm not against the idea, I've looked at something similar before (not
>> >>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
>> >>>>> could possibly take advantage of this, I might cherry pick it and test 
>> >>>>> performance :)
>> >>>>>
>> >>>>> I don't think it belongs in this series though. Should go together with
>> >>>>> something that takes advantage of it.
>> >>>>
>> >>>> I’m going to see if I can get hazard pointers into shape quickly.
>> >>>
>> >>> Here it is.  Not even boot tested!
>> >>>
>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
>> >>>
>> >>> Nick, I think you can accomplish much the same thing as your patch by:
>> >>>
>> >>> #define for_each_possible_lazymm_cpu while (false)
>> >> 
>> >> I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
>> >> as lazy at this point. I must be missing something.
>> > 
>> > What I mean is: if you want to shoot down lazies instead of doing the
>> > hazard pointer trick to track them, you could do:
>> > 
>> > #define for_each_possible_lazymm_cpu while (false)
>> > 
>> > which would promise to the core code that you don't have any lazies left
>> > by the time exit_mmap() is done.  You might need a new hook in
>> > exit_mmap() depending on exactly how you implement the lazy shootdown.
>> 
>> Oh for configuring it away entirely. I'll have to see how it falls out, 
>> I suspect we'd want to just no-op that entire function and avoid the 2 
>> atomics if we are taking care of our lazy mms with shootdowns.
> 
> Do you mean the smp_store_release()?  On x86 and similar architectures, that’s almost free.  I’m also not convinced it needs to be a real release.

Probably the shoot lazies code would complile that stuff out entirely so
not that as such, but the entire thing including the change to the 
membarrier barrier (which as I said, shoot lazies could possibly take 
advantage of anyway).

My point is I haven't seen how everything goes together or looked at 
generated code so I can't exactly say yes to your question, but that
there's no reason it couldn't be made to nicely fold away based on
config option so I'm not too concerned about that issue.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-18 20:09             ` Mathieu Desnoyers
  (?)
@ 2021-06-19  6:02               ` Nicholas Piggin
  -1 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-19  6:02 UTC (permalink / raw)
  To: Andy Lutomirski, Mathieu Desnoyers
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, linux-kernel, linux-mm,
	linuxppc-dev, Michael Ellerman, Paul Mackerras, Peter Zijlstra,
	stable, Will Deacon, x86

Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
> ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
> 
>> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>>> 
>>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>>> > 
>>> >> Please change back this #ifndef / #else / #endif within function for
>>> >> 
>>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>> >>   ...
>>> >> } else {
>>> >>   ...
>>> >> }
>>> >> 
>>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>>> > 
>>> > I agree, but I don't know how to make the result work well.
>>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>>> > case, so either I need to fake up a definition or use #ifdef.
>>> > 
>>> > If I faked up a definition, I would want to assert, at build time, that
>>> > it isn't called.  I don't think we can do:
>>> > 
>>> > static void membarrier_sync_core_before_usermode()
>>> > {
>>> >    BUILD_BUG_IF_REACHABLE();
>>> > }
>>> 
>>> Let's look at the context here:
>>> 
>>> static void ipi_sync_core(void *info)
>>> {
>>>     [....]
>>>     membarrier_sync_core_before_usermode()
>>> }
>>> 
>>> ^ this can be within #ifdef / #endif
>>> 
>>> static int membarrier_private_expedited(int flags, int cpu_id)
>>> [...]
>>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>>>                         return -EINVAL;
>>>                 if (!(atomic_read(&mm->membarrier_state) &
>>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>>>                         return -EPERM;
>>>                 ipi_func = ipi_sync_core;
>>> 
>>> All we need to make the line above work is to define an empty ipi_sync_core
>>> function in the #else case after the ipi_sync_core() function definition.
>>> 
>>> Or am I missing your point ?
>> 
>> Maybe?
>> 
>> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
>> I would be fine with that if I could have the compiler statically verify that
>> it’s not called, but I’m uncomfortable having it there if the implementation is
>> actively incorrect.
> 
> I see. Another approach would be to implement a "setter" function to populate
> "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> implementation.

I still don't get the problem with my suggestion. Sure the 
ipi is a "lie", but it doesn't get used. That's how a lot of
ifdef folding works out. E.g.,

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b5add64d9698..54cb32d064af 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,15 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#else
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	compiletime_assert(0, "architecture does not implement membarrier_sync_core_before_usermode");
+}
+
+#endif
 
 /*
  * For documentation purposes, here are some membarrier ordering

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-19  6:02               ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-19  6:02 UTC (permalink / raw)
  To: Andy Lutomirski, Mathieu Desnoyers
  Cc: Will Deacon, linux-mm, Peter Zijlstra, x86, linux-kernel, stable,
	Dave Hansen, Paul Mackerras, Catalin Marinas, Andrew Morton,
	linuxppc-dev, linux-arm-kernel

Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
> ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
> 
>> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>>> 
>>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>>> > 
>>> >> Please change back this #ifndef / #else / #endif within function for
>>> >> 
>>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>> >>   ...
>>> >> } else {
>>> >>   ...
>>> >> }
>>> >> 
>>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>>> > 
>>> > I agree, but I don't know how to make the result work well.
>>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>>> > case, so either I need to fake up a definition or use #ifdef.
>>> > 
>>> > If I faked up a definition, I would want to assert, at build time, that
>>> > it isn't called.  I don't think we can do:
>>> > 
>>> > static void membarrier_sync_core_before_usermode()
>>> > {
>>> >    BUILD_BUG_IF_REACHABLE();
>>> > }
>>> 
>>> Let's look at the context here:
>>> 
>>> static void ipi_sync_core(void *info)
>>> {
>>>     [....]
>>>     membarrier_sync_core_before_usermode()
>>> }
>>> 
>>> ^ this can be within #ifdef / #endif
>>> 
>>> static int membarrier_private_expedited(int flags, int cpu_id)
>>> [...]
>>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>>>                         return -EINVAL;
>>>                 if (!(atomic_read(&mm->membarrier_state) &
>>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>>>                         return -EPERM;
>>>                 ipi_func = ipi_sync_core;
>>> 
>>> All we need to make the line above work is to define an empty ipi_sync_core
>>> function in the #else case after the ipi_sync_core() function definition.
>>> 
>>> Or am I missing your point ?
>> 
>> Maybe?
>> 
>> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
>> I would be fine with that if I could have the compiler statically verify that
>> it’s not called, but I’m uncomfortable having it there if the implementation is
>> actively incorrect.
> 
> I see. Another approach would be to implement a "setter" function to populate
> "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> implementation.

I still don't get the problem with my suggestion. Sure the 
ipi is a "lie", but it doesn't get used. That's how a lot of
ifdef folding works out. E.g.,

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b5add64d9698..54cb32d064af 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,15 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#else
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	compiletime_assert(0, "architecture does not implement membarrier_sync_core_before_usermode");
+}
+
+#endif
 
 /*
  * For documentation purposes, here are some membarrier ordering

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-19  6:02               ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-19  6:02 UTC (permalink / raw)
  To: Andy Lutomirski, Mathieu Desnoyers
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, linux-kernel, linux-mm,
	linuxppc-dev, Michael Ellerman, Paul Mackerras, Peter Zijlstra,
	stable, Will Deacon, x86

Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
> ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
> 
>> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>>> 
>>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>>> > 
>>> >> Please change back this #ifndef / #else / #endif within function for
>>> >> 
>>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>> >>   ...
>>> >> } else {
>>> >>   ...
>>> >> }
>>> >> 
>>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>>> > 
>>> > I agree, but I don't know how to make the result work well.
>>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>>> > case, so either I need to fake up a definition or use #ifdef.
>>> > 
>>> > If I faked up a definition, I would want to assert, at build time, that
>>> > it isn't called.  I don't think we can do:
>>> > 
>>> > static void membarrier_sync_core_before_usermode()
>>> > {
>>> >    BUILD_BUG_IF_REACHABLE();
>>> > }
>>> 
>>> Let's look at the context here:
>>> 
>>> static void ipi_sync_core(void *info)
>>> {
>>>     [....]
>>>     membarrier_sync_core_before_usermode()
>>> }
>>> 
>>> ^ this can be within #ifdef / #endif
>>> 
>>> static int membarrier_private_expedited(int flags, int cpu_id)
>>> [...]
>>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>>>                         return -EINVAL;
>>>                 if (!(atomic_read(&mm->membarrier_state) &
>>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>>>                         return -EPERM;
>>>                 ipi_func = ipi_sync_core;
>>> 
>>> All we need to make the line above work is to define an empty ipi_sync_core
>>> function in the #else case after the ipi_sync_core() function definition.
>>> 
>>> Or am I missing your point ?
>> 
>> Maybe?
>> 
>> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
>> I would be fine with that if I could have the compiler statically verify that
>> it’s not called, but I’m uncomfortable having it there if the implementation is
>> actively incorrect.
> 
> I see. Another approach would be to implement a "setter" function to populate
> "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> implementation.

I still don't get the problem with my suggestion. Sure the 
ipi is a "lie", but it doesn't get used. That's how a lot of
ifdef folding works out. E.g.,

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b5add64d9698..54cb32d064af 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,15 @@
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#else
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	compiletime_assert(0, "architecture does not implement membarrier_sync_core_before_usermode");
+}
+
+#endif
 
 /*
  * For documentation purposes, here are some membarrier ordering

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-19  6:02               ` Nicholas Piggin
  (?)
@ 2021-06-19 15:50                 ` Andy Lutomirski
  -1 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-19 15:50 UTC (permalink / raw)
  To: Nicholas Piggin, Mathieu Desnoyers
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mm, linuxppc-dev, Michael Ellerman, Paul Mackerras,
	Peter Zijlstra (Intel),
	stable, Will Deacon, the arch/x86 maintainers



On Fri, Jun 18, 2021, at 11:02 PM, Nicholas Piggin wrote:
> Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
> > ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
> > 
> >> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
> >>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
> >>> 
> >>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> >>> > 
> >>> >> Please change back this #ifndef / #else / #endif within function for
> >>> >> 
> >>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
> >>> >>   ...
> >>> >> } else {
> >>> >>   ...
> >>> >> }
> >>> >> 
> >>> >> I don't think mixing up preprocessor and code logic makes it more readable.
> >>> > 
> >>> > I agree, but I don't know how to make the result work well.
> >>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> >>> > case, so either I need to fake up a definition or use #ifdef.
> >>> > 
> >>> > If I faked up a definition, I would want to assert, at build time, that
> >>> > it isn't called.  I don't think we can do:
> >>> > 
> >>> > static void membarrier_sync_core_before_usermode()
> >>> > {
> >>> >    BUILD_BUG_IF_REACHABLE();
> >>> > }
> >>> 
> >>> Let's look at the context here:
> >>> 
> >>> static void ipi_sync_core(void *info)
> >>> {
> >>>     [....]
> >>>     membarrier_sync_core_before_usermode()
> >>> }
> >>> 
> >>> ^ this can be within #ifdef / #endif
> >>> 
> >>> static int membarrier_private_expedited(int flags, int cpu_id)
> >>> [...]
> >>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> >>>                         return -EINVAL;
> >>>                 if (!(atomic_read(&mm->membarrier_state) &
> >>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> >>>                         return -EPERM;
> >>>                 ipi_func = ipi_sync_core;
> >>> 
> >>> All we need to make the line above work is to define an empty ipi_sync_core
> >>> function in the #else case after the ipi_sync_core() function definition.
> >>> 
> >>> Or am I missing your point ?
> >> 
> >> Maybe?
> >> 
> >> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
> >> I would be fine with that if I could have the compiler statically verify that
> >> it’s not called, but I’m uncomfortable having it there if the implementation is
> >> actively incorrect.
> > 
> > I see. Another approach would be to implement a "setter" function to populate
> > "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> > implementation.
> 
> I still don't get the problem with my suggestion. Sure the 
> ipi is a "lie", but it doesn't get used. That's how a lot of
> ifdef folding works out. E.g.,
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index b5add64d9698..54cb32d064af 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -5,6 +5,15 @@
>   * membarrier system call
>   */
>  #include "sched.h"
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> +#include <asm/sync_core.h>
> +#else
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +	compiletime_assert(0, "architecture does not implement 
> membarrier_sync_core_before_usermode");
> +}
> +

With the assert there, I’m fine with this. Let me see if the result builds.

> +#endif
>  
>  /*
>   * For documentation purposes, here are some membarrier ordering
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-19 15:50                 ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-19 15:50 UTC (permalink / raw)
  To: Nicholas Piggin, Mathieu Desnoyers
  Cc: Will Deacon, linux-mm, Peter Zijlstra (Intel),
	the arch/x86 maintainers, Linux Kernel Mailing List, stable,
	Dave Hansen, Paul Mackerras, Catalin Marinas, Andrew Morton,
	linuxppc-dev, linux-arm-kernel



On Fri, Jun 18, 2021, at 11:02 PM, Nicholas Piggin wrote:
> Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
> > ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
> > 
> >> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
> >>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
> >>> 
> >>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> >>> > 
> >>> >> Please change back this #ifndef / #else / #endif within function for
> >>> >> 
> >>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
> >>> >>   ...
> >>> >> } else {
> >>> >>   ...
> >>> >> }
> >>> >> 
> >>> >> I don't think mixing up preprocessor and code logic makes it more readable.
> >>> > 
> >>> > I agree, but I don't know how to make the result work well.
> >>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> >>> > case, so either I need to fake up a definition or use #ifdef.
> >>> > 
> >>> > If I faked up a definition, I would want to assert, at build time, that
> >>> > it isn't called.  I don't think we can do:
> >>> > 
> >>> > static void membarrier_sync_core_before_usermode()
> >>> > {
> >>> >    BUILD_BUG_IF_REACHABLE();
> >>> > }
> >>> 
> >>> Let's look at the context here:
> >>> 
> >>> static void ipi_sync_core(void *info)
> >>> {
> >>>     [....]
> >>>     membarrier_sync_core_before_usermode()
> >>> }
> >>> 
> >>> ^ this can be within #ifdef / #endif
> >>> 
> >>> static int membarrier_private_expedited(int flags, int cpu_id)
> >>> [...]
> >>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> >>>                         return -EINVAL;
> >>>                 if (!(atomic_read(&mm->membarrier_state) &
> >>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> >>>                         return -EPERM;
> >>>                 ipi_func = ipi_sync_core;
> >>> 
> >>> All we need to make the line above work is to define an empty ipi_sync_core
> >>> function in the #else case after the ipi_sync_core() function definition.
> >>> 
> >>> Or am I missing your point ?
> >> 
> >> Maybe?
> >> 
> >> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
> >> I would be fine with that if I could have the compiler statically verify that
> >> it’s not called, but I’m uncomfortable having it there if the implementation is
> >> actively incorrect.
> > 
> > I see. Another approach would be to implement a "setter" function to populate
> > "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> > implementation.
> 
> I still don't get the problem with my suggestion. Sure the 
> ipi is a "lie", but it doesn't get used. That's how a lot of
> ifdef folding works out. E.g.,
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index b5add64d9698..54cb32d064af 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -5,6 +5,15 @@
>   * membarrier system call
>   */
>  #include "sched.h"
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> +#include <asm/sync_core.h>
> +#else
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +	compiletime_assert(0, "architecture does not implement 
> membarrier_sync_core_before_usermode");
> +}
> +

With the assert there, I’m fine with this. Let me see if the result builds.

> +#endif
>  
>  /*
>   * For documentation purposes, here are some membarrier ordering
> 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-19 15:50                 ` Andy Lutomirski
  0 siblings, 0 replies; 165+ messages in thread
From: Andy Lutomirski @ 2021-06-19 15:50 UTC (permalink / raw)
  To: Nicholas Piggin, Mathieu Desnoyers
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mm, linuxppc-dev, Michael Ellerman, Paul Mackerras,
	Peter Zijlstra (Intel),
	stable, Will Deacon, the arch/x86 maintainers



On Fri, Jun 18, 2021, at 11:02 PM, Nicholas Piggin wrote:
> Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
> > ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
> > 
> >> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
> >>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
> >>> 
> >>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> >>> > 
> >>> >> Please change back this #ifndef / #else / #endif within function for
> >>> >> 
> >>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
> >>> >>   ...
> >>> >> } else {
> >>> >>   ...
> >>> >> }
> >>> >> 
> >>> >> I don't think mixing up preprocessor and code logic makes it more readable.
> >>> > 
> >>> > I agree, but I don't know how to make the result work well.
> >>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> >>> > case, so either I need to fake up a definition or use #ifdef.
> >>> > 
> >>> > If I faked up a definition, I would want to assert, at build time, that
> >>> > it isn't called.  I don't think we can do:
> >>> > 
> >>> > static void membarrier_sync_core_before_usermode()
> >>> > {
> >>> >    BUILD_BUG_IF_REACHABLE();
> >>> > }
> >>> 
> >>> Let's look at the context here:
> >>> 
> >>> static void ipi_sync_core(void *info)
> >>> {
> >>>     [....]
> >>>     membarrier_sync_core_before_usermode()
> >>> }
> >>> 
> >>> ^ this can be within #ifdef / #endif
> >>> 
> >>> static int membarrier_private_expedited(int flags, int cpu_id)
> >>> [...]
> >>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> >>>                         return -EINVAL;
> >>>                 if (!(atomic_read(&mm->membarrier_state) &
> >>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> >>>                         return -EPERM;
> >>>                 ipi_func = ipi_sync_core;
> >>> 
> >>> All we need to make the line above work is to define an empty ipi_sync_core
> >>> function in the #else case after the ipi_sync_core() function definition.
> >>> 
> >>> Or am I missing your point ?
> >> 
> >> Maybe?
> >> 
> >> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
> >> I would be fine with that if I could have the compiler statically verify that
> >> it’s not called, but I’m uncomfortable having it there if the implementation is
> >> actively incorrect.
> > 
> > I see. Another approach would be to implement a "setter" function to populate
> > "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> > implementation.
> 
> I still don't get the problem with my suggestion. Sure the 
> ipi is a "lie", but it doesn't get used. That's how a lot of
> ifdef folding works out. E.g.,
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index b5add64d9698..54cb32d064af 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -5,6 +5,15 @@
>   * membarrier system call
>   */
>  #include "sched.h"
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> +#include <asm/sync_core.h>
> +#else
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +	compiletime_assert(0, "architecture does not implement 
> membarrier_sync_core_before_usermode");
> +}
> +

With the assert there, I’m fine with this. Let me see if the result builds.

> +#endif
>  
>  /*
>   * For documentation purposes, here are some membarrier ordering
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
  2021-06-19 15:50                 ` Andy Lutomirski
  (?)
@ 2021-06-20  2:10                   ` Nicholas Piggin
  -1 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-20  2:10 UTC (permalink / raw)
  To: Andy Lutomirski, Mathieu Desnoyers
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mm, linuxppc-dev, Michael Ellerman, Paul Mackerras,
	Peter Zijlstra (Intel),
	stable, Will Deacon, the arch/x86 maintainers

Excerpts from Andy Lutomirski's message of June 20, 2021 1:50 am:
> 
> 
> On Fri, Jun 18, 2021, at 11:02 PM, Nicholas Piggin wrote:
>> Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
>> > ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
>> > 
>> >> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> >>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>> >>> 
>> >>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> >>> > 
>> >>> >> Please change back this #ifndef / #else / #endif within function for
>> >>> >> 
>> >>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>> >>   ...
>> >>> >> } else {
>> >>> >>   ...
>> >>> >> }
>> >>> >> 
>> >>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>> >>> > 
>> >>> > I agree, but I don't know how to make the result work well.
>> >>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> >>> > case, so either I need to fake up a definition or use #ifdef.
>> >>> > 
>> >>> > If I faked up a definition, I would want to assert, at build time, that
>> >>> > it isn't called.  I don't think we can do:
>> >>> > 
>> >>> > static void membarrier_sync_core_before_usermode()
>> >>> > {
>> >>> >    BUILD_BUG_IF_REACHABLE();
>> >>> > }
>> >>> 
>> >>> Let's look at the context here:
>> >>> 
>> >>> static void ipi_sync_core(void *info)
>> >>> {
>> >>>     [....]
>> >>>     membarrier_sync_core_before_usermode()
>> >>> }
>> >>> 
>> >>> ^ this can be within #ifdef / #endif
>> >>> 
>> >>> static int membarrier_private_expedited(int flags, int cpu_id)
>> >>> [...]
>> >>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>> >>>                         return -EINVAL;
>> >>>                 if (!(atomic_read(&mm->membarrier_state) &
>> >>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>> >>>                         return -EPERM;
>> >>>                 ipi_func = ipi_sync_core;
>> >>> 
>> >>> All we need to make the line above work is to define an empty ipi_sync_core
>> >>> function in the #else case after the ipi_sync_core() function definition.
>> >>> 
>> >>> Or am I missing your point ?
>> >> 
>> >> Maybe?
>> >> 
>> >> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
>> >> I would be fine with that if I could have the compiler statically verify that
>> >> it’s not called, but I’m uncomfortable having it there if the implementation is
>> >> actively incorrect.
>> > 
>> > I see. Another approach would be to implement a "setter" function to populate
>> > "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>> > implementation.
>> 
>> I still don't get the problem with my suggestion. Sure the 
>> ipi is a "lie", but it doesn't get used. That's how a lot of
>> ifdef folding works out. E.g.,
>> 
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index b5add64d9698..54cb32d064af 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -5,6 +5,15 @@
>>   * membarrier system call
>>   */
>>  #include "sched.h"
>> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>> +#include <asm/sync_core.h>
>> +#else
>> +static inline void membarrier_sync_core_before_usermode(void)
>> +{
>> +	compiletime_assert(0, "architecture does not implement 
>> membarrier_sync_core_before_usermode");
>> +}
>> +
> 
> With the assert there, I’m fine with this. Let me see if the result builds.

It had better, because compiletime_assert already relies on a similar 
level of code elimination to work.

I think it's fine to use for now, but it may not be quite the the 
logically correct primitive if we want to be really clean, because a 
valid compiletime_assert implementation should be able to fire even for 
code that is never linked. We would want something like to be clean 
IMO:

#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
#include <asm/sync_core.h>
#else
extern void membarrier_sync_core_before_usermode(void) __compiletime_error("architecture does not implement membarrier_sync_core_before_usermode");
#endif

However that does not have the ifdef for optimising compile so AFAIKS it 
could break with a false positive in some cases.

Something like compiletime_assert_not_called("msg") that either compiles
to a noop or a __compiletime_error depending on __OPTIMIZE__ would be 
the way to go IMO. I don't know if anything exists that fits, but it's
certainly not a unique thing in the kernel so I may not be looking hard
enough.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-20  2:10                   ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-20  2:10 UTC (permalink / raw)
  To: Andy Lutomirski, Mathieu Desnoyers
  Cc: Will Deacon, linux-mm, Peter Zijlstra (Intel),
	the arch/x86 maintainers, Linux Kernel Mailing List, stable,
	Dave Hansen, Paul Mackerras, Catalin Marinas, Andrew Morton,
	linuxppc-dev, linux-arm-kernel

Excerpts from Andy Lutomirski's message of June 20, 2021 1:50 am:
> 
> 
> On Fri, Jun 18, 2021, at 11:02 PM, Nicholas Piggin wrote:
>> Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
>> > ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
>> > 
>> >> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> >>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>> >>> 
>> >>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> >>> > 
>> >>> >> Please change back this #ifndef / #else / #endif within function for
>> >>> >> 
>> >>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>> >>   ...
>> >>> >> } else {
>> >>> >>   ...
>> >>> >> }
>> >>> >> 
>> >>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>> >>> > 
>> >>> > I agree, but I don't know how to make the result work well.
>> >>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> >>> > case, so either I need to fake up a definition or use #ifdef.
>> >>> > 
>> >>> > If I faked up a definition, I would want to assert, at build time, that
>> >>> > it isn't called.  I don't think we can do:
>> >>> > 
>> >>> > static void membarrier_sync_core_before_usermode()
>> >>> > {
>> >>> >    BUILD_BUG_IF_REACHABLE();
>> >>> > }
>> >>> 
>> >>> Let's look at the context here:
>> >>> 
>> >>> static void ipi_sync_core(void *info)
>> >>> {
>> >>>     [....]
>> >>>     membarrier_sync_core_before_usermode()
>> >>> }
>> >>> 
>> >>> ^ this can be within #ifdef / #endif
>> >>> 
>> >>> static int membarrier_private_expedited(int flags, int cpu_id)
>> >>> [...]
>> >>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>> >>>                         return -EINVAL;
>> >>>                 if (!(atomic_read(&mm->membarrier_state) &
>> >>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>> >>>                         return -EPERM;
>> >>>                 ipi_func = ipi_sync_core;
>> >>> 
>> >>> All we need to make the line above work is to define an empty ipi_sync_core
>> >>> function in the #else case after the ipi_sync_core() function definition.
>> >>> 
>> >>> Or am I missing your point ?
>> >> 
>> >> Maybe?
>> >> 
>> >> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
>> >> I would be fine with that if I could have the compiler statically verify that
>> >> it’s not called, but I’m uncomfortable having it there if the implementation is
>> >> actively incorrect.
>> > 
>> > I see. Another approach would be to implement a "setter" function to populate
>> > "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>> > implementation.
>> 
>> I still don't get the problem with my suggestion. Sure the 
>> ipi is a "lie", but it doesn't get used. That's how a lot of
>> ifdef folding works out. E.g.,
>> 
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index b5add64d9698..54cb32d064af 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -5,6 +5,15 @@
>>   * membarrier system call
>>   */
>>  #include "sched.h"
>> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>> +#include <asm/sync_core.h>
>> +#else
>> +static inline void membarrier_sync_core_before_usermode(void)
>> +{
>> +	compiletime_assert(0, "architecture does not implement 
>> membarrier_sync_core_before_usermode");
>> +}
>> +
> 
> With the assert there, I’m fine with this. Let me see if the result builds.

It had better, because compiletime_assert already relies on a similar 
level of code elimination to work.

I think it's fine to use for now, but it may not be quite the the 
logically correct primitive if we want to be really clean, because a 
valid compiletime_assert implementation should be able to fire even for 
code that is never linked. We would want something like to be clean 
IMO:

#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
#include <asm/sync_core.h>
#else
extern void membarrier_sync_core_before_usermode(void) __compiletime_error("architecture does not implement membarrier_sync_core_before_usermode");
#endif

However that does not have the ifdef for optimising compile so AFAIKS it 
could break with a false positive in some cases.

Something like compiletime_assert_not_called("msg") that either compiles
to a noop or a __compiletime_error depending on __OPTIMIZE__ would be 
the way to go IMO. I don't know if anything exists that fits, but it's
certainly not a unique thing in the kernel so I may not be looking hard
enough.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
@ 2021-06-20  2:10                   ` Nicholas Piggin
  0 siblings, 0 replies; 165+ messages in thread
From: Nicholas Piggin @ 2021-06-20  2:10 UTC (permalink / raw)
  To: Andy Lutomirski, Mathieu Desnoyers
  Cc: Andrew Morton, Benjamin Herrenschmidt, Catalin Marinas,
	Dave Hansen, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mm, linuxppc-dev, Michael Ellerman, Paul Mackerras,
	Peter Zijlstra (Intel),
	stable, Will Deacon, the arch/x86 maintainers

Excerpts from Andy Lutomirski's message of June 20, 2021 1:50 am:
> 
> 
> On Fri, Jun 18, 2021, at 11:02 PM, Nicholas Piggin wrote:
>> Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
>> > ----- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski luto@kernel.org wrote:
>> > 
>> >> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> >>> ----- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski luto@kernel.org wrote:
>> >>> 
>> >>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> >>> > 
>> >>> >> Please change back this #ifndef / #else / #endif within function for
>> >>> >> 
>> >>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>> >>   ...
>> >>> >> } else {
>> >>> >>   ...
>> >>> >> }
>> >>> >> 
>> >>> >> I don't think mixing up preprocessor and code logic makes it more readable.
>> >>> > 
>> >>> > I agree, but I don't know how to make the result work well.
>> >>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> >>> > case, so either I need to fake up a definition or use #ifdef.
>> >>> > 
>> >>> > If I faked up a definition, I would want to assert, at build time, that
>> >>> > it isn't called.  I don't think we can do:
>> >>> > 
>> >>> > static void membarrier_sync_core_before_usermode()
>> >>> > {
>> >>> >    BUILD_BUG_IF_REACHABLE();
>> >>> > }
>> >>> 
>> >>> Let's look at the context here:
>> >>> 
>> >>> static void ipi_sync_core(void *info)
>> >>> {
>> >>>     [....]
>> >>>     membarrier_sync_core_before_usermode()
>> >>> }
>> >>> 
>> >>> ^ this can be within #ifdef / #endif
>> >>> 
>> >>> static int membarrier_private_expedited(int flags, int cpu_id)
>> >>> [...]
>> >>>                if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>> >>>                         return -EINVAL;
>> >>>                 if (!(atomic_read(&mm->membarrier_state) &
>> >>>                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>> >>>                         return -EPERM;
>> >>>                 ipi_func = ipi_sync_core;
>> >>> 
>> >>> All we need to make the line above work is to define an empty ipi_sync_core
>> >>> function in the #else case after the ipi_sync_core() function definition.
>> >>> 
>> >>> Or am I missing your point ?
>> >> 
>> >> Maybe?
>> >> 
>> >> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the core.
>> >> I would be fine with that if I could have the compiler statically verify that
>> >> it’s not called, but I’m uncomfortable having it there if the implementation is
>> >> actively incorrect.
>> > 
>> > I see. Another approach would be to implement a "setter" function to populate
>> > "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>> > implementation.
>> 
>> I still don't get the problem with my suggestion. Sure the 
>> ipi is a "lie", but it doesn't get used. That's how a lot of
>> ifdef folding works out. E.g.,
>> 
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index b5add64d9698..54cb32d064af 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -5,6 +5,15 @@
>>   * membarrier system call
>>   */
>>  #include "sched.h"
>> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>> +#include <asm/sync_core.h>
>> +#else
>> +static inline void membarrier_sync_core_before_usermode(void)
>> +{
>> +	compiletime_assert(0, "architecture does not implement 
>> membarrier_sync_core_before_usermode");
>> +}
>> +
> 
> With the assert there, I’m fine with this. Let me see if the result builds.

It had better, because compiletime_assert already relies on a similar 
level of code elimination to work.

I think it's fine to use for now, but it may not be quite the the 
logically correct primitive if we want to be really clean, because a 
valid compiletime_assert implementation should be able to fire even for 
code that is never linked. We would want something like to be clean 
IMO:

#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
#include <asm/sync_core.h>
#else
extern void membarrier_sync_core_before_usermode(void) __compiletime_error("architecture does not implement membarrier_sync_core_before_usermode");
#endif

However that does not have the ifdef for optimising compile so AFAIKS it 
could break with a false positive in some cases.

Something like compiletime_assert_not_called("msg") that either compiles
to a noop or a __compiletime_error depending on __OPTIMIZE__ would be 
the way to go IMO. I don't know if anything exists that fits, but it's
certainly not a unique thing in the kernel so I may not be looking hard
enough.

Thanks,
Nick


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

end of thread, other threads:[~2021-06-20  2:12 UTC | newest]

Thread overview: 165+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-16  3:21 [PATCH 0/8] membarrier cleanups Andy Lutomirski
2021-06-16  3:21 ` [PATCH 1/8] membarrier: Document why membarrier() works Andy Lutomirski
2021-06-16  4:00   ` Nicholas Piggin
2021-06-16  7:30     ` Peter Zijlstra
2021-06-17 23:45       ` Andy Lutomirski
2021-06-16  3:21 ` [PATCH 2/8] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
2021-06-16  4:25   ` Nicholas Piggin
2021-06-16 18:31     ` Andy Lutomirski
2021-06-16 17:49   ` Mathieu Desnoyers
2021-06-16 17:49     ` Mathieu Desnoyers
2021-06-16 18:31     ` Andy Lutomirski
2021-06-16  3:21 ` [PATCH 3/8] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
2021-06-16  4:26   ` Nicholas Piggin
2021-06-16 17:52   ` Mathieu Desnoyers
2021-06-16 17:52     ` Mathieu Desnoyers
2021-06-16  3:21 ` [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
2021-06-16  4:19   ` Nicholas Piggin
2021-06-16  7:35     ` Peter Zijlstra
2021-06-16 18:41       ` Andy Lutomirski
2021-06-17  1:37         ` Nicholas Piggin
2021-06-17  2:57           ` Andy Lutomirski
2021-06-17  5:32             ` Andy Lutomirski
2021-06-17  6:51               ` Nicholas Piggin
2021-06-17 23:49                 ` Andy Lutomirski
2021-06-19  2:53                   ` Nicholas Piggin
2021-06-19  3:20                     ` Andy Lutomirski
2021-06-19  4:27                       ` Nicholas Piggin
2021-06-17  9:08               ` [RFC][PATCH] sched: Use lightweight hazard pointers to grab lazy mms Peter Zijlstra
2021-06-17  9:10                 ` Peter Zijlstra
2021-06-17 10:00                   ` Nicholas Piggin
2021-06-17  9:13                 ` Peter Zijlstra
2021-06-17 14:06                   ` Andy Lutomirski
2021-06-17  9:28                 ` Peter Zijlstra
2021-06-17 14:03                   ` Andy Lutomirski
2021-06-17 14:10                 ` Andy Lutomirski
2021-06-17 15:45                   ` Peter Zijlstra
2021-06-18  3:29                 ` Paul E. McKenney
2021-06-18  5:04                   ` Andy Lutomirski
2021-06-17 15:02               ` [PATCH 4/8] membarrier: Make the post-switch-mm barrier explicit Paul E. McKenney
2021-06-18  0:06                 ` Andy Lutomirski
2021-06-18  3:35                   ` Paul E. McKenney
2021-06-17  8:45         ` Peter Zijlstra
2021-06-16  3:21 ` [PATCH 5/8] membarrier, kthread: Use _ONCE accessors for task->mm Andy Lutomirski
2021-06-16  4:28   ` Nicholas Piggin
2021-06-16 18:08   ` Mathieu Desnoyers
2021-06-16 18:08     ` Mathieu Desnoyers
2021-06-16 18:45     ` Andy Lutomirski
2021-06-16  3:21 ` [PATCH 6/8] powerpc/membarrier: Remove special barrier on mm switch Andy Lutomirski
2021-06-16  3:21   ` Andy Lutomirski
2021-06-16  4:36   ` Nicholas Piggin
2021-06-16  4:36     ` Nicholas Piggin
2021-06-16  3:21 ` [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE Andy Lutomirski
2021-06-16  3:21   ` Andy Lutomirski
2021-06-16  9:28   ` Russell King (Oracle)
2021-06-16  9:28     ` Russell King (Oracle)
2021-06-16 10:16   ` Peter Zijlstra
2021-06-16 10:16     ` Peter Zijlstra
2021-06-16 10:20     ` Peter Zijlstra
2021-06-16 10:20       ` Peter Zijlstra
2021-06-16 10:34       ` Russell King (Oracle)
2021-06-16 10:34         ` Russell King (Oracle)
2021-06-16 11:10         ` Peter Zijlstra
2021-06-16 11:10           ` Peter Zijlstra
2021-06-16 13:22           ` Russell King (Oracle)
2021-06-16 13:22             ` Russell King (Oracle)
2021-06-16 15:04             ` Catalin Marinas
2021-06-16 15:04               ` Catalin Marinas
2021-06-16 15:23               ` Russell King (Oracle)
2021-06-16 15:23                 ` Russell King (Oracle)
2021-06-16 15:45                 ` Catalin Marinas
2021-06-16 15:45                   ` Catalin Marinas
2021-06-16 16:00                   ` Catalin Marinas
2021-06-16 16:00                     ` Catalin Marinas
2021-06-16 16:27                     ` Russell King (Oracle)
2021-06-16 16:27                       ` Russell King (Oracle)
2021-06-17  8:55                       ` Krzysztof Hałasa
2021-06-17  8:55                         ` Krzysztof Hałasa
2021-06-17  8:55                         ` Krzysztof Hałasa
2021-06-18 12:54                       ` Linus Walleij
2021-06-18 12:54                         ` Linus Walleij
2021-06-18 12:54                         ` Linus Walleij
2021-06-18 13:19                         ` Russell King (Oracle)
2021-06-18 13:19                           ` Russell King (Oracle)
2021-06-18 13:36                         ` Arnd Bergmann
2021-06-18 13:36                           ` Arnd Bergmann
2021-06-18 13:36                           ` Arnd Bergmann
2021-06-17 10:40   ` Mark Rutland
2021-06-17 10:40     ` Mark Rutland
2021-06-17 11:23     ` Russell King (Oracle)
2021-06-17 11:23       ` Russell King (Oracle)
2021-06-17 11:33       ` Mark Rutland
2021-06-17 11:33         ` Mark Rutland
2021-06-17 13:41         ` Andy Lutomirski
2021-06-17 13:41           ` Andy Lutomirski
2021-06-17 13:51           ` Mark Rutland
2021-06-17 13:51             ` Mark Rutland
2021-06-17 14:00             ` Andy Lutomirski
2021-06-17 14:00               ` Andy Lutomirski
2021-06-17 14:20               ` Mark Rutland
2021-06-17 14:20                 ` Mark Rutland
2021-06-17 15:01               ` Peter Zijlstra
2021-06-17 15:01                 ` Peter Zijlstra
2021-06-17 15:13                 ` Peter Zijlstra
2021-06-17 15:13                   ` Peter Zijlstra
2021-06-17 14:16             ` Mathieu Desnoyers
2021-06-17 14:16               ` Mathieu Desnoyers
2021-06-17 14:05           ` Peter Zijlstra
2021-06-17 14:05             ` Peter Zijlstra
2021-06-18  0:07   ` Andy Lutomirski
2021-06-18  0:07     ` Andy Lutomirski
2021-06-16  3:21 ` [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation Andy Lutomirski
2021-06-16  3:21   ` Andy Lutomirski
2021-06-16  3:21   ` Andy Lutomirski
2021-06-16  4:45   ` Nicholas Piggin
2021-06-16  4:45     ` Nicholas Piggin
2021-06-16  4:45     ` Nicholas Piggin
2021-06-16 18:52     ` Andy Lutomirski
2021-06-16 18:52       ` Andy Lutomirski
2021-06-16 18:52       ` Andy Lutomirski
2021-06-16 23:48       ` Andy Lutomirski
2021-06-16 23:48         ` Andy Lutomirski
2021-06-16 23:48         ` Andy Lutomirski
2021-06-18 15:27       ` Christophe Leroy
2021-06-18 15:27         ` Christophe Leroy
2021-06-18 15:27         ` Christophe Leroy
2021-06-16 10:20   ` Will Deacon
2021-06-16 10:20     ` Will Deacon
2021-06-16 10:20     ` Will Deacon
2021-06-16 23:58     ` Andy Lutomirski
2021-06-16 23:58       ` Andy Lutomirski
2021-06-16 23:58       ` Andy Lutomirski
2021-06-17 14:47   ` Mathieu Desnoyers
2021-06-17 14:47     ` Mathieu Desnoyers
2021-06-17 14:47     ` Mathieu Desnoyers
2021-06-17 14:47     ` Mathieu Desnoyers
2021-06-18  0:12     ` Andy Lutomirski
2021-06-18  0:12       ` Andy Lutomirski
2021-06-18  0:12       ` Andy Lutomirski
2021-06-18 16:31       ` Mathieu Desnoyers
2021-06-18 16:31         ` Mathieu Desnoyers
2021-06-18 16:31         ` Mathieu Desnoyers
2021-06-18 16:31         ` Mathieu Desnoyers
2021-06-18 19:58         ` Andy Lutomirski
2021-06-18 19:58           ` Andy Lutomirski
2021-06-18 19:58           ` Andy Lutomirski
2021-06-18 20:09           ` Mathieu Desnoyers
2021-06-18 20:09             ` Mathieu Desnoyers
2021-06-18 20:09             ` Mathieu Desnoyers
2021-06-18 20:09             ` Mathieu Desnoyers
2021-06-19  6:02             ` Nicholas Piggin
2021-06-19  6:02               ` Nicholas Piggin
2021-06-19  6:02               ` Nicholas Piggin
2021-06-19 15:50               ` Andy Lutomirski
2021-06-19 15:50                 ` Andy Lutomirski
2021-06-19 15:50                 ` Andy Lutomirski
2021-06-20  2:10                 ` Nicholas Piggin
2021-06-20  2:10                   ` Nicholas Piggin
2021-06-20  2:10                   ` Nicholas Piggin
2021-06-17 15:16   ` Mathieu Desnoyers
2021-06-17 15:16     ` Mathieu Desnoyers
2021-06-17 15:16     ` Mathieu Desnoyers
2021-06-17 15:16     ` Mathieu Desnoyers
2021-06-18  0:13     ` Andy Lutomirski
2021-06-18  0:13       ` Andy Lutomirski
2021-06-18  0:13       ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.