linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] membarrier fixes
@ 2020-11-30 17:50 Andy Lutomirski
  2020-11-30 17:50 ` [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization Andy Lutomirski
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Andy Lutomirski @ 2020-11-30 17:50 UTC (permalink / raw)
  To: x86, Mathieu Desnoyers
  Cc: LKML, Nicholas Piggin, Arnd Bergmann, Anton Blanchard, Andy Lutomirski

Hi all-

x86's sync_core_before_usermode() was bogus.  Without the other
patches applied, it would never be called in a problematic context,
but that's about to change.  In any event, sync_core_before_usermode()
should be correct.

The second patch fixes a minor issue, but it also makes the third patch
nicer.

The third patch is the biggie.  The old code looped over all CPUs
without disabling migration, and it skipped the current CPU.  There
were comments about how the scheduler's barriers made this okay.  This
may well be true, but it was a mess, and it's considerably simpler to
treat the current CPU just like all other CPUs.

The messy skip-the-current-CPU code masked what seems to be a couple
of much bigger issues: if the membarrier() syscall ran all the way
through _without_ being preempted, it completely failed to operate on
the calling thread.  The smp_mb() calls sprinkled through the function
would mask this problem for the normal barrier mode, but they wouldn't
help for the core-serializing mode or rseq_preempt mode.  In other
words, modifying some code, calling
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0), and running
that modified code was not actually safe.  This seems to rather defeat
the purpose.

Some or all of this might be -stable material.

The global_expedited code looks similarly nasty.  Any volunteers to
clean it up?

Andy Lutomirski (3):
  x86/membarrier: Get rid of a dubious optimization
  membarrier: Add an actual barrier before rseq_preempt()
  membarrier: Propagate SYNC_CORE and RSEQ actions more carefully

 arch/x86/include/asm/sync_core.h |   9 +--
 arch/x86/mm/tlb.c                |   6 +-
 kernel/sched/membarrier.c        | 102 ++++++++++++++++++++++---------
 3 files changed, 83 insertions(+), 34 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization
  2020-11-30 17:50 [PATCH 0/3] membarrier fixes Andy Lutomirski
@ 2020-11-30 17:50 ` Andy Lutomirski
  2020-12-01 14:39   ` Mathieu Desnoyers
  2020-11-30 17:50 ` [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt() Andy Lutomirski
  2020-11-30 17:50 ` [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully Andy Lutomirski
  2 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2020-11-30 17:50 UTC (permalink / raw)
  To: x86, Mathieu Desnoyers
  Cc: LKML, Nicholas Piggin, Arnd Bergmann, Anton Blanchard, Andy Lutomirski

sync_core_before_usermode() had an incorrect optimization.  If we're
in an IRQ, we can get to usermode without IRET -- we just have to
schedule to a different task in the same mm and do SYSRET.
Fortunately, there were no callers of sync_core_before_usermode()
that could have had in_irq() or in_nmi() equal to true, because it's
only ever called from the scheduler.

While we're at it, clarify a related comment.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/sync_core.h | 9 +++++----
 arch/x86/mm/tlb.c                | 6 ++++--
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index 0fd4a9dfb29c..ab7382f92aff 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -98,12 +98,13 @@ static inline void sync_core_before_usermode(void)
 	/* With PTI, we unconditionally serialize before running user code. */
 	if (static_cpu_has(X86_FEATURE_PTI))
 		return;
+
 	/*
-	 * Return from interrupt and NMI is done through iret, which is core
-	 * serializing.
+	 * Even if we're in an interrupt, we might reschedule before returning,
+	 * in which case we could switch to a different thread in the same mm
+	 * and return using SYSRET or SYSEXIT.  Instead of trying to keep
+	 * track of our need to sync the core, just sync right away.
 	 */
-	if (in_irq() || in_nmi())
-		return;
 	sync_core();
 }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 11666ba19b62..dabe683ab076 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -474,8 +474,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	/*
 	 * The membarrier system call requires a full memory barrier and
 	 * core serialization before returning to user-space, after
-	 * storing to rq->curr. Writing to CR3 provides that full
-	 * memory barrier and core serializing instruction.
+	 * storing to rq->curr, when changing mm.  This is because membarrier()
+	 * sends IPIs to all CPUs that are in the target mm, but another
+	 * CPU switch to the target mm in the mean time.  Writing to CR3
+	 * provides that full memory barrier and core serializing instruction.
 	 */
 	if (real_prev == next) {
 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt()
  2020-11-30 17:50 [PATCH 0/3] membarrier fixes Andy Lutomirski
  2020-11-30 17:50 ` [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization Andy Lutomirski
@ 2020-11-30 17:50 ` Andy Lutomirski
  2020-12-01 10:06   ` Peter Zijlstra
  2020-11-30 17:50 ` [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully Andy Lutomirski
  2 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2020-11-30 17:50 UTC (permalink / raw)
  To: x86, Mathieu Desnoyers
  Cc: LKML, Nicholas Piggin, Arnd Bergmann, Anton Blanchard, Andy Lutomirski

It seems to be that most RSEQ membarrier users will expect any
stores done before the membarrier() syscall to be visible to the
target task(s).  While this is extremely likely to be true in
practice, nothing actually guarantees it by a strict reading of the
x86 manuals.  Rather than providing this guarantee by accident and
potentially causing a problem down the road, just add an explicit
barrier.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/sched/membarrier.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index e23e74d52db5..7d98ef5d3bcd 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -40,6 +40,14 @@ static void ipi_mb(void *info)
 
 static void ipi_rseq(void *info)
 {
+	/*
+	 * Ensure that all stores done by the calling thread are visible
+	 * to the current task before the current task resumes.  We could
+	 * probably optimize this away on most architectures, but by the
+	 * time we've already sent an IPI, the cost of the extra smp_mb()
+	 * is negligible.
+	 */
+	smp_mb();
 	rseq_preempt(current);
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-11-30 17:50 [PATCH 0/3] membarrier fixes Andy Lutomirski
  2020-11-30 17:50 ` [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization Andy Lutomirski
  2020-11-30 17:50 ` [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt() Andy Lutomirski
@ 2020-11-30 17:50 ` Andy Lutomirski
  2020-12-01 10:16   ` Peter Zijlstra
  2 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2020-11-30 17:50 UTC (permalink / raw)
  To: x86, Mathieu Desnoyers
  Cc: LKML, Nicholas Piggin, Arnd Bergmann, Anton Blanchard, Andy Lutomirski

membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
other CPUs, but there are two issues.

 - membarrier() does not sync_core() or rseq_preempt() the calling
   CPU.  Aside from the logic being mind-bending, this also means
   that it may not be safe to modify user code through an alias,
   call membarrier(), and then jump to a different executable alias
   of the same code.

 - membarrier() does not explicitly sync_core() remote CPUs either;
   instead, it relies on the assumption that an IPI will result in a
   core sync.  On x86, I think this may be true in practice, but
   it's not architecturally reliable.  In particular, the SDM and
   APM do not appear to guarantee that interrupt delivery is
   serializing.  On a preemptible kernel, IPI return can schedule,
   thereby switching to another task in the same mm that was
   sleeping in a syscall.  The new task could then SYSRET back to
   usermode without ever executing IRET.

This patch simplifies the code to treat the calling CPU just like
all other CPUs, and explicitly sync_core() on all target CPUs.  This
eliminates the need for the smp_mb() at the end of the function
except in the special case of a targeted remote membarrier().  This
patch updates that code and the comments accordingly.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/sched/membarrier.c | 94 +++++++++++++++++++++++++++------------
 1 file changed, 66 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 7d98ef5d3bcd..40266aa7ec39 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -38,6 +38,11 @@ static void ipi_mb(void *info)
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 }
 
+static void ipi_sync_core(void *info)
+{
+	sync_core_before_usermode();
+}
+
 static void ipi_rseq(void *info)
 {
 	/*
@@ -162,6 +167,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 		if (!(atomic_read(&mm->membarrier_state) &
 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
 			return -EPERM;
+		ipi_func = ipi_sync_core;
 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
 		if (!IS_ENABLED(CONFIG_RSEQ))
 			return -EINVAL;
@@ -180,10 +186,46 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 		return 0;
 
 	/*
-	 * Matches memory barriers around rq->curr modification in
-	 * scheduler.
+	 * Consider the following scenario:
+	 *
+	 * Initially, x != y, *x == 0, *y == 0
+	 *
+	 * Our thread, on CPU 0:
+	 * ===========
+	 * WRITE_ONCE(*x, 1);
+	 * membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0);
+	 * a = READ_ONCE(*y);
+	 *
+	 * Another thread:
+	 * ============
+	 * WRITE_ONCE(*y, 1);
+	 * barrier(); -- should synchronize with membarrier()
+	 * b = READ_ONCE(*x);
+	 *
+	 * Now suppose the other thread schedules out before writing 1
+	 * to *y.  Then it schedules back in on CPU 1 concurrently
+	 * with the membarrier() call.  If our thread did not have any
+	 * barrier here in the membarrier() syscall and if system call
+	 * entries were not barriers, then CPU 0 could do this:
+	 *
+	 *  WRITE_ONCE(*x, 1);
+	 *  [no barrier here]
+	 *  observe cpu_rq(1)->curr->mm != current->mm
+	 *
+	 * and we would not send an IPI to CPU 1.  But, again because there is
+	 * no barrier, the write to *x might not be visible to CPU 1
+	 * until after CPU 1 returns to usermode and reads *x == 0.  And, since
+	 * nothing forces CPU1's write to *y to become globally visible
+	 * at any point, CPU0 can see *y == 0.
+	 *
+	 * This sequence of events gives a == b == 0, which is a violation
+	 * of the semantics of membarrier().  If effect, the WRITE_ONCE(*x, 1)
+	 * got reordered with the membarrier(), and that shouldn't happen.
+	 *
+	 * This barrier synchronizes with the barrier after the
+	 * rq->curr modification in the scheduler.
 	 */
-	smp_mb();	/* system call entry is not a mb. */
+	smp_mb();
 
 	if (cpu_id < 0 && !zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
 		return -ENOMEM;
@@ -195,8 +237,6 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 
 		if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id))
 			goto out;
-		if (cpu_id == raw_smp_processor_id())
-			goto out;
 		rcu_read_lock();
 		p = rcu_dereference(cpu_rq(cpu_id)->curr);
 		if (!p || p->mm != mm) {
@@ -211,16 +251,6 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 		for_each_online_cpu(cpu) {
 			struct task_struct *p;
 
-			/*
-			 * Skipping the current CPU is OK even through we can be
-			 * migrated at any point. The current CPU, at the point
-			 * where we read raw_smp_processor_id(), is ensured to
-			 * be in program order with respect to the caller
-			 * thread. Therefore, we can skip this CPU from the
-			 * iteration.
-			 */
-			if (cpu == raw_smp_processor_id())
-				continue;
 			p = rcu_dereference(cpu_rq(cpu)->curr);
 			if (p && p->mm == mm)
 				__cpumask_set_cpu(cpu, tmpmask);
@@ -228,25 +258,33 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 		rcu_read_unlock();
 	}
 
-	preempt_disable();
-	if (cpu_id >= 0)
-		smp_call_function_single(cpu_id, ipi_func, NULL, 1);
-	else
-		smp_call_function_many(tmpmask, ipi_func, NULL, 1);
-	preempt_enable();
+	if (cpu_id >= 0) {
+		int cpu = get_cpu();
+
+		if (cpu_id == cpu) {
+			ipi_func(NULL);
+		} else {
+			smp_call_function_single(cpu_id, ipi_func, NULL, 1);
+			/*
+			 * This is analogous to the smp_mb() at the beginning
+			 * of the function -- exit from a system call is not a
+			 * barrier.  We only need this if we're targeting a
+			 * specific remote CPU, though -- otherwise ipi_func()
+			 * would serves the same purpose.
+			 */
+			smp_mb();
+		}
+
+		put_cpu();
+	} else {
+		on_each_cpu_mask(tmpmask, ipi_func, NULL, true);
+	}
 
 out:
 	if (cpu_id < 0)
 		free_cpumask_var(tmpmask);
 	cpus_read_unlock();
 
-	/*
-	 * Memory barrier on the caller thread _after_ we finished
-	 * waiting for the last IPI. Matches memory barriers around
-	 * rq->curr modification in scheduler.
-	 */
-	smp_mb();	/* exit from system call is not a mb */
-
 	return 0;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt()
  2020-11-30 17:50 ` [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt() Andy Lutomirski
@ 2020-12-01 10:06   ` Peter Zijlstra
  2020-12-01 14:31     ` Mathieu Desnoyers
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2020-12-01 10:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Mathieu Desnoyers, LKML, Nicholas Piggin, Arnd Bergmann,
	Anton Blanchard, Linus Torvalds

On Mon, Nov 30, 2020 at 09:50:34AM -0800, Andy Lutomirski wrote:
> It seems to be that most RSEQ membarrier users will expect any
> stores done before the membarrier() syscall to be visible to the
> target task(s).  While this is extremely likely to be true in
> practice, nothing actually guarantees it by a strict reading of the
> x86 manuals.  Rather than providing this guarantee by accident and
> potentially causing a problem down the road, just add an explicit
> barrier.

A very long time ago; when Jens introduced smp_call_function(), we had
this discussion. At the time Linus said that receiving an interrupt had
better be ordering, and if it is not, then it's up to the architecture
to handle that before it gets into the common code.

  https://lkml.kernel.org/r/alpine.LFD.2.00.0902180744520.21686@localhost.localdomain

Maybe we want to revisit this now, but there might be a fair amount of
code relying on all this by now.

Documenting it better might help.

> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  kernel/sched/membarrier.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index e23e74d52db5..7d98ef5d3bcd 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -40,6 +40,14 @@ static void ipi_mb(void *info)
>  
>  static void ipi_rseq(void *info)
>  {
> +	/*
> +	 * Ensure that all stores done by the calling thread are visible
> +	 * to the current task before the current task resumes.  We could
> +	 * probably optimize this away on most architectures, but by the
> +	 * time we've already sent an IPI, the cost of the extra smp_mb()
> +	 * is negligible.
> +	 */
> +	smp_mb();
>  	rseq_preempt(current);
>  }

So I think this really isn't right.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-11-30 17:50 ` [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully Andy Lutomirski
@ 2020-12-01 10:16   ` Peter Zijlstra
  2020-12-01 14:28     ` Mathieu Desnoyers
  2020-12-01 18:09     ` Andy Lutomirski
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2020-12-01 10:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Mathieu Desnoyers, LKML, Nicholas Piggin, Arnd Bergmann,
	Anton Blanchard

On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
> other CPUs, but there are two issues.
> 
>  - membarrier() does not sync_core() or rseq_preempt() the calling
>    CPU.  Aside from the logic being mind-bending, this also means
>    that it may not be safe to modify user code through an alias,
>    call membarrier(), and then jump to a different executable alias
>    of the same code.

I always understood this to be on purpose. The calling CPU can fix up
itself just fine. The pain point is fixing up the other CPUs, and that's
where membarrier() helps.

That said, I don't mind including self, these aren't fast calls by any
means.

>  - membarrier() does not explicitly sync_core() remote CPUs either;
>    instead, it relies on the assumption that an IPI will result in a
>    core sync.  On x86, I think this may be true in practice, but
>    it's not architecturally reliable.  In particular, the SDM and
>    APM do not appear to guarantee that interrupt delivery is
>    serializing.

Right, I don't think we rely on that, we do rely on interrupt delivery
providing order though -- as per the previous email.

>    On a preemptible kernel, IPI return can schedule,
>    thereby switching to another task in the same mm that was
>    sleeping in a syscall.  The new task could then SYSRET back to
>    usermode without ever executing IRET.

This; I think we all overlooked this scenario.

> This patch simplifies the code to treat the calling CPU just like
> all other CPUs, and explicitly sync_core() on all target CPUs.  This
> eliminates the need for the smp_mb() at the end of the function
> except in the special case of a targeted remote membarrier().  This
> patch updates that code and the comments accordingly.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

> @@ -228,25 +258,33 @@ static int membarrier_private_expedited(int flags, int cpu_id)
>  		rcu_read_unlock();
>  	}
>  
> -	preempt_disable();
> -	if (cpu_id >= 0)
> -		smp_call_function_single(cpu_id, ipi_func, NULL, 1);
> -	else
> -		smp_call_function_many(tmpmask, ipi_func, NULL, 1);
> -	preempt_enable();
> +	if (cpu_id >= 0) {
> +		int cpu = get_cpu();
> +
> +		if (cpu_id == cpu) {
> +			ipi_func(NULL);
> +		} else {
> +			smp_call_function_single(cpu_id, ipi_func, NULL, 1);
> +			/*
> +			 * This is analogous to the smp_mb() at the beginning
> +			 * of the function -- exit from a system call is not a
> +			 * barrier.  We only need this if we're targeting a
> +			 * specific remote CPU, though -- otherwise ipi_func()
> +			 * would serves the same purpose.
> +			 */
> +			smp_mb();

smp_call_function_single(.wait=1) already orders against completion of
the IPI. Do we really need more?

> +		}
> +
> +		put_cpu();
> +	} else {
> +		on_each_cpu_mask(tmpmask, ipi_func, NULL, true);
> +	}
>  
>  out:
>  	if (cpu_id < 0)
>  		free_cpumask_var(tmpmask);
>  	cpus_read_unlock();
>  
> -	/*
> -	 * Memory barrier on the caller thread _after_ we finished
> -	 * waiting for the last IPI. Matches memory barriers around
> -	 * rq->curr modification in scheduler.
> -	 */
> -	smp_mb();	/* exit from system call is not a mb */
> -
>  	return 0;
>  }
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-12-01 10:16   ` Peter Zijlstra
@ 2020-12-01 14:28     ` Mathieu Desnoyers
  2020-12-01 18:12       ` Andy Lutomirski
  2020-12-01 18:09     ` Andy Lutomirski
  1 sibling, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2020-12-01 14:28 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: x86, linux-kernel, Nicholas Piggin, Arnd Bergmann, Anton Blanchard

----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
>> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
>> other CPUs, but there are two issues.
>> 
>>  - membarrier() does not sync_core() or rseq_preempt() the calling
>>    CPU.  Aside from the logic being mind-bending, this also means
>>    that it may not be safe to modify user code through an alias,
>>    call membarrier(), and then jump to a different executable alias
>>    of the same code.
> 
> I always understood this to be on purpose. The calling CPU can fix up
> itself just fine. The pain point is fixing up the other CPUs, and that's
> where membarrier() helps.

Indeed, as documented in the man page:

       MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
              In  addition  to  providing  the  memory ordering guarantees de‐
              scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  from
              system call the calling thread has a guarantee that all its run‐
              ning thread siblings have executed a core  serializing  instruc‐
              tion.   This  guarantee is provided only for threads in the same
              process as the calling thread.

membarrier sync core guarantees a core serializing instruction on the siblings,
not on the caller thread. This has been done on purpose given that the caller
thread can always issue its core serializing instruction from user-space on
its own.

> 
> That said, I don't mind including self, these aren't fast calls by any
> means.

I don't mind including self either, but this would require documentation
updates, including man pages, to state that starting from kernel Y this
is the guaranteed behavior. It's then tricky for user-space to query what
the behavior is unless we introduce a new membarrier command for it. So this
could introduce issues if software written for the newer kernels runs on older
kernels.

> 
>>  - membarrier() does not explicitly sync_core() remote CPUs either;
>>    instead, it relies on the assumption that an IPI will result in a
>>    core sync.  On x86, I think this may be true in practice, but
>>    it's not architecturally reliable.  In particular, the SDM and
>>    APM do not appear to guarantee that interrupt delivery is
>>    serializing.
> 
> Right, I don't think we rely on that, we do rely on interrupt delivery
> providing order though -- as per the previous email.
> 
>>    On a preemptible kernel, IPI return can schedule,
>>    thereby switching to another task in the same mm that was
>>    sleeping in a syscall.  The new task could then SYSRET back to
>>    usermode without ever executing IRET.
> 
> This; I think we all overlooked this scenario.

Indeed, this is an issue which needs to be fixed.

> 
>> This patch simplifies the code to treat the calling CPU just like
>> all other CPUs, and explicitly sync_core() on all target CPUs.  This
>> eliminates the need for the smp_mb() at the end of the function
>> except in the special case of a targeted remote membarrier().  This
>> patch updates that code and the comments accordingly.

I am not confident that removing the smp_mb at the end of membarrier is
an appropriate change, nor that it simplifies the model.

This changes things from a model where we have a barrier at the beginning
and end of the membarrier system call, which nicely orders things happening
before/after the system call with respect to anything that is observed within
the system call (including the scheduler activity updating the runqueue's
current task), to a model where the memory barrier for the current thread
will be conditionally executed after we have sent the IPIs, and unconditionally
when issuing smp_call_function* on self.

About the documentation of the membarrier scenario, I think it is redundant
with a documentation patch I already have sitting in -tip (scenario A):

https://git.kernel.org/tip/25595eb6aaa9fbb31330f1e0b400642694bc6574

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt()
  2020-12-01 10:06   ` Peter Zijlstra
@ 2020-12-01 14:31     ` Mathieu Desnoyers
  2020-12-01 17:55       ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2020-12-01 14:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, x86, linux-kernel, Nicholas Piggin,
	Arnd Bergmann, Anton Blanchard, Linus Torvalds

----- On Dec 1, 2020, at 5:06 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Mon, Nov 30, 2020 at 09:50:34AM -0800, Andy Lutomirski wrote:
>> It seems to be that most RSEQ membarrier users will expect any
>> stores done before the membarrier() syscall to be visible to the
>> target task(s).  While this is extremely likely to be true in
>> practice, nothing actually guarantees it by a strict reading of the
>> x86 manuals.  Rather than providing this guarantee by accident and
>> potentially causing a problem down the road, just add an explicit
>> barrier.
> 
> A very long time ago; when Jens introduced smp_call_function(), we had
> this discussion. At the time Linus said that receiving an interrupt had
> better be ordering, and if it is not, then it's up to the architecture
> to handle that before it gets into the common code.
> 
>  https://lkml.kernel.org/r/alpine.LFD.2.00.0902180744520.21686@localhost.localdomain
> 
> Maybe we want to revisit this now, but there might be a fair amount of
> code relying on all this by now.
> 
> Documenting it better might help.

Considering that we already have this in membarrier ipi_mb :

static void ipi_mb(void *info)
{
        smp_mb();       /* IPIs should be serializing but paranoid. */
}

I think it makes sense to add this same smp_mb() in the ipi_rseq if the expected
behavior is to order memory accesses as well, and have the same level of paranoia as
the ipi_mb.

Thanks,

Mathieu


> 
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  kernel/sched/membarrier.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>> 
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index e23e74d52db5..7d98ef5d3bcd 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -40,6 +40,14 @@ static void ipi_mb(void *info)
>>  
>>  static void ipi_rseq(void *info)
>>  {
>> +	/*
>> +	 * Ensure that all stores done by the calling thread are visible
>> +	 * to the current task before the current task resumes.  We could
>> +	 * probably optimize this away on most architectures, but by the
>> +	 * time we've already sent an IPI, the cost of the extra smp_mb()
>> +	 * is negligible.
>> +	 */
>> +	smp_mb();
>>  	rseq_preempt(current);
>>  }
> 
> So I think this really isn't right.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization
  2020-11-30 17:50 ` [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization Andy Lutomirski
@ 2020-12-01 14:39   ` Mathieu Desnoyers
  2020-12-01 17:47     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2020-12-01 14:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Nicholas Piggin, Arnd Bergmann, Anton Blanchard

----- On Nov 30, 2020, at 12:50 PM, Andy Lutomirski luto@kernel.org wrote:
[...]
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 11666ba19b62..dabe683ab076 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -474,8 +474,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
> 	/*
> 	 * The membarrier system call requires a full memory barrier and
> 	 * core serialization before returning to user-space, after
> -	 * storing to rq->curr. Writing to CR3 provides that full
> -	 * memory barrier and core serializing instruction.
> +	 * storing to rq->curr, when changing mm.  This is because membarrier()
> +	 * sends IPIs to all CPUs that are in the target mm, but another
> +	 * CPU switch to the target mm in the mean time.

The sentence "This is because membarrier() sends IPIs to all CPUs that are in
the target mm, but another CPU switch to the target mm in the mean time." seems
rather unclear. Could be clarified with e.g.:

"This is because membarrier() sends IPIs to all CPUs that are in the target mm
to make them issue memory barriers. However, if another CPU switches to/from the
target mm concurrently with membarrier(), it can cause that CPU not to receive the
IPI when it really should issue a memory barrier."

For the rest of this patch:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization
  2020-12-01 14:39   ` Mathieu Desnoyers
@ 2020-12-01 17:47     ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2020-12-01 17:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, x86, linux-kernel, Nicholas Piggin,
	Arnd Bergmann, Anton Blanchard

On Tue, Dec 1, 2020 at 6:39 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Nov 30, 2020, at 12:50 PM, Andy Lutomirski luto@kernel.org wrote:
> [...]
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 11666ba19b62..dabe683ab076 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -474,8 +474,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> > mm_struct *next,
> >       /*
> >        * The membarrier system call requires a full memory barrier and
> >        * core serialization before returning to user-space, after
> > -      * storing to rq->curr. Writing to CR3 provides that full
> > -      * memory barrier and core serializing instruction.
> > +      * storing to rq->curr, when changing mm.  This is because membarrier()
> > +      * sends IPIs to all CPUs that are in the target mm, but another
> > +      * CPU switch to the target mm in the mean time.
>
> The sentence "This is because membarrier() sends IPIs to all CPUs that are in
> the target mm, but another CPU switch to the target mm in the mean time." seems
> rather unclear. Could be clarified with e.g.:
>
> "This is because membarrier() sends IPIs to all CPUs that are in the target mm
> to make them issue memory barriers. However, if another CPU switches to/from the
> target mm concurrently with membarrier(), it can cause that CPU not to receive the
> IPI when it really should issue a memory barrier."
>
> For the rest of this patch:
>
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>
> Thanks!

Done, thanks!

>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt()
  2020-12-01 14:31     ` Mathieu Desnoyers
@ 2020-12-01 17:55       ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2020-12-01 17:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Andy Lutomirski, x86, linux-kernel,
	Nicholas Piggin, Arnd Bergmann, Anton Blanchard, Linus Torvalds

On Tue, Dec 1, 2020 at 6:31 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Dec 1, 2020, at 5:06 AM, Peter Zijlstra peterz@infradead.org wrote:
>
> > On Mon, Nov 30, 2020 at 09:50:34AM -0800, Andy Lutomirski wrote:
> >> It seems to be that most RSEQ membarrier users will expect any
> >> stores done before the membarrier() syscall to be visible to the
> >> target task(s).  While this is extremely likely to be true in
> >> practice, nothing actually guarantees it by a strict reading of the
> >> x86 manuals.  Rather than providing this guarantee by accident and
> >> potentially causing a problem down the road, just add an explicit
> >> barrier.
> >
> > A very long time ago; when Jens introduced smp_call_function(), we had
> > this discussion. At the time Linus said that receiving an interrupt had
> > better be ordering, and if it is not, then it's up to the architecture
> > to handle that before it gets into the common code.
> >
> >  https://lkml.kernel.org/r/alpine.LFD.2.00.0902180744520.21686@localhost.localdomain
> >
> > Maybe we want to revisit this now, but there might be a fair amount of
> > code relying on all this by now.
> >
> > Documenting it better might help.
>
> Considering that we already have this in membarrier ipi_mb :
>
> static void ipi_mb(void *info)
> {
>         smp_mb();       /* IPIs should be serializing but paranoid. */
> }
>
> I think it makes sense to add this same smp_mb() in the ipi_rseq if the expected
> behavior is to order memory accesses as well, and have the same level of paranoia as
> the ipi_mb.

That was my reasoning.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-12-01 10:16   ` Peter Zijlstra
  2020-12-01 14:28     ` Mathieu Desnoyers
@ 2020-12-01 18:09     ` Andy Lutomirski
  2020-12-01 18:53       ` Peter Zijlstra
  2020-12-01 18:55       ` Peter Zijlstra
  1 sibling, 2 replies; 18+ messages in thread
From: Andy Lutomirski @ 2020-12-01 18:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, X86 ML, Mathieu Desnoyers, LKML,
	Nicholas Piggin, Arnd Bergmann, Anton Blanchard

On Tue, Dec 1, 2020 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
> > membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
> > other CPUs, but there are two issues.
> >
> >  - membarrier() does not sync_core() or rseq_preempt() the calling
> >    CPU.  Aside from the logic being mind-bending, this also means
> >    that it may not be safe to modify user code through an alias,
> >    call membarrier(), and then jump to a different executable alias
> >    of the same code.
>
> I always understood this to be on purpose. The calling CPU can fix up
> itself just fine. The pain point is fixing up the other CPUs, and that's
> where membarrier() helps.
>
> That said, I don't mind including self, these aren't fast calls by any
> means.

Honestly, I mostly did this because IMO it's a nice cleanup.  It took
quite some reading of the code and the comments that try not to target
the calling CPU to decide that it was correct, and I think the new
code is considerably more clear.  If we want to skip the calling CPU,
then I think we should still use the new code but use
smp_call_function instead, which will itself skip the caller.
Thoughts?

>
> >  - membarrier() does not explicitly sync_core() remote CPUs either;
> >    instead, it relies on the assumption that an IPI will result in a
> >    core sync.  On x86, I think this may be true in practice, but
> >    it's not architecturally reliable.  In particular, the SDM and
> >    APM do not appear to guarantee that interrupt delivery is
> >    serializing.
>
> Right, I don't think we rely on that, we do rely on interrupt delivery
> providing order though -- as per the previous email.

I looked for a bit, and I couldn't find anything in the SDM or APM to
support this, and I would be rather surprised if other architectures
synchronize their instruction streams on interrupt delivery.  On
architectures without hardware I$ coherency and with actual fast
interrupts, I would be surprised if interrupts ensured I$ coherency
with prior writes from other cores.

On x86, interrupt *return* via IRET is definitely serializing but, as
mentioned in patch 1, we don't actually guarantee that we'll execute
IRET on the way out of an IPI before running user code.

So I stand by this part of my patch.

>
> >    On a preemptible kernel, IPI return can schedule,
> >    thereby switching to another task in the same mm that was
> >    sleeping in a syscall.  The new task could then SYSRET back to
> >    usermode without ever executing IRET.
>
> This; I think we all overlooked this scenario.
>
> > This patch simplifies the code to treat the calling CPU just like
> > all other CPUs, and explicitly sync_core() on all target CPUs.  This
> > eliminates the need for the smp_mb() at the end of the function
> > except in the special case of a targeted remote membarrier().  This
> > patch updates that code and the comments accordingly.
> >
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
>
> > @@ -228,25 +258,33 @@ static int membarrier_private_expedited(int flags, int cpu_id)
> >               rcu_read_unlock();
> >       }
> >
> > -     preempt_disable();
> > -     if (cpu_id >= 0)
> > -             smp_call_function_single(cpu_id, ipi_func, NULL, 1);
> > -     else
> > -             smp_call_function_many(tmpmask, ipi_func, NULL, 1);
> > -     preempt_enable();
> > +     if (cpu_id >= 0) {
> > +             int cpu = get_cpu();
> > +
> > +             if (cpu_id == cpu) {
> > +                     ipi_func(NULL);
> > +             } else {
> > +                     smp_call_function_single(cpu_id, ipi_func, NULL, 1);
> > +                     /*
> > +                      * This is analogous to the smp_mb() at the beginning
> > +                      * of the function -- exit from a system call is not a
> > +                      * barrier.  We only need this if we're targeting a
> > +                      * specific remote CPU, though -- otherwise ipi_func()
> > +                      * would serves the same purpose.
> > +                      */
> > +                     smp_mb();
>
> smp_call_function_single(.wait=1) already orders against completion of
> the IPI. Do we really need more?

What kind of order does it provide?  A quick skim of the code suggests
that it's an acquire barrier, but I think we need a full sequential
consistency barrier, at least on sufficiently weakly ordered
architectures.  On x86, loads are ordered and this is probably
irrelevant.  Also, this barrier was already there (it's the one I
deleted below), and I think that removing it should be its own patch
if we want to go that route.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-12-01 14:28     ` Mathieu Desnoyers
@ 2020-12-01 18:12       ` Andy Lutomirski
  2020-12-01 18:29         ` Mathieu Desnoyers
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2020-12-01 18:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Andy Lutomirski, x86, linux-kernel,
	Nicholas Piggin, Arnd Bergmann, Anton Blanchard

On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@infradead.org wrote:
>
> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
> >> other CPUs, but there are two issues.
> >>
> >>  - membarrier() does not sync_core() or rseq_preempt() the calling
> >>    CPU.  Aside from the logic being mind-bending, this also means
> >>    that it may not be safe to modify user code through an alias,
> >>    call membarrier(), and then jump to a different executable alias
> >>    of the same code.
> >
> > I always understood this to be on purpose. The calling CPU can fix up
> > itself just fine. The pain point is fixing up the other CPUs, and that's
> > where membarrier() helps.
>
> Indeed, as documented in the man page:
>
>        MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
>               In  addition  to  providing  the  memory ordering guarantees de‐
>               scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  from
>               system call the calling thread has a guarantee that all its run‐
>               ning thread siblings have executed a core  serializing  instruc‐
>               tion.   This  guarantee is provided only for threads in the same
>               process as the calling thread.
>
> membarrier sync core guarantees a core serializing instruction on the siblings,
> not on the caller thread. This has been done on purpose given that the caller
> thread can always issue its core serializing instruction from user-space on
> its own.
>
> >
> > That said, I don't mind including self, these aren't fast calls by any
> > means.
>
> I don't mind including self either, but this would require documentation
> updates, including man pages, to state that starting from kernel Y this
> is the guaranteed behavior. It's then tricky for user-space to query what
> the behavior is unless we introduce a new membarrier command for it. So this
> could introduce issues if software written for the newer kernels runs on older
> kernels.

For rseq at least, if we do this now we don't have this issue -- I
don't think any released kernel has the rseq mode.

>
> >
> >>  - membarrier() does not explicitly sync_core() remote CPUs either;
> >>    instead, it relies on the assumption that an IPI will result in a
> >>    core sync.  On x86, I think this may be true in practice, but
> >>    it's not architecturally reliable.  In particular, the SDM and
> >>    APM do not appear to guarantee that interrupt delivery is
> >>    serializing.
> >
> > Right, I don't think we rely on that, we do rely on interrupt delivery
> > providing order though -- as per the previous email.
> >
> >>    On a preemptible kernel, IPI return can schedule,
> >>    thereby switching to another task in the same mm that was
> >>    sleeping in a syscall.  The new task could then SYSRET back to
> >>    usermode without ever executing IRET.
> >
> > This; I think we all overlooked this scenario.
>
> Indeed, this is an issue which needs to be fixed.
>
> >
> >> This patch simplifies the code to treat the calling CPU just like
> >> all other CPUs, and explicitly sync_core() on all target CPUs.  This
> >> eliminates the need for the smp_mb() at the end of the function
> >> except in the special case of a targeted remote membarrier().  This
> >> patch updates that code and the comments accordingly.
>
> I am not confident that removing the smp_mb at the end of membarrier is
> an appropriate change, nor that it simplifies the model.

Ah, but I didn't remove it.  I carefully made sure that every possible
path through the function does an smp_mb() or stronger after all the
cpu_rq reads.  ipi_func(), on_each_cpu(), and the explicit smp_mb()
cover the three cases.

That being said, if you prefer, I can make the change to skip the
calling CPU, in which case I'll leave the smp_mb() at the end alone.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-12-01 18:12       ` Andy Lutomirski
@ 2020-12-01 18:29         ` Mathieu Desnoyers
  2020-12-01 18:48           ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2020-12-01 18:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, x86, linux-kernel, Nicholas Piggin,
	Arnd Bergmann, Anton Blanchard

----- On Dec 1, 2020, at 1:12 PM, Andy Lutomirski luto@kernel.org wrote:

> On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@infradead.org wrote:
>>
>> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
>> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
>> >> other CPUs, but there are two issues.
>> >>
>> >>  - membarrier() does not sync_core() or rseq_preempt() the calling
>> >>    CPU.  Aside from the logic being mind-bending, this also means
>> >>    that it may not be safe to modify user code through an alias,
>> >>    call membarrier(), and then jump to a different executable alias
>> >>    of the same code.
>> >
>> > I always understood this to be on purpose. The calling CPU can fix up
>> > itself just fine. The pain point is fixing up the other CPUs, and that's
>> > where membarrier() helps.
>>
>> Indeed, as documented in the man page:
>>
>>        MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
>>               In  addition  to  providing  the  memory ordering guarantees de‐
>>               scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  from
>>               system call the calling thread has a guarantee that all its run‐
>>               ning thread siblings have executed a core  serializing  instruc‐
>>               tion.   This  guarantee is provided only for threads in the same
>>               process as the calling thread.
>>
>> membarrier sync core guarantees a core serializing instruction on the siblings,
>> not on the caller thread. This has been done on purpose given that the caller
>> thread can always issue its core serializing instruction from user-space on
>> its own.
>>
>> >
>> > That said, I don't mind including self, these aren't fast calls by any
>> > means.
>>
>> I don't mind including self either, but this would require documentation
>> updates, including man pages, to state that starting from kernel Y this
>> is the guaranteed behavior. It's then tricky for user-space to query what
>> the behavior is unless we introduce a new membarrier command for it. So this
>> could introduce issues if software written for the newer kernels runs on older
>> kernels.
> 
> For rseq at least, if we do this now we don't have this issue -- I
> don't think any released kernel has the rseq mode.

But for rseq, there is no core-sync. And considering that it is invalid
to issue a system call within an rseq critical section (including membarrier),
I don't see what we gain by doing a rseq barrier on self ?

The only case where it really changes the semantic is for core-sync I think.
And in this case, it would be adding an additional core-sync on self. I
am OK with doing that considering that it will simplify use of the system
call. I'm just wondering how we should document this change in the man page.

> 
>>
>> >
>> >>  - membarrier() does not explicitly sync_core() remote CPUs either;
>> >>    instead, it relies on the assumption that an IPI will result in a
>> >>    core sync.  On x86, I think this may be true in practice, but
>> >>    it's not architecturally reliable.  In particular, the SDM and
>> >>    APM do not appear to guarantee that interrupt delivery is
>> >>    serializing.
>> >
>> > Right, I don't think we rely on that, we do rely on interrupt delivery
>> > providing order though -- as per the previous email.
>> >
>> >>    On a preemptible kernel, IPI return can schedule,
>> >>    thereby switching to another task in the same mm that was
>> >>    sleeping in a syscall.  The new task could then SYSRET back to
>> >>    usermode without ever executing IRET.
>> >
>> > This; I think we all overlooked this scenario.
>>
>> Indeed, this is an issue which needs to be fixed.
>>
>> >
>> >> This patch simplifies the code to treat the calling CPU just like
>> >> all other CPUs, and explicitly sync_core() on all target CPUs.  This
>> >> eliminates the need for the smp_mb() at the end of the function
>> >> except in the special case of a targeted remote membarrier().  This
>> >> patch updates that code and the comments accordingly.
>>
>> I am not confident that removing the smp_mb at the end of membarrier is
>> an appropriate change, nor that it simplifies the model.
> 
> Ah, but I didn't remove it.  I carefully made sure that every possible
> path through the function does an smp_mb() or stronger after all the
> cpu_rq reads.  ipi_func(), on_each_cpu(), and the explicit smp_mb()
> cover the three cases.
> 
> That being said, if you prefer, I can make the change to skip the
> calling CPU, in which case I'll leave the smp_mb() at the end alone.

For the memory barrier commands, I prefer skipping self and leaving the
smp_mb at the very beginning/end of the system call. Those are the key
before/after points we are synchronizing against, and those are simple
to document.

For rseq, I would not mind calling the ipi callback on self, but I don't
see how useful it would be considering that issuing system calls within
rseq c.s. is invalid.

For core-sync, I agree that doing a core-sync on self would be an actual
simplification for users, but we should think about how we plan to update
the man page to convey that behavior change to end users.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-12-01 18:29         ` Mathieu Desnoyers
@ 2020-12-01 18:48           ` Andy Lutomirski
  2020-12-01 20:51             ` Mathieu Desnoyers
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2020-12-01 18:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Peter Zijlstra, x86, linux-kernel,
	Nicholas Piggin, Arnd Bergmann, Anton Blanchard

On Tue, Dec 1, 2020 at 10:29 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Dec 1, 2020, at 1:12 PM, Andy Lutomirski luto@kernel.org wrote:
>
> > On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> ----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@infradead.org wrote:
> >>
> >> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
> >> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
> >> >> other CPUs, but there are two issues.
> >> >>
> >> >>  - membarrier() does not sync_core() or rseq_preempt() the calling
> >> >>    CPU.  Aside from the logic being mind-bending, this also means
> >> >>    that it may not be safe to modify user code through an alias,
> >> >>    call membarrier(), and then jump to a different executable alias
> >> >>    of the same code.
> >> >
> >> > I always understood this to be on purpose. The calling CPU can fix up
> >> > itself just fine. The pain point is fixing up the other CPUs, and that's
> >> > where membarrier() helps.
> >>
> >> Indeed, as documented in the man page:
> >>
> >>        MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
> >>               In  addition  to  providing  the  memory ordering guarantees de‐
> >>               scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  from
> >>               system call the calling thread has a guarantee that all its run‐
> >>               ning thread siblings have executed a core  serializing  instruc‐
> >>               tion.   This  guarantee is provided only for threads in the same
> >>               process as the calling thread.
> >>
> >> membarrier sync core guarantees a core serializing instruction on the siblings,
> >> not on the caller thread. This has been done on purpose given that the caller
> >> thread can always issue its core serializing instruction from user-space on
> >> its own.
> >>
> >> >
> >> > That said, I don't mind including self, these aren't fast calls by any
> >> > means.
> >>
> >> I don't mind including self either, but this would require documentation
> >> updates, including man pages, to state that starting from kernel Y this
> >> is the guaranteed behavior. It's then tricky for user-space to query what
> >> the behavior is unless we introduce a new membarrier command for it. So this
> >> could introduce issues if software written for the newer kernels runs on older
> >> kernels.
> >
> > For rseq at least, if we do this now we don't have this issue -- I
> > don't think any released kernel has the rseq mode.
>
> But for rseq, there is no core-sync. And considering that it is invalid
> to issue a system call within an rseq critical section (including membarrier),
> I don't see what we gain by doing a rseq barrier on self ?
>
> The only case where it really changes the semantic is for core-sync I think.
> And in this case, it would be adding an additional core-sync on self. I
> am OK with doing that considering that it will simplify use of the system
> call. I'm just wondering how we should document this change in the man page.
>
> >
> >>
> >> >
> >> >>  - membarrier() does not explicitly sync_core() remote CPUs either;
> >> >>    instead, it relies on the assumption that an IPI will result in a
> >> >>    core sync.  On x86, I think this may be true in practice, but
> >> >>    it's not architecturally reliable.  In particular, the SDM and
> >> >>    APM do not appear to guarantee that interrupt delivery is
> >> >>    serializing.
> >> >
> >> > Right, I don't think we rely on that, we do rely on interrupt delivery
> >> > providing order though -- as per the previous email.
> >> >
> >> >>    On a preemptible kernel, IPI return can schedule,
> >> >>    thereby switching to another task in the same mm that was
> >> >>    sleeping in a syscall.  The new task could then SYSRET back to
> >> >>    usermode without ever executing IRET.
> >> >
> >> > This; I think we all overlooked this scenario.
> >>
> >> Indeed, this is an issue which needs to be fixed.
> >>
> >> >
> >> >> This patch simplifies the code to treat the calling CPU just like
> >> >> all other CPUs, and explicitly sync_core() on all target CPUs.  This
> >> >> eliminates the need for the smp_mb() at the end of the function
> >> >> except in the special case of a targeted remote membarrier().  This
> >> >> patch updates that code and the comments accordingly.
> >>
> >> I am not confident that removing the smp_mb at the end of membarrier is
> >> an appropriate change, nor that it simplifies the model.
> >
> > Ah, but I didn't remove it.  I carefully made sure that every possible
> > path through the function does an smp_mb() or stronger after all the
> > cpu_rq reads.  ipi_func(), on_each_cpu(), and the explicit smp_mb()
> > cover the three cases.
> >
> > That being said, if you prefer, I can make the change to skip the
> > calling CPU, in which case I'll leave the smp_mb() at the end alone.
>
> For the memory barrier commands, I prefer skipping self and leaving the
> smp_mb at the very beginning/end of the system call. Those are the key
> before/after points we are synchronizing against, and those are simple
> to document.
>

Is there a reason that doing the barrier at the very end could make an
observable difference?  The two models are:

membarrier() {
 smp_mb();
 read a bunch of cpu_rq memory and make decisions;
 execute smp_mb() on relevant cpus including self;
}

versus

membarrier() {
 smp_mb();
 read a bunch of cpu_rq memory and make decisions;
 execute smp_mb() on relevant non-self cpus;
 wait for that to finish (acquire-style on the local cpu);
 smp_mb();
}

Is the idea that, on a sufficiently weakly ordered architecture, some
remote CPU could do a store before the IPI and a local load after the
membarrier() syscall might not observe the load unless the local
smp_mb() is after the remote smp_mb()?  If so, I'm not entirely
convinced that this is observably different from the store simply
occurring after the IPI, but maybe there are some gnarly situations in
which this could happen.

If your concern is something along these lines, I could try to write
up an appropriate comment, and I'll rework the patch.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-12-01 18:09     ` Andy Lutomirski
@ 2020-12-01 18:53       ` Peter Zijlstra
  2020-12-01 18:55       ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2020-12-01 18:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Mathieu Desnoyers, LKML, Nicholas Piggin, Arnd Bergmann,
	Anton Blanchard

On Tue, Dec 01, 2020 at 10:09:22AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 1, 2020 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >  - membarrier() does not explicitly sync_core() remote CPUs either;
> > >    instead, it relies on the assumption that an IPI will result in a
> > >    core sync.  On x86, I think this may be true in practice, but
> > >    it's not architecturally reliable.  In particular, the SDM and
> > >    APM do not appear to guarantee that interrupt delivery is
> > >    serializing.
> >
> > Right, I don't think we rely on that, we do rely on interrupt delivery
> > providing order though -- as per the previous email.

order, not serializing.

> I looked for a bit, and I couldn't find anything in the SDM or APM to
> support this, and I would be rather surprised if other architectures
> synchronize their instruction streams on interrupt delivery.  On
> architectures without hardware I$ coherency and with actual fast
> interrupts, I would be surprised if interrupts ensured I$ coherency
> with prior writes from other cores.

Data, not I$. smp_mb() has nothing on I$. The claim is that smp_mb() at
the start of an IPI is pointless (as a means of ordering against the CPU
raising the IPI).

Doing smp_mb() before raising the IPI does make sense and is actually
done IIRC.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-12-01 18:09     ` Andy Lutomirski
  2020-12-01 18:53       ` Peter Zijlstra
@ 2020-12-01 18:55       ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2020-12-01 18:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Mathieu Desnoyers, LKML, Nicholas Piggin, Arnd Bergmann,
	Anton Blanchard

On Tue, Dec 01, 2020 at 10:09:22AM -0800, Andy Lutomirski wrote:
> > smp_call_function_single(.wait=1) already orders against completion of
> > the IPI. Do we really need more?
> 
> What kind of order does it provide?  A quick skim of the code suggests
> that it's an acquire barrier, but I think we need a full sequential
> consistency barrier, at least on sufficiently weakly ordered
> architectures.  On x86, loads are ordered and this is probably
> irrelevant.  Also, this barrier was already there (it's the one I
> deleted below), and I think that removing it should be its own patch
> if we want to go that route.

	smp_mb()
	raise-IPI ---->
			<IPI>
			  /* do crud */
			  STORE-RELEASE csd->lock, 0;
			</IPI/

	LOAD-ACQUIRE csd->lock



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
  2020-12-01 18:48           ` Andy Lutomirski
@ 2020-12-01 20:51             ` Mathieu Desnoyers
  0 siblings, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2020-12-01 20:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, x86, linux-kernel, Nicholas Piggin,
	Arnd Bergmann, Anton Blanchard

----- On Dec 1, 2020, at 1:48 PM, Andy Lutomirski luto@kernel.org wrote:

> On Tue, Dec 1, 2020 at 10:29 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Dec 1, 2020, at 1:12 PM, Andy Lutomirski luto@kernel.org wrote:
>>
>> > On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
>> > <mathieu.desnoyers@efficios.com> wrote:
>> >>
>> >> ----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@infradead.org wrote:
>> >>
>> >> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
>> >> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
>> >> >> other CPUs, but there are two issues.
>> >> >>
>> >> >>  - membarrier() does not sync_core() or rseq_preempt() the calling
>> >> >>    CPU.  Aside from the logic being mind-bending, this also means
>> >> >>    that it may not be safe to modify user code through an alias,
>> >> >>    call membarrier(), and then jump to a different executable alias
>> >> >>    of the same code.
>> >> >
>> >> > I always understood this to be on purpose. The calling CPU can fix up
>> >> > itself just fine. The pain point is fixing up the other CPUs, and that's
>> >> > where membarrier() helps.
>> >>
>> >> Indeed, as documented in the man page:
>> >>
>> >>        MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
>> >>               In  addition  to  providing  the  memory ordering guarantees de‐
>> >>               scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  from
>> >>               system call the calling thread has a guarantee that all its run‐
>> >>               ning thread siblings have executed a core  serializing  instruc‐
>> >>               tion.   This  guarantee is provided only for threads in the same
>> >>               process as the calling thread.
>> >>
>> >> membarrier sync core guarantees a core serializing instruction on the siblings,
>> >> not on the caller thread. This has been done on purpose given that the caller
>> >> thread can always issue its core serializing instruction from user-space on
>> >> its own.
>> >>
>> >> >
>> >> > That said, I don't mind including self, these aren't fast calls by any
>> >> > means.
>> >>
>> >> I don't mind including self either, but this would require documentation
>> >> updates, including man pages, to state that starting from kernel Y this
>> >> is the guaranteed behavior. It's then tricky for user-space to query what
>> >> the behavior is unless we introduce a new membarrier command for it. So this
>> >> could introduce issues if software written for the newer kernels runs on older
>> >> kernels.
>> >
>> > For rseq at least, if we do this now we don't have this issue -- I
>> > don't think any released kernel has the rseq mode.
>>
>> But for rseq, there is no core-sync. And considering that it is invalid
>> to issue a system call within an rseq critical section (including membarrier),
>> I don't see what we gain by doing a rseq barrier on self ?
>>
>> The only case where it really changes the semantic is for core-sync I think.
>> And in this case, it would be adding an additional core-sync on self. I
>> am OK with doing that considering that it will simplify use of the system
>> call. I'm just wondering how we should document this change in the man page.
>>
>> >
>> >>
>> >> >
>> >> >>  - membarrier() does not explicitly sync_core() remote CPUs either;
>> >> >>    instead, it relies on the assumption that an IPI will result in a
>> >> >>    core sync.  On x86, I think this may be true in practice, but
>> >> >>    it's not architecturally reliable.  In particular, the SDM and
>> >> >>    APM do not appear to guarantee that interrupt delivery is
>> >> >>    serializing.
>> >> >
>> >> > Right, I don't think we rely on that, we do rely on interrupt delivery
>> >> > providing order though -- as per the previous email.
>> >> >
>> >> >>    On a preemptible kernel, IPI return can schedule,
>> >> >>    thereby switching to another task in the same mm that was
>> >> >>    sleeping in a syscall.  The new task could then SYSRET back to
>> >> >>    usermode without ever executing IRET.
>> >> >
>> >> > This; I think we all overlooked this scenario.
>> >>
>> >> Indeed, this is an issue which needs to be fixed.
>> >>
>> >> >
>> >> >> This patch simplifies the code to treat the calling CPU just like
>> >> >> all other CPUs, and explicitly sync_core() on all target CPUs.  This
>> >> >> eliminates the need for the smp_mb() at the end of the function
>> >> >> except in the special case of a targeted remote membarrier().  This
>> >> >> patch updates that code and the comments accordingly.
>> >>
>> >> I am not confident that removing the smp_mb at the end of membarrier is
>> >> an appropriate change, nor that it simplifies the model.
>> >
>> > Ah, but I didn't remove it.  I carefully made sure that every possible
>> > path through the function does an smp_mb() or stronger after all the
>> > cpu_rq reads.  ipi_func(), on_each_cpu(), and the explicit smp_mb()
>> > cover the three cases.
>> >
>> > That being said, if you prefer, I can make the change to skip the
>> > calling CPU, in which case I'll leave the smp_mb() at the end alone.
>>
>> For the memory barrier commands, I prefer skipping self and leaving the
>> smp_mb at the very beginning/end of the system call. Those are the key
>> before/after points we are synchronizing against, and those are simple
>> to document.
>>
> 
> Is there a reason that doing the barrier at the very end could make an
> observable difference?  The two models are:
> 
> membarrier() {
> smp_mb();
> read a bunch of cpu_rq memory and make decisions;
> execute smp_mb() on relevant cpus including self;

you forget the fact that you also add a smp_mb() after each
individual IPI returns, which is why you can get away with removing
the smp_mb at the end of the membarrier syscall without introducing
issues.

> }
> 
> versus
> 
> membarrier() {
> smp_mb();
> read a bunch of cpu_rq memory and make decisions;
> execute smp_mb() on relevant non-self cpus;
> wait for that to finish (acquire-style on the local cpu);
> smp_mb();
> }
> 
> Is the idea that, on a sufficiently weakly ordered architecture, some
> remote CPU could do a store before the IPI and a local load after the
> membarrier() syscall might not observe the load unless the local
> smp_mb() is after the remote smp_mb()?  If so, I'm not entirely
> convinced that this is observably different from the store simply
> occurring after the IPI, but maybe there are some gnarly situations in
> which this could happen.
> 
> If your concern is something along these lines, I could try to write
> up an appropriate comment, and I'll rework the patch.

This is already documented in the scenarios I added as comments in the
patch sitting in the tip tree:

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=25595eb6aaa9fbb31330f1e0b400642694bc6574

See "Scenario B Userspace thread execution before IPI vs membarrier's memory barrier after completing the IPI"

I think the change you proposed would be technically still OK:

- The callback on self issuing the smp_mb would take care of ensuring
  that at least one memory barrier is issued after loading rq->curr
  state for each cpu.
- The smp_mb after each ipi return would ensure we have barrier ordering
  between the smp_mb in the ipi handler / before the membarrier system
  call returns to userspace.

But then rather than having one clear spot where the smp_mb needs to be
placed and documented, we have a more complex maze of conditions we need
to consider. Hence my preference for keeping the smp_mb at the beginning/end
of the membarrier system call.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-12-01 20:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-30 17:50 [PATCH 0/3] membarrier fixes Andy Lutomirski
2020-11-30 17:50 ` [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization Andy Lutomirski
2020-12-01 14:39   ` Mathieu Desnoyers
2020-12-01 17:47     ` Andy Lutomirski
2020-11-30 17:50 ` [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt() Andy Lutomirski
2020-12-01 10:06   ` Peter Zijlstra
2020-12-01 14:31     ` Mathieu Desnoyers
2020-12-01 17:55       ` Andy Lutomirski
2020-11-30 17:50 ` [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully Andy Lutomirski
2020-12-01 10:16   ` Peter Zijlstra
2020-12-01 14:28     ` Mathieu Desnoyers
2020-12-01 18:12       ` Andy Lutomirski
2020-12-01 18:29         ` Mathieu Desnoyers
2020-12-01 18:48           ` Andy Lutomirski
2020-12-01 20:51             ` Mathieu Desnoyers
2020-12-01 18:09     ` Andy Lutomirski
2020-12-01 18:53       ` Peter Zijlstra
2020-12-01 18:55       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).