linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
       [not found] <20180129202020.8515-1-mathieu.desnoyers@efficios.com>
@ 2018-01-29 20:20 ` Mathieu Desnoyers
  2018-02-05 20:22   ` Ingo Molnar
  2021-06-18 17:13   ` Christophe Leroy
  0 siblings, 2 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2018-01-29 20:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner
  Cc: linux-kernel, linux-api, Andy Lutomirski, Paul E . McKenney,
	Boqun Feng, Andrew Hunter, Maged Michael, Avi Kivity,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Dave Watson, H . Peter Anvin, Andrea Parri, Russell King,
	Greg Hackmann, Will Deacon, David Sehr, Linus Torvalds, x86,
	Mathieu Desnoyers, Alan Stern, Alexander Viro, Nicholas Piggin,
	linuxppc-dev, linux-arch

Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Dave Watson <davejwatson@fb.com>
CC: Alan Stern <stern@rowland.harvard.edu>
CC: Will Deacon <will.deacon@arm.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-arch@vger.kernel.org
---
Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.
---
 MAINTAINERS                           |  1 +
 arch/powerpc/Kconfig                  |  1 +
 arch/powerpc/include/asm/membarrier.h | 26 ++++++++++++++++++++++++++
 arch/powerpc/mm/mmu_context.c         |  7 +++++++
 include/linux/sched/mm.h              | 13 ++++++++++++-
 init/Kconfig                          |  3 +++
 kernel/sched/core.c                   | 10 ----------
 kernel/sched/membarrier.c             |  8 ++++++++
 8 files changed, 58 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 845fc25812f1..34c1ecd5a5d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8929,6 +8929,7 @@ L:	linux-kernel@vger.kernel.org
 S:	Supported
 F:	kernel/sched/membarrier.c
 F:	include/uapi/linux/membarrier.h
+F:	arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2ed525a44734..09b02180b8a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_PMEM_API                if PPC64
+	select ARCH_HAS_MEMBARRIER_HOOKS
 	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_TICK_BROADCAST		if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
new file mode 100644
index 000000000000..98ff4f1fcf2b
--- /dev/null
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -0,0 +1,26 @@
+#ifndef _ASM_POWERPC_MEMBARRIER_H
+#define _ASM_POWERPC_MEMBARRIER_H
+
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+					     struct mm_struct *next,
+					     struct task_struct *tsk)
+{
+	/*
+	 * Only need the full barrier when switching between processes.
+	 * Barrier when switching from kernel to userspace is not
+	 * required here, given that it is implied by mmdrop(). Barrier
+	 * when switching from userspace to kernel is not needed after
+	 * store to rq->curr.
+	 */
+	if (likely(!(atomic_read(&next->membarrier_state) &
+		     MEMBARRIER_STATE_PRIVATE_EXPEDITED) || !prev))
+		return;
+
+	/*
+	 * The membarrier system call requires a full memory barrier
+	 * after storing to rq->curr, before going back to user-space.
+	 */
+	smp_mb();
+}
+
+#endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index d60a62bf4fc7..0ab297c4cfad 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -12,6 +12,7 @@
 
 #include <linux/mm.h>
 #include <linux/cpu.h>
+#include <linux/sched/mm.h>
 
 #include <asm/mmu_context.h>
 
@@ -58,6 +59,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 *
 		 * On the read side the barrier is in pte_xchg(), which orders
 		 * the store to the PTE vs the load of mm_cpumask.
+		 *
+		 * This full barrier is needed by membarrier when switching
+		 * between processes after store to rq->curr, before user-space
+		 * memory accesses.
 		 */
 		smp_mb();
 
@@ -80,6 +85,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 
 	if (new_on_cpu)
 		radix_kvm_prefetch_workaround(next);
+	else
+		membarrier_arch_switch_mm(prev, next, tsk);
 
 	/*
 	 * The actual HW switching method differs between the various
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 3d49b91b674d..1754396795f6 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -215,14 +215,25 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
 #ifdef CONFIG_MEMBARRIER
 enum {
 	MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY	= (1U << 0),
-	MEMBARRIER_STATE_SWITCH_MM			= (1U << 1),
+	MEMBARRIER_STATE_PRIVATE_EXPEDITED		= (1U << 1),
 };
 
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
+#include <asm/membarrier.h>
+#endif
+
 static inline void membarrier_execve(struct task_struct *t)
 {
 	atomic_set(&t->mm->membarrier_state, 0);
 }
 #else
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+					     struct mm_struct *next,
+					     struct task_struct *tsk)
+{
+}
+#endif
 static inline void membarrier_execve(struct task_struct *t)
 {
 }
diff --git a/init/Kconfig b/init/Kconfig
index a9a2e2c86671..2d118b6adee2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1412,6 +1412,9 @@ config USERFAULTFD
 	  Enable the userfaultfd() system call that allows to intercept and
 	  handle page faults in userland.
 
+config ARCH_HAS_MEMBARRIER_HOOKS
+	bool
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7bf32aabfda..c7e06dfa804b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2653,16 +2653,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	prev_state = prev->state;
 	vtime_task_switch(prev);
 	perf_event_task_sched_in(prev, current);
-	/*
-	 * The membarrier system call requires a full memory barrier
-	 * after storing to rq->curr, before going back to user-space.
-	 *
-	 * TODO: This smp_mb__after_unlock_lock can go away if PPC end
-	 * up adding a full barrier to switch_mm(), or we should figure
-	 * out if a smp_mb__after_unlock_lock is really the proper API
-	 * to use.
-	 */
-	smp_mb__after_unlock_lock();
 	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
 
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 9bcbacba82a8..678577267a9a 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -118,6 +118,14 @@ static void membarrier_register_private_expedited(void)
 	if (atomic_read(&mm->membarrier_state)
 			& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
 		return;
+	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
+	if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
+		/*
+		 * Ensure all future scheduler executions will observe the
+		 * new thread flag state for this process.
+		 */
+		synchronize_sched();
+	}
 	atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
 			&mm->membarrier_state);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
  2018-01-29 20:20 ` [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm() Mathieu Desnoyers
@ 2018-02-05 20:22   ` Ingo Molnar
  2018-02-05 20:32     ` Mathieu Desnoyers
  2021-06-18 17:13   ` Christophe Leroy
  1 sibling, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2018-02-05 20:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	linux-api, Andy Lutomirski, Paul E . McKenney, Boqun Feng,
	Andrew Hunter, Maged Michael, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Dave Watson, H . Peter Anvin,
	Andrea Parri, Russell King, Greg Hackmann, Will Deacon,
	David Sehr, Linus Torvalds, x86, Alan Stern, Alexander Viro,
	Nicholas Piggin, linuxppc-dev, linux-arch


* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

>  
> +config ARCH_HAS_MEMBARRIER_HOOKS
> +	bool

Yeah, so I have renamed this to ARCH_HAS_MEMBARRIER_CALLBACKS, and propagated it 
through the rest of the patches. "Callback" is the canonical name, and I also 
cringe every time I see 'hook'.

Please let me know if there are any big objections against this minor cleanup.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
  2018-02-05 20:22   ` Ingo Molnar
@ 2018-02-05 20:32     ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2018-02-05 20:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	linux-api, Andy Lutomirski, Paul E. McKenney, Boqun Feng,
	Andrew Hunter, maged michael, Avi Kivity, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Dave Watson, H. Peter Anvin,
	Andrea Parri, Russell King, ARM Linux, Greg Hackmann,
	Will Deacon, David Sehr, Linus Torvalds, x86, Alan Stern,
	Alexander Viro, Nicholas Piggin, linuxppc-dev, linux-arch

----- On Feb 5, 2018, at 3:22 PM, Ingo Molnar mingo@kernel.org wrote:

> * Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>  
>> +config ARCH_HAS_MEMBARRIER_HOOKS
>> +	bool
> 
> Yeah, so I have renamed this to ARCH_HAS_MEMBARRIER_CALLBACKS, and propagated it
> through the rest of the patches. "Callback" is the canonical name, and I also
> cringe every time I see 'hook'.
> 
> Please let me know if there are any big objections against this minor cleanup.

No objection at all,

Thanks!

Mathieu

> 
> Thanks,
> 
> 	Ingo

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
  2018-01-29 20:20 ` [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm() Mathieu Desnoyers
  2018-02-05 20:22   ` Ingo Molnar
@ 2021-06-18 17:13   ` Christophe Leroy
  2021-06-18 17:26     ` Mathieu Desnoyers
  1 sibling, 1 reply; 9+ messages in thread
From: Christophe Leroy @ 2021-06-18 17:13 UTC (permalink / raw)
  To: Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Thomas Gleixner
  Cc: Maged Michael, Dave Watson, Will Deacon, Andrew Hunter,
	David Sehr, Paul Mackerras, H . Peter Anvin, linux-arch, x86,
	Russell King, Greg Hackmann, Linus Torvalds, Alan Stern,
	Paul E . McKenney, Andrea Parri, Avi Kivity, Boqun Feng,
	Nicholas Piggin, Alexander Viro, Andy Lutomirski, linux-api,
	linux-kernel, linuxppc-dev



Le 29/01/2018 à 21:20, Mathieu Desnoyers a écrit :
> Allow PowerPC to skip the full memory barrier in switch_mm(), and
> only issue the barrier when scheduling into a task belonging to a
> process that has registered to use expedited private.
> 
> Threads targeting the same VM but which belong to different thread
> groups is a tricky case. It has a few consequences:
> 
> It turns out that we cannot rely on get_nr_threads(p) to count the
> number of threads using a VM. We can use
> (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
> instead to skip the synchronize_sched() for cases where the VM only has
> a single user, and that user only has a single thread.
> 
> It also turns out that we cannot use for_each_thread() to set
> thread flags in all threads using a VM, as it only iterates on the
> thread group.
> 
> Therefore, test the membarrier state variable directly rather than
> relying on thread flags. This means
> membarrier_register_private_expedited() needs to set the
> MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
> only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
> private expedited membarrier commands to succeed.
> membarrier_arch_switch_mm() now tests for the
> MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Looking at switch_mm_irqs_off(), I found it more complex than expected and found that this patch is 
the reason for that complexity.

Before the patch (ie in kernel 4.14), we have:

00000000 <switch_mm_irqs_off>:
    0:	81 24 01 c8 	lwz     r9,456(r4)
    4:	71 29 00 01 	andi.   r9,r9,1
    8:	40 82 00 1c 	bne     24 <switch_mm_irqs_off+0x24>
    c:	39 24 01 c8 	addi    r9,r4,456
   10:	39 40 00 01 	li      r10,1
   14:	7d 00 48 28 	lwarx   r8,0,r9
   18:	7d 08 53 78 	or      r8,r8,r10
   1c:	7d 00 49 2d 	stwcx.  r8,0,r9
   20:	40 c2 ff f4 	bne-    14 <switch_mm_irqs_off+0x14>
   24:	7c 04 18 40 	cmplw   r4,r3
   28:	81 24 00 24 	lwz     r9,36(r4)
   2c:	91 25 04 4c 	stw     r9,1100(r5)
   30:	4d 82 00 20 	beqlr
   34:	48 00 00 00 	b       34 <switch_mm_irqs_off+0x34>
			34: R_PPC_REL24	switch_mmu_context


After the patch (ie in 5.13-rc6), that now is:

00000000 <switch_mm_irqs_off>:
    0:	81 24 02 18 	lwz     r9,536(r4)
    4:	71 29 00 01 	andi.   r9,r9,1
    8:	41 82 00 24 	beq     2c <switch_mm_irqs_off+0x2c>
    c:	7c 04 18 40 	cmplw   r4,r3
   10:	81 24 00 24 	lwz     r9,36(r4)
   14:	91 25 04 d0 	stw     r9,1232(r5)
   18:	4d 82 00 20 	beqlr
   1c:	81 24 00 28 	lwz     r9,40(r4)
   20:	71 29 00 0a 	andi.   r9,r9,10
   24:	40 82 00 34 	bne     58 <switch_mm_irqs_off+0x58>
   28:	48 00 00 00 	b       28 <switch_mm_irqs_off+0x28>
			28: R_PPC_REL24	switch_mmu_context
   2c:	39 24 02 18 	addi    r9,r4,536
   30:	39 40 00 01 	li      r10,1
   34:	7d 00 48 28 	lwarx   r8,0,r9
   38:	7d 08 53 78 	or      r8,r8,r10
   3c:	7d 00 49 2d 	stwcx.  r8,0,r9
   40:	40 a2 ff f4 	bne     34 <switch_mm_irqs_off+0x34>
   44:	7c 04 18 40 	cmplw   r4,r3
   48:	81 24 00 24 	lwz     r9,36(r4)
   4c:	91 25 04 d0 	stw     r9,1232(r5)
   50:	4d 82 00 20 	beqlr
   54:	48 00 00 00 	b       54 <switch_mm_irqs_off+0x54>
			54: R_PPC_REL24	switch_mmu_context
   58:	2c 03 00 00 	cmpwi   r3,0
   5c:	41 82 ff cc 	beq     28 <switch_mm_irqs_off+0x28>
   60:	48 00 00 00 	b       60 <switch_mm_irqs_off+0x60>
			60: R_PPC_REL24	switch_mmu_context


Especially, the comparison of 'prev' to 0 is pointless as both cases end up with just branching to 
'switch_mmu_context'

I don't understand all that complexity to just replace a simple 'smp_mb__after_unlock_lock()'.

#define smp_mb__after_unlock_lock()	smp_mb()
#define smp_mb()	barrier()
# define barrier() __asm__ __volatile__("": : :"memory")


Am I missing some subtility ?

Thanks
Christophe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
  2021-06-18 17:13   ` Christophe Leroy
@ 2021-06-18 17:26     ` Mathieu Desnoyers
  2021-06-19  9:35       ` Christophe Leroy
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2021-06-18 17:26 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: maged michael, Peter Zijlstra, Dave Watson, Will Deacon,
	Andrew Hunter, David Sehr, Paul Mackerras, H. Peter Anvin,
	linux-arch, x86, Russell King, ARM Linux, Greg Hackmann,
	Linus Torvalds, Ingo Molnar, Alan Stern, Paul, Andrea Parri,
	Avi Kivity, Boqun Feng, Nicholas Piggin, Alexander Viro,
	Andy Lutomirski, Thomas Gleixner, linux-api, linux-kernel,
	linuxppc-dev

----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy christophe.leroy@csgroup.eu wrote:
[...]
> 
> I don't understand all that complexity to just replace a simple
> 'smp_mb__after_unlock_lock()'.
> 
> #define smp_mb__after_unlock_lock()	smp_mb()
> #define smp_mb()	barrier()
> # define barrier() __asm__ __volatile__("": : :"memory")
> 
> 
> Am I missing some subtility ?

On powerpc CONFIG_SMP, smp_mb() is actually defined as:

#define smp_mb()        __smp_mb()
#define __smp_mb()      mb()
#define mb()   __asm__ __volatile__ ("sync" : : : "memory")

So the original motivation here was to skip a "sync" instruction whenever
switching between threads which are part of the same process. But based on
recent discussions, I suspect my implementation may be inaccurately doing
so though.

Thanks,

Mathieu


> 
> Thanks
> Christophe

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
  2021-06-18 17:26     ` Mathieu Desnoyers
@ 2021-06-19  9:35       ` Christophe Leroy
  2021-06-19 15:02         ` Segher Boessenkool
  0 siblings, 1 reply; 9+ messages in thread
From: Christophe Leroy @ 2021-06-19  9:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: maged michael, Peter Zijlstra, Dave Watson, Will Deacon,
	Andrew Hunter, David Sehr, Paul Mackerras, H. Peter Anvin,
	linux-arch, x86, Russell King, ARM Linux, Greg Hackmann,
	Linus Torvalds, Ingo Molnar, Alan Stern, Paul, Andrea Parri,
	Avi Kivity, Boqun Feng, Nicholas Piggin, Alexander Viro,
	Andy Lutomirski, Thomas Gleixner, linux-api, linux-kernel,
	linuxppc-dev



Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit :
> ----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy christophe.leroy@csgroup.eu wrote:
> [...]
>>
>> I don't understand all that complexity to just replace a simple
>> 'smp_mb__after_unlock_lock()'.
>>
>> #define smp_mb__after_unlock_lock()	smp_mb()
>> #define smp_mb()	barrier()
>> # define barrier() __asm__ __volatile__("": : :"memory")
>>
>>
>> Am I missing some subtility ?
> 
> On powerpc CONFIG_SMP, smp_mb() is actually defined as:
> 
> #define smp_mb()        __smp_mb()
> #define __smp_mb()      mb()
> #define mb()   __asm__ __volatile__ ("sync" : : : "memory")
> 
> So the original motivation here was to skip a "sync" instruction whenever
> switching between threads which are part of the same process. But based on
> recent discussions, I suspect my implementation may be inaccurately doing
> so though.
> 

I see.

Then, if you think a 'sync' is a concern, shouldn't we try and remove the forest of 'sync' in the 
I/O accessors ?

I can't really understand why we need all those 'sync' and 'isync' and 'twi' around the accesses 
whereas I/O memory is usually mapped as 'Guarded' so memory access ordering is already garantied.

I'm sure we'll save a lot with that.

Christophe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
  2021-06-19  9:35       ` Christophe Leroy
@ 2021-06-19 15:02         ` Segher Boessenkool
  2021-06-21 14:11           ` Christophe Leroy
  0 siblings, 1 reply; 9+ messages in thread
From: Segher Boessenkool @ 2021-06-19 15:02 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: maged michael, Peter Zijlstra, Dave Watson, Will Deacon,
	Russell King, ARM Linux, David Sehr, Paul Mackerras,
	H. Peter Anvin, linux-arch, x86, Andrew Hunter, Greg Hackmann,
	Ingo Molnar, Alan Stern, Paul, Andrea Parri, Avi Kivity,
	Boqun Feng, linuxppc-dev, Nicholas Piggin, Mathieu Desnoyers,
	Alexander Viro, Andy Lutomirski, Thomas Gleixner, linux-api,
	linux-kernel, Linus Torvalds

On Sat, Jun 19, 2021 at 11:35:34AM +0200, Christophe Leroy wrote:
> 
> 
> Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit :
> >----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy 
> >christophe.leroy@csgroup.eu wrote:
> >[...]
> >>
> >>I don't understand all that complexity to just replace a simple
> >>'smp_mb__after_unlock_lock()'.
> >>
> >>#define smp_mb__after_unlock_lock()	smp_mb()
> >>#define smp_mb()	barrier()
> >># define barrier() __asm__ __volatile__("": : :"memory")
> >>
> >>
> >>Am I missing some subtility ?
> >
> >On powerpc CONFIG_SMP, smp_mb() is actually defined as:
> >
> >#define smp_mb()        __smp_mb()
> >#define __smp_mb()      mb()
> >#define mb()   __asm__ __volatile__ ("sync" : : : "memory")
> >
> >So the original motivation here was to skip a "sync" instruction whenever
> >switching between threads which are part of the same process. But based on
> >recent discussions, I suspect my implementation may be inaccurately doing
> >so though.
> >
> 
> I see.
> 
> Then, if you think a 'sync' is a concern, shouldn't we try and remove the 
> forest of 'sync' in the I/O accessors ?
> 
> I can't really understand why we need all those 'sync' and 'isync' and 
> 'twi' around the accesses whereas I/O memory is usually mapped as 'Guarded' 
> so memory access ordering is already garantied.
> 
> I'm sure we'll save a lot with that.

The point of the twi in the I/O accessors was to make things easier to
debug if the accesses fail: for the twi insn to complete the load will
have to have completed as well.  On a correctly working system you never
should need this (until something fails ;-) )

Without the twi you might need to enforce ordering in some cases still.
The twi is a very heavy hammer, but some of that that gives us is no
doubt actually needed.


Segher

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
  2021-06-19 15:02         ` Segher Boessenkool
@ 2021-06-21 14:11           ` Christophe Leroy
  2021-06-22  0:15             ` Segher Boessenkool
  0 siblings, 1 reply; 9+ messages in thread
From: Christophe Leroy @ 2021-06-21 14:11 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: maged michael, Peter Zijlstra, Dave Watson, Will Deacon,
	Russell King, ARM Linux, David Sehr, Paul Mackerras,
	H. Peter Anvin, linux-arch, x86, Andrew Hunter, Greg Hackmann,
	Ingo Molnar, Alan Stern, Paul, Andrea Parri, Avi Kivity,
	Boqun Feng, linuxppc-dev, Nicholas Piggin, Mathieu Desnoyers,
	Alexander Viro, Andy Lutomirski, Thomas Gleixner, linux-api,
	linux-kernel, Linus Torvalds



Le 19/06/2021 à 17:02, Segher Boessenkool a écrit :
> On Sat, Jun 19, 2021 at 11:35:34AM +0200, Christophe Leroy wrote:
>>
>>
>> Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit :
>>> ----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy
>>> christophe.leroy@csgroup.eu wrote:
>>> [...]
>>>>
>>>> I don't understand all that complexity to just replace a simple
>>>> 'smp_mb__after_unlock_lock()'.
>>>>
>>>> #define smp_mb__after_unlock_lock()	smp_mb()
>>>> #define smp_mb()	barrier()
>>>> # define barrier() __asm__ __volatile__("": : :"memory")
>>>>
>>>>
>>>> Am I missing some subtility ?
>>>
>>> On powerpc CONFIG_SMP, smp_mb() is actually defined as:
>>>
>>> #define smp_mb()        __smp_mb()
>>> #define __smp_mb()      mb()
>>> #define mb()   __asm__ __volatile__ ("sync" : : : "memory")
>>>
>>> So the original motivation here was to skip a "sync" instruction whenever
>>> switching between threads which are part of the same process. But based on
>>> recent discussions, I suspect my implementation may be inaccurately doing
>>> so though.
>>>
>>
>> I see.
>>
>> Then, if you think a 'sync' is a concern, shouldn't we try and remove the
>> forest of 'sync' in the I/O accessors ?
>>
>> I can't really understand why we need all those 'sync' and 'isync' and
>> 'twi' around the accesses whereas I/O memory is usually mapped as 'Guarded'
>> so memory access ordering is already garantied.
>>
>> I'm sure we'll save a lot with that.
> 
> The point of the twi in the I/O accessors was to make things easier to
> debug if the accesses fail: for the twi insn to complete the load will
> have to have completed as well.  On a correctly working system you never
> should need this (until something fails ;-) )
> 
> Without the twi you might need to enforce ordering in some cases still.
> The twi is a very heavy hammer, but some of that that gives us is no
> doubt actually needed.
> 

Well, I've always been quite perplex about that. According to the documentation of the 8xx, if a bus 
error or something happens on an I/O access, the exception will be accounted on the instruction 
which does the access. But based on the following function, I understand that some version of 
powerpc do generate the trap on the instruction which was being executed at the time the I/O access 
failed, not the instruction that does the access itself ?

/*
  * I/O accesses can cause machine checks on powermacs.
  * Check if the NIP corresponds to the address of a sync
  * instruction for which there is an entry in the exception
  * table.
  *  -- paulus.
  */
static inline int check_io_access(struct pt_regs *regs)
{
#ifdef CONFIG_PPC32
	unsigned long msr = regs->msr;
	const struct exception_table_entry *entry;
	unsigned int *nip = (unsigned int *)regs->nip;

	if (((msr & 0xffff0000) == 0 || (msr & (0x80000 | 0x40000)))
	    && (entry = search_exception_tables(regs->nip)) != NULL) {
		/*
		 * Check that it's a sync instruction, or somewhere
		 * in the twi; isync; nop sequence that inb/inw/inl uses.
		 * As the address is in the exception table
		 * we should be able to read the instr there.
		 * For the debug message, we look at the preceding
		 * load or store.
		 */
		if (*nip == PPC_INST_NOP)
			nip -= 2;
		else if (*nip == PPC_INST_ISYNC)
			--nip;
		if (*nip == PPC_INST_SYNC || (*nip >> 26) == OP_TRAP) {
			unsigned int rb;

			--nip;
			rb = (*nip >> 11) & 0x1f;
			printk(KERN_DEBUG "%s bad port %lx at %p\n",
			       (*nip & 0x100)? "OUT to": "IN from",
			       regs->gpr[rb] - _IO_BASE, nip);
			regs->msr |= MSR_RI;
			regs->nip = extable_fixup(entry);
			return 1;
		}
	}
#endif /* CONFIG_PPC32 */
	return 0;
}

Am I right ?

It is not only the twi which bother's me in the I/O accessors but also the sync/isync and stuff.

A write typically is

	sync
	stw

A read is

	sync
	lwz
	twi
	isync

Taking into account that HW ordering is garanteed by the fact that __iomem is guarded, isn't the 
'memory' clobber enough as a barrier ?

Thanks
Christophe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
  2021-06-21 14:11           ` Christophe Leroy
@ 2021-06-22  0:15             ` Segher Boessenkool
  0 siblings, 0 replies; 9+ messages in thread
From: Segher Boessenkool @ 2021-06-22  0:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: maged michael, Peter Zijlstra, Dave Watson, Will Deacon,
	Russell King, ARM Linux, David Sehr, Paul Mackerras,
	H. Peter Anvin, linux-arch, x86, Andrew Hunter, Greg Hackmann,
	Ingo Molnar, Alan Stern, Paul, Andrea Parri, Avi Kivity,
	Boqun Feng, linuxppc-dev, Nicholas Piggin, Mathieu Desnoyers,
	Alexander Viro, Andy Lutomirski, Thomas Gleixner, linux-api,
	linux-kernel, Linus Torvalds

Hi!

On Mon, Jun 21, 2021 at 04:11:04PM +0200, Christophe Leroy wrote:
> Le 19/06/2021 à 17:02, Segher Boessenkool a écrit :
> >The point of the twi in the I/O accessors was to make things easier to
> >debug if the accesses fail: for the twi insn to complete the load will
> >have to have completed as well.  On a correctly working system you never
> >should need this (until something fails ;-) )
> >
> >Without the twi you might need to enforce ordering in some cases still.
> >The twi is a very heavy hammer, but some of that that gives us is no
> >doubt actually needed.
> 
> Well, I've always been quite perplex about that. According to the 
> documentation of the 8xx, if a bus error or something happens on an I/O 
> access, the exception will be accounted on the instruction which does the 
> access. But based on the following function, I understand that some version 
> of powerpc do generate the trap on the instruction which was being executed 
> at the time the I/O access failed, not the instruction that does the access 
> itself ?

Trap instructions are never speculated (this may not be architectural,
but it is true on all existing implementations).  So the instructions
after the twi;isync will not execute until the twi itself has finished,
and that cannot happen before the preceding load has (because it uses
the loaded register).

Now, some I/O accesses can cause machine checks.  Machine checks are
asynchronous and can be hard to correlate to specific load insns, and
worse, you may not even have the address loaded from in architected
registers anymore.  Since I/O accesses often take *long*, tens or even
hundreds of cycles is not unusual, this can be a challenge.

To recover from machine checks you typically need special debug hardware
and/or software.  For the Apple machines those are not so easy to come
by.  This "twi after loads" thing made it pretty easy to figure out
where your code was going wrong.

And it isn't as slow as it may sound: typically you really need to have
the result of the load before you can go on do useful work anyway, and
loads from I/O are slow non-posted things.

> /*
>  * I/O accesses can cause machine checks on powermacs.
>  * Check if the NIP corresponds to the address of a sync
>  * instruction for which there is an entry in the exception
>  * table.
>  *  -- paulus.
>  */

I suspect this is from before the twi thing was added?

> It is not only the twi which bother's me in the I/O accessors but also the 
> sync/isync and stuff.
> 
> A write typically is
> 
> 	sync
> 	stw
> 
> A read is
> 
> 	sync
> 	lwz
> 	twi
> 	isync
> 
> Taking into account that HW ordering is garanteed by the fact that __iomem 
> is guarded,

Yes.  But machine checks are asynchronous :-)

> isn't the 'memory' clobber enough as a barrier ?

A "memory" clobber isn't a barrier of any kind.  "Compiler barriers" do
not exist.

The only thing such a clobber does is it tells the compiler that this
inline asm can access some memory, and we do not say at what address.
So the compiler cannot reorder this asm with other memory accesses.  It
has no other effects, no magical effects, and it is not comparable to
actual barrier instructions (that actually tell the hardware that some
certain ordering is required).

"Compiler barrier" is a harmful misnomer: language shapes thoughts,
using misleading names causes misguided thoughts.

Anyway :-)

The isync is simply to make sure the code after it does not start before
the code before it has completed.  The sync before I am not sure.


Segher

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-22  0:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20180129202020.8515-1-mathieu.desnoyers@efficios.com>
2018-01-29 20:20 ` [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm() Mathieu Desnoyers
2018-02-05 20:22   ` Ingo Molnar
2018-02-05 20:32     ` Mathieu Desnoyers
2021-06-18 17:13   ` Christophe Leroy
2021-06-18 17:26     ` Mathieu Desnoyers
2021-06-19  9:35       ` Christophe Leroy
2021-06-19 15:02         ` Segher Boessenkool
2021-06-21 14:11           ` Christophe Leroy
2021-06-22  0:15             ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).