linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
@ 2010-01-07  4:40 Mathieu Desnoyers
  2010-01-07  5:02 ` Paul E. McKenney
                   ` (5 more replies)
  0 siblings, 6 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07  4:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul E. McKenney, Ingo Molnar, akpm, josh, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, laijs, dipankar

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process.

It aims at greatly simplifying and enhancing the current signal-based
liburcu userspace RCU synchronize_rcu() implementation.
(found at http://lttng.org/urcu)

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A synchronize_rcu() are
ordering memory accesses with respect to smp_mb() present in
rcu_read_lock/unlock(), we can change all smp_mb() from
synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
rcu_read_lock/unlock() into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
prev mem accesses           prev mem accesses
smp_mb()                    smp_mb()
follow mem accesses         follow mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() thanks to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

The current implementation simply executes a memory barrier in an IPI
handler on each active cpu. Going through the hassle of taking run queue
locks and checking if the thread running on each online CPU belongs to
the current thread seems more heavyweight than the cost of the IPI
itself (not measured though).

The system call number is only assigned for x86_64 in this RFC patch.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: akpm@linux-foundation.org
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: tglx@linutronix.de
CC: peterz@infradead.org
CC: rostedt@goodmis.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
---
 arch/x86/include/asm/unistd_64.h |    2 ++
 kernel/sched.c                   |   30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-06 22:11:32.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-06 22:11:50.000000000 -0500
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open			298
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_membarrier				299
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
@@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+/*
+ * Execute a memory barrier on all CPUs on SMP systems.
+ * Do not rely on implicit barriers in smp_call_function(), just in case they
+ * are ever relaxed in the future.
+ */
+static void membarrier_ipi(void *unused)
+{
+	smp_mb();
+}
+
+/*
+ * sys_membarrier - issue memory barrier on current process running threads
+ *
+ * Execute a memory barrier on all running threads of the current process.
+ * Upon completion, the caller thread is ensured that all process threads
+ * have passed through a state where memory accesses match program order.
+ * (non-running threads are de facto in such a state)
+ *
+ * The current implementation simply executes a memory barrier in an IPI handler
+ * on each active cpu. Going through the hassle of taking run queue locks and
+ * checking if the thread running on each online CPU belongs to the current
+ * thread seems more heavyweight than the cost of the IPI itself.
+ */
+SYSCALL_DEFINE0(membarrier)
+{
+	on_each_cpu(membarrier_ipi, NULL, 1);
+
+	return 0;
+}
+
 #ifndef CONFIG_SMP
 
 int rcu_expedited_torture_stats(char *page)

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
@ 2010-01-07  5:02 ` Paul E. McKenney
  2010-01-07  5:39   ` Mathieu Desnoyers
  2010-01-07  8:32   ` Peter Zijlstra
  2010-01-07  5:28 ` Josh Triplett
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07  5:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, akpm, josh, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, laijs, dipankar

On Wed, Jan 06, 2010 at 11:40:07PM -0500, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process.
> 
> It aims at greatly simplifying and enhancing the current signal-based
> liburcu userspace RCU synchronize_rcu() implementation.
> (found at http://lttng.org/urcu)
> 
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invokation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
> 
> To explain the benefit of this scheme, let's introduce two example threads:
> 
> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> 
> In a scheme where all smp_mb() in thread A synchronize_rcu() are
> ordering memory accesses with respect to smp_mb() present in
> rcu_read_lock/unlock(), we can change all smp_mb() from
> synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> rcu_read_lock/unlock() into compiler barriers "barrier()".
> 
> Before the change, we had, for each smp_mb() pairs:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> smp_mb()                    smp_mb()
> follow mem accesses         follow mem accesses
> 
> After the change, these pairs become:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> As we can see, there are two possible scenarios: either Thread B memory
> accesses do not happen concurrently with Thread A accesses (1), or they
> do (2).
> 
> 1) Non-concurrent Thread A vs Thread B accesses:
> 
> Thread A                    Thread B
> prev mem accesses
> sys_membarrier()
> follow mem accesses
>                             prev mem accesses
>                             barrier()
>                             follow mem accesses
> 
> In this case, thread B accesses will be weakly ordered. This is OK,
> because at that point, thread A is not particularly interested in
> ordering them with respect to its own accesses.
> 
> 2) Concurrent Thread A vs Thread B accesses
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> In this case, thread B accesses, which are ensured to be in program
> order thanks to the compiler barrier, will be "upgraded" to full
> smp_mb() thanks to the IPIs executing memory barriers on each active
> system threads. Each non-running process threads are intrinsically
> serialized by the scheduler.
> 
> The current implementation simply executes a memory barrier in an IPI
> handler on each active cpu. Going through the hassle of taking run queue
> locks and checking if the thread running on each online CPU belongs to
> the current thread seems more heavyweight than the cost of the IPI
> itself (not measured though).
> 
> The system call number is only assigned for x86_64 in this RFC patch.

Beats the heck out of user-mode signal handlers!!!  And it is hard
to imagine groveling through runqueues ever being a win, even on very
large systems.  The only reasonable optimization I can imagine is to
turn this into a no-op for a single-threaded process, but there are
other ways to do that optimization.

Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: mingo@elte.hu
> CC: laijs@cn.fujitsu.com
> CC: dipankar@in.ibm.com
> CC: akpm@linux-foundation.org
> CC: josh@joshtriplett.org
> CC: dvhltc@us.ibm.com
> CC: niv@us.ibm.com
> CC: tglx@linutronix.de
> CC: peterz@infradead.org
> CC: rostedt@goodmis.org
> CC: Valdis.Kletnieks@vt.edu
> CC: dhowells@redhat.com
> ---
>  arch/x86/include/asm/unistd_64.h |    2 ++
>  kernel/sched.c                   |   30 ++++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
> 
> Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> ===================================================================
> --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-06 22:11:32.000000000 -0500
> +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-06 22:11:50.000000000 -0500
> @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
>  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
>  #define __NR_perf_event_open			298
>  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> +#define __NR_membarrier				299
> +__SYSCALL(__NR_membarrier, sys_membarrier)
> 
>  #ifndef __NO_STUBS
>  #define __ARCH_WANT_OLD_READDIR
> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
> @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
> 
> +/*
> + * Execute a memory barrier on all CPUs on SMP systems.
> + * Do not rely on implicit barriers in smp_call_function(), just in case they
> + * are ever relaxed in the future.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> +	smp_mb();
> +}
> +
> +/*
> + * sys_membarrier - issue memory barrier on current process running threads
> + *
> + * Execute a memory barrier on all running threads of the current process.
> + * Upon completion, the caller thread is ensured that all process threads
> + * have passed through a state where memory accesses match program order.
> + * (non-running threads are de facto in such a state)
> + *
> + * The current implementation simply executes a memory barrier in an IPI handler
> + * on each active cpu. Going through the hassle of taking run queue locks and
> + * checking if the thread running on each online CPU belongs to the current
> + * thread seems more heavyweight than the cost of the IPI itself.
> + */
> +SYSCALL_DEFINE0(membarrier)
> +{
> +	on_each_cpu(membarrier_ipi, NULL, 1);
> +
> +	return 0;
> +}
> +
>  #ifndef CONFIG_SMP
> 
>  int rcu_expedited_torture_stats(char *page)
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
  2010-01-07  5:02 ` Paul E. McKenney
@ 2010-01-07  5:28 ` Josh Triplett
  2010-01-07  6:04   ` Mathieu Desnoyers
  2010-01-07  5:40 ` Steven Rostedt
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 107+ messages in thread
From: Josh Triplett @ 2010-01-07  5:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, tglx, peterz,
	rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

On Wed, Jan 06, 2010 at 11:40:07PM -0500, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process.
> 
> It aims at greatly simplifying and enhancing the current signal-based
> liburcu userspace RCU synchronize_rcu() implementation.
> (found at http://lttng.org/urcu)
> 
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invokation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
[...]
> The current implementation simply executes a memory barrier in an IPI
> handler on each active cpu. Going through the hassle of taking run queue
> locks and checking if the thread running on each online CPU belongs to
> the current thread seems more heavyweight than the cost of the IPI
> itself (not measured though).

> --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
> @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
>  
> +/*
> + * Execute a memory barrier on all CPUs on SMP systems.
> + * Do not rely on implicit barriers in smp_call_function(), just in case they
> + * are ever relaxed in the future.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> +	smp_mb();
> +}
> +
> +/*
> + * sys_membarrier - issue memory barrier on current process running threads
> + *
> + * Execute a memory barrier on all running threads of the current process.
> + * Upon completion, the caller thread is ensured that all process threads
> + * have passed through a state where memory accesses match program order.
> + * (non-running threads are de facto in such a state)
> + *
> + * The current implementation simply executes a memory barrier in an IPI handler
> + * on each active cpu. Going through the hassle of taking run queue locks and
> + * checking if the thread running on each online CPU belongs to the current
> + * thread seems more heavyweight than the cost of the IPI itself.
> + */
> +SYSCALL_DEFINE0(membarrier)
> +{
> +	on_each_cpu(membarrier_ipi, NULL, 1);
> +
> +	return 0;
> +}
> +

Nice idea.  A few things come immediately to mind:

- If !CONFIG_SMP, this syscall should become (more of) a no-op.  Ideally
  even if CONFIG_SMP but running with one CPU.  (If you really wanted to
  go nuts, you could make it a vsyscall that did nothing with 1 CPU, to
  avoid the syscall overhead, but that seems like entirely too much
  trouble.)

- Have you tested what happens if a process does "while(1)
  membarrier();"?  By running on every CPU, including those not owned by
  the current process, this has the potential to make DoS easier,
  particularly on systems with many CPUs.  That gets even worse if a
  process forks multiple threads running that same loop.  Also consider
  that executing an IPI will do work even on a CPU currently running a
  real-time task.

- Rather than groveling through runqueues, could you somehow remotely
  check the value of current?  In theory, a race in doing so wouldn't
  matter; finding something other than the current process should mean
  you don't need to do a barrier, and finding the current process means
  you might need to do a barrier.

- Part of me thinks this ought to become slightly more general, and just
  deliver a signal that the receiving thread could handle as it likes.
  However, that would certainly prove more expensive than this, and I
  don't know that the generality would buy anything.

- Could you somehow register reader threads with the kernel, in a way
  that makes them easy to detect remotely?


- Josh Triplett

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  5:02 ` Paul E. McKenney
@ 2010-01-07  5:39   ` Mathieu Desnoyers
  2010-01-07  8:32   ` Peter Zijlstra
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07  5:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, Ingo Molnar, akpm, josh, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, laijs, dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Jan 06, 2010 at 11:40:07PM -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> > 
> > It aims at greatly simplifying and enhancing the current signal-based
> > liburcu userspace RCU synchronize_rcu() implementation.
> > (found at http://lttng.org/urcu)
> > 
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the
> > write-side are turned into an invokation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> > 
> > To explain the benefit of this scheme, let's introduce two example threads:
> > 
> > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> > Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> > 
> > In a scheme where all smp_mb() in thread A synchronize_rcu() are
> > ordering memory accesses with respect to smp_mb() present in
> > rcu_read_lock/unlock(), we can change all smp_mb() from
> > synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> > rcu_read_lock/unlock() into compiler barriers "barrier()".
> > 
> > Before the change, we had, for each smp_mb() pairs:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > smp_mb()                    smp_mb()
> > follow mem accesses         follow mem accesses
> > 
> > After the change, these pairs become:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > As we can see, there are two possible scenarios: either Thread B memory
> > accesses do not happen concurrently with Thread A accesses (1), or they
> > do (2).
> > 
> > 1) Non-concurrent Thread A vs Thread B accesses:
> > 
> > Thread A                    Thread B
> > prev mem accesses
> > sys_membarrier()
> > follow mem accesses
> >                             prev mem accesses
> >                             barrier()
> >                             follow mem accesses
> > 
> > In this case, thread B accesses will be weakly ordered. This is OK,
> > because at that point, thread A is not particularly interested in
> > ordering them with respect to its own accesses.
> > 
> > 2) Concurrent Thread A vs Thread B accesses
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > In this case, thread B accesses, which are ensured to be in program
> > order thanks to the compiler barrier, will be "upgraded" to full
> > smp_mb() thanks to the IPIs executing memory barriers on each active
> > system threads. Each non-running process threads are intrinsically
> > serialized by the scheduler.
> > 
> > The current implementation simply executes a memory barrier in an IPI
> > handler on each active cpu. Going through the hassle of taking run queue
> > locks and checking if the thread running on each online CPU belongs to
> > the current thread seems more heavyweight than the cost of the IPI
> > itself (not measured though).
> > 
> > The system call number is only assigned for x86_64 in this RFC patch.
> 
> Beats the heck out of user-mode signal handlers!!!  And it is hard
> to imagine groveling through runqueues ever being a win, even on very
> large systems.  The only reasonable optimization I can imagine is to
> turn this into a no-op for a single-threaded process, but there are
> other ways to do that optimization.
> 

I'll cook something using thread_group_empty(current) for the next
version.

Thanks !

Mathieu

> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > CC: mingo@elte.hu
> > CC: laijs@cn.fujitsu.com
> > CC: dipankar@in.ibm.com
> > CC: akpm@linux-foundation.org
> > CC: josh@joshtriplett.org
> > CC: dvhltc@us.ibm.com
> > CC: niv@us.ibm.com
> > CC: tglx@linutronix.de
> > CC: peterz@infradead.org
> > CC: rostedt@goodmis.org
> > CC: Valdis.Kletnieks@vt.edu
> > CC: dhowells@redhat.com
> > ---
> >  arch/x86/include/asm/unistd_64.h |    2 ++
> >  kernel/sched.c                   |   30 ++++++++++++++++++++++++++++++
> >  2 files changed, 32 insertions(+)
> > 
> > Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-06 22:11:32.000000000 -0500
> > +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-06 22:11:50.000000000 -0500
> > @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
> >  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
> >  #define __NR_perf_event_open			298
> >  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> > +#define __NR_membarrier				299
> > +__SYSCALL(__NR_membarrier, sys_membarrier)
> > 
> >  #ifndef __NO_STUBS
> >  #define __ARCH_WANT_OLD_READDIR
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
> > @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
> >  };
> >  #endif	/* CONFIG_CGROUP_CPUACCT */
> > 
> > +/*
> > + * Execute a memory barrier on all CPUs on SMP systems.
> > + * Do not rely on implicit barriers in smp_call_function(), just in case they
> > + * are ever relaxed in the future.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + *
> > + * Execute a memory barrier on all running threads of the current process.
> > + * Upon completion, the caller thread is ensured that all process threads
> > + * have passed through a state where memory accesses match program order.
> > + * (non-running threads are de facto in such a state)
> > + *
> > + * The current implementation simply executes a memory barrier in an IPI handler
> > + * on each active cpu. Going through the hassle of taking run queue locks and
> > + * checking if the thread running on each online CPU belongs to the current
> > + * thread seems more heavyweight than the cost of the IPI itself.
> > + */
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +	on_each_cpu(membarrier_ipi, NULL, 1);
> > +
> > +	return 0;
> > +}
> > +
> >  #ifndef CONFIG_SMP
> > 
> >  int rcu_expedited_torture_stats(char *page)
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
  2010-01-07  5:02 ` Paul E. McKenney
  2010-01-07  5:28 ` Josh Triplett
@ 2010-01-07  5:40 ` Steven Rostedt
  2010-01-07  6:19   ` Mathieu Desnoyers
  2010-01-07 16:49   ` Paul E. McKenney
  2010-01-07  8:27 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07  5:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, josh, tglx,
	peterz, Valdis.Kletnieks, dhowells, laijs, dipankar

On Wed, 2010-01-06 at 23:40 -0500, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process.
> 
> It aims at greatly simplifying and enhancing the current signal-based
> liburcu userspace RCU synchronize_rcu() implementation.
> (found at http://lttng.org/urcu)
> 

Nice.

> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invokation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
> 
> To explain the benefit of this scheme, let's introduce two example threads:
> 
> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> 
> In a scheme where all smp_mb() in thread A synchronize_rcu() are
> ordering memory accesses with respect to smp_mb() present in
> rcu_read_lock/unlock(), we can change all smp_mb() from
> synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> rcu_read_lock/unlock() into compiler barriers "barrier()".
> 
> Before the change, we had, for each smp_mb() pairs:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> smp_mb()                    smp_mb()
> follow mem accesses         follow mem accesses
> 
> After the change, these pairs become:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> As we can see, there are two possible scenarios: either Thread B memory
> accesses do not happen concurrently with Thread A accesses (1), or they
> do (2).
> 
> 1) Non-concurrent Thread A vs Thread B accesses:
> 
> Thread A                    Thread B
> prev mem accesses
> sys_membarrier()
> follow mem accesses
>                             prev mem accesses
>                             barrier()
>                             follow mem accesses
> 
> In this case, thread B accesses will be weakly ordered. This is OK,
> because at that point, thread A is not particularly interested in
> ordering them with respect to its own accesses.
> 
> 2) Concurrent Thread A vs Thread B accesses
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> In this case, thread B accesses, which are ensured to be in program
> order thanks to the compiler barrier, will be "upgraded" to full
> smp_mb() thanks to the IPIs executing memory barriers on each active
> system threads. Each non-running process threads are intrinsically
> serialized by the scheduler.
> 
> The current implementation simply executes a memory barrier in an IPI
> handler on each active cpu. Going through the hassle of taking run queue
> locks and checking if the thread running on each online CPU belongs to
> the current thread seems more heavyweight than the cost of the IPI
> itself (not measured though).
> 


I don't think you need to grab any locks. Doing an rcu_read_lock()
should prevent tasks from disappearing (since destruction of tasks use
RCU). You may still need to grab the tasklist_lock under read_lock().

So what you could do, is find each task that is a thread of the calling
task, and then just check task_rq(task)->curr != task. Just send the
IPI's to those tasks that pass the test.

If the task->rq changes, or the task->rq->curr changes, and makes the
condition fail (or even pass), the events that cause those changes are
probably good enough than needing to call smp_mb();

-- Steve



> The system call number is only assigned for x86_64 in this RFC patch.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  5:28 ` Josh Triplett
@ 2010-01-07  6:04   ` Mathieu Desnoyers
  2010-01-07  6:32     ` Josh Triplett
  2010-01-07 16:46     ` Paul E. McKenney
  0 siblings, 2 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07  6:04 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, tglx, peterz,
	rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

* Josh Triplett (josh@joshtriplett.org) wrote:
> On Wed, Jan 06, 2010 at 11:40:07PM -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> > 
> > It aims at greatly simplifying and enhancing the current signal-based
> > liburcu userspace RCU synchronize_rcu() implementation.
> > (found at http://lttng.org/urcu)
> > 
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the
> > write-side are turned into an invokation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> [...]
> > The current implementation simply executes a memory barrier in an IPI
> > handler on each active cpu. Going through the hassle of taking run queue
> > locks and checking if the thread running on each online CPU belongs to
> > the current thread seems more heavyweight than the cost of the IPI
> > itself (not measured though).
> 
> > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
> > @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
> >  };
> >  #endif	/* CONFIG_CGROUP_CPUACCT */
> >  
> > +/*
> > + * Execute a memory barrier on all CPUs on SMP systems.
> > + * Do not rely on implicit barriers in smp_call_function(), just in case they
> > + * are ever relaxed in the future.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + *
> > + * Execute a memory barrier on all running threads of the current process.
> > + * Upon completion, the caller thread is ensured that all process threads
> > + * have passed through a state where memory accesses match program order.
> > + * (non-running threads are de facto in such a state)
> > + *
> > + * The current implementation simply executes a memory barrier in an IPI handler
> > + * on each active cpu. Going through the hassle of taking run queue locks and
> > + * checking if the thread running on each online CPU belongs to the current
> > + * thread seems more heavyweight than the cost of the IPI itself.
> > + */
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +	on_each_cpu(membarrier_ipi, NULL, 1);
> > +
> > +	return 0;
> > +}
> > +
> 
> Nice idea.  A few things come immediately to mind:
> 
> - If !CONFIG_SMP, this syscall should become (more of) a no-op.  Ideally
>   even if CONFIG_SMP but running with one CPU.  (If you really wanted to
>   go nuts, you could make it a vsyscall that did nothing with 1 CPU, to
>   avoid the syscall overhead, but that seems like entirely too much
>   trouble.)
> 

Sure, will do.

> - Have you tested what happens if a process does "while(1)
>   membarrier();"?  By running on every CPU, including those not owned by
>   the current process, this has the potential to make DoS easier,
>   particularly on systems with many CPUs.  That gets even worse if a
>   process forks multiple threads running that same loop.  Also consider
>   that executing an IPI will do work even on a CPU currently running a
>   real-time task.

Just tried it with a 10,000,000 iterations loop.

The thread doing the system call loop takes 2.0% of user time, 98% of
system time. All other cpus are nearly 100.0% idle. Just to give a bit
more info about my test setup, I also have a thread sitting on a CPU
busy-waiting for the loop to complete. This thread takes 97.7% user
time (but it really is just there to make sure we are indeed doing the
IPIs, not skipping it through the thread_group_empty(current) test). If
I remove this thread, the execution time of the test program shrinks
from 32 seconds down to 1.9 seconds. So yes, the IPI is actually
executed in the first place, because removing the extra thread
accelerates the loop tremendously. I used a 8-core Xeon to test.

> 
> - Rather than groveling through runqueues, could you somehow remotely
>   check the value of current?  In theory, a race in doing so wouldn't
>   matter; finding something other than the current process should mean
>   you don't need to do a barrier, and finding the current process means
>   you might need to do a barrier.

Well, the thing is that sending an IPI to all processors can be done
very efficiently on a lot of architectures because it uses an IPI
broadcast. If we have to select a few processors to which we send the
signal individually, I fear that the solution will scale poorly on
systems where cpus are densely used by threads belonging to the current
process.

So if we go down the route of sending an IPI broadcast as I did, then
the performance improvement of skipping the smp_mb() for some CPU seems
insignificant compared to the IPI. In addition, it would require to add
some preparation code and exchange cache-lines (containing the process
ID), which would actually slow down the non-parallel portion of the
system call (to accelerate the parallelizable portion on only some of
the CPUs).

So I don't think this would buy us anything. However, if we would have a
per-process count of the number of threads in the thread group, then
we could switch to a per-cpu IPI rather than broadcast if we detect that
we have much fewer threads than CPUs.

> 
> - Part of me thinks this ought to become slightly more general, and just
>   deliver a signal that the receiving thread could handle as it likes.
>   However, that would certainly prove more expensive than this, and I
>   don't know that the generality would buy anything.

A general scheme would have to call every threads, even those which are
not running. In the case of this system call, this is a particular case
where we can forget about non-running threads, because the memory
barrier is implied by the scheduler activity that brought them offline.
So I really don't see how we can use this IPI scheme for other things
that this kind of synchronization.

> 
> - Could you somehow register reader threads with the kernel, in a way
>   that makes them easy to detect remotely?

There are two ways I figure out we could do this. One would imply adding
extra shared data between kernel and userspace (which I'd like to avoid,
to keep coupling low). The other alternative would be to add per
task_struct information about this, and new system calls. The added per
task_struct information would use up cache lines (which are very
important, especially in the task_struct) and the added system call at
rcu_read_lock/unlock() would simply kill performance.

Thanks,

Mathieu

> 
> 
> - Josh Triplett

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  5:40 ` Steven Rostedt
@ 2010-01-07  6:19   ` Mathieu Desnoyers
  2010-01-07  6:35     ` Josh Triplett
  2010-01-07 14:27     ` Steven Rostedt
  2010-01-07 16:49   ` Paul E. McKenney
  1 sibling, 2 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07  6:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, josh, tglx,
	peterz, Valdis.Kletnieks, dhowells, laijs, dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-01-06 at 23:40 -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> > 
> > It aims at greatly simplifying and enhancing the current signal-based
> > liburcu userspace RCU synchronize_rcu() implementation.
> > (found at http://lttng.org/urcu)
> > 
> 
> Nice.
> 
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the
> > write-side are turned into an invokation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> > 
> > To explain the benefit of this scheme, let's introduce two example threads:
> > 
> > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> > Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> > 
> > In a scheme where all smp_mb() in thread A synchronize_rcu() are
> > ordering memory accesses with respect to smp_mb() present in
> > rcu_read_lock/unlock(), we can change all smp_mb() from
> > synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> > rcu_read_lock/unlock() into compiler barriers "barrier()".
> > 
> > Before the change, we had, for each smp_mb() pairs:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > smp_mb()                    smp_mb()
> > follow mem accesses         follow mem accesses
> > 
> > After the change, these pairs become:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > As we can see, there are two possible scenarios: either Thread B memory
> > accesses do not happen concurrently with Thread A accesses (1), or they
> > do (2).
> > 
> > 1) Non-concurrent Thread A vs Thread B accesses:
> > 
> > Thread A                    Thread B
> > prev mem accesses
> > sys_membarrier()
> > follow mem accesses
> >                             prev mem accesses
> >                             barrier()
> >                             follow mem accesses
> > 
> > In this case, thread B accesses will be weakly ordered. This is OK,
> > because at that point, thread A is not particularly interested in
> > ordering them with respect to its own accesses.
> > 
> > 2) Concurrent Thread A vs Thread B accesses
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > In this case, thread B accesses, which are ensured to be in program
> > order thanks to the compiler barrier, will be "upgraded" to full
> > smp_mb() thanks to the IPIs executing memory barriers on each active
> > system threads. Each non-running process threads are intrinsically
> > serialized by the scheduler.
> > 
> > The current implementation simply executes a memory barrier in an IPI
> > handler on each active cpu. Going through the hassle of taking run queue
> > locks and checking if the thread running on each online CPU belongs to
> > the current thread seems more heavyweight than the cost of the IPI
> > itself (not measured though).
> > 
> 
> 
> I don't think you need to grab any locks. Doing an rcu_read_lock()
> should prevent tasks from disappearing (since destruction of tasks use
> RCU). You may still need to grab the tasklist_lock under read_lock().
> 
> So what you could do, is find each task that is a thread of the calling
> task, and then just check task_rq(task)->curr != task. Just send the
> IPI's to those tasks that pass the test.

I guess you mean

"then just check task_rq(task)->curr == task" ... ?

> 
> If the task->rq changes, or the task->rq->curr changes, and makes the
> condition fail (or even pass), the events that cause those changes are
> probably good enough than needing to call smp_mb();

I see your point.

This would probably be good for machines with very large number of cpus
and without IPI broadcast support, running processes with only few
threads. I really start to think that we should have some way to compare
the number of threads belonging to a process and choose between the
broadcast IPI and the per-cpu IPI depending if we are over or under an
arbitrary threshold.

Thanks,

Mathieu


> 
> -- Steve
> 
> 
> 
> > The system call number is only assigned for x86_64 in this RFC patch.
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  6:04   ` Mathieu Desnoyers
@ 2010-01-07  6:32     ` Josh Triplett
  2010-01-07 17:45       ` Mathieu Desnoyers
  2010-01-07 16:46     ` Paul E. McKenney
  1 sibling, 1 reply; 107+ messages in thread
From: Josh Triplett @ 2010-01-07  6:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, tglx, peterz,
	rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

On Thu, Jan 07, 2010 at 01:04:39AM -0500, Mathieu Desnoyers wrote:
> * Josh Triplett (josh@joshtriplett.org) wrote:
> > On Wed, Jan 06, 2010 at 11:40:07PM -0500, Mathieu Desnoyers wrote:
> > > Here is an implementation of a new system call, sys_membarrier(), which
> > > executes a memory barrier on all threads of the current process.
> > > 
> > > It aims at greatly simplifying and enhancing the current signal-based
> > > liburcu userspace RCU synchronize_rcu() implementation.
> > > (found at http://lttng.org/urcu)
> > > 
> > > Both the signal-based and the sys_membarrier userspace RCU schemes
> > > permit us to remove the memory barrier from the userspace RCU
> > > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > > accelerating them. These memory barriers are replaced by compiler
> > > barriers on the read-side, and all matching memory barriers on the
> > > write-side are turned into an invokation of a memory barrier on all
> > > active threads in the process. By letting the kernel perform this
> > > synchronization rather than dumbly sending a signal to every process
> > > threads (as we currently do), we diminish the number of unnecessary wake
> > > ups and only issue the memory barriers on active threads. Non-running
> > > threads do not need to execute such barrier anyway, because these are
> > > implied by the scheduler context switches.
> > [...]
> > > The current implementation simply executes a memory barrier in an IPI
> > > handler on each active cpu. Going through the hassle of taking run queue
> > > locks and checking if the thread running on each online CPU belongs to
> > > the current thread seems more heavyweight than the cost of the IPI
> > > itself (not measured though).
> > 
> > > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
> > > +++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
> > > @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
> > >  };
> > >  #endif	/* CONFIG_CGROUP_CPUACCT */
> > >  
> > > +/*
> > > + * Execute a memory barrier on all CPUs on SMP systems.
> > > + * Do not rely on implicit barriers in smp_call_function(), just in case they
> > > + * are ever relaxed in the future.
> > > + */
> > > +static void membarrier_ipi(void *unused)
> > > +{
> > > +	smp_mb();
> > > +}
> > > +
> > > +/*
> > > + * sys_membarrier - issue memory barrier on current process running threads
> > > + *
> > > + * Execute a memory barrier on all running threads of the current process.
> > > + * Upon completion, the caller thread is ensured that all process threads
> > > + * have passed through a state where memory accesses match program order.
> > > + * (non-running threads are de facto in such a state)
> > > + *
> > > + * The current implementation simply executes a memory barrier in an IPI handler
> > > + * on each active cpu. Going through the hassle of taking run queue locks and
> > > + * checking if the thread running on each online CPU belongs to the current
> > > + * thread seems more heavyweight than the cost of the IPI itself.
> > > + */
> > > +SYSCALL_DEFINE0(membarrier)
> > > +{
> > > +	on_each_cpu(membarrier_ipi, NULL, 1);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > 
> > Nice idea.  A few things come immediately to mind:
> > 
> > - If !CONFIG_SMP, this syscall should become (more of) a no-op.  Ideally
> >   even if CONFIG_SMP but running with one CPU.  (If you really wanted to
> >   go nuts, you could make it a vsyscall that did nothing with 1 CPU, to
> >   avoid the syscall overhead, but that seems like entirely too much
> >   trouble.)
> > 
> 
> Sure, will do.
> 
> > - Have you tested what happens if a process does "while(1)
> >   membarrier();"?  By running on every CPU, including those not owned by
> >   the current process, this has the potential to make DoS easier,
> >   particularly on systems with many CPUs.  That gets even worse if a
> >   process forks multiple threads running that same loop.  Also consider
> >   that executing an IPI will do work even on a CPU currently running a
> >   real-time task.
> 
> Just tried it with a 10,000,000 iterations loop.
> 
> The thread doing the system call loop takes 2.0% of user time, 98% of
> system time. All other cpus are nearly 100.0% idle. Just to give a bit
> more info about my test setup, I also have a thread sitting on a CPU
> busy-waiting for the loop to complete. This thread takes 97.7% user
> time (but it really is just there to make sure we are indeed doing the
> IPIs, not skipping it through the thread_group_empty(current) test). If
> I remove this thread, the execution time of the test program shrinks
> from 32 seconds down to 1.9 seconds. So yes, the IPI is actually
> executed in the first place, because removing the extra thread
> accelerates the loop tremendously. I used a 8-core Xeon to test.

Do you know if the kernel properly measures the overhead of IPIs?  The
CPUs might have only looked idle.  What about running some kind of
CPU-bound benchmark on the other CPUs and testing the completion time
with and without the process running the membarrier loop?

> > - Rather than groveling through runqueues, could you somehow remotely
> >   check the value of current?  In theory, a race in doing so wouldn't
> >   matter; finding something other than the current process should mean
> >   you don't need to do a barrier, and finding the current process means
> >   you might need to do a barrier.
> 
> Well, the thing is that sending an IPI to all processors can be done
> very efficiently on a lot of architectures because it uses an IPI
> broadcast. If we have to select a few processors to which we send the
> signal individually, I fear that the solution will scale poorly on
> systems where cpus are densely used by threads belonging to the current
> process.

Assuming the system doesn't have some kind of "broadcast with mask" IPI,
yeah.

But it seems OK to make writers not scale quite as well, as long as
readers continue to scale OK and unrelated processes don't get impacted.

> So if we go down the route of sending an IPI broadcast as I did, then
> the performance improvement of skipping the smp_mb() for some CPU seems
> insignificant compared to the IPI. In addition, it would require to add
> some preparation code and exchange cache-lines (containing the process
> ID), which would actually slow down the non-parallel portion of the
> system call (to accelerate the parallelizable portion on only some of
> the CPUs).


> > - Part of me thinks this ought to become slightly more general, and just
> >   deliver a signal that the receiving thread could handle as it likes.
> >   However, that would certainly prove more expensive than this, and I
> >   don't know that the generality would buy anything.
> 
> A general scheme would have to call every threads, even those which are
> not running. In the case of this system call, this is a particular case
> where we can forget about non-running threads, because the memory
> barrier is implied by the scheduler activity that brought them offline.
> So I really don't see how we can use this IPI scheme for other things
> that this kind of synchronization.

No, I don't mean non-running threads.  If you wanted that, you could do
what urcu currently does, and send a signal to all threads.  I meant
something like "signal all *running* threads from my process".

> > - Could you somehow register reader threads with the kernel, in a way
> >   that makes them easy to detect remotely?
> 
> There are two ways I figure out we could do this. One would imply adding
> extra shared data between kernel and userspace (which I'd like to avoid,
> to keep coupling low). The other alternative would be to add per
> task_struct information about this, and new system calls. The added per
> task_struct information would use up cache lines (which are very
> important, especially in the task_struct) and the added system call at
> rcu_read_lock/unlock() would simply kill performance.

No, I didn't mean that you would do a syscall in rcu_read_{lock,unlock}.
I meant that you would do a system call when the reader threads start,
saying "hey, reader thread here".

- Josh Triplett

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  6:19   ` Mathieu Desnoyers
@ 2010-01-07  6:35     ` Josh Triplett
  2010-01-07  8:44       ` Peter Zijlstra
  2010-01-07 14:27     ` Steven Rostedt
  1 sibling, 1 reply; 107+ messages in thread
From: Josh Triplett @ 2010-01-07  6:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, linux-kernel, Paul E. McKenney, Ingo Molnar,
	akpm, tglx, peterz, Valdis.Kletnieks, dhowells, laijs, dipankar

On Thu, Jan 07, 2010 at 01:19:55AM -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Wed, 2010-01-06 at 23:40 -0500, Mathieu Desnoyers wrote:
> > > Here is an implementation of a new system call, sys_membarrier(), which
> > > executes a memory barrier on all threads of the current process.
> > > 
> > > It aims at greatly simplifying and enhancing the current signal-based
> > > liburcu userspace RCU synchronize_rcu() implementation.
> > > (found at http://lttng.org/urcu)
> > > 
> > 
> > Nice.
> > 
> > > Both the signal-based and the sys_membarrier userspace RCU schemes
> > > permit us to remove the memory barrier from the userspace RCU
> > > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > > accelerating them. These memory barriers are replaced by compiler
> > > barriers on the read-side, and all matching memory barriers on the
> > > write-side are turned into an invokation of a memory barrier on all
> > > active threads in the process. By letting the kernel perform this
> > > synchronization rather than dumbly sending a signal to every process
> > > threads (as we currently do), we diminish the number of unnecessary wake
> > > ups and only issue the memory barriers on active threads. Non-running
> > > threads do not need to execute such barrier anyway, because these are
> > > implied by the scheduler context switches.
> > > 
> > > To explain the benefit of this scheme, let's introduce two example threads:
> > > 
> > > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> > > Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> > > 
> > > In a scheme where all smp_mb() in thread A synchronize_rcu() are
> > > ordering memory accesses with respect to smp_mb() present in
> > > rcu_read_lock/unlock(), we can change all smp_mb() from
> > > synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> > > rcu_read_lock/unlock() into compiler barriers "barrier()".
> > > 
> > > Before the change, we had, for each smp_mb() pairs:
> > > 
> > > Thread A                    Thread B
> > > prev mem accesses           prev mem accesses
> > > smp_mb()                    smp_mb()
> > > follow mem accesses         follow mem accesses
> > > 
> > > After the change, these pairs become:
> > > 
> > > Thread A                    Thread B
> > > prev mem accesses           prev mem accesses
> > > sys_membarrier()            barrier()
> > > follow mem accesses         follow mem accesses
> > > 
> > > As we can see, there are two possible scenarios: either Thread B memory
> > > accesses do not happen concurrently with Thread A accesses (1), or they
> > > do (2).
> > > 
> > > 1) Non-concurrent Thread A vs Thread B accesses:
> > > 
> > > Thread A                    Thread B
> > > prev mem accesses
> > > sys_membarrier()
> > > follow mem accesses
> > >                             prev mem accesses
> > >                             barrier()
> > >                             follow mem accesses
> > > 
> > > In this case, thread B accesses will be weakly ordered. This is OK,
> > > because at that point, thread A is not particularly interested in
> > > ordering them with respect to its own accesses.
> > > 
> > > 2) Concurrent Thread A vs Thread B accesses
> > > 
> > > Thread A                    Thread B
> > > prev mem accesses           prev mem accesses
> > > sys_membarrier()            barrier()
> > > follow mem accesses         follow mem accesses
> > > 
> > > In this case, thread B accesses, which are ensured to be in program
> > > order thanks to the compiler barrier, will be "upgraded" to full
> > > smp_mb() thanks to the IPIs executing memory barriers on each active
> > > system threads. Each non-running process threads are intrinsically
> > > serialized by the scheduler.
> > > 
> > > The current implementation simply executes a memory barrier in an IPI
> > > handler on each active cpu. Going through the hassle of taking run queue
> > > locks and checking if the thread running on each online CPU belongs to
> > > the current thread seems more heavyweight than the cost of the IPI
> > > itself (not measured though).
> > > 
> > 
> > 
> > I don't think you need to grab any locks. Doing an rcu_read_lock()
> > should prevent tasks from disappearing (since destruction of tasks use
> > RCU). You may still need to grab the tasklist_lock under read_lock().
> > 
> > So what you could do, is find each task that is a thread of the calling
> > task, and then just check task_rq(task)->curr != task. Just send the
> > IPI's to those tasks that pass the test.
> 
> I guess you mean
> 
> "then just check task_rq(task)->curr == task" ... ?
> 
> > 
> > If the task->rq changes, or the task->rq->curr changes, and makes the
> > condition fail (or even pass), the events that cause those changes are
> > probably good enough than needing to call smp_mb();
> 
> I see your point.
> 
> This would probably be good for machines with very large number of cpus
> and without IPI broadcast support, running processes with only few
> threads.

Or with expensive IPIs and/or expensive user-kernel switches.

> I really start to think that we should have some way to compare
> the number of threads belonging to a process and choose between the
> broadcast IPI and the per-cpu IPI depending if we are over or under an
> arbitrary threshold.

The number of threads doesn't matter nearly as much as the number of
threads typically running at a time compared to the number of
processors.  Of course, we can't measure that as easily, but I don't
know that your proposed heuristic would approximate it well.

- Josh Triplett

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2010-01-07  5:40 ` Steven Rostedt
@ 2010-01-07  8:27 ` Peter Zijlstra
  2010-01-07 18:30   ` Oleg Nesterov
  2010-01-07  9:50 ` Andi Kleen
  2010-01-07 11:04 ` David Howells
  5 siblings, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-07  8:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, josh, tglx,
	rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar,
	Oleg Nesterov

On Wed, 2010-01-06 at 23:40 -0500, Mathieu Desnoyers wrote:

> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
> @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
>  
> +/*
> + * Execute a memory barrier on all CPUs on SMP systems.
> + * Do not rely on implicit barriers in smp_call_function(), just in case they
> + * are ever relaxed in the future.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> +	smp_mb();
> +}
> +
> +/*
> + * sys_membarrier - issue memory barrier on current process running threads
> + *
> + * Execute a memory barrier on all running threads of the current process.
> + * Upon completion, the caller thread is ensured that all process threads
> + * have passed through a state where memory accesses match program order.
> + * (non-running threads are de facto in such a state)
> + *
> + * The current implementation simply executes a memory barrier in an IPI handler
> + * on each active cpu. Going through the hassle of taking run queue locks and
> + * checking if the thread running on each online CPU belongs to the current
> + * thread seems more heavyweight than the cost of the IPI itself.
> + */
> +SYSCALL_DEFINE0(membarrier)
> +{
> +	on_each_cpu(membarrier_ipi, NULL, 1);
> +
> +	return 0;
> +}
> +
>  #ifndef CONFIG_SMP
>  
>  int rcu_expedited_torture_stats(char *page)

OK, so my worry here is that its a DoS on large machines.

Something like:
  smp_call_function_any(current->mm->cpu_vm_mask, membarrier, NULL, 1);

might be slightly better, but would still hurt. The alternative is
iterating all cpus and looking to see if cpu_curr(cpu)->mm ==
current->mm and then send it an ipi.

Also, there was some talk a while ago about IPIs implying memory
barriers. Which I of course forgot all details about,.. at least sending
one implies a wmb and receiving one an rmb, but it could be stronger,
Oleg?




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  5:02 ` Paul E. McKenney
  2010-01-07  5:39   ` Mathieu Desnoyers
@ 2010-01-07  8:32   ` Peter Zijlstra
  2010-01-07 16:39     ` Paul E. McKenney
  1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-07  8:32 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, akpm, josh, tglx,
	rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

On Wed, 2010-01-06 at 21:02 -0800, Paul E. McKenney wrote:
> 
> Beats the heck out of user-mode signal handlers!!!  And it is hard
> to imagine groveling through runqueues ever being a win, even on very
> large systems.  The only reasonable optimization I can imagine is to
> turn this into a no-op for a single-threaded process, but there are
> other ways to do that optimization.
> 
> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Then imagine someone doing:

 while (1)
  sys_membarrier();

on your multi node machine, see how happy you are then.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  6:35     ` Josh Triplett
@ 2010-01-07  8:44       ` Peter Zijlstra
  2010-01-07 13:15         ` Steven Rostedt
                           ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-07  8:44 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Mathieu Desnoyers, Steven Rostedt, linux-kernel,
	Paul E. McKenney, Ingo Molnar, akpm, tglx, Valdis.Kletnieks,
	dhowells, laijs, dipankar

On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> 
> The number of threads doesn't matter nearly as much as the number of
> threads typically running at a time compared to the number of
> processors.  Of course, we can't measure that as easily, but I don't
> know that your proposed heuristic would approximate it well.

Quite agreed, and not disturbing RT tasks is even more important.

A simple:

  for_each_cpu(cpu, current->mm->cpu_vm_mask) {
     if (cpu_curr(cpu)->mm == current->mm)
        smp_call_function_single(cpu, func, NULL, 1);
  }

seems far preferable over anything else, if you really want you can use
a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
smp_call_function_any(), but that includes having to allocate the
cpumask, which might or might not be too expensive for Mathieu.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2010-01-07  8:27 ` Peter Zijlstra
@ 2010-01-07  9:50 ` Andi Kleen
  2010-01-07 15:12   ` Mathieu Desnoyers
  2010-01-07 16:56   ` Paul E. McKenney
  2010-01-07 11:04 ` David Howells
  5 siblings, 2 replies; 107+ messages in thread
From: Andi Kleen @ 2010-01-07  9:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, josh, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> writes:

> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invokation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.

I'm not sure all this effort is really needed on architectures
with strong memory ordering.

> + * The current implementation simply executes a memory barrier in an IPI handler
> + * on each active cpu. Going through the hassle of taking run queue locks and
> + * checking if the thread running on each online CPU belongs to the current
> + * thread seems more heavyweight than the cost of the IPI itself.
> + */
> +SYSCALL_DEFINE0(membarrier)
> +{
> +	on_each_cpu(membarrier_ipi, NULL, 1);

Can't you use mm->cpu_vm_mask?

-Andi

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2010-01-07  9:50 ` Andi Kleen
@ 2010-01-07 11:04 ` David Howells
  2010-01-07 15:15   ` Mathieu Desnoyers
  2010-01-07 15:47   ` David Howells
  5 siblings, 2 replies; 107+ messages in thread
From: David Howells @ 2010-01-07 11:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: dhowells, linux-kernel, Paul E. McKenney, Ingo Molnar, akpm,
	josh, tglx, peterz, rostedt, Valdis.Kletnieks, laijs, dipankar

Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> The current implementation simply executes a memory barrier in an IPI
> handler on each active cpu. Going through the hassle of taking run queue
> locks and checking if the thread running on each online CPU belongs to
> the current thread seems more heavyweight than the cost of the IPI
> itself (not measured though).

There's another way to do this:

 (1) For each threads you want to execute a memory barrier, mark in its
     task_struct that you want it to do a memory barrier and set
     TIF_NOTIFY_RESUME.

 (2) Interrupt all CPUs.  The interrupt handler doesn't have to do anything.

 (3) When any of the threads marked in (1) gain CPU time, do_notify_resume()
     will be executed, and the do-memory-barrier flag can be tested and if it
     was set, the flag can be cleared and a memory barrier can be
     interpolated.

The current thread will also pass through stage (3) on its way out, if it's
marked in stage (1).

David

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  8:44       ` Peter Zijlstra
@ 2010-01-07 13:15         ` Steven Rostedt
  2010-01-07 15:07         ` Mathieu Desnoyers
  2010-01-07 16:52         ` Paul E. McKenney
  2 siblings, 0 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07 13:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Triplett, Mathieu Desnoyers, linux-kernel, Paul E. McKenney,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, 2010-01-07 at 09:44 +0100, Peter Zijlstra wrote:

> A simple:
> 
>   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
>      if (cpu_curr(cpu)->mm == current->mm)
>         smp_call_function_single(cpu, func, NULL, 1);
>   }

I like this algorithm the best ;-)

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  6:19   ` Mathieu Desnoyers
  2010-01-07  6:35     ` Josh Triplett
@ 2010-01-07 14:27     ` Steven Rostedt
  2010-01-07 15:10       ` Mathieu Desnoyers
  1 sibling, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07 14:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, josh, tglx,
	peterz, Valdis.Kletnieks, dhowells, laijs, dipankar

On Thu, 2010-01-07 at 01:19 -0500, Mathieu Desnoyers wrote:

> I see your point.

Actually you are missing the point ;-)

> 
> This would probably be good for machines with very large number of cpus
> and without IPI broadcast support, running processes with only few
> threads. I really start to think that we should have some way to compare
> the number of threads belonging to a process and choose between the
> broadcast IPI and the per-cpu IPI depending if we are over or under an
> arbitrary threshold.


This has nothing to do with performance. It has to do with a thread
should not interfere with a thread belonging to another process. We
really don't care how long the sys_membarrier() takes (it's the slow
path anyway). We do care that a critical RT task is being interrupted by
some java thread sending thousands of IPIs.

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  8:44       ` Peter Zijlstra
  2010-01-07 13:15         ` Steven Rostedt
@ 2010-01-07 15:07         ` Mathieu Desnoyers
  2010-01-07 16:52         ` Paul E. McKenney
  2 siblings, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Triplett, Steven Rostedt, linux-kernel, Paul E. McKenney,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> > 
> > The number of threads doesn't matter nearly as much as the number of
> > threads typically running at a time compared to the number of
> > processors.  Of course, we can't measure that as easily, but I don't
> > know that your proposed heuristic would approximate it well.
> 
> Quite agreed, and not disturbing RT tasks is even more important.
> 
> A simple:
> 
>   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
>      if (cpu_curr(cpu)->mm == current->mm)
>         smp_call_function_single(cpu, func, NULL, 1);
>   }
> 
> seems far preferable over anything else, if you really want you can use
> a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
> smp_call_function_any(), but that includes having to allocate the
> cpumask, which might or might not be too expensive for Mathieu.
> 

I like this ! :)

Following some testing, I think I'll go with your scheme, with 2
smp_call_function_single (one function call for the local thread, one
IPI). If we need more than that, then we allocate a cpumask and call
smp_call_function_many() for the other cpus. I provide benchmarks in my
reply to Josh justifying this choice.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 14:27     ` Steven Rostedt
@ 2010-01-07 15:10       ` Mathieu Desnoyers
  0 siblings, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07 15:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, josh, tglx,
	peterz, Valdis.Kletnieks, dhowells, laijs, dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Thu, 2010-01-07 at 01:19 -0500, Mathieu Desnoyers wrote:
> 
> > I see your point.
> 
> Actually you are missing the point ;-)
> 
> > 
> > This would probably be good for machines with very large number of cpus
> > and without IPI broadcast support, running processes with only few
> > threads. I really start to think that we should have some way to compare
> > the number of threads belonging to a process and choose between the
> > broadcast IPI and the per-cpu IPI depending if we are over or under an
> > arbitrary threshold.
> 
> 
> This has nothing to do with performance. It has to do with a thread
> should not interfere with a thread belonging to another process. We
> really don't care how long the sys_membarrier() takes (it's the slow
> path anyway). We do care that a critical RT task is being interrupted by
> some java thread sending thousands of IPIs.

Yes, PeterZ scheme seems (and yours) seems to address this problem by
only impacting the CPUs running threads belonging to the current
process. I'll go with this instead of the broadcast IPI, which, as you,
Josh and Peter clearly pointed out, is a no-go in terms of real-time.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  9:50 ` Andi Kleen
@ 2010-01-07 15:12   ` Mathieu Desnoyers
  2010-01-07 16:56   ` Paul E. McKenney
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07 15:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, josh, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

* Andi Kleen (andi@firstfloor.org) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> writes:
> 
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the
> > write-side are turned into an invokation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> 
> I'm not sure all this effort is really needed on architectures
> with strong memory ordering.

Do we still have many out there that support SMP ? Even newer ARM now
need memory barriers.

> 
> > + * The current implementation simply executes a memory barrier in an IPI handler
> > + * on each active cpu. Going through the hassle of taking run queue locks and
> > + * checking if the thread running on each online CPU belongs to the current
> > + * thread seems more heavyweight than the cost of the IPI itself.
> > + */
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +	on_each_cpu(membarrier_ipi, NULL, 1);
> 
> Can't you use mm->cpu_vm_mask?

I'll go for PeterZ scheme, which is based on cpu_vm_mask.

Thanks,

Mathieu


> 
> -Andi

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 11:04 ` David Howells
@ 2010-01-07 15:15   ` Mathieu Desnoyers
  2010-01-07 15:47   ` David Howells
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07 15:15 UTC (permalink / raw)
  To: David Howells
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, josh, tglx,
	peterz, rostedt, Valdis.Kletnieks, laijs, dipankar

* David Howells (dhowells@redhat.com) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > The current implementation simply executes a memory barrier in an IPI
> > handler on each active cpu. Going through the hassle of taking run queue
> > locks and checking if the thread running on each online CPU belongs to
> > the current thread seems more heavyweight than the cost of the IPI
> > itself (not measured though).
> 
> There's another way to do this:
> 
>  (1) For each threads you want to execute a memory barrier, mark in its
>      task_struct that you want it to do a memory barrier and set
>      TIF_NOTIFY_RESUME.
> 
>  (2) Interrupt all CPUs.  The interrupt handler doesn't have to do anything.

AFAIK, the smp_mb() is not very costly compared to the IPI. So as your
proposal implies sending an IPI to the remote threads anyway, I don't
see how adding thread flags and extra tests in return to userland paths
will help us... it will just add extra tests and branches doing exactly
nothing.

Or am I missing your point entirely ?

Thanks,

Mathieu

> 
>  (3) When any of the threads marked in (1) gain CPU time, do_notify_resume()
>      will be executed, and the do-memory-barrier flag can be tested and if it
>      was set, the flag can be cleared and a memory barrier can be
>      interpolated.
> 
> The current thread will also pass through stage (3) on its way out, if it's
> marked in stage (1).
> 
> David

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 11:04 ` David Howells
  2010-01-07 15:15   ` Mathieu Desnoyers
@ 2010-01-07 15:47   ` David Howells
  1 sibling, 0 replies; 107+ messages in thread
From: David Howells @ 2010-01-07 15:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: dhowells, linux-kernel, Paul E. McKenney, Ingo Molnar, akpm,
	josh, tglx, peterz, rostedt, Valdis.Kletnieks, laijs, dipankar

Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Or am I missing your point entirely ?

No, just a suggestion.

David

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  8:32   ` Peter Zijlstra
@ 2010-01-07 16:39     ` Paul E. McKenney
  0 siblings, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, akpm, josh, tglx,
	rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

On Thu, Jan 07, 2010 at 09:32:16AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-01-06 at 21:02 -0800, Paul E. McKenney wrote:
> > 
> > Beats the heck out of user-mode signal handlers!!!  And it is hard
> > to imagine groveling through runqueues ever being a win, even on very
> > large systems.  The only reasonable optimization I can imagine is to
> > turn this into a no-op for a single-threaded process, but there are
> > other ways to do that optimization.
> > 
> > Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Then imagine someone doing:
> 
>  while (1)
>   sys_membarrier();
> 
> on your multi node machine, see how happy you are then.

I guess in that situation, I would be feeling no pain.  Or anything else
for that matter.  :-/

So, good point!!!  I stand un-Reviewed-By.

I could imagine throttling the requests, as well as batching them.  If
any CPU does a sys_membarrier() after this CPU's sys_membarrier has
entered the kernel, then this CPU can simply return.  A token-bucket
approach could throttle things nicely, but at some point it becomes
better to just do POSIX signals.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  6:04   ` Mathieu Desnoyers
  2010-01-07  6:32     ` Josh Triplett
@ 2010-01-07 16:46     ` Paul E. McKenney
  1 sibling, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 16:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Josh Triplett, linux-kernel, Ingo Molnar, akpm, tglx, peterz,
	rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

On Thu, Jan 07, 2010 at 01:04:39AM -0500, Mathieu Desnoyers wrote:
> * Josh Triplett (josh@joshtriplett.org) wrote:
> > On Wed, Jan 06, 2010 at 11:40:07PM -0500, Mathieu Desnoyers wrote:

[ . . . ]

> > - Have you tested what happens if a process does "while(1)
> >   membarrier();"?  By running on every CPU, including those not owned by
> >   the current process, this has the potential to make DoS easier,
> >   particularly on systems with many CPUs.  That gets even worse if a
> >   process forks multiple threads running that same loop.  Also consider
> >   that executing an IPI will do work even on a CPU currently running a
> >   real-time task.
> 
> Just tried it with a 10,000,000 iterations loop.
> 
> The thread doing the system call loop takes 2.0% of user time, 98% of
> system time. All other cpus are nearly 100.0% idle. Just to give a bit
> more info about my test setup, I also have a thread sitting on a CPU
> busy-waiting for the loop to complete. This thread takes 97.7% user
> time (but it really is just there to make sure we are indeed doing the
> IPIs, not skipping it through the thread_group_empty(current) test). If
> I remove this thread, the execution time of the test program shrinks
> from 32 seconds down to 1.9 seconds. So yes, the IPI is actually
> executed in the first place, because removing the extra thread
> accelerates the loop tremendously. I used a 8-core Xeon to test.

So a singled-threaded DoS attack can give you a 17-to-1 slowdown on
other processors.

Does this get worse if more than one CPU is in a tight loop doing
sys_membarrier()?  Or is there some other limit on IPI rate?

> > - Rather than groveling through runqueues, could you somehow remotely
> >   check the value of current?  In theory, a race in doing so wouldn't
> >   matter; finding something other than the current process should mean
> >   you don't need to do a barrier, and finding the current process means
> >   you might need to do a barrier.
> 
> Well, the thing is that sending an IPI to all processors can be done
> very efficiently on a lot of architectures because it uses an IPI
> broadcast. If we have to select a few processors to which we send the
> signal individually, I fear that the solution will scale poorly on
> systems where cpus are densely used by threads belonging to the current
> process.
> 
> So if we go down the route of sending an IPI broadcast as I did, then
> the performance improvement of skipping the smp_mb() for some CPU seems
> insignificant compared to the IPI. In addition, it would require to add
> some preparation code and exchange cache-lines (containing the process
> ID), which would actually slow down the non-parallel portion of the
> system call (to accelerate the parallelizable portion on only some of
> the CPUs).
> 
> So I don't think this would buy us anything. However, if we would have a
> per-process count of the number of threads in the thread group, then
> we could switch to a per-cpu IPI rather than broadcast if we detect that
> we have much fewer threads than CPUs.

My concern would be that we see an old value of the remote CPU's current,
and incorrectly fail to send an IPI.  Then that CPU might have picked
up a reference to the thing that we are trying to free up, which is just
not going to be good!

> > - Part of me thinks this ought to become slightly more general, and just
> >   deliver a signal that the receiving thread could handle as it likes.
> >   However, that would certainly prove more expensive than this, and I
> >   don't know that the generality would buy anything.
> 
> A general scheme would have to call every threads, even those which are
> not running. In the case of this system call, this is a particular case
> where we can forget about non-running threads, because the memory
> barrier is implied by the scheduler activity that brought them offline.
> So I really don't see how we can use this IPI scheme for other things
> that this kind of synchronization.

						Thanx, Paul

> > - Could you somehow register reader threads with the kernel, in a way
> >   that makes them easy to detect remotely?
> 
> There are two ways I figure out we could do this. One would imply adding
> extra shared data between kernel and userspace (which I'd like to avoid,
> to keep coupling low). The other alternative would be to add per
> task_struct information about this, and new system calls. The added per
> task_struct information would use up cache lines (which are very
> important, especially in the task_struct) and the added system call at
> rcu_read_lock/unlock() would simply kill performance.
> 
> Thanks,
> 
> Mathieu
> 
> > 
> > 
> > - Josh Triplett
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  5:40 ` Steven Rostedt
  2010-01-07  6:19   ` Mathieu Desnoyers
@ 2010-01-07 16:49   ` Paul E. McKenney
  2010-01-07 17:00     ` Steven Rostedt
  1 sibling, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 16:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, akpm, josh, tglx,
	peterz, Valdis.Kletnieks, dhowells, laijs, dipankar

On Thu, Jan 07, 2010 at 12:40:54AM -0500, Steven Rostedt wrote:
> On Wed, 2010-01-06 at 23:40 -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> > 
> > It aims at greatly simplifying and enhancing the current signal-based
> > liburcu userspace RCU synchronize_rcu() implementation.
> > (found at http://lttng.org/urcu)
> > 
> 
> Nice.
> 
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the
> > write-side are turned into an invokation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> > 
> > To explain the benefit of this scheme, let's introduce two example threads:
> > 
> > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> > Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> > 
> > In a scheme where all smp_mb() in thread A synchronize_rcu() are
> > ordering memory accesses with respect to smp_mb() present in
> > rcu_read_lock/unlock(), we can change all smp_mb() from
> > synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> > rcu_read_lock/unlock() into compiler barriers "barrier()".
> > 
> > Before the change, we had, for each smp_mb() pairs:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > smp_mb()                    smp_mb()
> > follow mem accesses         follow mem accesses
> > 
> > After the change, these pairs become:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > As we can see, there are two possible scenarios: either Thread B memory
> > accesses do not happen concurrently with Thread A accesses (1), or they
> > do (2).
> > 
> > 1) Non-concurrent Thread A vs Thread B accesses:
> > 
> > Thread A                    Thread B
> > prev mem accesses
> > sys_membarrier()
> > follow mem accesses
> >                             prev mem accesses
> >                             barrier()
> >                             follow mem accesses
> > 
> > In this case, thread B accesses will be weakly ordered. This is OK,
> > because at that point, thread A is not particularly interested in
> > ordering them with respect to its own accesses.
> > 
> > 2) Concurrent Thread A vs Thread B accesses
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > In this case, thread B accesses, which are ensured to be in program
> > order thanks to the compiler barrier, will be "upgraded" to full
> > smp_mb() thanks to the IPIs executing memory barriers on each active
> > system threads. Each non-running process threads are intrinsically
> > serialized by the scheduler.
> > 
> > The current implementation simply executes a memory barrier in an IPI
> > handler on each active cpu. Going through the hassle of taking run queue
> > locks and checking if the thread running on each online CPU belongs to
> > the current thread seems more heavyweight than the cost of the IPI
> > itself (not measured though).
> > 
> 
> 
> I don't think you need to grab any locks. Doing an rcu_read_lock()
> should prevent tasks from disappearing (since destruction of tasks use
> RCU). You may still need to grab the tasklist_lock under read_lock().
> 
> So what you could do, is find each task that is a thread of the calling
> task, and then just check task_rq(task)->curr != task. Just send the
> IPI's to those tasks that pass the test.
> 
> If the task->rq changes, or the task->rq->curr changes, and makes the
> condition fail (or even pass), the events that cause those changes are
> probably good enough than needing to call smp_mb();

This narrows the fatal window, but does not eliminate it.  :-(

The CPU doing the sys_membarrier() might see an old value of ->curr,
and the other CPU might see an old value of whatever pointer we are
trying to recycle.  This combination is fatal.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  8:44       ` Peter Zijlstra
  2010-01-07 13:15         ` Steven Rostedt
  2010-01-07 15:07         ` Mathieu Desnoyers
@ 2010-01-07 16:52         ` Paul E. McKenney
  2010-01-07 17:18           ` Peter Zijlstra
  2 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Triplett, Mathieu Desnoyers, Steven Rostedt, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 09:44:15AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> > 
> > The number of threads doesn't matter nearly as much as the number of
> > threads typically running at a time compared to the number of
> > processors.  Of course, we can't measure that as easily, but I don't
> > know that your proposed heuristic would approximate it well.
> 
> Quite agreed, and not disturbing RT tasks is even more important.

OK, so I stand un-Reviewed-by twice in one morning.  ;-)

> A simple:
> 
>   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
>      if (cpu_curr(cpu)->mm == current->mm)
>         smp_call_function_single(cpu, func, NULL, 1);
>   }
> 
> seems far preferable over anything else, if you really want you can use
> a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
> smp_call_function_any(), but that includes having to allocate the
> cpumask, which might or might not be too expensive for Mathieu.

This would be vulnerable to the sys_membarrier() CPU seeing an old value
of cpu_curr(cpu)->mm, and that other task seeing the old value of the
pointer we are trying to RCU-destroy, right?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  9:50 ` Andi Kleen
  2010-01-07 15:12   ` Mathieu Desnoyers
@ 2010-01-07 16:56   ` Paul E. McKenney
  1 sibling, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 16:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, akpm, josh, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

On Thu, Jan 07, 2010 at 10:50:26AM +0100, Andi Kleen wrote:
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> writes:
> 
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the
> > write-side are turned into an invokation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> 
> I'm not sure all this effort is really needed on architectures
> with strong memory ordering.

Even CPUs with strong memory ordering allow later reads to complete
prior to earlier writes, which is enough to cause problems.

That said, some of the lighter-weight schemes sampling ->mm might be
safe on TSO machines.

> > + * The current implementation simply executes a memory barrier in an IPI handler
> > + * on each active cpu. Going through the hassle of taking run queue locks and
> > + * checking if the thread running on each online CPU belongs to the current
> > + * thread seems more heavyweight than the cost of the IPI itself.
> > + */
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +	on_each_cpu(membarrier_ipi, NULL, 1);
> 
> Can't you use mm->cpu_vm_mask?

Hmmm...  Acquiring the corresponding lock would certainly make this
safe.  Not sure about lock-less access to it, though.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 16:49   ` Paul E. McKenney
@ 2010-01-07 17:00     ` Steven Rostedt
  0 siblings, 0 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07 17:00 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, akpm, josh, tglx,
	peterz, Valdis.Kletnieks, dhowells, laijs, dipankar

On Thu, 2010-01-07 at 08:49 -0800, Paul E. McKenney wrote:

> > If the task->rq changes, or the task->rq->curr changes, and makes the
> > condition fail (or even pass), the events that cause those changes are
> > probably good enough than needing to call smp_mb();
> 
> This narrows the fatal window, but does not eliminate it.  :-(
> 
> The CPU doing the sys_membarrier() might see an old value of ->curr,
> and the other CPU might see an old value of whatever pointer we are
> trying to recycle.  This combination is fatal.

But for curr to change, the rq spin lock must have been held. Which
implies smp_wb(). I would think that if you do a smp_rb() wouldn't that
guarantee that you see the new value of curr?

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 16:52         ` Paul E. McKenney
@ 2010-01-07 17:18           ` Peter Zijlstra
  2010-01-07 17:31             ` Paul E. McKenney
  2010-01-07 17:36             ` Mathieu Desnoyers
  0 siblings, 2 replies; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-07 17:18 UTC (permalink / raw)
  To: paulmck
  Cc: Josh Triplett, Mathieu Desnoyers, Steven Rostedt, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, 2010-01-07 at 08:52 -0800, Paul E. McKenney wrote:
> On Thu, Jan 07, 2010 at 09:44:15AM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> > > 
> > > The number of threads doesn't matter nearly as much as the number of
> > > threads typically running at a time compared to the number of
> > > processors.  Of course, we can't measure that as easily, but I don't
> > > know that your proposed heuristic would approximate it well.
> > 
> > Quite agreed, and not disturbing RT tasks is even more important.
> 
> OK, so I stand un-Reviewed-by twice in one morning.  ;-)
> 
> > A simple:
> > 
> >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> >      if (cpu_curr(cpu)->mm == current->mm)
> >         smp_call_function_single(cpu, func, NULL, 1);
> >   }
> > 
> > seems far preferable over anything else, if you really want you can use
> > a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
> > smp_call_function_any(), but that includes having to allocate the
> > cpumask, which might or might not be too expensive for Mathieu.
> 
> This would be vulnerable to the sys_membarrier() CPU seeing an old value
> of cpu_curr(cpu)->mm, and that other task seeing the old value of the
> pointer we are trying to RCU-destroy, right?

Right, so I was thinking that since you want a mb to be executed when
calling sys_membarrier(). If you observe a matching ->mm but the cpu has
since scheduled, we're good since it scheduled (but we'll still send the
IPI anyway), if we do not observe it because the task gets scheduled in
after we do the iteration we're still good because it scheduled.

As to needing to keep rcu_read_lock() around the iteration, for sure we
need that to ensure the remote task_struct reference we take is valid.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 17:18           ` Peter Zijlstra
@ 2010-01-07 17:31             ` Paul E. McKenney
  2010-01-07 17:44               ` Mathieu Desnoyers
  2010-01-07 17:44               ` Steven Rostedt
  2010-01-07 17:36             ` Mathieu Desnoyers
  1 sibling, 2 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 17:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Triplett, Mathieu Desnoyers, Steven Rostedt, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 06:18:36PM +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-07 at 08:52 -0800, Paul E. McKenney wrote:
> > On Thu, Jan 07, 2010 at 09:44:15AM +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> > > > 
> > > > The number of threads doesn't matter nearly as much as the number of
> > > > threads typically running at a time compared to the number of
> > > > processors.  Of course, we can't measure that as easily, but I don't
> > > > know that your proposed heuristic would approximate it well.
> > > 
> > > Quite agreed, and not disturbing RT tasks is even more important.
> > 
> > OK, so I stand un-Reviewed-by twice in one morning.  ;-)
> > 
> > > A simple:
> > > 
> > >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > >      if (cpu_curr(cpu)->mm == current->mm)
> > >         smp_call_function_single(cpu, func, NULL, 1);
> > >   }
> > > 
> > > seems far preferable over anything else, if you really want you can use
> > > a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
> > > smp_call_function_any(), but that includes having to allocate the
> > > cpumask, which might or might not be too expensive for Mathieu.
> > 
> > This would be vulnerable to the sys_membarrier() CPU seeing an old value
> > of cpu_curr(cpu)->mm, and that other task seeing the old value of the
> > pointer we are trying to RCU-destroy, right?
> 
> Right, so I was thinking that since you want a mb to be executed when
> calling sys_membarrier(). If you observe a matching ->mm but the cpu has
> since scheduled, we're good since it scheduled (but we'll still send the
> IPI anyway), if we do not observe it because the task gets scheduled in
> after we do the iteration we're still good because it scheduled.

Something like the following for sys_membarrier(), then?

  smp_mb();
  for_each_cpu(cpu, current->mm->cpu_vm_mask) {
     if (cpu_curr(cpu)->mm == current->mm)
        smp_call_function_single(cpu, func, NULL, 1);
  }

Then the code changing ->mm on the other CPU also needs to have a
full smp_mb() somewhere after the change to ->mm, but before starting
user-space execution.  Which it might well just due to overhead, but
we need to make sure that someone doesn't optimize us out of existence.

							Thanx, Paul

> As to needing to keep rcu_read_lock() around the iteration, for sure we
> need that to ensure the remote task_struct reference we take is valid.
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 17:18           ` Peter Zijlstra
  2010-01-07 17:31             ` Paul E. McKenney
@ 2010-01-07 17:36             ` Mathieu Desnoyers
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Josh Triplett, Steven Rostedt, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2010-01-07 at 08:52 -0800, Paul E. McKenney wrote:
> > On Thu, Jan 07, 2010 at 09:44:15AM +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> > > > 
> > > > The number of threads doesn't matter nearly as much as the number of
> > > > threads typically running at a time compared to the number of
> > > > processors.  Of course, we can't measure that as easily, but I don't
> > > > know that your proposed heuristic would approximate it well.
> > > 
> > > Quite agreed, and not disturbing RT tasks is even more important.
> > 
> > OK, so I stand un-Reviewed-by twice in one morning.  ;-)
> > 
> > > A simple:
> > > 
> > >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > >      if (cpu_curr(cpu)->mm == current->mm)
> > >         smp_call_function_single(cpu, func, NULL, 1);
> > >   }
> > > 
> > > seems far preferable over anything else, if you really want you can use
> > > a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
> > > smp_call_function_any(), but that includes having to allocate the
> > > cpumask, which might or might not be too expensive for Mathieu.
> > 
> > This would be vulnerable to the sys_membarrier() CPU seeing an old value
> > of cpu_curr(cpu)->mm, and that other task seeing the old value of the
> > pointer we are trying to RCU-destroy, right?
> 
> Right, so I was thinking that since you want a mb to be executed when
> calling sys_membarrier(). If you observe a matching ->mm but the cpu has
> since scheduled, we're good since it scheduled (but we'll still send the
> IPI anyway), if we do not observe it because the task gets scheduled in
> after we do the iteration we're still good because it scheduled.

This deals with the case where the remote thread is being scheduled out.

As I understand it, if the thread is currently being scheduled in
exactly while we read cpu_curr(cpu)->mm, this means that we are
executing concurrently with the scheduler code. I expect that the
scheduler will issue a smp_mb() before leaving the cpu to the newcoming
thread, am I correct ? If this is correct, then we can assume that
reading any value of cpu_curr(cpu)->mm that does not match that of our
own process means that the value is either:

- corresponding to a thread belonging to another process (no ipi
  needed).
- corresponding to any thread, being scheduled in/out -> no IPI needed,
  but it does not hurt to send one. In this case, it does not matter if
  the racy read even returns pure garbage, as it's really a "don't
  care".

Even if the read would return garbage for some weird reason (piecewise
read maybe ?), that does not hurt, because the IPI is just "not needed"
at that point, since we rely on the concurrently executing scheduler to
issue the smp_mb().

> 
> As to needing to keep rcu_read_lock() around the iteration, for sure we
> need that to ensure the remote task_struct reference we take is valid.
> 

Indeed,

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 17:31             ` Paul E. McKenney
@ 2010-01-07 17:44               ` Mathieu Desnoyers
  2010-01-07 17:55                 ` Paul E. McKenney
  2010-01-07 17:44               ` Steven Rostedt
  1 sibling, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07 17:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Josh Triplett, Steven Rostedt, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Jan 07, 2010 at 06:18:36PM +0100, Peter Zijlstra wrote:
> > On Thu, 2010-01-07 at 08:52 -0800, Paul E. McKenney wrote:
> > > On Thu, Jan 07, 2010 at 09:44:15AM +0100, Peter Zijlstra wrote:
> > > > On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> > > > > 
> > > > > The number of threads doesn't matter nearly as much as the number of
> > > > > threads typically running at a time compared to the number of
> > > > > processors.  Of course, we can't measure that as easily, but I don't
> > > > > know that your proposed heuristic would approximate it well.
> > > > 
> > > > Quite agreed, and not disturbing RT tasks is even more important.
> > > 
> > > OK, so I stand un-Reviewed-by twice in one morning.  ;-)
> > > 
> > > > A simple:
> > > > 
> > > >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > > >      if (cpu_curr(cpu)->mm == current->mm)
> > > >         smp_call_function_single(cpu, func, NULL, 1);
> > > >   }
> > > > 
> > > > seems far preferable over anything else, if you really want you can use
> > > > a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
> > > > smp_call_function_any(), but that includes having to allocate the
> > > > cpumask, which might or might not be too expensive for Mathieu.
> > > 
> > > This would be vulnerable to the sys_membarrier() CPU seeing an old value
> > > of cpu_curr(cpu)->mm, and that other task seeing the old value of the
> > > pointer we are trying to RCU-destroy, right?
> > 
> > Right, so I was thinking that since you want a mb to be executed when
> > calling sys_membarrier(). If you observe a matching ->mm but the cpu has
> > since scheduled, we're good since it scheduled (but we'll still send the
> > IPI anyway), if we do not observe it because the task gets scheduled in
> > after we do the iteration we're still good because it scheduled.
> 
> Something like the following for sys_membarrier(), then?
> 
>   smp_mb();

This smp_mb() is redundant, as we issue it through the for_each_cpu loop
on the local CPU already.

>   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
>      if (cpu_curr(cpu)->mm == current->mm)
>         smp_call_function_single(cpu, func, NULL, 1);
>   }
> 
> Then the code changing ->mm on the other CPU also needs to have a
> full smp_mb() somewhere after the change to ->mm, but before starting
> user-space execution.  Which it might well just due to overhead, but
> we need to make sure that someone doesn't optimize us out of existence.

I believe we also need one between execution of the userspace task and
change to ->mm. If we have these guarantees I think we are fine.

Mathieu

> 
> 							Thanx, Paul
> 
> > As to needing to keep rcu_read_lock() around the iteration, for sure we
> > need that to ensure the remote task_struct reference we take is valid.
> > 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 17:31             ` Paul E. McKenney
  2010-01-07 17:44               ` Mathieu Desnoyers
@ 2010-01-07 17:44               ` Steven Rostedt
  2010-01-07 17:56                 ` Paul E. McKenney
  1 sibling, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07 17:44 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Josh Triplett, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, 2010-01-07 at 09:31 -0800, Paul E. McKenney wrote:

> Something like the following for sys_membarrier(), then?
> 
>   smp_mb();
>   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
>      if (cpu_curr(cpu)->mm == current->mm)
>         smp_call_function_single(cpu, func, NULL, 1);
>   }
> 
> Then the code changing ->mm on the other CPU also needs to have a
> full smp_mb() somewhere after the change to ->mm, but before starting
> user-space execution.  Which it might well just due to overhead, but
> we need to make sure that someone doesn't optimize us out of existence.

To change the mm requires things like flushing the TLB. I'd be surprised
if the change of the mm does not already do a smp_mb() somewhere.

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  6:32     ` Josh Triplett
@ 2010-01-07 17:45       ` Mathieu Desnoyers
  0 siblings, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-07 17:45 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, akpm, tglx, peterz,
	rostedt, Valdis.Kletnieks, dhowells, laijs, dipankar

* Josh Triplett (josh@joshtriplett.org) wrote:
> On Thu, Jan 07, 2010 at 01:04:39AM -0500, Mathieu Desnoyers wrote:
[...]
> > Just tried it with a 10,000,000 iterations loop.
> > 
> > The thread doing the system call loop takes 2.0% of user time, 98% of
> > system time. All other cpus are nearly 100.0% idle. Just to give a bit
> > more info about my test setup, I also have a thread sitting on a CPU
> > busy-waiting for the loop to complete. This thread takes 97.7% user
> > time (but it really is just there to make sure we are indeed doing the
> > IPIs, not skipping it through the thread_group_empty(current) test). If
> > I remove this thread, the execution time of the test program shrinks
> > from 32 seconds down to 1.9 seconds. So yes, the IPI is actually
> > executed in the first place, because removing the extra thread
> > accelerates the loop tremendously. I used a 8-core Xeon to test.
> 
> Do you know if the kernel properly measures the overhead of IPIs?  The
> CPUs might have only looked idle.  What about running some kind of
> CPU-bound benchmark on the other CPUs and testing the completion time
> with and without the process running the membarrier loop?

Good point. Just tried with a cache-hot kernel compilation using 6/8 CPUs.

Normally:                                              real 2m41.852s
With the sys_membarrier+1 busy-looping thread running: real 5m41.830s

So... the unrelated processes become 2x slower. That hurts.

So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
small allocation overhead and benefit from cpumask broadcast if
possible so we scale better. But that all depends on how big the
allocation overhead is.

Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
calls, one thread is doing the sys_membarrier, the others are busy
looping)):

IPI to all:                                            real 0m44.708s
alloc cpumask+local mb()+IPI-many to 1 thread:         real 1m2.034s

So, roughly, the cpumask allocation overhead is 17s here, not exactly
cheap. So let's see when it becomes better than single IPIs:

local mb()+single IPI to 1 thread:                     real 0m29.502s
local mb()+single IPI to 7 threads:                    real 2m30.971s

So, roughly, the single IPI overhead is 120s here for 6 more threads,
for 20s per thread.

Here is what we can do: Given that it costs almost half as much to
perform the cpumask allocation than to send a single IPI, as we iterate
on the CPUs, for the, say, first N CPUs (ourself and 1 cpu that needs to
have an IPI sent), we send a "single IPI". This will be N-1 IPI and a
local function call. If we need more than that, then we switch to the
cpumask allocation and send a broadcast IPI to the cpumask we construct
for the rest of the CPUs. Let's call it the "adaptative IPI scheme".

For my Intel Xeon E5405:

Just doing local mb()+single IPI to T other threads:

T=1: 0m29.219s
T=2: 0m46.310s
T=3: 1m10.172s
T=4: 1m24.822s
T=5: 1m43.205s
T=6: 2m15.405s
T=7: 2m31.207s

Just doing cpumask alloc+IPI-many to T other threads:

T=1: 0m39.605s
T=2: 0m48.566s
T=3: 0m50.167s
T=4: 0m57.896s
T=5: 0m56.411s
T=6: 1m0.536s
T=7: 1m12.532s

So I think the right threshold should be around 2 threads (assuming
other architecture will behave like mine). So starting 3 threads, we
allocate the cpumask and send IPIs.

How does that sound ?

[...]

> 
> > > - Part of me thinks this ought to become slightly more general, and just
> > >   deliver a signal that the receiving thread could handle as it likes.
> > >   However, that would certainly prove more expensive than this, and I
> > >   don't know that the generality would buy anything.
> > 
> > A general scheme would have to call every threads, even those which are
> > not running. In the case of this system call, this is a particular case
> > where we can forget about non-running threads, because the memory
> > barrier is implied by the scheduler activity that brought them offline.
> > So I really don't see how we can use this IPI scheme for other things
> > that this kind of synchronization.
> 
> No, I don't mean non-running threads.  If you wanted that, you could do
> what urcu currently does, and send a signal to all threads.  I meant
> something like "signal all *running* threads from my process".

Well, if you find me a real-life use-case, then we can surely look into
that ;)

> 
> > > - Could you somehow register reader threads with the kernel, in a way
> > >   that makes them easy to detect remotely?
> > 
> > There are two ways I figure out we could do this. One would imply adding
> > extra shared data between kernel and userspace (which I'd like to avoid,
> > to keep coupling low). The other alternative would be to add per
> > task_struct information about this, and new system calls. The added per
> > task_struct information would use up cache lines (which are very
> > important, especially in the task_struct) and the added system call at
> > rcu_read_lock/unlock() would simply kill performance.
> 
> No, I didn't mean that you would do a syscall in rcu_read_{lock,unlock}.
> I meant that you would do a system call when the reader threads start,
> saying "hey, reader thread here".

Hrm, we need to inform the userspace RCU library that this thread is
present too. So I don't see how going through the kernel helps us there.

Thanks,

Mathieu

> 
> - Josh Triplett

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 17:44               ` Mathieu Desnoyers
@ 2010-01-07 17:55                 ` Paul E. McKenney
  0 siblings, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 17:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Josh Triplett, Steven Rostedt, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 12:44:35PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Thu, Jan 07, 2010 at 06:18:36PM +0100, Peter Zijlstra wrote:
> > > On Thu, 2010-01-07 at 08:52 -0800, Paul E. McKenney wrote:
> > > > On Thu, Jan 07, 2010 at 09:44:15AM +0100, Peter Zijlstra wrote:
> > > > > On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> > > > > > 
> > > > > > The number of threads doesn't matter nearly as much as the number of
> > > > > > threads typically running at a time compared to the number of
> > > > > > processors.  Of course, we can't measure that as easily, but I don't
> > > > > > know that your proposed heuristic would approximate it well.
> > > > > 
> > > > > Quite agreed, and not disturbing RT tasks is even more important.
> > > > 
> > > > OK, so I stand un-Reviewed-by twice in one morning.  ;-)
> > > > 
> > > > > A simple:
> > > > > 
> > > > >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > > > >      if (cpu_curr(cpu)->mm == current->mm)
> > > > >         smp_call_function_single(cpu, func, NULL, 1);
> > > > >   }
> > > > > 
> > > > > seems far preferable over anything else, if you really want you can use
> > > > > a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
> > > > > smp_call_function_any(), but that includes having to allocate the
> > > > > cpumask, which might or might not be too expensive for Mathieu.
> > > > 
> > > > This would be vulnerable to the sys_membarrier() CPU seeing an old value
> > > > of cpu_curr(cpu)->mm, and that other task seeing the old value of the
> > > > pointer we are trying to RCU-destroy, right?
> > > 
> > > Right, so I was thinking that since you want a mb to be executed when
> > > calling sys_membarrier(). If you observe a matching ->mm but the cpu has
> > > since scheduled, we're good since it scheduled (but we'll still send the
> > > IPI anyway), if we do not observe it because the task gets scheduled in
> > > after we do the iteration we're still good because it scheduled.
> > 
> > Something like the following for sys_membarrier(), then?
> > 
> >   smp_mb();
> 
> This smp_mb() is redundant, as we issue it through the for_each_cpu loop
> on the local CPU already.

But we need to do the smp_mb() -before- checking the first cpu_curr(cpu)->mm.

> >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> >      if (cpu_curr(cpu)->mm == current->mm)
> >         smp_call_function_single(cpu, func, NULL, 1);
> >   }
> > 
> > Then the code changing ->mm on the other CPU also needs to have a
> > full smp_mb() somewhere after the change to ->mm, but before starting
> > user-space execution.  Which it might well just due to overhead, but
> > we need to make sure that someone doesn't optimize us out of existence.
> 
> I believe we also need one between execution of the userspace task and
> change to ->mm. If we have these guarantees I think we are fine.

Agreed, in case an outgoing RCU read-side critical section does a store
into an RCU-protected data structure.  Unconventional, but definitely
permitted.

							Thanx, Paul

> Mathieu
> 
> > 
> > 							Thanx, Paul
> > 
> > > As to needing to keep rcu_read_lock() around the iteration, for sure we
> > > need that to ensure the remote task_struct reference we take is valid.
> > > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 17:44               ` Steven Rostedt
@ 2010-01-07 17:56                 ` Paul E. McKenney
  2010-01-07 18:04                   ` Steven Rostedt
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 17:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Josh Triplett, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 12:44:37PM -0500, Steven Rostedt wrote:
> On Thu, 2010-01-07 at 09:31 -0800, Paul E. McKenney wrote:
> 
> > Something like the following for sys_membarrier(), then?
> > 
> >   smp_mb();
> >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> >      if (cpu_curr(cpu)->mm == current->mm)
> >         smp_call_function_single(cpu, func, NULL, 1);
> >   }
> > 
> > Then the code changing ->mm on the other CPU also needs to have a
> > full smp_mb() somewhere after the change to ->mm, but before starting
> > user-space execution.  Which it might well just due to overhead, but
> > we need to make sure that someone doesn't optimize us out of existence.
> 
> To change the mm requires things like flushing the TLB. I'd be surprised
> if the change of the mm does not already do a smp_mb() somewhere.

Agreed, but "somewhere" does not fill me with warm fuzzies.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 17:56                 ` Paul E. McKenney
@ 2010-01-07 18:04                   ` Steven Rostedt
  2010-01-07 18:40                     ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07 18:04 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Josh Triplett, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, 2010-01-07 at 09:56 -0800, Paul E. McKenney wrote:
> On Thu, Jan 07, 2010 at 12:44:37PM -0500, Steven Rostedt wrote:
> > On Thu, 2010-01-07 at 09:31 -0800, Paul E. McKenney wrote:
> > 
> > > Something like the following for sys_membarrier(), then?
> > > 
> > >   smp_mb();
> > >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > >      if (cpu_curr(cpu)->mm == current->mm)
> > >         smp_call_function_single(cpu, func, NULL, 1);
> > >   }
> > > 
> > > Then the code changing ->mm on the other CPU also needs to have a
> > > full smp_mb() somewhere after the change to ->mm, but before starting
> > > user-space execution.  Which it might well just due to overhead, but
> > > we need to make sure that someone doesn't optimize us out of existence.
> > 
> > To change the mm requires things like flushing the TLB. I'd be surprised
> > if the change of the mm does not already do a smp_mb() somewhere.
> 
> Agreed, but "somewhere" does not fill me with warm fuzzies.  ;-)

Another question would be, does flushing the TLB imply a mb()?

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07  8:27 ` Peter Zijlstra
@ 2010-01-07 18:30   ` Oleg Nesterov
  2010-01-07 18:39     ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Oleg Nesterov @ 2010-01-07 18:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Paul E. McKenney, Ingo Molnar,
	akpm, josh, tglx, rostedt, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On 01/07, Peter Zijlstra wrote:
>
> On Wed, 2010-01-06 at 23:40 -0500, Mathieu Desnoyers wrote:
>
> http://marc.info/?t=126283939400002
>
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
> > @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
> >  };
> >  #endif	/* CONFIG_CGROUP_CPUACCT */
> >
> > +/*
> > + * Execute a memory barrier on all CPUs on SMP systems.
> > + * Do not rely on implicit barriers in smp_call_function(), just in case they
> > + * are ever relaxed in the future.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	smp_mb();
> > +}
> > +
>
> Also, there was some talk a while ago about IPIs implying memory
> barriers. Which I of course forgot all details about,.. at least sending
> one implies a wmb and receiving one an rmb, but it could be stronger,
> Oleg?

IIRC, it was decided that IPIs must imply mb(), but I am not sure
this is true on any arch/



However, even if IPI didn't imply mb(), I don't understand why it
is needed... After the quick reading of the original changelog in
http://marc.info/?l=linux-kernel&m=126283923115068

	Thread A                    Thread B
	prev mem accesses           prev mem accesses
	sys_membarrier()            barrier()
	follow mem accesses         follow mem accesses

sys_membarrier() should "insert" mb() on behalf of B "instead"
of barrier(), right? But, if we send IPI, B enters kernel mode
and returns to user-mode. Should this imply mb() in any case?

Oleg.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 18:30   ` Oleg Nesterov
@ 2010-01-07 18:39     ` Paul E. McKenney
  2010-01-07 18:59       ` Steven Rostedt
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 18:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mathieu Desnoyers, linux-kernel, Ingo Molnar,
	akpm, josh, tglx, rostedt, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 07:30:10PM +0100, Oleg Nesterov wrote:
> On 01/07, Peter Zijlstra wrote:
> >
> > On Wed, 2010-01-06 at 23:40 -0500, Mathieu Desnoyers wrote:
> >
> > http://marc.info/?t=126283939400002
> >
> > > Index: linux-2.6-lttng/kernel/sched.c
> > > ===================================================================
> > > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-06 22:11:32.000000000 -0500
> > > +++ linux-2.6-lttng/kernel/sched.c	2010-01-06 23:20:42.000000000 -0500
> > > @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
> > >  };
> > >  #endif	/* CONFIG_CGROUP_CPUACCT */
> > >
> > > +/*
> > > + * Execute a memory barrier on all CPUs on SMP systems.
> > > + * Do not rely on implicit barriers in smp_call_function(), just in case they
> > > + * are ever relaxed in the future.
> > > + */
> > > +static void membarrier_ipi(void *unused)
> > > +{
> > > +	smp_mb();
> > > +}
> > > +
> >
> > Also, there was some talk a while ago about IPIs implying memory
> > barriers. Which I of course forgot all details about,.. at least sending
> > one implies a wmb and receiving one an rmb, but it could be stronger,
> > Oleg?
> 
> IIRC, it was decided that IPIs must imply mb(), but I am not sure
> this is true on any arch/
> 
> 
> 
> However, even if IPI didn't imply mb(), I don't understand why it
> is needed... After the quick reading of the original changelog in
> http://marc.info/?l=linux-kernel&m=126283923115068
> 
> 	Thread A                    Thread B
> 	prev mem accesses           prev mem accesses
> 	sys_membarrier()            barrier()
> 	follow mem accesses         follow mem accesses
> 
> sys_membarrier() should "insert" mb() on behalf of B "instead"
> of barrier(), right? But, if we send IPI, B enters kernel mode
> and returns to user-mode. Should this imply mb() in any case?

Hello, Oleg,

The issue is with some suggested optimizations that would avoid sending
the IPI to CPUs that are not running threads in the same process as the
thread executing the sys_membarrier().  Some forms of these optimizations
sample ->mm without locking, and the question is whether this is safe.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 18:04                   ` Steven Rostedt
@ 2010-01-07 18:40                     ` Paul E. McKenney
  0 siblings, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 18:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Josh Triplett, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 01:04:07PM -0500, Steven Rostedt wrote:
> On Thu, 2010-01-07 at 09:56 -0800, Paul E. McKenney wrote:
> > On Thu, Jan 07, 2010 at 12:44:37PM -0500, Steven Rostedt wrote:
> > > On Thu, 2010-01-07 at 09:31 -0800, Paul E. McKenney wrote:
> > > 
> > > > Something like the following for sys_membarrier(), then?
> > > > 
> > > >   smp_mb();
> > > >   for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > > >      if (cpu_curr(cpu)->mm == current->mm)
> > > >         smp_call_function_single(cpu, func, NULL, 1);
> > > >   }
> > > > 
> > > > Then the code changing ->mm on the other CPU also needs to have a
> > > > full smp_mb() somewhere after the change to ->mm, but before starting
> > > > user-space execution.  Which it might well just due to overhead, but
> > > > we need to make sure that someone doesn't optimize us out of existence.
> > > 
> > > To change the mm requires things like flushing the TLB. I'd be surprised
> > > if the change of the mm does not already do a smp_mb() somewhere.
> > 
> > Agreed, but "somewhere" does not fill me with warm fuzzies.  ;-)
> 
> Another question would be, does flushing the TLB imply a mb()?

I do not believe that it is guaranteed to on all architectures.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 18:39     ` Paul E. McKenney
@ 2010-01-07 18:59       ` Steven Rostedt
  2010-01-07 19:16         ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07 18:59 UTC (permalink / raw)
  To: paulmck
  Cc: Oleg Nesterov, Peter Zijlstra, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, 2010-01-07 at 10:39 -0800, Paul E. McKenney wrote:

> > sys_membarrier() should "insert" mb() on behalf of B "instead"
> > of barrier(), right? But, if we send IPI, B enters kernel mode
> > and returns to user-mode. Should this imply mb() in any case?
> 
> Hello, Oleg,
> 
> The issue is with some suggested optimizations that would avoid sending
> the IPI to CPUs that are not running threads in the same process as the
> thread executing the sys_membarrier().  Some forms of these optimizations
> sample ->mm without locking, and the question is whether this is safe.

Note, we are not suggesting optimizations. It has nothing to do with
performance of the syscall. We just can't allow one process to be DoSing
another process on another cpu by it sending out millions of IPIs.
Mathieu already showed that you could cause a 2x slowdown to the
unrelated tasks.

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 18:59       ` Steven Rostedt
@ 2010-01-07 19:16         ` Paul E. McKenney
  2010-01-07 19:40           ` Steven Rostedt
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 19:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Peter Zijlstra, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 01:59:42PM -0500, Steven Rostedt wrote:
> On Thu, 2010-01-07 at 10:39 -0800, Paul E. McKenney wrote:
> 
> > > sys_membarrier() should "insert" mb() on behalf of B "instead"
> > > of barrier(), right? But, if we send IPI, B enters kernel mode
> > > and returns to user-mode. Should this imply mb() in any case?
> > 
> > Hello, Oleg,
> > 
> > The issue is with some suggested optimizations that would avoid sending
> > the IPI to CPUs that are not running threads in the same process as the
> > thread executing the sys_membarrier().  Some forms of these optimizations
> > sample ->mm without locking, and the question is whether this is safe.
> 
> Note, we are not suggesting optimizations. It has nothing to do with
> performance of the syscall. We just can't allow one process to be DoSing
> another process on another cpu by it sending out millions of IPIs.
> Mathieu already showed that you could cause a 2x slowdown to the
> unrelated tasks.

I would have said that we are trying to optimize our way out of a DoS
situation, but point taken.  Whatever we choose to call it, the discussion
is on the suggested modifications, not strictly on the original patch.  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 19:16         ` Paul E. McKenney
@ 2010-01-07 19:40           ` Steven Rostedt
  2010-01-07 20:58             ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07 19:40 UTC (permalink / raw)
  To: paulmck
  Cc: Oleg Nesterov, Peter Zijlstra, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, 2010-01-07 at 11:16 -0800, Paul E. McKenney wrote:

> > Note, we are not suggesting optimizations. It has nothing to do with
> > performance of the syscall. We just can't allow one process to be DoSing
> > another process on another cpu by it sending out millions of IPIs.
> > Mathieu already showed that you could cause a 2x slowdown to the
> > unrelated tasks.
> 
> I would have said that we are trying to optimize our way out of a DoS
> situation, but point taken.  Whatever we choose to call it, the discussion
> is on the suggested modifications, not strictly on the original patch.  ;-)

OK, I just want to get a better understanding of what can go wrong. A
sys_membarrier() is used as follows, correct? (using a list example)

	<user space>

	list_del(obj);

	synchronize_rcu();  -> calls sys_membarrier();

	free(obj);


And we need to protect against:

	<user space>

	read_rcu_lock();

	obj = list->next;

	use_object(obj);

	read_rcu_unlock();

where we want to make sure that the synchronize_rcu() makes sure that we
have passed the grace period of all takers of read_rcu_lock(). Now I
have not looked at the code that implements userspace rcu, so I'm making
a lot of assumptions here. But the problem that we need to avoid is:


	CPU 1				CPU 2
     -----------                    -------------

	<user space>			<user space>

					rcu_read_lock();

					obj = list->next

	list_del(obj)

					< Interrupt >
					< kernel space>
					<schedule>

					<kernel_thread>

					<schedule>

					< back to original task >

	sys_membarrier();
	< kernel space >

	if (task_rq(task)->curr != task)
	< but still sees kernel thread >

	< user space >

	< misses that we are still in rcu section >

	free(obj);

					< user space >

					use_object(obj); <=== crash!



I guess what I'm trying to do here is to understand what can go wrong,
and then when we understand the issues, we can find a solution.

-- Steve




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 19:40           ` Steven Rostedt
@ 2010-01-07 20:58             ` Paul E. McKenney
  2010-01-07 21:35               ` Steven Rostedt
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 20:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Peter Zijlstra, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 02:40:43PM -0500, Steven Rostedt wrote:
> On Thu, 2010-01-07 at 11:16 -0800, Paul E. McKenney wrote:
> 
> > > Note, we are not suggesting optimizations. It has nothing to do with
> > > performance of the syscall. We just can't allow one process to be DoSing
> > > another process on another cpu by it sending out millions of IPIs.
> > > Mathieu already showed that you could cause a 2x slowdown to the
> > > unrelated tasks.
> > 
> > I would have said that we are trying to optimize our way out of a DoS
> > situation, but point taken.  Whatever we choose to call it, the discussion
> > is on the suggested modifications, not strictly on the original patch.  ;-)
> 
> OK, I just want to get a better understanding of what can go wrong. A
> sys_membarrier() is used as follows, correct? (using a list example)
> 
> 	<user space>
> 
> 	list_del(obj);
> 
> 	synchronize_rcu();  -> calls sys_membarrier();
> 
> 	free(obj);
> 
> 
> And we need to protect against:
> 
> 	<user space>
> 
> 	read_rcu_lock();
> 
> 	obj = list->next;
> 
> 	use_object(obj);
> 
> 	read_rcu_unlock();

Yep!

> where we want to make sure that the synchronize_rcu() makes sure that we
> have passed the grace period of all takers of read_rcu_lock(). Now I
> have not looked at the code that implements userspace rcu, so I'm making
> a lot of assumptions here. But the problem that we need to avoid is:
> 
> 
> 	CPU 1				CPU 2
>      -----------                    -------------
> 
> 	<user space>			<user space>
> 
> 					rcu_read_lock();
> 
> 					obj = list->next
> 
> 	list_del(obj)
> 
> 					< Interrupt >
> 					< kernel space>
> 					<schedule>
> 
> 					<kernel_thread>
> 
> 					<schedule>
> 
> 					< back to original task >
> 
> 	sys_membarrier();
> 	< kernel space >
> 
> 	if (task_rq(task)->curr != task)
> 	< but still sees kernel thread >
> 
> 	< user space >
> 
> 	< misses that we are still in rcu section >
> 
> 	free(obj);
> 
> 					< user space >
> 
> 					use_object(obj); <=== crash!
> 
> 
> 
> I guess what I'm trying to do here is to understand what can go wrong,
> and then when we understand the issues, we can find a solution.

I believe that I am worried about a different scenario.  I do not believe
that the scenario you lay out above can actually happen.  The pair of
schedules on CPU 2 have to act as a full memory barrier, otherwise,
it would not be safe to resume a task on some other CPU.  If the pair
of schedules act as a full memory barrier, then the code in
synchronize_rcu() that looks at the RCU read-side state would see that
CPU 2 is in an RCU read-side critical section.

The scenario that I am (perhaps wrongly) concerned about is enabled by
the fact that URCU's rcu_read_lock() has a load, some checks, and a store.
It has compiler constraints, but no hardware memory barriers.  This
means that CPUs (even x86) can execute an rcu_dereference() before the
rcu_read_lock()'s store has executed.

Hacking your example above, keeping mind that x86 can reorder subsequent
loads to precede prior stores:


	CPU 1				CPU 2
     -----------                    -------------

	<user space>			<kernel space, switching to task>

					->curr updated

					<long code path, maybe mb?>

					<user space>

					rcu_read_lock(); [load only]

					obj = list->next

	list_del(obj)

	sys_membarrier();
	< kernel space >

	if (task_rq(task)->curr != task)
	< but load to obj reordered before store to ->curr >

	< user space >

	< misses that CPU 2 is in rcu section >

	[CPU 2's ->curr update now visible]

	[CPU 2's rcu_read_lock() store now visible]

	free(obj);

					use_object(obj); <=== crash!



If the "long code path" happens to include a full memory barrier, or if it
happens to be long enough to overflow CPU 2's store buffer, then the
above scenario cannot happen.  Until such time as someone applies some
unforeseen optimization to the context-switch path.

And, yes, the context-switch path has to have a full memory barrier
somewhere, but that somewhere could just as easily come before the
update of ->curr.

The same scenario applies when using ->cpu_vm_mask instead of ->curr.

Now, I could easily believe that the current context-switch code has
sufficient atomic operations, memory barriers, and instructions to
prevent this scenario from occurring, but it is feeling a lot like an
accident waiting to happen.  Hence my strident complaints.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 20:58             ` Paul E. McKenney
@ 2010-01-07 21:35               ` Steven Rostedt
  2010-01-07 22:34                 ` Paul E. McKenney
                                   ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-07 21:35 UTC (permalink / raw)
  To: paulmck
  Cc: Oleg Nesterov, Peter Zijlstra, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, 2010-01-07 at 12:58 -0800, Paul E. McKenney wrote:

> I believe that I am worried about a different scenario.  I do not believe
> that the scenario you lay out above can actually happen.  The pair of
> schedules on CPU 2 have to act as a full memory barrier, otherwise,
> it would not be safe to resume a task on some other CPU.

I'm not so sure about that. The update of ->curr happens inside a
spinlock, which is a rmb() ... wmb() pair. Must be, because a spin_lock
must be an rmb otherwise the loads could move outside the lock, and the
spin_unlock must be a wmb() otherwise what was written could move
outside the lock.


>   If the pair
> of schedules act as a full memory barrier, then the code in
> synchronize_rcu() that looks at the RCU read-side state would see that
> CPU 2 is in an RCU read-side critical section.
> 
> The scenario that I am (perhaps wrongly) concerned about is enabled by
> the fact that URCU's rcu_read_lock() has a load, some checks, and a store.
> It has compiler constraints, but no hardware memory barriers.  This
> means that CPUs (even x86) can execute an rcu_dereference() before the
> rcu_read_lock()'s store has executed.
> 
> Hacking your example above, keeping mind that x86 can reorder subsequent
> loads to precede prior stores:
> 
> 
> 	CPU 1				CPU 2
>      -----------                    -------------
> 
> 	<user space>			<kernel space, switching to task>
> 
> 					->curr updated
> 
> 					<long code path, maybe mb?>
> 
> 					<user space>
> 
> 					rcu_read_lock(); [load only]
> 
> 					obj = list->next
> 
> 	list_del(obj)
> 
> 	sys_membarrier();
> 	< kernel space >

Well, if we just grab the task_rq(task)->lock here, then we should be
OK? We would guarantee that curr is either the task we want or not.

> 
> 	if (task_rq(task)->curr != task)
> 	< but load to obj reordered before store to ->curr >
> 
> 	< user space >
> 
> 	< misses that CPU 2 is in rcu section >
> 
> 	[CPU 2's ->curr update now visible]
> 
> 	[CPU 2's rcu_read_lock() store now visible]
> 
> 	free(obj);
> 
> 					use_object(obj); <=== crash!
> 
> 
> 
> If the "long code path" happens to include a full memory barrier, or if it
> happens to be long enough to overflow CPU 2's store buffer, then the
> above scenario cannot happen.  Until such time as someone applies some
> unforeseen optimization to the context-switch path.
> 
> And, yes, the context-switch path has to have a full memory barrier
> somewhere, but that somewhere could just as easily come before the
> update of ->curr.

Hmm, since ->curr is updated before sched_mm() I'm thinking it would
have to be after the update of curr.

> 
> The same scenario applies when using ->cpu_vm_mask instead of ->curr.
> 
> Now, I could easily believe that the current context-switch code has
> sufficient atomic operations, memory barriers, and instructions to
> prevent this scenario from occurring, but it is feeling a lot like an
> accident waiting to happen.  Hence my strident complaints.  ;-)

I'm totally with you on this. I really want a good understanding of what
can go wrong, and show that we have the necessary infrastructure to
prevent it.

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 21:35               ` Steven Rostedt
@ 2010-01-07 22:34                 ` Paul E. McKenney
  2010-01-08 22:28                 ` Mathieu Desnoyers
  2010-01-08 23:53                 ` Mathieu Desnoyers
  2 siblings, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-07 22:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Peter Zijlstra, Mathieu Desnoyers, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Thu, Jan 07, 2010 at 04:35:40PM -0500, Steven Rostedt wrote:
> On Thu, 2010-01-07 at 12:58 -0800, Paul E. McKenney wrote:
> 
> > I believe that I am worried about a different scenario.  I do not believe
> > that the scenario you lay out above can actually happen.  The pair of
> > schedules on CPU 2 have to act as a full memory barrier, otherwise,
> > it would not be safe to resume a task on some other CPU.
> 
> I'm not so sure about that. The update of ->curr happens inside a
> spinlock, which is a rmb() ... wmb() pair. Must be, because a spin_lock
> must be an rmb otherwise the loads could move outside the lock, and the
> spin_unlock must be a wmb() otherwise what was written could move
> outside the lock.

If a given task is running on CPU 0, then switches to CPU 1, all of the
CPU-0 activity from that task had -better- be visible when it runs on
CPU 1.  But if you were saying that there are other ways to accomplish
this than a full memory barrier, I do agree.

> >   If the pair
> > of schedules act as a full memory barrier, then the code in
> > synchronize_rcu() that looks at the RCU read-side state would see that
> > CPU 2 is in an RCU read-side critical section.
> > 
> > The scenario that I am (perhaps wrongly) concerned about is enabled by
> > the fact that URCU's rcu_read_lock() has a load, some checks, and a store.
> > It has compiler constraints, but no hardware memory barriers.  This
> > means that CPUs (even x86) can execute an rcu_dereference() before the
> > rcu_read_lock()'s store has executed.
> > 
> > Hacking your example above, keeping mind that x86 can reorder subsequent
> > loads to precede prior stores:
> > 
> > 
> > 	CPU 1				CPU 2
> >      -----------                    -------------
> > 
> > 	<user space>			<kernel space, switching to task>
> > 
> > 					->curr updated
> > 
> > 					<long code path, maybe mb?>
> > 
> > 					<user space>
> > 
> > 					rcu_read_lock(); [load only]
> > 
> > 					obj = list->next
> > 
> > 	list_del(obj)
> > 
> > 	sys_membarrier();
> > 	< kernel space >
> 
> Well, if we just grab the task_rq(task)->lock here, then we should be
> OK? We would guarantee that curr is either the task we want or not.

The lock that CPU 2 just grabbed to protect its ->curr update?  If so,
then I believe that this would work, because the CPU would not be
permitted to re-order the "obj = list->next" to precede CPU 2's
acquisition of this lock.

> > 	if (task_rq(task)->curr != task)
> > 	< but load to obj reordered before store to ->curr >
> > 
> > 	< user space >
> > 
> > 	< misses that CPU 2 is in rcu section >
> > 
> > 	[CPU 2's ->curr update now visible]
> > 
> > 	[CPU 2's rcu_read_lock() store now visible]
> > 
> > 	free(obj);
> > 
> > 					use_object(obj); <=== crash!
> > 
> > 
> > 
> > If the "long code path" happens to include a full memory barrier, or if it
> > happens to be long enough to overflow CPU 2's store buffer, then the
> > above scenario cannot happen.  Until such time as someone applies some
> > unforeseen optimization to the context-switch path.
> > 
> > And, yes, the context-switch path has to have a full memory barrier
> > somewhere, but that somewhere could just as easily come before the
> > update of ->curr.
> 
> Hmm, since ->curr is updated before sched_mm() I'm thinking it would
> have to be after the update of curr.

If I understand what you are getting at, from a coherence viewpoint,
the only requirement is that the memory barrier (or equivalent) come
between the last user-mode instruction and the runqueue update on the
outgoing CPU, and between the runqueue read and the first user-mode
instruction on the incoming CPU.

> > The same scenario applies when using ->cpu_vm_mask instead of ->curr.
> > 
> > Now, I could easily believe that the current context-switch code has
> > sufficient atomic operations, memory barriers, and instructions to
> > prevent this scenario from occurring, but it is feeling a lot like an
> > accident waiting to happen.  Hence my strident complaints.  ;-)
> 
> I'm totally with you on this. I really want a good understanding of what
> can go wrong, and show that we have the necessary infrastructure to
> prevent it.

Sounds good to me!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 21:35               ` Steven Rostedt
  2010-01-07 22:34                 ` Paul E. McKenney
@ 2010-01-08 22:28                 ` Mathieu Desnoyers
  2010-01-08 23:53                 ` Mathieu Desnoyers
  2 siblings, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-08 22:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Thu, 2010-01-07 at 12:58 -0800, Paul E. McKenney wrote:
> 
> > I believe that I am worried about a different scenario.  I do not believe
> > that the scenario you lay out above can actually happen.  The pair of
> > schedules on CPU 2 have to act as a full memory barrier, otherwise,
> > it would not be safe to resume a task on some other CPU.
> 
> I'm not so sure about that. The update of ->curr happens inside a
> spinlock, which is a rmb() ... wmb() pair. Must be, because a spin_lock
> must be an rmb otherwise the loads could move outside the lock, and the
> spin_unlock must be a wmb() otherwise what was written could move
> outside the lock.

Hrm, a rmb + wmb pair is different than a full mb(), because rmb and wmb
can be reordered ont wrt the other. The equivalence is more:

mb() = rmb() + sync_core() + wmb()

> 
> 
> >   If the pair
> > of schedules act as a full memory barrier, then the code in
> > synchronize_rcu() that looks at the RCU read-side state would see that
> > CPU 2 is in an RCU read-side critical section.
> > 
> > The scenario that I am (perhaps wrongly) concerned about is enabled by
> > the fact that URCU's rcu_read_lock() has a load, some checks, and a store.
> > It has compiler constraints, but no hardware memory barriers.  This
> > means that CPUs (even x86) can execute an rcu_dereference() before the
> > rcu_read_lock()'s store has executed.
> > 
> > Hacking your example above, keeping mind that x86 can reorder subsequent
> > loads to precede prior stores:
> > 
> > 
> > 	CPU 1				CPU 2
> >      -----------                    -------------
> > 
> > 	<user space>			<kernel space, switching to task>
> > 
> > 					->curr updated
> > 
> > 					<long code path, maybe mb?>
> > 
> > 					<user space>
> > 
> > 					rcu_read_lock(); [load only]
> > 
> > 					obj = list->next
> > 
> > 	list_del(obj)
> > 
> > 	sys_membarrier();
> > 	< kernel space >
> 
> Well, if we just grab the task_rq(task)->lock here, then we should be
> OK? We would guarantee that curr is either the task we want or not.
> 
[...]

Yes, I'll do some testing to figure out how much overhead this has.
Probably not much, but it's an iteration on all CPUs, so it will be a
bit larger on big iron. Clearly taking the run queue lock would be the
safest way to proceed.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-07 21:35               ` Steven Rostedt
  2010-01-07 22:34                 ` Paul E. McKenney
  2010-01-08 22:28                 ` Mathieu Desnoyers
@ 2010-01-08 23:53                 ` Mathieu Desnoyers
  2010-01-09  0:20                   ` Paul E. McKenney
  2 siblings, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-08 23:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> Well, if we just grab the task_rq(task)->lock here, then we should be
> OK? We would guarantee that curr is either the task we want or not.
> 

Hrm, I just tested it, and there seems to be a significant performance
penality involved with taking these locks for each CPU, even with just 8
cores. So if we can do without the locks, that would be preferred.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-08 23:53                 ` Mathieu Desnoyers
@ 2010-01-09  0:20                   ` Paul E. McKenney
  2010-01-09  1:02                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-09  0:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > Well, if we just grab the task_rq(task)->lock here, then we should be
> > OK? We would guarantee that curr is either the task we want or not.
> 
> Hrm, I just tested it, and there seems to be a significant performance
> penality involved with taking these locks for each CPU, even with just 8
> cores. So if we can do without the locks, that would be preferred.

How significant?  Factor of two?  Two orders of magnitude?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09  0:20                   ` Paul E. McKenney
@ 2010-01-09  1:02                     ` Mathieu Desnoyers
  2010-01-09  1:21                       ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-09  1:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > OK? We would guarantee that curr is either the task we want or not.
> > 
> > Hrm, I just tested it, and there seems to be a significant performance
> > penality involved with taking these locks for each CPU, even with just 8
> > cores. So if we can do without the locks, that would be preferred.
> 
> How significant?  Factor of two?  Two orders of magnitude?
> 

On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):

Without runqueue locks:

T=1: 0m13.911s
T=2: 0m20.730s
T=3: 0m21.474s
T=4: 0m27.952s
T=5: 0m26.286s
T=6: 0m27.855s
T=7: 0m29.695s

With runqueue locks:

T=1: 0m15.802s
T=2: 0m22.484s
T=3: 0m24.751s
T=4: 0m29.134s
T=5: 0m30.094s
T=6: 0m33.090s
T=7: 0m33.897s

So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
15% overhead when doing an IPI to 1 thread. Therefore, that won't be
pretty on 128+-core machines.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09  1:02                     ` Mathieu Desnoyers
@ 2010-01-09  1:21                       ` Paul E. McKenney
  2010-01-09  1:22                         ` Paul E. McKenney
  2010-01-09  2:38                         ` Mathieu Desnoyers
  0 siblings, 2 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-09  1:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > OK? We would guarantee that curr is either the task we want or not.
> > > 
> > > Hrm, I just tested it, and there seems to be a significant performance
> > > penality involved with taking these locks for each CPU, even with just 8
> > > cores. So if we can do without the locks, that would be preferred.
> > 
> > How significant?  Factor of two?  Two orders of magnitude?
> > 
> 
> On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> 
> Without runqueue locks:
> 
> T=1: 0m13.911s
> T=2: 0m20.730s
> T=3: 0m21.474s
> T=4: 0m27.952s
> T=5: 0m26.286s
> T=6: 0m27.855s
> T=7: 0m29.695s
> 
> With runqueue locks:
> 
> T=1: 0m15.802s
> T=2: 0m22.484s
> T=3: 0m24.751s
> T=4: 0m29.134s
> T=5: 0m30.094s
> T=6: 0m33.090s
> T=7: 0m33.897s
> 
> So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> pretty on 128+-core machines.

But isn't the bulk of the overhead the IPIs rather than the runqueue
locks?

     W/out RQ       W/RQ   % degradation
T=1:    13.91      15.8    1.14
T=2:    20.73      22.48   1.08
T=3:    21.47      24.75   1.15
T=4:    27.95      29.13   1.04
T=5:    26.29      30.09   1.14
T=6:    27.86      33.09   1.19
T=7:    29.7       33.9    1.14

So if we had lots of CPUs, we might want to fan the IPIs out through
intermediate CPUs in a tree fashion, but the runqueue locks are not
causing excessive pain.

How does this compare to use of POSIX signals?  Never mind, POSIX
signals are arbitrarily bad if you have way more threads than are
actually running at the time...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09  1:21                       ` Paul E. McKenney
@ 2010-01-09  1:22                         ` Paul E. McKenney
  2010-01-09  2:38                         ` Mathieu Desnoyers
  1 sibling, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-09  1:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Fri, Jan 08, 2010 at 05:21:28PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > 
> > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > penality involved with taking these locks for each CPU, even with just 8
> > > > cores. So if we can do without the locks, that would be preferred.
> > > 
> > > How significant?  Factor of two?  Two orders of magnitude?
> > > 
> > 
> > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > 
> > Without runqueue locks:
> > 
> > T=1: 0m13.911s
> > T=2: 0m20.730s
> > T=3: 0m21.474s
> > T=4: 0m27.952s
> > T=5: 0m26.286s
> > T=6: 0m27.855s
> > T=7: 0m29.695s
> > 
> > With runqueue locks:
> > 
> > T=1: 0m15.802s
> > T=2: 0m22.484s
> > T=3: 0m24.751s
> > T=4: 0m29.134s
> > T=5: 0m30.094s
> > T=6: 0m33.090s
> > T=7: 0m33.897s
> > 
> > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > pretty on 128+-core machines.
> 
> But isn't the bulk of the overhead the IPIs rather than the runqueue
> locks?
> 
>      W/out RQ       W/RQ   % degradation
> T=1:    13.91      15.8    1.14
> T=2:    20.73      22.48   1.08
> T=3:    21.47      24.75   1.15
> T=4:    27.95      29.13   1.04
> T=5:    26.29      30.09   1.14
> T=6:    27.86      33.09   1.19
> T=7:    29.7       33.9    1.14

Right...  s/% degradation/Ratio/  :-/

							Thanx, Paul

> So if we had lots of CPUs, we might want to fan the IPIs out through
> intermediate CPUs in a tree fashion, but the runqueue locks are not
> causing excessive pain.
> 
> How does this compare to use of POSIX signals?  Never mind, POSIX
> signals are arbitrarily bad if you have way more threads than are
> actually running at the time...
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09  1:21                       ` Paul E. McKenney
  2010-01-09  1:22                         ` Paul E. McKenney
@ 2010-01-09  2:38                         ` Mathieu Desnoyers
  2010-01-09  5:42                           ` Paul E. McKenney
  1 sibling, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-09  2:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > 
> > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > penality involved with taking these locks for each CPU, even with just 8
> > > > cores. So if we can do without the locks, that would be preferred.
> > > 
> > > How significant?  Factor of two?  Two orders of magnitude?
> > > 
> > 
> > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > 
> > Without runqueue locks:
> > 
> > T=1: 0m13.911s
> > T=2: 0m20.730s
> > T=3: 0m21.474s
> > T=4: 0m27.952s
> > T=5: 0m26.286s
> > T=6: 0m27.855s
> > T=7: 0m29.695s
> > 
> > With runqueue locks:
> > 
> > T=1: 0m15.802s
> > T=2: 0m22.484s
> > T=3: 0m24.751s
> > T=4: 0m29.134s
> > T=5: 0m30.094s
> > T=6: 0m33.090s
> > T=7: 0m33.897s
> > 
> > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > pretty on 128+-core machines.
> 
> But isn't the bulk of the overhead the IPIs rather than the runqueue
> locks?
> 
>      W/out RQ       W/RQ   % degradation
fix:
       W/out RQ       W/RQ   ratio
> T=1:    13.91      15.8    1.14
> T=2:    20.73      22.48   1.08
> T=3:    21.47      24.75   1.15
> T=4:    27.95      29.13   1.04
> T=5:    26.29      30.09   1.14
> T=6:    27.86      33.09   1.19
> T=7:    29.7       33.9    1.14

These numbers tell you that the degradation is roughly constant as we
add more threads (let's consider 1 thread per core, 1 IPI per thread,
with active threads). It is all run on a 8-core system will all cpus
active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
added 180 ns/core per call. (note: T=1 is a special-case, as I do not
allocate any cpumask.)

Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
or a 8-core system, for an added 300 ns/core per call.

So the overhead of taking the task lock is about twice higher, per core,
than the overhead of the IPIs. This is understandable if the
architecture does an IPI broadcast: the scalability problem then boils
down to exchange cache-lines to inform the ipi sender that the other
cpus have completed. An atomic operation exchanging a cache-line would
be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.

> 
> So if we had lots of CPUs, we might want to fan the IPIs out through
> intermediate CPUs in a tree fashion, but the runqueue locks are not
> causing excessive pain.

A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
they can be broadcasted pretty efficiciently), but however could be
useful when waiting for the IPIs to complete efficiently.

> 
> How does this compare to use of POSIX signals?  Never mind, POSIX
> signals are arbitrarily bad if you have way more threads than are
> actually running at the time...

POSIX signals to all threads are terrible in that they require to wake
up all those threads. I have not even thought it useful to compare
these two approaches with benchmarks yet (I'll do that when the
sys_membarrier() support is implemented in liburcu).

Thanks,

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09  2:38                         ` Mathieu Desnoyers
@ 2010-01-09  5:42                           ` Paul E. McKenney
  2010-01-09 19:20                             ` Mathieu Desnoyers
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-09  5:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > > 
> > > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > > penality involved with taking these locks for each CPU, even with just 8
> > > > > cores. So if we can do without the locks, that would be preferred.
> > > > 
> > > > How significant?  Factor of two?  Two orders of magnitude?
> > > > 
> > > 
> > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > > 
> > > Without runqueue locks:
> > > 
> > > T=1: 0m13.911s
> > > T=2: 0m20.730s
> > > T=3: 0m21.474s
> > > T=4: 0m27.952s
> > > T=5: 0m26.286s
> > > T=6: 0m27.855s
> > > T=7: 0m29.695s
> > > 
> > > With runqueue locks:
> > > 
> > > T=1: 0m15.802s
> > > T=2: 0m22.484s
> > > T=3: 0m24.751s
> > > T=4: 0m29.134s
> > > T=5: 0m30.094s
> > > T=6: 0m33.090s
> > > T=7: 0m33.897s
> > > 
> > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > > pretty on 128+-core machines.
> > 
> > But isn't the bulk of the overhead the IPIs rather than the runqueue
> > locks?
> > 
> >      W/out RQ       W/RQ   % degradation
> fix:
>        W/out RQ       W/RQ   ratio
> > T=1:    13.91      15.8    1.14
> > T=2:    20.73      22.48   1.08
> > T=3:    21.47      24.75   1.15
> > T=4:    27.95      29.13   1.04
> > T=5:    26.29      30.09   1.14
> > T=6:    27.86      33.09   1.19
> > T=7:    29.7       33.9    1.14
> 
> These numbers tell you that the degradation is roughly constant as we
> add more threads (let's consider 1 thread per core, 1 IPI per thread,
> with active threads). It is all run on a 8-core system will all cpus
> active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
> for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
> added 180 ns/core per call. (note: T=1 is a special-case, as I do not
> allocate any cpumask.)
> 
> Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> or a 8-core system, for an added 300 ns/core per call.
> 
> So the overhead of taking the task lock is about twice higher, per core,
> than the overhead of the IPIs. This is understandable if the
> architecture does an IPI broadcast: the scalability problem then boils
> down to exchange cache-lines to inform the ipi sender that the other
> cpus have completed. An atomic operation exchanging a cache-line would
> be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.

Let me rephrase the question...  Isn't the vast bulk of the overhead
something other than the runqueue spinlocks?

> > So if we had lots of CPUs, we might want to fan the IPIs out through
> > intermediate CPUs in a tree fashion, but the runqueue locks are not
> > causing excessive pain.
> 
> A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
> they can be broadcasted pretty efficiciently), but however could be
> useful when waiting for the IPIs to complete efficiently.

OK, given that you precompute the CPU mask, you might be able to take
advantage of hardware broadcast, on architectures having it.

> > How does this compare to use of POSIX signals?  Never mind, POSIX
> > signals are arbitrarily bad if you have way more threads than are
> > actually running at the time...
> 
> POSIX signals to all threads are terrible in that they require to wake
> up all those threads. I have not even thought it useful to compare
> these two approaches with benchmarks yet (I'll do that when the
> sys_membarrier() support is implemented in liburcu).

It would be of some interest.  I bet that the runqueue spinlock overhead
is -way- down in the noise by comparison to POSIX signals, even when all
the threads are running.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09  5:42                           ` Paul E. McKenney
@ 2010-01-09 19:20                             ` Mathieu Desnoyers
  2010-01-09 23:05                               ` Steven Rostedt
  2010-01-09 23:59                               ` Paul E. McKenney
  0 siblings, 2 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-09 19:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > > > 
> > > > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > > > penality involved with taking these locks for each CPU, even with just 8
> > > > > > cores. So if we can do without the locks, that would be preferred.
> > > > > 
> > > > > How significant?  Factor of two?  Two orders of magnitude?
> > > > > 
> > > > 
> > > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > > > 
> > > > Without runqueue locks:
> > > > 
> > > > T=1: 0m13.911s
> > > > T=2: 0m20.730s
> > > > T=3: 0m21.474s
> > > > T=4: 0m27.952s
> > > > T=5: 0m26.286s
> > > > T=6: 0m27.855s
> > > > T=7: 0m29.695s
> > > > 
> > > > With runqueue locks:
> > > > 
> > > > T=1: 0m15.802s
> > > > T=2: 0m22.484s
> > > > T=3: 0m24.751s
> > > > T=4: 0m29.134s
> > > > T=5: 0m30.094s
> > > > T=6: 0m33.090s
> > > > T=7: 0m33.897s
> > > > 
> > > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > > > pretty on 128+-core machines.
> > > 
> > > But isn't the bulk of the overhead the IPIs rather than the runqueue
> > > locks?
> > > 
> > >      W/out RQ       W/RQ   % degradation
> > fix:
> >        W/out RQ       W/RQ   ratio
> > > T=1:    13.91      15.8    1.14
> > > T=2:    20.73      22.48   1.08
> > > T=3:    21.47      24.75   1.15
> > > T=4:    27.95      29.13   1.04
> > > T=5:    26.29      30.09   1.14
> > > T=6:    27.86      33.09   1.19
> > > T=7:    29.7       33.9    1.14
> > 
> > These numbers tell you that the degradation is roughly constant as we
> > add more threads (let's consider 1 thread per core, 1 IPI per thread,
> > with active threads). It is all run on a 8-core system will all cpus
> > active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
> > for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
> > added 180 ns/core per call. (note: T=1 is a special-case, as I do not
> > allocate any cpumask.)
> > 
> > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> > or a 8-core system, for an added 300 ns/core per call.
> > 
> > So the overhead of taking the task lock is about twice higher, per core,
> > than the overhead of the IPIs. This is understandable if the
> > architecture does an IPI broadcast: the scalability problem then boils
> > down to exchange cache-lines to inform the ipi sender that the other
> > cpus have completed. An atomic operation exchanging a cache-line would
> > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.
> 
> Let me rephrase the question...  Isn't the vast bulk of the overhead
> something other than the runqueue spinlocks?

I don't think so. What we have here is:

O(1)
- a system call
- cpumask allocation
- IPI broadcast

O(nr cpus)
- wait for IPI handlers to complete
- runqueue spinlocks

The O(1) operations seems to be about 5x slower than the combined
O(nr cpus) wait and spinlock operations, but this only means that as
soon as we have 8 cores, then the bulk of the overhead sits in the
runqueue spinlock (if we have to take them).

If we don't take spinlocks, then we can go up to 16 cores before the
bulk of the overhead starts to be the "wait for IPI handlers to
complete" phase. As you pointed out, we could turn this wait phase into
a tree hierarchy. However, we cannot do this with the spinlocks, as they
have to be taken for the initial cpumask iteration.

Therefore, if we don't have to take those spinlocks, we can have a very
significant gain over this system call overhead, especially on large
systems. Not taking spinlocks here allows us to use a tree hierarchy to
turn the bulk of the scalability overhead (waiting for IPI handlers to
complete) into a O(log(nb cpus)) complexity order, which is quite
interesting.

> 
> > > So if we had lots of CPUs, we might want to fan the IPIs out through
> > > intermediate CPUs in a tree fashion, but the runqueue locks are not
> > > causing excessive pain.
> > 
> > A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
> > they can be broadcasted pretty efficiciently), but however could be
> > useful when waiting for the IPIs to complete efficiently.
> 
> OK, given that you precompute the CPU mask, you might be able to take
> advantage of hardware broadcast, on architectures having it.
> 
> > > How does this compare to use of POSIX signals?  Never mind, POSIX
> > > signals are arbitrarily bad if you have way more threads than are
> > > actually running at the time...
> > 
> > POSIX signals to all threads are terrible in that they require to wake
> > up all those threads. I have not even thought it useful to compare
> > these two approaches with benchmarks yet (I'll do that when the
> > sys_membarrier() support is implemented in liburcu).
> 
> It would be of some interest.  I bet that the runqueue spinlock overhead
> is -way- down in the noise by comparison to POSIX signals, even when all
> the threads are running.  ;-)
> 

For 1,000,000 iterations, sending signals to execute a remote mb and
waiting for it to complete:

T=1: 0m3.107s
T=2: 0m5.772s
T=3: 0m8.662s
T=4: 0m12.239s
T=5: 0m16.213s
T=6: 0m19.482s
T=7: 0m23.227s

So, per iteration:

T=1: 3107 ns
T=2: 5772 ns
T=3: 8662 ns
T=4: 12239 ns
T=5: 16213 ns
T=6: 19482 ns
T=7: 23227 ns

For an added 3000 ns per extra thread. So, yes, the added 300 ns/core
for spinlocks is almost lost in the noise compared to the signal-based
solution, but it's not because the old solution was behaving so poorly
that we can rely on it to say what is noise vs not in the current
implementation. Looking at what the scalability bottlenecks are, and
looking at what is noise within the current implementation seems like
a more appropriate way to design an efficient system call.

So all in all, we can expect around 6.25-fold improvement because we
diminish the per-core overhead if we use the spinlocks (480 ns/core vs
3000 ns/core), but if we don't take the runqueue spinlocks (180
ns/core), then we are contemplating a 16.7-fold improvement. And this is
without considering a tree-hierarchy for waiting for IPIs to complete,
which would additionally change the order of the scalability overhead
from O(n) to O(log(n)).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09 19:20                             ` Mathieu Desnoyers
@ 2010-01-09 23:05                               ` Steven Rostedt
  2010-01-09 23:16                                 ` Steven Rostedt
  2010-01-10  1:01                                 ` Mathieu Desnoyers
  2010-01-09 23:59                               ` Paul E. McKenney
  1 sibling, 2 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-09 23:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, 2010-01-09 at 14:20 -0500, Mathieu Desnoyers wrote:

> > > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> > > or a 8-core system, for an added 300 ns/core per call.
> > > 
> > > So the overhead of taking the task lock is about twice higher, per core,
> > > than the overhead of the IPIs. This is understandable if the
> > > architecture does an IPI broadcast: the scalability problem then boils
> > > down to exchange cache-lines to inform the ipi sender that the other
> > > cpus have completed. An atomic operation exchanging a cache-line would
> > > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.
> > 
> > Let me rephrase the question...  Isn't the vast bulk of the overhead
> > something other than the runqueue spinlocks?
> 
> I don't think so. What we have here is:
> 
> O(1)
> - a system call
> - cpumask allocation
> - IPI broadcast

> O(nr cpus)

Isn't this really O(tasks) ?

Don't you do the spinlock(task_rq(task)->rq->lock)?

So the scale is not with large boxes, but the number of tasks that must
be checked. Still, if you have 1000 threads, a rcu writer is bound to
take a bit of overhead. But the advantage is the readers are still fast.

RCU is known to be slow for writing. A user must be aware of this.

Then we should have O(tasks) for spinlocks taken, and 
O(min(tasks, CPUS)) for IPIs.

cpumask = 0;
foreach task {
	spin_lock(task_rq(task)->rq->lock);
	if (task_rq(task)->curr == task)
		cpu_set(task_cpu(task), cpumask);
	spin_unlock(task_rq(task)->rq->lock);
}
send_ipi(cpumask);

-- Steve


> - wait for IPI handlers to complete
> - runqueue spinlocks



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09 23:05                               ` Steven Rostedt
@ 2010-01-09 23:16                                 ` Steven Rostedt
  2010-01-10  0:03                                   ` Paul E. McKenney
  2010-01-10  1:04                                   ` Mathieu Desnoyers
  2010-01-10  1:01                                 ` Mathieu Desnoyers
  1 sibling, 2 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-09 23:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:

> Then we should have O(tasks) for spinlocks taken, and 
> O(min(tasks, CPUS)) for IPIs.
> 

And for nr tasks >> CPUS, this may help too:

> cpumask = 0;
> foreach task {

	if (cpumask == online_cpus)
		break;

> 	spin_lock(task_rq(task)->rq->lock);
> 	if (task_rq(task)->curr == task)
> 		cpu_set(task_cpu(task), cpumask);
> 	spin_unlock(task_rq(task)->rq->lock);
> }
> send_ipi(cpumask);
> 

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09 19:20                             ` Mathieu Desnoyers
  2010-01-09 23:05                               ` Steven Rostedt
@ 2010-01-09 23:59                               ` Paul E. McKenney
  2010-01-10  1:11                                 ` Mathieu Desnoyers
  1 sibling, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-09 23:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, Jan 09, 2010 at 02:20:06PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > > > > 
> > > > > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > > > > penality involved with taking these locks for each CPU, even with just 8
> > > > > > > cores. So if we can do without the locks, that would be preferred.
> > > > > > 
> > > > > > How significant?  Factor of two?  Two orders of magnitude?
> > > > > > 
> > > > > 
> > > > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > > > > 
> > > > > Without runqueue locks:
> > > > > 
> > > > > T=1: 0m13.911s
> > > > > T=2: 0m20.730s
> > > > > T=3: 0m21.474s
> > > > > T=4: 0m27.952s
> > > > > T=5: 0m26.286s
> > > > > T=6: 0m27.855s
> > > > > T=7: 0m29.695s
> > > > > 
> > > > > With runqueue locks:
> > > > > 
> > > > > T=1: 0m15.802s
> > > > > T=2: 0m22.484s
> > > > > T=3: 0m24.751s
> > > > > T=4: 0m29.134s
> > > > > T=5: 0m30.094s
> > > > > T=6: 0m33.090s
> > > > > T=7: 0m33.897s
> > > > > 
> > > > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > > > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > > > > pretty on 128+-core machines.
> > > > 
> > > > But isn't the bulk of the overhead the IPIs rather than the runqueue
> > > > locks?
> > > > 
> > > >      W/out RQ       W/RQ   % degradation
> > > fix:
> > >        W/out RQ       W/RQ   ratio
> > > > T=1:    13.91      15.8    1.14
> > > > T=2:    20.73      22.48   1.08
> > > > T=3:    21.47      24.75   1.15
> > > > T=4:    27.95      29.13   1.04
> > > > T=5:    26.29      30.09   1.14
> > > > T=6:    27.86      33.09   1.19
> > > > T=7:    29.7       33.9    1.14
> > > 
> > > These numbers tell you that the degradation is roughly constant as we
> > > add more threads (let's consider 1 thread per core, 1 IPI per thread,
> > > with active threads). It is all run on a 8-core system will all cpus
> > > active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
> > > for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
> > > added 180 ns/core per call. (note: T=1 is a special-case, as I do not
> > > allocate any cpumask.)
> > > 
> > > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> > > or a 8-core system, for an added 300 ns/core per call.
> > > 
> > > So the overhead of taking the task lock is about twice higher, per core,
> > > than the overhead of the IPIs. This is understandable if the
> > > architecture does an IPI broadcast: the scalability problem then boils
> > > down to exchange cache-lines to inform the ipi sender that the other
> > > cpus have completed. An atomic operation exchanging a cache-line would
> > > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.
> > 
> > Let me rephrase the question...  Isn't the vast bulk of the overhead
> > something other than the runqueue spinlocks?
> 
> I don't think so. What we have here is:
> 
> O(1)
> - a system call
> - cpumask allocation
> - IPI broadcast
> 
> O(nr cpus)
> - wait for IPI handlers to complete
> - runqueue spinlocks
> 
> The O(1) operations seems to be about 5x slower than the combined
> O(nr cpus) wait and spinlock operations, but this only means that as
> soon as we have 8 cores, then the bulk of the overhead sits in the
> runqueue spinlock (if we have to take them).
> 
> If we don't take spinlocks, then we can go up to 16 cores before the
> bulk of the overhead starts to be the "wait for IPI handlers to
> complete" phase. As you pointed out, we could turn this wait phase into
> a tree hierarchy. However, we cannot do this with the spinlocks, as they
> have to be taken for the initial cpumask iteration.
> 
> Therefore, if we don't have to take those spinlocks, we can have a very
> significant gain over this system call overhead, especially on large
> systems. Not taking spinlocks here allows us to use a tree hierarchy to
> turn the bulk of the scalability overhead (waiting for IPI handlers to
> complete) into a O(log(nb cpus)) complexity order, which is quite
> interesting.

All this would sound plausible if it weren't for the ratio of overheads
for runqueue-lock and non-runqueue-lock versions not varying much from
one to seven CPUs.  ;-)

And please keep in mind that this operations happens on the URCU slowpath.
Further, there may be opportunities for much larger savings by batching
grace-period requests, given that a single grace period can serve an
arbitrarily large number of synchronize_rcu() requests.

> > > > So if we had lots of CPUs, we might want to fan the IPIs out through
> > > > intermediate CPUs in a tree fashion, but the runqueue locks are not
> > > > causing excessive pain.
> > > 
> > > A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
> > > they can be broadcasted pretty efficiciently), but however could be
> > > useful when waiting for the IPIs to complete efficiently.
> > 
> > OK, given that you precompute the CPU mask, you might be able to take
> > advantage of hardware broadcast, on architectures having it.
> > 
> > > > How does this compare to use of POSIX signals?  Never mind, POSIX
> > > > signals are arbitrarily bad if you have way more threads than are
> > > > actually running at the time...
> > > 
> > > POSIX signals to all threads are terrible in that they require to wake
> > > up all those threads. I have not even thought it useful to compare
> > > these two approaches with benchmarks yet (I'll do that when the
> > > sys_membarrier() support is implemented in liburcu).
> > 
> > It would be of some interest.  I bet that the runqueue spinlock overhead
> > is -way- down in the noise by comparison to POSIX signals, even when all
> > the threads are running.  ;-)
> 
> For 1,000,000 iterations, sending signals to execute a remote mb and
> waiting for it to complete:

Adding the previous results for comparison, and please keep in mind the
need to multiply the left-hand column by ten before comparing to the
right-hand column:

>                              W/out RQ       W/RQ   ratio
> T=1: 0m3.107s           T=1:    13.91      15.8    1.14
> T=2: 0m5.772s           T=2:    20.73      22.48   1.08
> T=3: 0m8.662s           T=3:    21.47      24.75   1.15
> T=4: 0m12.239s          T=4:    27.95      29.13   1.04
> T=5: 0m16.213s          T=5:    26.29      30.09   1.14
> T=6: 0m19.482s          T=6:    27.86      33.09   1.19
> T=7: 0m23.227s          T=7:    29.7       33.9    1.14

So sys_membarrier() is roughly a factor of two better than POSIX signals
for a single thread, rising to not quite a factor of eight for seven
threads.  And this data -does- support the notion that POSIX signals
get increasingly worse with increasing numbers of threads, as one
would expect.

> So, per iteration:
> 
> T=1: 3107 ns
> T=2: 5772 ns
> T=3: 8662 ns
> T=4: 12239 ns
> T=5: 16213 ns
> T=6: 19482 ns
> T=7: 23227 ns
> 
> For an added 3000 ns per extra thread. So, yes, the added 300 ns/core
> for spinlocks is almost lost in the noise compared to the signal-based
> solution, but it's not because the old solution was behaving so poorly
> that we can rely on it to say what is noise vs not in the current
> implementation. Looking at what the scalability bottlenecks are, and
> looking at what is noise within the current implementation seems like
> a more appropriate way to design an efficient system call.

I agree that your measurements show a marked improvement compared to
POSIX signals that increases with increasing numbers of threads, again,
as expected.

> So all in all, we can expect around 6.25-fold improvement because we
> diminish the per-core overhead if we use the spinlocks (480 ns/core vs
> 3000 ns/core), but if we don't take the runqueue spinlocks (180
> ns/core), then we are contemplating a 16.7-fold improvement. And this is
> without considering a tree-hierarchy for waiting for IPIs to complete,
> which would additionally change the order of the scalability overhead
> from O(n) to O(log(n)).

Eh?  You should be able to use the locks to accumulate a cpumask, then
use whatever you want, including a tree hierarchy, to both send and wait
for the IPIs, right?

Keep in mind that the runqueue locks just force memory ordering on the
sampling.  It should not be necessary to hold them while sending the
IPIs.  Or am I missing something?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09 23:16                                 ` Steven Rostedt
@ 2010-01-10  0:03                                   ` Paul E. McKenney
  2010-01-10  0:41                                     ` Steven Rostedt
  2010-01-10  1:12                                     ` Mathieu Desnoyers
  2010-01-10  1:04                                   ` Mathieu Desnoyers
  1 sibling, 2 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-10  0:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, Jan 09, 2010 at 06:16:40PM -0500, Steven Rostedt wrote:
> On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:
> 
> > Then we should have O(tasks) for spinlocks taken, and 
> > O(min(tasks, CPUS)) for IPIs.
> 
> And for nr tasks >> CPUS, this may help too:
> 
> > cpumask = 0;
> > foreach task {
> 
> 	if (cpumask == online_cpus)
> 		break;
> 
> > 	spin_lock(task_rq(task)->rq->lock);
> > 	if (task_rq(task)->curr == task)
> > 		cpu_set(task_cpu(task), cpumask);
> > 	spin_unlock(task_rq(task)->rq->lock);
> > }
> > send_ipi(cpumask);

Good point, erring on the side of sending too many IPIs is safe.  One
might even be able to just send the full set if enough of the CPUs were
running the current process and none of the remainder were running
real-time threads.  And yes, it would then be necessary to throttle
calls to sys_membarrier().

Quickly hiding behind a suitable boulder...  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  0:03                                   ` Paul E. McKenney
@ 2010-01-10  0:41                                     ` Steven Rostedt
  2010-01-10  1:14                                       ` Mathieu Desnoyers
  2010-01-10  1:44                                       ` Mathieu Desnoyers
  2010-01-10  1:12                                     ` Mathieu Desnoyers
  1 sibling, 2 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-10  0:41 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, 2010-01-09 at 16:03 -0800, Paul E. McKenney wrote:
> On Sat, Jan 09, 2010 at 06:16:40PM -0500, Steven Rostedt wrote:
> > On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:
> > 
> > > Then we should have O(tasks) for spinlocks taken, and 
> > > O(min(tasks, CPUS)) for IPIs.
> > 
> > And for nr tasks >> CPUS, this may help too:
> > 
> > > cpumask = 0;
> > > foreach task {
> > 
> > 	if (cpumask == online_cpus)
> > 		break;
> > 
> > > 	spin_lock(task_rq(task)->rq->lock);
> > > 	if (task_rq(task)->curr == task)
> > > 		cpu_set(task_cpu(task), cpumask);
> > > 	spin_unlock(task_rq(task)->rq->lock);
> > > }
> > > send_ipi(cpumask);
> 
> Good point, erring on the side of sending too many IPIs is safe.  One
> might even be able to just send the full set if enough of the CPUs were
> running the current process and none of the remainder were running
> real-time threads.  And yes, it would then be necessary to throttle
> calls to sys_membarrier().
> 

If you need to throttle calls to sys_membarrier(), than why bother
optimizing it? Again, this is like calling synchronize_sched() in the
kernel, which is a very heavy operation, and should only be called by
those that are not performance critical.

Why are we struggling so much with optimizing the slow path?

Here's how I take it. This method is much better that sending signals to
all threads. The advantage the sys_membarrier gives us, is also a way to
keep user rcu_read_locks barrier free, which means that rcu_read_locks
are quick and scale well.

So what if we have a linear decrease in performance with the number of
threads on the write side?

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09 23:05                               ` Steven Rostedt
  2010-01-09 23:16                                 ` Steven Rostedt
@ 2010-01-10  1:01                                 ` Mathieu Desnoyers
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10  1:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Sat, 2010-01-09 at 14:20 -0500, Mathieu Desnoyers wrote:
> 
> > > > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> > > > or a 8-core system, for an added 300 ns/core per call.
> > > > 
> > > > So the overhead of taking the task lock is about twice higher, per core,
> > > > than the overhead of the IPIs. This is understandable if the
> > > > architecture does an IPI broadcast: the scalability problem then boils
> > > > down to exchange cache-lines to inform the ipi sender that the other
> > > > cpus have completed. An atomic operation exchanging a cache-line would
> > > > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.
> > > 
> > > Let me rephrase the question...  Isn't the vast bulk of the overhead
> > > something other than the runqueue spinlocks?
> > 
> > I don't think so. What we have here is:
> > 
> > O(1)
> > - a system call
> > - cpumask allocation
> > - IPI broadcast
> 
> > O(nr cpus)
> 
> Isn't this really O(tasks) ?

Yes, you are right. The iteration is done with:

for_each_cpu(cpu, mm_cpumask(current->mm))

which is bounded by the number of threads in the process.

> 
> Don't you do the spinlock(task_rq(task)->rq->lock)?

Within this loop, I check with cpu_curr(cpu)->mm

So, really, it's O(min(nr threads, nr cpus)), which could be translated
into O(nr active threads).

> 
> So the scale is not with large boxes, but the number of tasks that must
> be checked. Still, if you have 1000 threads, a rcu writer is bound to
> take a bit of overhead. But the advantage is the readers are still fast.

Yep.

> 
> RCU is known to be slow for writing. A user must be aware of this.

True. Although the goal of this modification is to ensure that
synchronize_rcu() is not painfully slow and does not involve waking up
all threads, which would have many side-effects on the system (killing
sleep states and so on).

> 
> Then we should have O(tasks) for spinlocks taken, and 
> O(min(tasks, CPUS)) for IPIs.

We actually have O(nr active threads) for both spinlocks taken and IPI
wait, which is not that bad.

You're starting to convince me to start with something rock-solid, and
wait until there is a need for something faster before we do tighter
coupling with the scheduler memory barriers.

Thanks,

Mathieu

> 
> cpumask = 0;
> foreach task {
> 	spin_lock(task_rq(task)->rq->lock);
> 	if (task_rq(task)->curr == task)
> 		cpu_set(task_cpu(task), cpumask);
> 	spin_unlock(task_rq(task)->rq->lock);
> }
> send_ipi(cpumask);
> 
> -- Steve
> 
> 
> > - wait for IPI handlers to complete
> > - runqueue spinlocks
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09 23:16                                 ` Steven Rostedt
  2010-01-10  0:03                                   ` Paul E. McKenney
@ 2010-01-10  1:04                                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10  1:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:
> 
> > Then we should have O(tasks) for spinlocks taken, and 
> > O(min(tasks, CPUS)) for IPIs.
> > 
> 
> And for nr tasks >> CPUS, this may help too:
> 
> > cpumask = 0;
> > foreach task {
> 
> 	if (cpumask == online_cpus)
> 		break;

This is not required with for_each_cpu(cpu, mm_cpumask(current->mm)),
because it only iterates on active cpus on which the current process
threads are scheduled.

Thanks,

Mathieu

> 
> > 	spin_lock(task_rq(task)->rq->lock);
> > 	if (task_rq(task)->curr == task)
> > 		cpu_set(task_cpu(task), cpumask);
> > 	spin_unlock(task_rq(task)->rq->lock);
> > }
> > send_ipi(cpumask);
> > 
> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-09 23:59                               ` Paul E. McKenney
@ 2010-01-10  1:11                                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10  1:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Jan 09, 2010 at 02:20:06PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > > > > > 
> > > > > > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > > > > > penality involved with taking these locks for each CPU, even with just 8
> > > > > > > > cores. So if we can do without the locks, that would be preferred.
> > > > > > > 
> > > > > > > How significant?  Factor of two?  Two orders of magnitude?
> > > > > > > 
> > > > > > 
> > > > > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > > > > > 
> > > > > > Without runqueue locks:
> > > > > > 
> > > > > > T=1: 0m13.911s
> > > > > > T=2: 0m20.730s
> > > > > > T=3: 0m21.474s
> > > > > > T=4: 0m27.952s
> > > > > > T=5: 0m26.286s
> > > > > > T=6: 0m27.855s
> > > > > > T=7: 0m29.695s
> > > > > > 
> > > > > > With runqueue locks:
> > > > > > 
> > > > > > T=1: 0m15.802s
> > > > > > T=2: 0m22.484s
> > > > > > T=3: 0m24.751s
> > > > > > T=4: 0m29.134s
> > > > > > T=5: 0m30.094s
> > > > > > T=6: 0m33.090s
> > > > > > T=7: 0m33.897s
> > > > > > 
> > > > > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > > > > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > > > > > pretty on 128+-core machines.
> > > > > 
> > > > > But isn't the bulk of the overhead the IPIs rather than the runqueue
> > > > > locks?
> > > > > 
> > > > >      W/out RQ       W/RQ   % degradation
> > > > fix:
> > > >        W/out RQ       W/RQ   ratio
> > > > > T=1:    13.91      15.8    1.14
> > > > > T=2:    20.73      22.48   1.08
> > > > > T=3:    21.47      24.75   1.15
> > > > > T=4:    27.95      29.13   1.04
> > > > > T=5:    26.29      30.09   1.14
> > > > > T=6:    27.86      33.09   1.19
> > > > > T=7:    29.7       33.9    1.14
> > > > 
> > > > These numbers tell you that the degradation is roughly constant as we
> > > > add more threads (let's consider 1 thread per core, 1 IPI per thread,
> > > > with active threads). It is all run on a 8-core system will all cpus
> > > > active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
> > > > for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
> > > > added 180 ns/core per call. (note: T=1 is a special-case, as I do not
> > > > allocate any cpumask.)
> > > > 
> > > > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> > > > or a 8-core system, for an added 300 ns/core per call.
> > > > 
> > > > So the overhead of taking the task lock is about twice higher, per core,
> > > > than the overhead of the IPIs. This is understandable if the
> > > > architecture does an IPI broadcast: the scalability problem then boils
> > > > down to exchange cache-lines to inform the ipi sender that the other
> > > > cpus have completed. An atomic operation exchanging a cache-line would
> > > > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.
> > > 
> > > Let me rephrase the question...  Isn't the vast bulk of the overhead
> > > something other than the runqueue spinlocks?
> > 
> > I don't think so. What we have here is:
> > 
> > O(1)
> > - a system call
> > - cpumask allocation
> > - IPI broadcast
> > 
> > O(nr cpus)
> > - wait for IPI handlers to complete
> > - runqueue spinlocks
> > 
> > The O(1) operations seems to be about 5x slower than the combined
> > O(nr cpus) wait and spinlock operations, but this only means that as
> > soon as we have 8 cores, then the bulk of the overhead sits in the
> > runqueue spinlock (if we have to take them).
> > 
> > If we don't take spinlocks, then we can go up to 16 cores before the
> > bulk of the overhead starts to be the "wait for IPI handlers to
> > complete" phase. As you pointed out, we could turn this wait phase into
> > a tree hierarchy. However, we cannot do this with the spinlocks, as they
> > have to be taken for the initial cpumask iteration.
> > 
> > Therefore, if we don't have to take those spinlocks, we can have a very
> > significant gain over this system call overhead, especially on large
> > systems. Not taking spinlocks here allows us to use a tree hierarchy to
> > turn the bulk of the scalability overhead (waiting for IPI handlers to
> > complete) into a O(log(nb cpus)) complexity order, which is quite
> > interesting.
> 
> All this would sound plausible if it weren't for the ratio of overheads
> for runqueue-lock and non-runqueue-lock versions not varying much from
> one to seven CPUs.  ;-)

Hrm, right. for_each_cpu(cpu, mm_cpumask(current->mm)) only iterates
(thus takes locks) on active threads. So this overhead being constant is
a bit unexpected. Unless for some weird reason the mm_cpumask would
always contain all cpus, but I doubt so.

> 
> And please keep in mind that this operations happens on the URCU slowpath.
> Further, there may be opportunities for much larger savings by batching
> grace-period requests, given that a single grace period can serve an
> arbitrarily large number of synchronize_rcu() requests.

That's right.

> 
> > > > > So if we had lots of CPUs, we might want to fan the IPIs out through
> > > > > intermediate CPUs in a tree fashion, but the runqueue locks are not
> > > > > causing excessive pain.
> > > > 
> > > > A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
> > > > they can be broadcasted pretty efficiciently), but however could be
> > > > useful when waiting for the IPIs to complete efficiently.
> > > 
> > > OK, given that you precompute the CPU mask, you might be able to take
> > > advantage of hardware broadcast, on architectures having it.
> > > 
> > > > > How does this compare to use of POSIX signals?  Never mind, POSIX
> > > > > signals are arbitrarily bad if you have way more threads than are
> > > > > actually running at the time...
> > > > 
> > > > POSIX signals to all threads are terrible in that they require to wake
> > > > up all those threads. I have not even thought it useful to compare
> > > > these two approaches with benchmarks yet (I'll do that when the
> > > > sys_membarrier() support is implemented in liburcu).
> > > 
> > > It would be of some interest.  I bet that the runqueue spinlock overhead
> > > is -way- down in the noise by comparison to POSIX signals, even when all
> > > the threads are running.  ;-)
> > 
> > For 1,000,000 iterations, sending signals to execute a remote mb and
> > waiting for it to complete:
> 
> Adding the previous results for comparison, and please keep in mind the
> need to multiply the left-hand column by ten before comparing to the
> right-hand column:
> 
> >                              W/out RQ       W/RQ   ratio
> > T=1: 0m3.107s           T=1:    13.91      15.8    1.14
> > T=2: 0m5.772s           T=2:    20.73      22.48   1.08
> > T=3: 0m8.662s           T=3:    21.47      24.75   1.15
> > T=4: 0m12.239s          T=4:    27.95      29.13   1.04
> > T=5: 0m16.213s          T=5:    26.29      30.09   1.14
> > T=6: 0m19.482s          T=6:    27.86      33.09   1.19
> > T=7: 0m23.227s          T=7:    29.7       33.9    1.14
> 
> So sys_membarrier() is roughly a factor of two better than POSIX signals
> for a single thread, rising to not quite a factor of eight for seven
> threads.  And this data -does- support the notion that POSIX signals
> get increasingly worse with increasing numbers of threads, as one
> would expect.

Yep.

> 
> > So, per iteration:
> > 
> > T=1: 3107 ns
> > T=2: 5772 ns
> > T=3: 8662 ns
> > T=4: 12239 ns
> > T=5: 16213 ns
> > T=6: 19482 ns
> > T=7: 23227 ns
> > 
> > For an added 3000 ns per extra thread. So, yes, the added 300 ns/core
> > for spinlocks is almost lost in the noise compared to the signal-based
> > solution, but it's not because the old solution was behaving so poorly
> > that we can rely on it to say what is noise vs not in the current
> > implementation. Looking at what the scalability bottlenecks are, and
> > looking at what is noise within the current implementation seems like
> > a more appropriate way to design an efficient system call.
> 
> I agree that your measurements show a marked improvement compared to
> POSIX signals that increases with increasing numbers of threads, again,
> as expected.

I knew it! Why did we need a proof again ? (just kidding) ;)

> 
> > So all in all, we can expect around 6.25-fold improvement because we
> > diminish the per-core overhead if we use the spinlocks (480 ns/core vs
> > 3000 ns/core), but if we don't take the runqueue spinlocks (180
> > ns/core), then we are contemplating a 16.7-fold improvement. And this is
> > without considering a tree-hierarchy for waiting for IPIs to complete,
> > which would additionally change the order of the scalability overhead
> > from O(n) to O(log(n)).
> 
> Eh?  You should be able to use the locks to accumulate a cpumask, then
> use whatever you want, including a tree hierarchy, to both send and wait
> for the IPIs, right?

Yes, although I don't see that a tree hierarchy is useful at all when
the architecture supports IPI broadcast efficiently. It's only useful
when waiting for these IPIs to complete.

> 
> Keep in mind that the runqueue locks just force memory ordering on the
> sampling.  It should not be necessary to hold them while sending the
> IPIs.  Or am I missing something?

Yes, that's true. We're just talking about a constant cost per thread
here (taking/releasing the runqueue spinlock).

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  0:03                                   ` Paul E. McKenney
  2010-01-10  0:41                                     ` Steven Rostedt
@ 2010-01-10  1:12                                     ` Mathieu Desnoyers
  2010-01-10  5:19                                       ` Paul E. McKenney
  1 sibling, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10  1:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Jan 09, 2010 at 06:16:40PM -0500, Steven Rostedt wrote:
> > On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:
> > 
> > > Then we should have O(tasks) for spinlocks taken, and 
> > > O(min(tasks, CPUS)) for IPIs.
> > 
> > And for nr tasks >> CPUS, this may help too:
> > 
> > > cpumask = 0;
> > > foreach task {
> > 
> > 	if (cpumask == online_cpus)
> > 		break;
> > 
> > > 	spin_lock(task_rq(task)->rq->lock);
> > > 	if (task_rq(task)->curr == task)
> > > 		cpu_set(task_cpu(task), cpumask);
> > > 	spin_unlock(task_rq(task)->rq->lock);
> > > }
> > > send_ipi(cpumask);
> 
> Good point, erring on the side of sending too many IPIs is safe.  One
> might even be able to just send the full set if enough of the CPUs were
> running the current process and none of the remainder were running
> real-time threads.  And yes, it would then be necessary to throttle
> calls to sys_membarrier().
> 
> Quickly hiding behind a suitable boulder...  ;-)

:)

One quick counter-argument against IPI-to-all: that will wake up all
CPUs, including those which are asleep. Not really good for
energy-saving.

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  0:41                                     ` Steven Rostedt
@ 2010-01-10  1:14                                       ` Mathieu Desnoyers
  2010-01-10  1:44                                       ` Mathieu Desnoyers
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10  1:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Sat, 2010-01-09 at 16:03 -0800, Paul E. McKenney wrote:
> > On Sat, Jan 09, 2010 at 06:16:40PM -0500, Steven Rostedt wrote:
> > > On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:
> > > 
> > > > Then we should have O(tasks) for spinlocks taken, and 
> > > > O(min(tasks, CPUS)) for IPIs.
> > > 
> > > And for nr tasks >> CPUS, this may help too:
> > > 
> > > > cpumask = 0;
> > > > foreach task {
> > > 
> > > 	if (cpumask == online_cpus)
> > > 		break;
> > > 
> > > > 	spin_lock(task_rq(task)->rq->lock);
> > > > 	if (task_rq(task)->curr == task)
> > > > 		cpu_set(task_cpu(task), cpumask);
> > > > 	spin_unlock(task_rq(task)->rq->lock);
> > > > }
> > > > send_ipi(cpumask);
> > 
> > Good point, erring on the side of sending too many IPIs is safe.  One
> > might even be able to just send the full set if enough of the CPUs were
> > running the current process and none of the remainder were running
> > real-time threads.  And yes, it would then be necessary to throttle
> > calls to sys_membarrier().
> > 
> 
> If you need to throttle calls to sys_membarrier(), than why bother
> optimizing it? Again, this is like calling synchronize_sched() in the
> kernel, which is a very heavy operation, and should only be called by
> those that are not performance critical.
> 
> Why are we struggling so much with optimizing the slow path?
> 
> Here's how I take it. This method is much better that sending signals to
> all threads. The advantage the sys_membarrier gives us, is also a way to
> keep user rcu_read_locks barrier free, which means that rcu_read_locks
> are quick and scale well.
> 
> So what if we have a linear decrease in performance with the number of
> threads on the write side?

100% agree. Will use spinlocks.

And will keep the "tree-based" IPI wait as "future work", if anyone ever
need this.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  0:41                                     ` Steven Rostedt
  2010-01-10  1:14                                       ` Mathieu Desnoyers
@ 2010-01-10  1:44                                       ` Mathieu Desnoyers
  2010-01-10  2:12                                         ` Steven Rostedt
  2010-01-10  5:18                                         ` Paul E. McKenney
  1 sibling, 2 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10  1:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Sat, 2010-01-09 at 16:03 -0800, Paul E. McKenney wrote:
> > On Sat, Jan 09, 2010 at 06:16:40PM -0500, Steven Rostedt wrote:
> > > On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:
> > > 
> > > > Then we should have O(tasks) for spinlocks taken, and 
> > > > O(min(tasks, CPUS)) for IPIs.
> > > 
> > > And for nr tasks >> CPUS, this may help too:
> > > 
> > > > cpumask = 0;
> > > > foreach task {
> > > 
> > > 	if (cpumask == online_cpus)
> > > 		break;
> > > 
> > > > 	spin_lock(task_rq(task)->rq->lock);
> > > > 	if (task_rq(task)->curr == task)
> > > > 		cpu_set(task_cpu(task), cpumask);
> > > > 	spin_unlock(task_rq(task)->rq->lock);
> > > > }
> > > > send_ipi(cpumask);
> > 
> > Good point, erring on the side of sending too many IPIs is safe.  One
> > might even be able to just send the full set if enough of the CPUs were
> > running the current process and none of the remainder were running
> > real-time threads.  And yes, it would then be necessary to throttle
> > calls to sys_membarrier().
> > 
> 
> If you need to throttle calls to sys_membarrier(), than why bother
> optimizing it? Again, this is like calling synchronize_sched() in the
> kernel, which is a very heavy operation, and should only be called by
> those that are not performance critical.
> 
> Why are we struggling so much with optimizing the slow path?
> 
> Here's how I take it. This method is much better that sending signals to
> all threads. The advantage the sys_membarrier gives us, is also a way to
> keep user rcu_read_locks barrier free, which means that rcu_read_locks
> are quick and scale well.
> 
> So what if we have a linear decrease in performance with the number of
> threads on the write side?

Hrm, looking at arch/x86/include/asm/mmu_context.h

switch_mm(), which is basically called each time the scheduler needs to
change the current task, does a

cpumask_clear_cpu(cpu, mm_cpumask(prev));

and

cpumask_set_cpu(cpu, mm_cpumask(next));

which precise goal is to stop the flush ipis for the previous mm. The
100$ question is : why do we have to confirm that the thread is indeed
on the runqueue (taking locks and everything) when we could simply just
bluntly use the mm_cpumask for our own IPIs ?

cpumask_clear_cpu and cpumask_set_cpu translate into clear_bit/set_bit.
cpumask_next does a find_next_bit on the cpumask.

clear_bit/set_bit are atomic and not reordered on x86. PowerPC also uses
ll/sc loops in bitops.h, so I think it should be pretty safe to assume
that mm_cpumask is, by design, made to be used as cpumask to send a
broadcast IPI to all CPUs which run threads belonging to a given
process.

So, how about just using mm_cpumask(current) for the broadcast ? Then we
don't even need to allocate our own cpumask neither.

Or am I missing something ? I just sounds too simple.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  1:44                                       ` Mathieu Desnoyers
@ 2010-01-10  2:12                                         ` Steven Rostedt
  2010-01-10  5:25                                           ` Paul E. McKenney
  2010-01-10  5:18                                         ` Paul E. McKenney
  1 sibling, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2010-01-10  2:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, 2010-01-09 at 20:44 -0500, Mathieu Desnoyers wrote:

> > So what if we have a linear decrease in performance with the number of
> > threads on the write side?
> 
> Hrm, looking at arch/x86/include/asm/mmu_context.h
> 
> switch_mm(), which is basically called each time the scheduler needs to
> change the current task, does a
> 
> cpumask_clear_cpu(cpu, mm_cpumask(prev));
> 
> and
> 
> cpumask_set_cpu(cpu, mm_cpumask(next));
> 
> which precise goal is to stop the flush ipis for the previous mm. The
> 100$ question is : why do we have to confirm that the thread is indeed
> on the runqueue (taking locks and everything) when we could simply just
> bluntly use the mm_cpumask for our own IPIs ?


I was just looking at that code, and was thinking the same thing ;-)

> cpumask_clear_cpu and cpumask_set_cpu translate into clear_bit/set_bit.
> cpumask_next does a find_next_bit on the cpumask.
> 
> clear_bit/set_bit are atomic and not reordered on x86. PowerPC also uses
> ll/sc loops in bitops.h, so I think it should be pretty safe to assume
> that mm_cpumask is, by design, made to be used as cpumask to send a
> broadcast IPI to all CPUs which run threads belonging to a given
> process.
> 
> So, how about just using mm_cpumask(current) for the broadcast ? Then we
> don't even need to allocate our own cpumask neither.
> 
> Or am I missing something ? I just sounds too simple.

I think we can use it. If for some reason it does not satisfy what you
need then I also think the TLB flushing is also broken.

IIRC, (Paul help me out on this), what Paul said earlier, we are trying
to protect against this scenario:

(from Paul's email:)


> 
>         CPU 1                           CPU 2
>      -----------                    -------------
> 
>         <user space>                    <kernel space, switching to task>
> 
>                                         ->curr updated
> 
>                                         <long code path, maybe mb?>
> 
>                                         <user space>
> 
>                                         rcu_read_lock(); [load only]
> 
>                                         obj = list->next
> 
>         list_del(obj)
> 
>         sys_membarrier();
>         < kernel space >
> 
>         if (task_rq(task)->curr != task)
>         < but load to obj reordered before store to ->curr >
> 
>         < user space >
> 
>         < misses that CPU 2 is in rcu section >


If the TLB flush misses that CPU 2 has a threaded task, and does not
flush CPU 2s TLB, it can also risk the same type of crash.

> 
>         [CPU 2's ->curr update now visible]
> 
>         [CPU 2's rcu_read_lock() store now visible]
> 
>         free(obj);
> 
>                                         use_object(obj); <=== crash!
> 

Think about it. If you change a process mmap, say you updated a mmap of
a file by flushing out one page and replacing it with another. If the
above missed sending to CPU 2, then CPU 2 may still be accessing the old
page of the file, and not the new one.

I think this may be the safe bet.

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  1:44                                       ` Mathieu Desnoyers
  2010-01-10  2:12                                         ` Steven Rostedt
@ 2010-01-10  5:18                                         ` Paul E. McKenney
  1 sibling, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-10  5:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, Jan 09, 2010 at 08:44:56PM -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Sat, 2010-01-09 at 16:03 -0800, Paul E. McKenney wrote:
> > > On Sat, Jan 09, 2010 at 06:16:40PM -0500, Steven Rostedt wrote:
> > > > On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:
> > > > 
> > > > > Then we should have O(tasks) for spinlocks taken, and 
> > > > > O(min(tasks, CPUS)) for IPIs.
> > > > 
> > > > And for nr tasks >> CPUS, this may help too:
> > > > 
> > > > > cpumask = 0;
> > > > > foreach task {
> > > > 
> > > > 	if (cpumask == online_cpus)
> > > > 		break;
> > > > 
> > > > > 	spin_lock(task_rq(task)->rq->lock);
> > > > > 	if (task_rq(task)->curr == task)
> > > > > 		cpu_set(task_cpu(task), cpumask);
> > > > > 	spin_unlock(task_rq(task)->rq->lock);
> > > > > }
> > > > > send_ipi(cpumask);
> > > 
> > > Good point, erring on the side of sending too many IPIs is safe.  One
> > > might even be able to just send the full set if enough of the CPUs were
> > > running the current process and none of the remainder were running
> > > real-time threads.  And yes, it would then be necessary to throttle
> > > calls to sys_membarrier().
> > > 
> > 
> > If you need to throttle calls to sys_membarrier(), than why bother
> > optimizing it? Again, this is like calling synchronize_sched() in the
> > kernel, which is a very heavy operation, and should only be called by
> > those that are not performance critical.
> > 
> > Why are we struggling so much with optimizing the slow path?
> > 
> > Here's how I take it. This method is much better that sending signals to
> > all threads. The advantage the sys_membarrier gives us, is also a way to
> > keep user rcu_read_locks barrier free, which means that rcu_read_locks
> > are quick and scale well.
> > 
> > So what if we have a linear decrease in performance with the number of
> > threads on the write side?
> 
> Hrm, looking at arch/x86/include/asm/mmu_context.h
> 
> switch_mm(), which is basically called each time the scheduler needs to
> change the current task, does a
> 
> cpumask_clear_cpu(cpu, mm_cpumask(prev));
> 
> and
> 
> cpumask_set_cpu(cpu, mm_cpumask(next));
> 
> which precise goal is to stop the flush ipis for the previous mm. The
> 100$ question is : why do we have to confirm that the thread is indeed
> on the runqueue (taking locks and everything) when we could simply just
> bluntly use the mm_cpumask for our own IPIs ?
> 
> cpumask_clear_cpu and cpumask_set_cpu translate into clear_bit/set_bit.
> cpumask_next does a find_next_bit on the cpumask.
> 
> clear_bit/set_bit are atomic and not reordered on x86. PowerPC also uses
> ll/sc loops in bitops.h, so I think it should be pretty safe to assume
> that mm_cpumask is, by design, made to be used as cpumask to send a
> broadcast IPI to all CPUs which run threads belonging to a given
> process.

According to Documentation/atomic_ops.txt, clear_bit/set_bit are atomic,
but do not require memory-barrier semantics.

> So, how about just using mm_cpumask(current) for the broadcast ? Then we
> don't even need to allocate our own cpumask neither.
> 
> Or am I missing something ? I just sounds too simple.

In this case, a pair of memory barriers around the clear_bit/set_bit in
mm and a memory barrier before sampling the mask.  Yes, x86 gives you
memory barriers on atomics whether you need them or not, but they are
not guaranteed.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  1:12                                     ` Mathieu Desnoyers
@ 2010-01-10  5:19                                       ` Paul E. McKenney
  0 siblings, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-10  5:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, Jan 09, 2010 at 08:12:55PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sat, Jan 09, 2010 at 06:16:40PM -0500, Steven Rostedt wrote:
> > > On Sat, 2010-01-09 at 18:05 -0500, Steven Rostedt wrote:
> > > 
> > > > Then we should have O(tasks) for spinlocks taken, and 
> > > > O(min(tasks, CPUS)) for IPIs.
> > > 
> > > And for nr tasks >> CPUS, this may help too:
> > > 
> > > > cpumask = 0;
> > > > foreach task {
> > > 
> > > 	if (cpumask == online_cpus)
> > > 		break;
> > > 
> > > > 	spin_lock(task_rq(task)->rq->lock);
> > > > 	if (task_rq(task)->curr == task)
> > > > 		cpu_set(task_cpu(task), cpumask);
> > > > 	spin_unlock(task_rq(task)->rq->lock);
> > > > }
> > > > send_ipi(cpumask);
> > 
> > Good point, erring on the side of sending too many IPIs is safe.  One
> > might even be able to just send the full set if enough of the CPUs were
> > running the current process and none of the remainder were running
> > real-time threads.  And yes, it would then be necessary to throttle
> > calls to sys_membarrier().
> > 
> > Quickly hiding behind a suitable boulder...  ;-)
> 
> :)
> 
> One quick counter-argument against IPI-to-all: that will wake up all
> CPUs, including those which are asleep. Not really good for
> energy-saving.

Good point.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  2:12                                         ` Steven Rostedt
@ 2010-01-10  5:25                                           ` Paul E. McKenney
  2010-01-10 11:50                                             ` Steven Rostedt
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-10  5:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, Jan 09, 2010 at 09:12:58PM -0500, Steven Rostedt wrote:
> On Sat, 2010-01-09 at 20:44 -0500, Mathieu Desnoyers wrote:
> 
> > > So what if we have a linear decrease in performance with the number of
> > > threads on the write side?
> > 
> > Hrm, looking at arch/x86/include/asm/mmu_context.h
> > 
> > switch_mm(), which is basically called each time the scheduler needs to
> > change the current task, does a
> > 
> > cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > 
> > and
> > 
> > cpumask_set_cpu(cpu, mm_cpumask(next));
> > 
> > which precise goal is to stop the flush ipis for the previous mm. The
> > 100$ question is : why do we have to confirm that the thread is indeed
> > on the runqueue (taking locks and everything) when we could simply just
> > bluntly use the mm_cpumask for our own IPIs ?
> 
> I was just looking at that code, and was thinking the same thing ;-)
> 
> > cpumask_clear_cpu and cpumask_set_cpu translate into clear_bit/set_bit.
> > cpumask_next does a find_next_bit on the cpumask.
> > 
> > clear_bit/set_bit are atomic and not reordered on x86. PowerPC also uses
> > ll/sc loops in bitops.h, so I think it should be pretty safe to assume
> > that mm_cpumask is, by design, made to be used as cpumask to send a
> > broadcast IPI to all CPUs which run threads belonging to a given
> > process.
> > 
> > So, how about just using mm_cpumask(current) for the broadcast ? Then we
> > don't even need to allocate our own cpumask neither.
> > 
> > Or am I missing something ? I just sounds too simple.
> 
> I think we can use it. If for some reason it does not satisfy what you
> need then I also think the TLB flushing is also broken.
> 
> IIRC, (Paul help me out on this), what Paul said earlier, we are trying
> to protect against this scenario:
> 
> (from Paul's email:)
> 
> 
> > 
> >         CPU 1                           CPU 2
> >      -----------                    -------------
> > 
> >         <user space>                    <kernel space, switching to task>
> > 
> >                                         ->curr updated
> > 
> >                                         <long code path, maybe mb?>
> > 
> >                                         <user space>
> > 
> >                                         rcu_read_lock(); [load only]
> > 
> >                                         obj = list->next
> > 
> >         list_del(obj)
> > 
> >         sys_membarrier();
> >         < kernel space >
> > 
> >         if (task_rq(task)->curr != task)
> >         < but load to obj reordered before store to ->curr >
> > 
> >         < user space >
> > 
> >         < misses that CPU 2 is in rcu section >
> 
> 
> If the TLB flush misses that CPU 2 has a threaded task, and does not
> flush CPU 2s TLB, it can also risk the same type of crash.

But isn't the VM's locking helping us out in that case?

> >         [CPU 2's ->curr update now visible]
> > 
> >         [CPU 2's rcu_read_lock() store now visible]
> > 
> >         free(obj);
> > 
> >                                         use_object(obj); <=== crash!
> > 
> 
> Think about it. If you change a process mmap, say you updated a mmap of
> a file by flushing out one page and replacing it with another. If the
> above missed sending to CPU 2, then CPU 2 may still be accessing the old
> page of the file, and not the new one.
> 
> I think this may be the safe bet.

You might well be correct that we can access that bitmap locklessly,
but there are additional things (like the loading of the arch-specific
page-table register) that are likely to be helping in the VM case, but
not necessarily helping in this case.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10  5:25                                           ` Paul E. McKenney
@ 2010-01-10 11:50                                             ` Steven Rostedt
  2010-01-10 16:03                                               ` Mathieu Desnoyers
  2010-01-10 17:45                                               ` Paul E. McKenney
  0 siblings, 2 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-10 11:50 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sat, 2010-01-09 at 21:25 -0800, Paul E. McKenney wrote:
> On Sat, Jan 09, 2010 at 09:12:58PM -0500, Steven Rostedt wrote:

> > >         < user space >
> > > 
> > >         < misses that CPU 2 is in rcu section >
> > 
> > 
> > If the TLB flush misses that CPU 2 has a threaded task, and does not
> > flush CPU 2s TLB, it can also risk the same type of crash.
> 
> But isn't the VM's locking helping us out in that case?
> 
> > >         [CPU 2's ->curr update now visible]
> > > 
> > >         [CPU 2's rcu_read_lock() store now visible]
> > > 
> > >         free(obj);
> > > 
> > >                                         use_object(obj); <=== crash!
> > > 
> > 
> > Think about it. If you change a process mmap, say you updated a mmap of
> > a file by flushing out one page and replacing it with another. If the
> > above missed sending to CPU 2, then CPU 2 may still be accessing the old
> > page of the file, and not the new one.
> > 
> > I think this may be the safe bet.
> 
> You might well be correct that we can access that bitmap locklessly,
> but there are additional things (like the loading of the arch-specific
> page-table register) that are likely to be helping in the VM case, but
> not necessarily helping in this case.


Then perhaps the sys_membarrier() should just do a flush_tlb()? That
should guarantee the synchronization, right?

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 11:50                                             ` Steven Rostedt
@ 2010-01-10 16:03                                               ` Mathieu Desnoyers
  2010-01-10 16:21                                                 ` Steven Rostedt
  2010-01-10 17:45                                               ` Paul E. McKenney
  1 sibling, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10 16:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Sat, 2010-01-09 at 21:25 -0800, Paul E. McKenney wrote:
> > On Sat, Jan 09, 2010 at 09:12:58PM -0500, Steven Rostedt wrote:
> 
> > > >         < user space >
> > > > 
> > > >         < misses that CPU 2 is in rcu section >
> > > 
> > > 
> > > If the TLB flush misses that CPU 2 has a threaded task, and does not
> > > flush CPU 2s TLB, it can also risk the same type of crash.
> > 
> > But isn't the VM's locking helping us out in that case?
> > 
> > > >         [CPU 2's ->curr update now visible]
> > > > 
> > > >         [CPU 2's rcu_read_lock() store now visible]
> > > > 
> > > >         free(obj);
> > > > 
> > > >                                         use_object(obj); <=== crash!
> > > > 
> > > 
> > > Think about it. If you change a process mmap, say you updated a mmap of
> > > a file by flushing out one page and replacing it with another. If the
> > > above missed sending to CPU 2, then CPU 2 may still be accessing the old
> > > page of the file, and not the new one.
> > > 
> > > I think this may be the safe bet.
> > 
> > You might well be correct that we can access that bitmap locklessly,
> > but there are additional things (like the loading of the arch-specific
> > page-table register) that are likely to be helping in the VM case, but
> > not necessarily helping in this case.
> 
> 
> Then perhaps the sys_membarrier() should just do a flush_tlb()? That
> should guarantee the synchronization, right?
> 

The way I see it, TLB can be seen as read-only elements (a local
read-only cache) on the processors. Therefore, we don't care if they are
in a stale state while performing the cpumask update, because the fact
that we are executing switch_mm() means that these TLB entries are not
being used locally anyway and will be dropped shortly. So we have the
equivalent of a full memory barrier (load_cr3()) _after_ the cpumask
updates.

However, in sys_membarrier(), we also need to flush the write buffers
present on each processor running threads which belong to our current
process. Therefore, we would need, in addition, a smp_mb() before the
mm cpumask modification. For x86, cpumask_clear_cpu/cpumask_set_cpu
implies a LOCK-prefixed operation, and hence does not need any added
barrier, but this could be different for other architectures.

So, AFAIK, doing a flush_tlb() would not guarantee the kind of
synchronization we are looking for because an uncommitted write buffer
could still sit on the remote CPU when we return from sys_membarrier().

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 16:03                                               ` Mathieu Desnoyers
@ 2010-01-10 16:21                                                 ` Steven Rostedt
  2010-01-10 17:10                                                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2010-01-10 16:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sun, 2010-01-10 at 11:03 -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:

> The way I see it, TLB can be seen as read-only elements (a local
> read-only cache) on the processors. Therefore, we don't care if they are
> in a stale state while performing the cpumask update, because the fact
> that we are executing switch_mm() means that these TLB entries are not
> being used locally anyway and will be dropped shortly. So we have the
> equivalent of a full memory barrier (load_cr3()) _after_ the cpumask
> updates.
> 
> However, in sys_membarrier(), we also need to flush the write buffers
> present on each processor running threads which belong to our current
> process. Therefore, we would need, in addition, a smp_mb() before the
> mm cpumask modification. For x86, cpumask_clear_cpu/cpumask_set_cpu
> implies a LOCK-prefixed operation, and hence does not need any added
> barrier, but this could be different for other architectures.
> 
> So, AFAIK, doing a flush_tlb() would not guarantee the kind of
> synchronization we are looking for because an uncommitted write buffer
> could still sit on the remote CPU when we return from sys_membarrier().

Ah, so you are saying we can have this:


	CPU 0			CPU 1
     ----------		    --------------
	obj = list->obj;
				<user space>
				rcu_read_lock();
				obj = rcu_dereference(list->obj);
				obj->foo = bar;

				<preempt>
				<kernel space>

				schedule();
				cpumask_clear(mm_cpumask, cpu);

	sys_membarrier();
	free(obj);

				<store to obj->foo goes to memory>  <- corruption

		

So, if there's no smp_wmb() between the <preempt> and cpumask_clear()
then we have an issue?

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 16:21                                                 ` Steven Rostedt
@ 2010-01-10 17:10                                                   ` Mathieu Desnoyers
  2010-01-10 21:02                                                     ` Steven Rostedt
  0 siblings, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10 17:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Sun, 2010-01-10 at 11:03 -0500, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> > The way I see it, TLB can be seen as read-only elements (a local
> > read-only cache) on the processors. Therefore, we don't care if they are
> > in a stale state while performing the cpumask update, because the fact
> > that we are executing switch_mm() means that these TLB entries are not
> > being used locally anyway and will be dropped shortly. So we have the
> > equivalent of a full memory barrier (load_cr3()) _after_ the cpumask
> > updates.
> > 
> > However, in sys_membarrier(), we also need to flush the write buffers
> > present on each processor running threads which belong to our current
> > process. Therefore, we would need, in addition, a smp_mb() before the
> > mm cpumask modification. For x86, cpumask_clear_cpu/cpumask_set_cpu
> > implies a LOCK-prefixed operation, and hence does not need any added
> > barrier, but this could be different for other architectures.
> > 
> > So, AFAIK, doing a flush_tlb() would not guarantee the kind of
> > synchronization we are looking for because an uncommitted write buffer
> > could still sit on the remote CPU when we return from sys_membarrier().
> 
> Ah, so you are saying we can have this:
> 
> 
> 	CPU 0			CPU 1
>      ----------		    --------------
> 	obj = list->obj;
> 				<user space>
> 				rcu_read_lock();
> 				obj = rcu_dereference(list->obj);
> 				obj->foo = bar;
> 
> 				<preempt>
> 				<kernel space>
> 
> 				schedule();
> 				cpumask_clear(mm_cpumask, cpu);
> 
> 	sys_membarrier();
> 	free(obj);
> 
> 				<store to obj->foo goes to memory>  <- corruption
> 

Hrm, having a writer like this in a rcu read-side would be a bit weird.
We have to look at the actual rcu_read_lock() implementation in urcu to
see why load/stores are important on the rcu read-side.

(note: _STORE_SHARED is simply a volatile store)

(Thread-local variable, shared with the thread doing synchronize_rcu())
struct urcu_reader __thread urcu_reader;

static inline void _rcu_read_lock(void)
{
        long tmp;

        tmp = urcu_reader.ctr;
        if (likely(!(tmp & RCU_GP_CTR_NEST_MASK))) {
                _STORE_SHARED(urcu_reader.ctr, _LOAD_SHARED(urcu_gp_ctr));
                /*
                 * Set active readers count for outermost nesting level before
                 * accessing the pointer. See force_mb_all_threads().
                 */
                barrier();
        } else {
                _STORE_SHARED(urcu_reader.ctr, tmp + RCU_GP_COUNT);
        }
}

So as you see here, we have to ensure that the store to urcu_reader.ctr
is globally visible before entering the critical section (previous
stores must complete before following loads). For rcu_read_unlock, it's
the opposite:

static inline void _rcu_read_unlock(void)
{
        long tmp;

        tmp = urcu_reader.ctr;
        /*
         * Finish using rcu before decrementing the pointer.
         * See force_mb_all_threads().
         */
        if (likely((tmp & RCU_GP_CTR_NEST_MASK) == RCU_GP_COUNT)) {
                barrier();
                _STORE_SHARED(urcu_reader.ctr, urcu_reader.ctr - RCU_GP_COUNT);
        } else {
                _STORE_SHARED(urcu_reader.ctr, urcu_reader.ctr - RCU_GP_COUNT);
        }
}

We need to ensure that previous loads complete before following stores.

Therefore, the race with unlock showing that we need to order loads
before stores:

	CPU 0			CPU 1
        --------------          --------------
                                <user space> (already in read-side C.S.)
                                obj = rcu_dereference(list->next);
                                  -> load list->next
                                copy = obj->foo;
                                rcu_read_unlock();
                                  -> store to urcu_reader.ctr
                                <urcu_reader.ctr store is globally visible>
        list_del(obj);
                                <preempt>
                                <kernel space>

                                schedule();
                                cpumask_clear(mm_cpumask, cpu);

        sys_membarrier();
        set global g.p. (urcu_gp_ctr) phase to 1
        wait for all urcu_reader.ctr in phase 0
        set global g.p. (urcu_gp_ctr) phase to 0
        wait for all urcu_reader.ctr in phase 1
        sys_membarrier();
        free(obj);
                                <list->next load hits memory>
                                <obj->foo load hits memory> <- corruption

> 		
> So, if there's no smp_wmb() between the <preempt> and cpumask_clear()
> then we have an issue?

Considering the scenario above, we would need a full smp_mb() (or
equivalent) rather than just smp_wmb() to be strictly correct.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 11:50                                             ` Steven Rostedt
  2010-01-10 16:03                                               ` Mathieu Desnoyers
@ 2010-01-10 17:45                                               ` Paul E. McKenney
  2010-01-10 18:24                                                 ` Mathieu Desnoyers
  1 sibling, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-10 17:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sun, Jan 10, 2010 at 06:50:09AM -0500, Steven Rostedt wrote:
> On Sat, 2010-01-09 at 21:25 -0800, Paul E. McKenney wrote:
> > On Sat, Jan 09, 2010 at 09:12:58PM -0500, Steven Rostedt wrote:
> 
> > > >         < user space >
> > > > 
> > > >         < misses that CPU 2 is in rcu section >
> > > 
> > > 
> > > If the TLB flush misses that CPU 2 has a threaded task, and does not
> > > flush CPU 2s TLB, it can also risk the same type of crash.
> > 
> > But isn't the VM's locking helping us out in that case?
> > 
> > > >         [CPU 2's ->curr update now visible]
> > > > 
> > > >         [CPU 2's rcu_read_lock() store now visible]
> > > > 
> > > >         free(obj);
> > > > 
> > > >                                         use_object(obj); <=== crash!
> > > > 
> > > 
> > > Think about it. If you change a process mmap, say you updated a mmap of
> > > a file by flushing out one page and replacing it with another. If the
> > > above missed sending to CPU 2, then CPU 2 may still be accessing the old
> > > page of the file, and not the new one.
> > > 
> > > I think this may be the safe bet.
> > 
> > You might well be correct that we can access that bitmap locklessly,
> > but there are additional things (like the loading of the arch-specific
> > page-table register) that are likely to be helping in the VM case, but
> > not necessarily helping in this case.
> 
> Then perhaps the sys_membarrier() should just do a flush_tlb()? That
> should guarantee the synchronization, right?

Isn't just grabbing the runqueue locks a bit more straightforward and
more clearly correct?  Again, this is the URCU slowpath, so it is hard
to get too excited about a 15% performance penalty.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 17:45                                               ` Paul E. McKenney
@ 2010-01-10 18:24                                                 ` Mathieu Desnoyers
  2010-01-11  1:17                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10 18:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Jan 10, 2010 at 06:50:09AM -0500, Steven Rostedt wrote:
> > On Sat, 2010-01-09 at 21:25 -0800, Paul E. McKenney wrote:
> > > On Sat, Jan 09, 2010 at 09:12:58PM -0500, Steven Rostedt wrote:
> > 
> > > > >         < user space >
> > > > > 
> > > > >         < misses that CPU 2 is in rcu section >
> > > > 
> > > > 
> > > > If the TLB flush misses that CPU 2 has a threaded task, and does not
> > > > flush CPU 2s TLB, it can also risk the same type of crash.
> > > 
> > > But isn't the VM's locking helping us out in that case?
> > > 
> > > > >         [CPU 2's ->curr update now visible]
> > > > > 
> > > > >         [CPU 2's rcu_read_lock() store now visible]
> > > > > 
> > > > >         free(obj);
> > > > > 
> > > > >                                         use_object(obj); <=== crash!
> > > > > 
> > > > 
> > > > Think about it. If you change a process mmap, say you updated a mmap of
> > > > a file by flushing out one page and replacing it with another. If the
> > > > above missed sending to CPU 2, then CPU 2 may still be accessing the old
> > > > page of the file, and not the new one.
> > > > 
> > > > I think this may be the safe bet.
> > > 
> > > You might well be correct that we can access that bitmap locklessly,
> > > but there are additional things (like the loading of the arch-specific
> > > page-table register) that are likely to be helping in the VM case, but
> > > not necessarily helping in this case.
> > 
> > Then perhaps the sys_membarrier() should just do a flush_tlb()? That
> > should guarantee the synchronization, right?
> 
> Isn't just grabbing the runqueue locks a bit more straightforward and
> more clearly correct?  Again, this is the URCU slowpath, so it is hard
> to get too excited about a 15% performance penalty.
> 

Even when taking the spinlocks, efficient iteration on active threads is
done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
the same cpumask, and thus requires the same memory barriers around the
updates.

We could switch to an inefficient iteration on all online CPUs instead,
and check read runqueue ->mm with the spinlock held. Is that what you
propose ? This will cause reading of large amounts of runqueue
information, especially on large systems running few threads. The other
way around is to iterate on all the process threads: in this case, small
systems running many threads will have to read information about many
inactive threads, which is not much better.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 17:10                                                   ` Mathieu Desnoyers
@ 2010-01-10 21:02                                                     ` Steven Rostedt
  2010-01-10 21:41                                                       ` Mathieu Desnoyers
  2010-01-11  1:21                                                       ` Paul E. McKenney
  0 siblings, 2 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-10 21:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sun, 2010-01-10 at 12:10 -0500, Mathieu Desnoyers wrote:

> > 
> > 	CPU 0			CPU 1
> >      ----------		    --------------
> > 	obj = list->obj;
> > 				<user space>
> > 				rcu_read_lock();
> > 				obj = rcu_dereference(list->obj);
> > 				obj->foo = bar;
> > 
> > 				<preempt>
> > 				<kernel space>
> > 
> > 				schedule();
> > 				cpumask_clear(mm_cpumask, cpu);
> > 
> > 	sys_membarrier();
> > 	free(obj);
> > 
> > 				<store to obj->foo goes to memory>  <- corruption
> > 
> 
> Hrm, having a writer like this in a rcu read-side would be a bit weird.
> We have to look at the actual rcu_read_lock() implementation in urcu to
> see why load/stores are important on the rcu read-side.
> 

No it is not weird, it is common. The read is on the link list that we
can access. Yes a write should be protected by other locks, so maybe
that is the weird part.

> (note: _STORE_SHARED is simply a volatile store)
> 
> (Thread-local variable, shared with the thread doing synchronize_rcu())
> struct urcu_reader __thread urcu_reader;
> 
> static inline void _rcu_read_lock(void)
> {
>         long tmp;
> 
>         tmp = urcu_reader.ctr;
>         if (likely(!(tmp & RCU_GP_CTR_NEST_MASK))) {
>                 _STORE_SHARED(urcu_reader.ctr, _LOAD_SHARED(urcu_gp_ctr));
>                 /*
>                  * Set active readers count for outermost nesting level before
>                  * accessing the pointer. See force_mb_all_threads().
>                  */
>                 barrier();
>         } else {
>                 _STORE_SHARED(urcu_reader.ctr, tmp + RCU_GP_COUNT);
>         }
> }
> 
> So as you see here, we have to ensure that the store to urcu_reader.ctr
> is globally visible before entering the critical section (previous
> stores must complete before following loads). For rcu_read_unlock, it's
> the opposite:
> 
> static inline void _rcu_read_unlock(void)
> {
>         long tmp;
> 
>         tmp = urcu_reader.ctr;
>         /*
>          * Finish using rcu before decrementing the pointer.
>          * See force_mb_all_threads().
>          */
>         if (likely((tmp & RCU_GP_CTR_NEST_MASK) == RCU_GP_COUNT)) {
>                 barrier();
>                 _STORE_SHARED(urcu_reader.ctr, urcu_reader.ctr - RCU_GP_COUNT);
>         } else {
>                 _STORE_SHARED(urcu_reader.ctr, urcu_reader.ctr - RCU_GP_COUNT);
>         }
> }

Thanks for the insight of the code. I need to get around and look at
your userspace implementation ;-)

> 
> We need to ensure that previous loads complete before following stores.
> 
> Therefore, the race with unlock showing that we need to order loads
> before stores:
> 
> 	CPU 0			CPU 1
>         --------------          --------------
>                                 <user space> (already in read-side C.S.)
>                                 obj = rcu_dereference(list->next);
>                                   -> load list->next
>                                 copy = obj->foo;
>                                 rcu_read_unlock();
>                                   -> store to urcu_reader.ctr
>                                 <urcu_reader.ctr store is globally visible>
>         list_del(obj);
>                                 <preempt>
>                                 <kernel space>
> 
>                                 schedule();
>                                 cpumask_clear(mm_cpumask, cpu);

				but here we are switching to a new task.

> 
>         sys_membarrier();
>         set global g.p. (urcu_gp_ctr) phase to 1
>         wait for all urcu_reader.ctr in phase 0
>         set global g.p. (urcu_gp_ctr) phase to 0
>         wait for all urcu_reader.ctr in phase 1
>         sys_membarrier();
>         free(obj);
>                                 <list->next load hits memory>
>                                 <obj->foo load hits memory> <- corruption

load of obj->foo is really load foo(obj) into some register. And for the
above to fail, that means that this load happened even after we switched
to kernel space, and that load of foo(obj) is still pending to get into
the thread stack that saved that register.

But I'm sure Paul will point me to some arch that does this ;-)

> 
> > 		
> > So, if there's no smp_wmb() between the <preempt> and cpumask_clear()
> > then we have an issue?
> 
> Considering the scenario above, we would need a full smp_mb() (or
> equivalent) rather than just smp_wmb() to be strictly correct.

I agree with Paul, we should just punt and grab the rq locks. That seems
to be the safest way without resorting to funny tricks to save 15% on a
slow path.

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 21:02                                                     ` Steven Rostedt
@ 2010-01-10 21:41                                                       ` Mathieu Desnoyers
  2010-01-11  1:21                                                       ` Paul E. McKenney
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-10 21:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Sun, 2010-01-10 at 12:10 -0500, Mathieu Desnoyers wrote:
> 
> > > 
> > > 	CPU 0			CPU 1
> > >      ----------		    --------------
> > > 	obj = list->obj;
> > > 				<user space>
> > > 				rcu_read_lock();
> > > 				obj = rcu_dereference(list->obj);
> > > 				obj->foo = bar;
> > > 
> > > 				<preempt>
> > > 				<kernel space>
> > > 
> > > 				schedule();
> > > 				cpumask_clear(mm_cpumask, cpu);
> > > 
> > > 	sys_membarrier();
> > > 	free(obj);
> > > 
> > > 				<store to obj->foo goes to memory>  <- corruption
> > > 
> > 
> > Hrm, having a writer like this in a rcu read-side would be a bit weird.
> > We have to look at the actual rcu_read_lock() implementation in urcu to
> > see why load/stores are important on the rcu read-side.
> > 
> 
> No it is not weird, it is common. The read is on the link list that we
> can access. Yes a write should be protected by other locks, so maybe
> that is the weird part.

Yes, this is what I thought was a bit weird.

> 
> > (note: _STORE_SHARED is simply a volatile store)
> > 
> > (Thread-local variable, shared with the thread doing synchronize_rcu())
> > struct urcu_reader __thread urcu_reader;
> > 
> > static inline void _rcu_read_lock(void)
> > {
> >         long tmp;
> > 
> >         tmp = urcu_reader.ctr;
> >         if (likely(!(tmp & RCU_GP_CTR_NEST_MASK))) {
> >                 _STORE_SHARED(urcu_reader.ctr, _LOAD_SHARED(urcu_gp_ctr));
> >                 /*
> >                  * Set active readers count for outermost nesting level before
> >                  * accessing the pointer. See force_mb_all_threads().
> >                  */
> >                 barrier();
> >         } else {
> >                 _STORE_SHARED(urcu_reader.ctr, tmp + RCU_GP_COUNT);
> >         }
> > }
> > 
> > So as you see here, we have to ensure that the store to urcu_reader.ctr
> > is globally visible before entering the critical section (previous
> > stores must complete before following loads). For rcu_read_unlock, it's
> > the opposite:
> > 
> > static inline void _rcu_read_unlock(void)
> > {
> >         long tmp;
> > 
> >         tmp = urcu_reader.ctr;
> >         /*
> >          * Finish using rcu before decrementing the pointer.
> >          * See force_mb_all_threads().
> >          */
> >         if (likely((tmp & RCU_GP_CTR_NEST_MASK) == RCU_GP_COUNT)) {
> >                 barrier();
> >                 _STORE_SHARED(urcu_reader.ctr, urcu_reader.ctr - RCU_GP_COUNT);
> >         } else {
> >                 _STORE_SHARED(urcu_reader.ctr, urcu_reader.ctr - RCU_GP_COUNT);
> >         }
> > }
> 
> Thanks for the insight of the code. I need to get around and look at
> your userspace implementation ;-)
> 
> > 
> > We need to ensure that previous loads complete before following stores.
> > 
> > Therefore, the race with unlock showing that we need to order loads
> > before stores:
> > 
> > 	CPU 0			CPU 1
> >         --------------          --------------
> >                                 <user space> (already in read-side C.S.)
> >                                 obj = rcu_dereference(list->next);
> >                                   -> load list->next
> >                                 copy = obj->foo;
> >                                 rcu_read_unlock();
> >                                   -> store to urcu_reader.ctr
> >                                 <urcu_reader.ctr store is globally visible>
> >         list_del(obj);
> >                                 <preempt>
> >                                 <kernel space>
> > 
> >                                 schedule();
> >                                 cpumask_clear(mm_cpumask, cpu);
> 
> 				but here we are switching to a new task.

Yes

> 
> > 
> >         sys_membarrier();
> >         set global g.p. (urcu_gp_ctr) phase to 1
> >         wait for all urcu_reader.ctr in phase 0
> >         set global g.p. (urcu_gp_ctr) phase to 0
> >         wait for all urcu_reader.ctr in phase 1
> >         sys_membarrier();
> >         free(obj);
> >                                 <list->next load hits memory>
> >                                 <obj->foo load hits memory> <- corruption
> 
> load of obj->foo is really load foo(obj) into some register. And for the
> above to fail, that means that this load happened even after we switched
> to kernel space, and that load of foo(obj) is still pending to get into
> the thread stack that saved that register.
> 

Yes, even though this event is very unlikely, I don't want to rely on a
memory barrier that would happen to be missing.


> But I'm sure Paul will point me to some arch that does this ;-)
> 
> > 
> > > 		
> > > So, if there's no smp_wmb() between the <preempt> and cpumask_clear()
> > > then we have an issue?
> > 
> > Considering the scenario above, we would need a full smp_mb() (or
> > equivalent) rather than just smp_wmb() to be strictly correct.
> 
> I agree with Paul, we should just punt and grab the rq locks. That seems
> to be the safest way without resorting to funny tricks to save 15% on a
> slow path.

Allright. I must warn you though: the resulting code is _much_ bigger
than a simple ipi sent to a mm cpumask, because we have to allocate the
cpumask and iterate on cpus/threads (whichever is the smallest). I grows
from a tiny 41 lines to 224 lines.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 18:24                                                 ` Mathieu Desnoyers
@ 2010-01-11  1:17                                                   ` Paul E. McKenney
  2010-01-11  4:25                                                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-11  1:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sun, Jan 10, 2010 at 01:24:23PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sun, Jan 10, 2010 at 06:50:09AM -0500, Steven Rostedt wrote:
> > > On Sat, 2010-01-09 at 21:25 -0800, Paul E. McKenney wrote:
> > > > On Sat, Jan 09, 2010 at 09:12:58PM -0500, Steven Rostedt wrote:
> > > 
> > > > > >         < user space >
> > > > > > 
> > > > > >         < misses that CPU 2 is in rcu section >
> > > > > 
> > > > > 
> > > > > If the TLB flush misses that CPU 2 has a threaded task, and does not
> > > > > flush CPU 2s TLB, it can also risk the same type of crash.
> > > > 
> > > > But isn't the VM's locking helping us out in that case?
> > > > 
> > > > > >         [CPU 2's ->curr update now visible]
> > > > > > 
> > > > > >         [CPU 2's rcu_read_lock() store now visible]
> > > > > > 
> > > > > >         free(obj);
> > > > > > 
> > > > > >                                         use_object(obj); <=== crash!
> > > > > > 
> > > > > 
> > > > > Think about it. If you change a process mmap, say you updated a mmap of
> > > > > a file by flushing out one page and replacing it with another. If the
> > > > > above missed sending to CPU 2, then CPU 2 may still be accessing the old
> > > > > page of the file, and not the new one.
> > > > > 
> > > > > I think this may be the safe bet.
> > > > 
> > > > You might well be correct that we can access that bitmap locklessly,
> > > > but there are additional things (like the loading of the arch-specific
> > > > page-table register) that are likely to be helping in the VM case, but
> > > > not necessarily helping in this case.
> > > 
> > > Then perhaps the sys_membarrier() should just do a flush_tlb()? That
> > > should guarantee the synchronization, right?
> > 
> > Isn't just grabbing the runqueue locks a bit more straightforward and
> > more clearly correct?  Again, this is the URCU slowpath, so it is hard
> > to get too excited about a 15% performance penalty.
> 
> Even when taking the spinlocks, efficient iteration on active threads is
> done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
> the same cpumask, and thus requires the same memory barriers around the
> updates.

Ouch!!!  Good point and good catch!!!

> We could switch to an inefficient iteration on all online CPUs instead,
> and check read runqueue ->mm with the spinlock held. Is that what you
> propose ? This will cause reading of large amounts of runqueue
> information, especially on large systems running few threads. The other
> way around is to iterate on all the process threads: in this case, small
> systems running many threads will have to read information about many
> inactive threads, which is not much better.

I am not all that worried about exactly what we do as long as it is
pretty obviously correct.  We can then improve performance when and as
the need arises.  We might need to use any of the strategies you
propose, or perhaps even choose among them depending on the number of
threads in the process, the number of CPUs, and so forth.  (I hope not,
but...)

My guess is that an obviously correct approach would work well for a
slowpath.  If someone later runs into performance problems, we can fix
them with the added knowledge of what they are trying to do.

							Thanx, Paul

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-10 21:02                                                     ` Steven Rostedt
  2010-01-10 21:41                                                       ` Mathieu Desnoyers
@ 2010-01-11  1:21                                                       ` Paul E. McKenney
  1 sibling, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-11  1:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sun, Jan 10, 2010 at 04:02:01PM -0500, Steven Rostedt wrote:
> On Sun, 2010-01-10 at 12:10 -0500, Mathieu Desnoyers wrote:
> 
> > > 
> > > 	CPU 0			CPU 1
> > >      ----------		    --------------
> > > 	obj = list->obj;
> > > 				<user space>
> > > 				rcu_read_lock();
> > > 				obj = rcu_dereference(list->obj);
> > > 				obj->foo = bar;
> > > 
> > > 				<preempt>
> > > 				<kernel space>
> > > 
> > > 				schedule();
> > > 				cpumask_clear(mm_cpumask, cpu);
> > > 
> > > 	sys_membarrier();
> > > 	free(obj);
> > > 
> > > 				<store to obj->foo goes to memory>  <- corruption
> > > 
> > 
> > Hrm, having a writer like this in a rcu read-side would be a bit weird.
> > We have to look at the actual rcu_read_lock() implementation in urcu to
> > see why load/stores are important on the rcu read-side.
> 
> No it is not weird, it is common. The read is on the link list that we
> can access. Yes a write should be protected by other locks, so maybe
> that is the weird part.

Not all that weird -- statistical counters are a common example, as
would be setting of flags, for example, those used to indicate that
later pruning operations might be needed.

> > (note: _STORE_SHARED is simply a volatile store)
> > 
> > (Thread-local variable, shared with the thread doing synchronize_rcu())
> > struct urcu_reader __thread urcu_reader;
> > 
> > static inline void _rcu_read_lock(void)
> > {
> >         long tmp;
> > 
> >         tmp = urcu_reader.ctr;
> >         if (likely(!(tmp & RCU_GP_CTR_NEST_MASK))) {
> >                 _STORE_SHARED(urcu_reader.ctr, _LOAD_SHARED(urcu_gp_ctr));
> >                 /*
> >                  * Set active readers count for outermost nesting level before
> >                  * accessing the pointer. See force_mb_all_threads().
> >                  */
> >                 barrier();
> >         } else {
> >                 _STORE_SHARED(urcu_reader.ctr, tmp + RCU_GP_COUNT);
> >         }
> > }
> > 
> > So as you see here, we have to ensure that the store to urcu_reader.ctr
> > is globally visible before entering the critical section (previous
> > stores must complete before following loads). For rcu_read_unlock, it's
> > the opposite:
> > 
> > static inline void _rcu_read_unlock(void)
> > {
> >         long tmp;
> > 
> >         tmp = urcu_reader.ctr;
> >         /*
> >          * Finish using rcu before decrementing the pointer.
> >          * See force_mb_all_threads().
> >          */
> >         if (likely((tmp & RCU_GP_CTR_NEST_MASK) == RCU_GP_COUNT)) {
> >                 barrier();
> >                 _STORE_SHARED(urcu_reader.ctr, urcu_reader.ctr - RCU_GP_COUNT);
> >         } else {
> >                 _STORE_SHARED(urcu_reader.ctr, urcu_reader.ctr - RCU_GP_COUNT);
> >         }
> > }
> 
> Thanks for the insight of the code. I need to get around and look at
> your userspace implementation ;-)
> 
> > 
> > We need to ensure that previous loads complete before following stores.
> > 
> > Therefore, the race with unlock showing that we need to order loads
> > before stores:
> > 
> > 	CPU 0			CPU 1
> >         --------------          --------------
> >                                 <user space> (already in read-side C.S.)
> >                                 obj = rcu_dereference(list->next);
> >                                   -> load list->next
> >                                 copy = obj->foo;
> >                                 rcu_read_unlock();
> >                                   -> store to urcu_reader.ctr
> >                                 <urcu_reader.ctr store is globally visible>
> >         list_del(obj);
> >                                 <preempt>
> >                                 <kernel space>
> > 
> >                                 schedule();
> >                                 cpumask_clear(mm_cpumask, cpu);
> 
> 				but here we are switching to a new task.
> 
> > 
> >         sys_membarrier();
> >         set global g.p. (urcu_gp_ctr) phase to 1
> >         wait for all urcu_reader.ctr in phase 0
> >         set global g.p. (urcu_gp_ctr) phase to 0
> >         wait for all urcu_reader.ctr in phase 1
> >         sys_membarrier();
> >         free(obj);
> >                                 <list->next load hits memory>
> >                                 <obj->foo load hits memory> <- corruption
> 
> load of obj->foo is really load foo(obj) into some register. And for the
> above to fail, that means that this load happened even after we switched
> to kernel space, and that load of foo(obj) is still pending to get into
> the thread stack that saved that register.
> 
> But I'm sure Paul will point me to some arch that does this ;-)

Given the recent dual-core non-cache-coherent Blackfin, I would not want
to make guarantees that this will never happen.  Yes, it was experimental,
but there are lots of strange CPUs showing up.  And we really should put
the ordering dependency in core non-arch-specific code -- less
vulnerable that way.

							Thanx, Paul

> > > So, if there's no smp_wmb() between the <preempt> and cpumask_clear()
> > > then we have an issue?
> > 
> > Considering the scenario above, we would need a full smp_mb() (or
> > equivalent) rather than just smp_wmb() to be strictly correct.
> 
> I agree with Paul, we should just punt and grab the rq locks. That seems
> to be the safest way without resorting to funny tricks to save 15% on a
> slow path.
> 
> -- Steve
> 
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-11  1:17                                                   ` Paul E. McKenney
@ 2010-01-11  4:25                                                     ` Mathieu Desnoyers
  2010-01-11  4:29                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-11  4:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
[...]
> > Even when taking the spinlocks, efficient iteration on active threads is
> > done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
> > the same cpumask, and thus requires the same memory barriers around the
> > updates.
> 
> Ouch!!!  Good point and good catch!!!
> 
> > We could switch to an inefficient iteration on all online CPUs instead,
> > and check read runqueue ->mm with the spinlock held. Is that what you
> > propose ? This will cause reading of large amounts of runqueue
> > information, especially on large systems running few threads. The other
> > way around is to iterate on all the process threads: in this case, small
> > systems running many threads will have to read information about many
> > inactive threads, which is not much better.
> 
> I am not all that worried about exactly what we do as long as it is
> pretty obviously correct.  We can then improve performance when and as
> the need arises.  We might need to use any of the strategies you
> propose, or perhaps even choose among them depending on the number of
> threads in the process, the number of CPUs, and so forth.  (I hope not,
> but...)
> 
> My guess is that an obviously correct approach would work well for a
> slowpath.  If someone later runs into performance problems, we can fix
> them with the added knowledge of what they are trying to do.
> 

OK, here is what I propose. Let's choose between two implementations
(v3a and v3b), which implement two "obviously correct" approaches. In
summary:

* baseline (based on 2.6.32.2)
   text	   data	    bss	    dec	    hex	filename
  76887	   8782	   2044	  87713	  156a1	kernel/sched.o

* v3a: ipi to many using mm_cpumask

- adds smp_mb__before_clear_bit()/smp_mb__after_clear_bit() before and
  after mm_cpumask stores in context_switch(). They are only executed
  when oldmm and mm are different. (it's my turn to hide behind an
  appropriately-sized boulder for touching the scheduler). ;) Note that
  it's not that bad, as these barriers turn into simple compiler barrier()
  on:
    avr32, blackfin, cris, frb, h8300, m32r, m68k, mn10300, score, sh,
    sparc, x86 and xtensa.
  The less lucky architectures gaining two smp_mb() are:
    alpha, arm, ia64, mips, parisc, powerpc and s390.
  ia64 is gaining only one smp_mb() thanks to its acquire semantic.
- size
   text	   data	    bss	    dec	    hex	filename
  77239	   8782	   2044	  88065	  15801	kernel/sched.o
  -> adds 352 bytes of text
- Number of lines (system call source code, w/o comments) : 18

* v3b: iteration on min(num_online_cpus(), nr threads in the process),
  taking runqueue spinlocks, allocating a cpumask, ipi to many to the
  cpumask. Does not allocate the cpumask if only a single IPI is needed.

- only adds sys_membarrier() and related functions.
- size
   text	   data	    bss	    dec	    hex	filename
  78047	   8782	   2044	  88873	  15b29	kernel/sched.o
  -> adds 1160 bytes of text
- Number of lines (system call source code, w/o comments) : 163

I'll reply to this email with the two implementations. Comments are
welcome.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11  4:25                                                     ` Mathieu Desnoyers
@ 2010-01-11  4:29                                                       ` Mathieu Desnoyers
  2010-01-11 17:27                                                         ` Paul E. McKenney
  2010-01-11 17:50                                                         ` Peter Zijlstra
  2010-01-11  4:30                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b) Mathieu Desnoyers
  2010-01-11 16:25                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Paul E. McKenney
  2 siblings, 2 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-11  4:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process.
 
It aims at greatly simplifying and enhancing the current signal-based
liburcu userspace RCU synchronize_rcu() implementation.
(found at http://lttng.org/urcu)

Changelog since v1:

- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Changelog since v2:
- simply send-to-many to the mm_cpumask. It contains the list of processors we
  have to IPI to (which use the mm), and this mask is updated atomically.

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the 
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

To explain the benefit of this scheme, let's introduce two example threads:
 
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A synchronize_rcu() are
ordering memory accesses with respect to smp_mb() present in 
rcu_read_lock/unlock(), we can change all smp_mb() from
synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
rcu_read_lock/unlock() into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
prev mem accesses           prev mem accesses
smp_mb()                    smp_mb()
follow mem accesses         follow mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() thanks to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

For my Intel Xeon E5405 (new set of results, disabled kernel debugging)

T=1: 0m18.921s
T=2: 0m19.457s
T=3: 0m21.619s
T=4: 0m21.641s
T=5: 0m23.426s
T=6: 0m26.450s
T=7: 0m27.731s

The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
in a loop and other threads busy-waiting in user-space on a variable shows that
the thread doing sys_membarrier is doing mostly system calls, and other threads
are mostly running in user-space. Side-note, in this test, it's important to
check that individual threads are not always fully at 100% user-space time (they
range between ~95% and 100%), because when some thread in the test is always at
100% on the same CPU, this means it does not get the IPI at all. (I actually
found out about a bug in my own code while developing it with this test.)

Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme:      6289946025 reads,   1251 writes

(what we have now, with dynamic sys_membarrier check)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme:    4061976535 reads, 526807 writes

So the dynamic sys_membarrier availability check adds some overhead to the
read-side, but besides that, we can see that we are close to the read-side
performance of the signal-based scheme and also close (5/8) to the performance
of the memory-barrier write-side. We have a write-side speedup of 421:1 over the
signal-based scheme by using the sys_membarrier system call. This allows a 4.5:1
read-side speedup over the memory barrier scheme.

The system call number is only assigned for x86_64 in this RFC patch.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: akpm@linux-foundation.org
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: tglx@linutronix.de
CC: peterz@infradead.org
CC: rostedt@goodmis.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
---
 arch/x86/include/asm/unistd_64.h |    2 +
 kernel/sched.c                   |   59 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:31.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:37.000000000 -0500
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open			298
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_membarrier				299
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 19:21:31.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c	2010-01-10 22:22:40.000000000 -0500
@@ -2861,12 +2861,26 @@ context_switch(struct rq *rq, struct tas
 	 */
 	arch_start_context_switch(prev);
 
+	/*
+	 * sys_membarrier IPI-mb scheme requires a memory barrier between
+	 * user-space thread execution and update to mm_cpumask.
+	 */
+	if (likely(oldmm) && likely(oldmm != mm))
+		smp_mb__before_clear_bit();
+
 	if (unlikely(!mm)) {
 		next->active_mm = oldmm;
 		atomic_inc(&oldmm->mm_count);
 		enter_lazy_tlb(oldmm, next);
-	} else
+	} else {
 		switch_mm(oldmm, mm, next);
+		/*
+		 * sys_membarrier IPI-mb scheme requires a memory barrier
+		 * between update to mm_cpumask and user-space thread execution.
+		 */
+		if (likely(oldmm != mm))
+			smp_mb__after_clear_bit();
+	}
 
 	if (unlikely(!prev->mm)) {
 		prev->active_mm = NULL;
@@ -10822,6 +10836,49 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+/*
+ * Execute a memory barrier on all active threads from the current process
+ * on SMP systems. Do not rely on implicit barriers in
+ * smp_call_function_many(), just in case they are ever relaxed in the future.
+ */
+static void membarrier_ipi(void *unused)
+{
+	smp_mb();
+}
+
+/*
+ * sys_membarrier - issue memory barrier on current process running threads
+ *
+ * Execute a memory barrier on all running threads of the current process.
+ * Upon completion, the caller thread is ensured that all process threads
+ * have passed through a state where memory accesses match program order.
+ * (non-running threads are de facto in such a state)
+ */
+SYSCALL_DEFINE0(membarrier)
+{
+#ifdef CONFIG_SMP
+	if (unlikely(thread_group_empty(current)))
+		return 0;
+	/*
+	 * Memory barrier on the caller thread _before_ sending first
+	 * IPI. Matches memory barriers around mm_cpumask modification in
+	 * context_switch().
+	 */
+	smp_mb();
+	preempt_disable();
+	smp_call_function_many(mm_cpumask(current->mm), membarrier_ipi,
+			       NULL, 1);
+	preempt_enable();
+	/*
+	 * Memory barrier on the caller thread _after_ we finished
+	 * waiting for the last IPI. Matches memory barriers around mm_cpumask
+	 * modification in context_switch().
+	 */
+	smp_mb();
+#endif	/* #ifdef CONFIG_SMP */
+	return 0;
+}
+
 #ifndef CONFIG_SMP
 
 int rcu_expedited_torture_stats(char *page)
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-11  4:25                                                     ` Mathieu Desnoyers
  2010-01-11  4:29                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
@ 2010-01-11  4:30                                                       ` Mathieu Desnoyers
  2010-01-11 22:43                                                         ` Paul E. McKenney
  2010-01-11 16:25                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Paul E. McKenney
  2 siblings, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-11  4:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process.
 
It aims at greatly simplifying and enhancing the current signal-based
liburcu userspace RCU synchronize_rcu() implementation.
(found at http://lttng.org/urcu)

Changelog since v1:

- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Changelog since v2:

- Iteration on min(num_online_cpus(), nr threads in the process),
  taking runqueue spinlocks, allocating a cpumask, ipi to many to the
  cpumask. Does not allocate the cpumask if only a single IPI is needed.


Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the 
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

To explain the benefit of this scheme, let's introduce two example threads:
 
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A synchronize_rcu() are
ordering memory accesses with respect to smp_mb() present in 
rcu_read_lock/unlock(), we can change all smp_mb() from
synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
rcu_read_lock/unlock() into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
prev mem accesses           prev mem accesses
smp_mb()                    smp_mb()
follow mem accesses         follow mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() thanks to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

Just tried with a cache-hot kernel compilation using 6/8 CPUs.

Normally:                                              real 2m41.852s
With the sys_membarrier+1 busy-looping thread running: real 5m41.830s

So... 2x slower. That hurts.

So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
small allocation overhead and benefit from cpumask broadcast if
possible so we scale better. But that all depends on how big the
allocation overhead is.

Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
calls, one thread is doing the sys_membarrier, the others are busy
looping)).  Given that it costs almost half as much to perform the
cpumask allocation than to send a single IPI, as we iterate on the CPUs
until we find more than N match or iterated on all cpus.  If we only have
N match or less, we send single IPIs. If we need more than that, then we
switch to the cpumask allocation and send a broadcast IPI to the cpumask
we construct for the matching CPUs. Let's call it the "adaptative IPI
scheme".

For my Intel Xeon E5405

*This is calibration only, not taking the runqueue locks*

Just doing local mb()+single IPI to T other threads:

T=1: 0m18.801s
T=2: 0m29.086s
T=3: 0m46.841s
T=4: 0m53.758s
T=5: 1m10.856s
T=6: 1m21.142s
T=7: 1m38.362s

Just doing cpumask alloc+IPI-many to T other threads:

T=1: 0m21.778s
T=2: 0m22.741s
T=3: 0m22.185s
T=4: 0m24.660s
T=5: 0m26.855s
T=6: 0m30.841s
T=7: 0m29.551s

So I think the right threshold should be 1 thread (assuming other
architecture will behave like mine). So starting with 2 threads, we
allocate the cpumask before sending IPIs.

*end of calibration*

Resulting adaptative scheme, with runqueue locks:

T=1: 0m20.990s
T=2: 0m22.588s
T=3: 0m27.028s
T=4: 0m29.027s
T=5: 0m32.592s
T=6: 0m36.556s
T=7: 0m33.093s

The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
in a loop and other threads busy-waiting in user-space on a variable shows that
the thread doing sys_membarrier is doing mostly system calls, and other threads
are mostly running in user-space. Side-note, in this test, it's important to
check that individual threads are not always fully at 100% user-space time (they
range between ~95% and 100%), because when some thread in the test is always at
100% on the same CPU, this means it does not get the IPI at all. (I actually
found out about a bug in my own code while developing it with this test.)

Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st

The system call number is only assigned for x86_64 in this RFC patch.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: akpm@linux-foundation.org
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: tglx@linutronix.de
CC: peterz@infradead.org
CC: rostedt@goodmis.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
---
 arch/x86/include/asm/unistd_64.h |    2 
 kernel/sched.c                   |  219 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 221 insertions(+)

Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 22:23:59.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 22:29:30.000000000 -0500
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open			298
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_membarrier				299
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 22:23:59.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c	2010-01-10 23:12:35.000000000 -0500
@@ -119,6 +119,11 @@
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
+/*
+ * IPI vs cpumask broadcast threshold. Threshold of 1 IPI.
+ */
+#define ADAPT_IPI_THRESHOLD	1
+
 static inline int rt_policy(int policy)
 {
 	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
@@ -10822,6 +10827,220 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+/*
+ * Execute a memory barrier on all CPUs on SMP systems.
+ * Do not rely on implicit barriers in smp_call_function(), just in case they
+ * are ever relaxed in the future.
+ */
+static void membarrier_ipi(void *unused)
+{
+	smp_mb();
+}
+
+/*
+ * Handle out-of-mem by sending per-cpu IPIs instead.
+ */
+static void membarrier_cpus_retry(int this_cpu)
+{
+	struct mm_struct *mm;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (unlikely(cpu == this_cpu))
+			continue;
+		spin_lock_irq(&cpu_rq(cpu)->lock);
+		mm = cpu_curr(cpu)->mm;
+		spin_unlock_irq(&cpu_rq(cpu)->lock);
+		if (current->mm == mm)
+			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
+	}
+}
+
+static void membarrier_threads_retry(int this_cpu)
+{
+	struct mm_struct *mm;
+	struct task_struct *t;
+	struct rq *rq;
+	int cpu;
+
+	list_for_each_entry_rcu(t, &current->thread_group, thread_group) {
+		local_irq_disable();
+		rq = __task_rq_lock(t);
+		mm = rq->curr->mm;
+		cpu = rq->cpu;
+		__task_rq_unlock(rq);
+		local_irq_enable();
+		if (cpu == this_cpu)
+			continue;
+		if (current->mm == mm)
+			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
+	}
+}
+
+static void membarrier_cpus(int this_cpu)
+{
+	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
+	cpumask_var_t tmpmask;
+	struct mm_struct *mm;
+
+	/* Get CPU IDs up to threshold */
+	for_each_online_cpu(cpu) {
+		if (unlikely(cpu == this_cpu))
+			continue;
+		spin_lock_irq(&cpu_rq(cpu)->lock);
+		mm = cpu_curr(cpu)->mm;
+		spin_unlock_irq(&cpu_rq(cpu)->lock);
+		if (current->mm == mm) {
+			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
+				nr_cpus++;
+				break;
+			}
+			cpu_ipi[nr_cpus++] = cpu;
+		}
+	}
+	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
+		for (i = 0; i < nr_cpus; i++) {
+			smp_call_function_single(cpu_ipi[i],
+						 membarrier_ipi,
+						 NULL, 1);
+		}
+	} else {
+		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
+			membarrier_cpus_retry(this_cpu);
+			return;
+		}
+		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
+			cpumask_set_cpu(cpu_ipi[i], tmpmask);
+		/* Continue previous online cpu iteration */
+		cpumask_set_cpu(cpu, tmpmask);
+		for (;;) {
+			cpu = cpumask_next(cpu, cpu_online_mask);
+			if (unlikely(cpu == this_cpu))
+				continue;
+			if (unlikely(cpu >= nr_cpu_ids))
+				break;
+			spin_lock_irq(&cpu_rq(cpu)->lock);
+			mm = cpu_curr(cpu)->mm;
+			spin_unlock_irq(&cpu_rq(cpu)->lock);
+			if (current->mm == mm)
+				cpumask_set_cpu(cpu, tmpmask);
+		}
+		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
+		free_cpumask_var(tmpmask);
+	}
+}
+
+static void membarrier_threads(int this_cpu)
+{
+	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
+	cpumask_var_t tmpmask;
+	struct mm_struct *mm;
+	struct task_struct *t;
+	struct rq *rq;
+
+	/* Get CPU IDs up to threshold */
+	list_for_each_entry_rcu(t, &current->thread_group,
+				thread_group) {
+		local_irq_disable();
+		rq = __task_rq_lock(t);
+		mm = rq->curr->mm;
+		cpu = rq->cpu;
+		__task_rq_unlock(rq);
+		local_irq_enable();
+		if (cpu == this_cpu)
+			continue;
+		if (current->mm == mm) {
+			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
+				nr_cpus++;
+				break;
+			}
+			cpu_ipi[nr_cpus++] = cpu;
+		}
+	}
+	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
+		for (i = 0; i < nr_cpus; i++) {
+			smp_call_function_single(cpu_ipi[i],
+						 membarrier_ipi,
+						 NULL, 1);
+		}
+	} else {
+		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
+			membarrier_threads_retry(this_cpu);
+			return;
+		}
+		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
+			cpumask_set_cpu(cpu_ipi[i], tmpmask);
+		/* Continue previous thread iteration */
+		cpumask_set_cpu(cpu, tmpmask);
+		list_for_each_entry_continue_rcu(t,
+						 &current->thread_group,
+						 thread_group) {
+			local_irq_disable();
+			rq = __task_rq_lock(t);
+			mm = rq->curr->mm;
+			cpu = rq->cpu;
+			__task_rq_unlock(rq);
+			local_irq_enable();
+			if (cpu == this_cpu)
+				continue;
+			if (current->mm == mm)
+				cpumask_set_cpu(cpu, tmpmask);
+		}
+		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
+		free_cpumask_var(tmpmask);
+	}
+}
+
+/*
+ * sys_membarrier - issue memory barrier on current process running threads
+ *
+ * Execute a memory barrier on all running threads of the current process.
+ * Upon completion, the caller thread is ensured that all process threads
+ * have passed through a state where memory accesses match program order.
+ * (non-running threads are de facto in such a state)
+ *
+ * We do not use mm_cpumask because there is no guarantee that each architecture
+ * switch_mm issues a smp_mb() before and after mm_cpumask modification upon
+ * scheduling change. Furthermore, leave_mm is also modifying the mm_cpumask (at
+ * least on x86) from the TLB flush IPI handler. So rather than playing tricky
+ * games with lazy TLB flush, let's simply iterate on online cpus/thread group,
+ * whichever is the smallest.
+ */
+SYSCALL_DEFINE0(membarrier)
+{
+#ifdef CONFIG_SMP
+	int this_cpu;
+
+	if (unlikely(thread_group_empty(current)))
+		return 0;
+
+	rcu_read_lock();	/* protect cpu_curr(cpu)-> and rcu list */
+	preempt_disable();
+	/*
+	 * Memory barrier on the caller thread _before_ sending first IPI.
+	 */
+	smp_mb();
+	/*
+	 * We don't need to include ourself in IPI, as we already
+	 * surround our execution with memory barriers.
+	 */
+	this_cpu = smp_processor_id();
+	/* Approximate which is fastest: CPU or thread group iteration ? */
+	if (num_online_cpus() <= atomic_read(&current->mm->mm_users))
+		membarrier_cpus(this_cpu);
+	else
+		membarrier_threads(this_cpu);
+	/*
+	 * Memory barrier on the caller thread _after_ we finished
+	 * waiting for the last IPI.
+	 */
+	smp_mb();
+	preempt_enable();
+	rcu_read_unlock();
+#endif	/* #ifdef CONFIG_SMP */
+	return 0;
+}
+
 #ifndef CONFIG_SMP
 
 int rcu_expedited_torture_stats(char *page)
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-11  4:25                                                     ` Mathieu Desnoyers
  2010-01-11  4:29                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
  2010-01-11  4:30                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b) Mathieu Desnoyers
@ 2010-01-11 16:25                                                       ` Paul E. McKenney
  2010-01-11 20:21                                                         ` Mathieu Desnoyers
  2 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-11 16:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sun, Jan 10, 2010 at 11:25:21PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> [...]
> > > Even when taking the spinlocks, efficient iteration on active threads is
> > > done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
> > > the same cpumask, and thus requires the same memory barriers around the
> > > updates.
> > 
> > Ouch!!!  Good point and good catch!!!
> > 
> > > We could switch to an inefficient iteration on all online CPUs instead,
> > > and check read runqueue ->mm with the spinlock held. Is that what you
> > > propose ? This will cause reading of large amounts of runqueue
> > > information, especially on large systems running few threads. The other
> > > way around is to iterate on all the process threads: in this case, small
> > > systems running many threads will have to read information about many
> > > inactive threads, which is not much better.
> > 
> > I am not all that worried about exactly what we do as long as it is
> > pretty obviously correct.  We can then improve performance when and as
> > the need arises.  We might need to use any of the strategies you
> > propose, or perhaps even choose among them depending on the number of
> > threads in the process, the number of CPUs, and so forth.  (I hope not,
> > but...)
> > 
> > My guess is that an obviously correct approach would work well for a
> > slowpath.  If someone later runs into performance problems, we can fix
> > them with the added knowledge of what they are trying to do.
> > 
> 
> OK, here is what I propose. Let's choose between two implementations
> (v3a and v3b), which implement two "obviously correct" approaches. In
> summary:
> 
> * baseline (based on 2.6.32.2)
>    text	   data	    bss	    dec	    hex	filename
>   76887	   8782	   2044	  87713	  156a1	kernel/sched.o
> 
> * v3a: ipi to many using mm_cpumask
> 
> - adds smp_mb__before_clear_bit()/smp_mb__after_clear_bit() before and
>   after mm_cpumask stores in context_switch(). They are only executed
>   when oldmm and mm are different. (it's my turn to hide behind an
>   appropriately-sized boulder for touching the scheduler). ;) Note that
>   it's not that bad, as these barriers turn into simple compiler barrier()
>   on:
>     avr32, blackfin, cris, frb, h8300, m32r, m68k, mn10300, score, sh,
>     sparc, x86 and xtensa.
>   The less lucky architectures gaining two smp_mb() are:
>     alpha, arm, ia64, mips, parisc, powerpc and s390.
>   ia64 is gaining only one smp_mb() thanks to its acquire semantic.
> - size
>    text	   data	    bss	    dec	    hex	filename
>   77239	   8782	   2044	  88065	  15801	kernel/sched.o
>   -> adds 352 bytes of text
> - Number of lines (system call source code, w/o comments) : 18
> 
> * v3b: iteration on min(num_online_cpus(), nr threads in the process),
>   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
>   cpumask. Does not allocate the cpumask if only a single IPI is needed.
> 
> - only adds sys_membarrier() and related functions.
> - size
>    text	   data	    bss	    dec	    hex	filename
>   78047	   8782	   2044	  88873	  15b29	kernel/sched.o
>   -> adds 1160 bytes of text
> - Number of lines (system call source code, w/o comments) : 163
> 
> I'll reply to this email with the two implementations. Comments are
> welcome.

Cool!!!  Just for completeness, I point out the following trivial
implementation:

/*
 * sys_membarrier - issue memory barrier on current process running threads
 *
 * Execute a memory barrier on all running threads of the current process.
 * Upon completion, the caller thread is ensured that all process threads
 * have passed through a state where memory accesses match program order.
 * (non-running threads are de facto in such a state)
 *
 * Note that synchronize_sched() has the side-effect of doing a memory
 * barrier on each CPU.
 */
SYSCALL_DEFINE0(membarrier)
{
	synchronize_sched();
}

This does unnecessarily hit all CPUs in the system, but has the same
minimal impact that in-kernel RCU already has.  It has long latency,
(milliseconds) which might well disqualify it from consideration for
some applications.  On the other hand, it automatically batches multiple
concurrent calls to sys_membarrier().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11  4:29                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
@ 2010-01-11 17:27                                                         ` Paul E. McKenney
  2010-01-11 17:35                                                           ` Mathieu Desnoyers
  2010-01-11 17:50                                                         ` Peter Zijlstra
  1 sibling, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-11 17:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sun, Jan 10, 2010 at 11:29:03PM -0500, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process.
> 
> It aims at greatly simplifying and enhancing the current signal-based
> liburcu userspace RCU synchronize_rcu() implementation.
> (found at http://lttng.org/urcu)

Given that this has the memory barrier both before and after the
assignment to ->mm, looks good to me from a memory-ordering viewpoint.
I must defer to others on the effect on context-switch overhead.

						Thanx, Paul

> Changelog since v1:
> 
> - Only perform the IPI in CONFIG_SMP.
> - Only perform the IPI if the process has more than one thread.
> - Only send IPIs to CPUs involved with threads belonging to our process.
> - Adaptative IPI scheme (single vs many IPI with threshold).
> - Issue smp_mb() at the beginning and end of the system call.
> 
> Changelog since v2:
> - simply send-to-many to the mm_cpumask. It contains the list of processors we
>   have to IPI to (which use the mm), and this mask is updated atomically.
> 
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the 
> write-side are turned into an invokation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
> 
> To explain the benefit of this scheme, let's introduce two example threads:
> 
> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> 
> In a scheme where all smp_mb() in thread A synchronize_rcu() are
> ordering memory accesses with respect to smp_mb() present in 
> rcu_read_lock/unlock(), we can change all smp_mb() from
> synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> rcu_read_lock/unlock() into compiler barriers "barrier()".
> 
> Before the change, we had, for each smp_mb() pairs:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> smp_mb()                    smp_mb()
> follow mem accesses         follow mem accesses
> 
> After the change, these pairs become:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> As we can see, there are two possible scenarios: either Thread B memory
> accesses do not happen concurrently with Thread A accesses (1), or they
> do (2).
> 
> 1) Non-concurrent Thread A vs Thread B accesses:
> 
> Thread A                    Thread B
> prev mem accesses
> sys_membarrier()
> follow mem accesses
>                             prev mem accesses
>                             barrier()
>                             follow mem accesses
> 
> In this case, thread B accesses will be weakly ordered. This is OK,
> because at that point, thread A is not particularly interested in
> ordering them with respect to its own accesses.
> 
> 2) Concurrent Thread A vs Thread B accesses
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> In this case, thread B accesses, which are ensured to be in program
> order thanks to the compiler barrier, will be "upgraded" to full
> smp_mb() thanks to the IPIs executing memory barriers on each active
> system threads. Each non-running process threads are intrinsically
> serialized by the scheduler.
> 
> For my Intel Xeon E5405 (new set of results, disabled kernel debugging)
> 
> T=1: 0m18.921s
> T=2: 0m19.457s
> T=3: 0m21.619s
> T=4: 0m21.641s
> T=5: 0m23.426s
> T=6: 0m26.450s
> T=7: 0m27.731s
> 
> The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
> in a loop and other threads busy-waiting in user-space on a variable shows that
> the thread doing sys_membarrier is doing mostly system calls, and other threads
> are mostly running in user-space. Side-note, in this test, it's important to
> check that individual threads are not always fully at 100% user-space time (they
> range between ~95% and 100%), because when some thread in the test is always at
> 100% on the same CPU, this means it does not get the IPI at all. (I actually
> found out about a bug in my own code while developing it with this test.)
> 
> Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
> Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
> Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
> Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> 
> Results in liburcu:
> 
> Operations in 10s, 6 readers, 2 writers:
> 
> (what we previously had)
> memory barriers in reader: 973494744 reads, 892368 writes
> signal-based scheme:      6289946025 reads,   1251 writes
> 
> (what we have now, with dynamic sys_membarrier check)
> memory barriers in reader: 907693804 reads, 817793 writes
> sys_membarrier scheme:    4061976535 reads, 526807 writes
> 
> So the dynamic sys_membarrier availability check adds some overhead to the
> read-side, but besides that, we can see that we are close to the read-side
> performance of the signal-based scheme and also close (5/8) to the performance
> of the memory-barrier write-side. We have a write-side speedup of 421:1 over the
> signal-based scheme by using the sys_membarrier system call. This allows a 4.5:1
> read-side speedup over the memory barrier scheme.
> 
> The system call number is only assigned for x86_64 in this RFC patch.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: mingo@elte.hu
> CC: laijs@cn.fujitsu.com
> CC: dipankar@in.ibm.com
> CC: akpm@linux-foundation.org
> CC: josh@joshtriplett.org
> CC: dvhltc@us.ibm.com
> CC: niv@us.ibm.com
> CC: tglx@linutronix.de
> CC: peterz@infradead.org
> CC: rostedt@goodmis.org
> CC: Valdis.Kletnieks@vt.edu
> CC: dhowells@redhat.com
> ---
>  arch/x86/include/asm/unistd_64.h |    2 +
>  kernel/sched.c                   |   59 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 60 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> ===================================================================
> --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:31.000000000 -0500
> +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:37.000000000 -0500
> @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
>  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
>  #define __NR_perf_event_open			298
>  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> +#define __NR_membarrier				299
> +__SYSCALL(__NR_membarrier, sys_membarrier)
> 
>  #ifndef __NO_STUBS
>  #define __ARCH_WANT_OLD_READDIR
> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 19:21:31.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c	2010-01-10 22:22:40.000000000 -0500
> @@ -2861,12 +2861,26 @@ context_switch(struct rq *rq, struct tas
>  	 */
>  	arch_start_context_switch(prev);
> 
> +	/*
> +	 * sys_membarrier IPI-mb scheme requires a memory barrier between
> +	 * user-space thread execution and update to mm_cpumask.
> +	 */
> +	if (likely(oldmm) && likely(oldmm != mm))
> +		smp_mb__before_clear_bit();
> +
>  	if (unlikely(!mm)) {
>  		next->active_mm = oldmm;
>  		atomic_inc(&oldmm->mm_count);
>  		enter_lazy_tlb(oldmm, next);
> -	} else
> +	} else {
>  		switch_mm(oldmm, mm, next);
> +		/*
> +		 * sys_membarrier IPI-mb scheme requires a memory barrier
> +		 * between update to mm_cpumask and user-space thread execution.
> +		 */
> +		if (likely(oldmm != mm))
> +			smp_mb__after_clear_bit();
> +	}
> 
>  	if (unlikely(!prev->mm)) {
>  		prev->active_mm = NULL;
> @@ -10822,6 +10836,49 @@ struct cgroup_subsys cpuacct_subsys = {
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
> 
> +/*
> + * Execute a memory barrier on all active threads from the current process
> + * on SMP systems. Do not rely on implicit barriers in
> + * smp_call_function_many(), just in case they are ever relaxed in the future.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> +	smp_mb();
> +}
> +
> +/*
> + * sys_membarrier - issue memory barrier on current process running threads
> + *
> + * Execute a memory barrier on all running threads of the current process.
> + * Upon completion, the caller thread is ensured that all process threads
> + * have passed through a state where memory accesses match program order.
> + * (non-running threads are de facto in such a state)
> + */
> +SYSCALL_DEFINE0(membarrier)
> +{
> +#ifdef CONFIG_SMP
> +	if (unlikely(thread_group_empty(current)))
> +		return 0;
> +	/*
> +	 * Memory barrier on the caller thread _before_ sending first
> +	 * IPI. Matches memory barriers around mm_cpumask modification in
> +	 * context_switch().
> +	 */
> +	smp_mb();
> +	preempt_disable();
> +	smp_call_function_many(mm_cpumask(current->mm), membarrier_ipi,
> +			       NULL, 1);
> +	preempt_enable();
> +	/*
> +	 * Memory barrier on the caller thread _after_ we finished
> +	 * waiting for the last IPI. Matches memory barriers around mm_cpumask
> +	 * modification in context_switch().
> +	 */
> +	smp_mb();
> +#endif	/* #ifdef CONFIG_SMP */
> +	return 0;
> +}
> +
>  #ifndef CONFIG_SMP
> 
>  int rcu_expedited_torture_stats(char *page)
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 17:27                                                         ` Paul E. McKenney
@ 2010-01-11 17:35                                                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-11 17:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Jan 10, 2010 at 11:29:03PM -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> > 
> > It aims at greatly simplifying and enhancing the current signal-based
> > liburcu userspace RCU synchronize_rcu() implementation.
> > (found at http://lttng.org/urcu)
> 
> Given that this has the memory barrier both before and after the
> assignment to ->mm, looks good to me from a memory-ordering viewpoint.
> I must defer to others on the effect on context-switch overhead.

More precisely, it's the assignment to cpu_vm_mask (clear bit/set bit)
that needs to be surrounded by memory barriers here. This is what we use
as cpu mask to which IPIs are sent. Only the current ->mm is accessed,
so ->mm ordering is not the issue here.

Thanks,

Mathieu

> 
> 						Thanx, Paul
> 
> > Changelog since v1:
> > 
> > - Only perform the IPI in CONFIG_SMP.
> > - Only perform the IPI if the process has more than one thread.
> > - Only send IPIs to CPUs involved with threads belonging to our process.
> > - Adaptative IPI scheme (single vs many IPI with threshold).
> > - Issue smp_mb() at the beginning and end of the system call.
> > 
> > Changelog since v2:
> > - simply send-to-many to the mm_cpumask. It contains the list of processors we
> >   have to IPI to (which use the mm), and this mask is updated atomically.
> > 
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the 
> > write-side are turned into an invokation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> > 
> > To explain the benefit of this scheme, let's introduce two example threads:
> > 
> > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> > Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> > 
> > In a scheme where all smp_mb() in thread A synchronize_rcu() are
> > ordering memory accesses with respect to smp_mb() present in 
> > rcu_read_lock/unlock(), we can change all smp_mb() from
> > synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> > rcu_read_lock/unlock() into compiler barriers "barrier()".
> > 
> > Before the change, we had, for each smp_mb() pairs:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > smp_mb()                    smp_mb()
> > follow mem accesses         follow mem accesses
> > 
> > After the change, these pairs become:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > As we can see, there are two possible scenarios: either Thread B memory
> > accesses do not happen concurrently with Thread A accesses (1), or they
> > do (2).
> > 
> > 1) Non-concurrent Thread A vs Thread B accesses:
> > 
> > Thread A                    Thread B
> > prev mem accesses
> > sys_membarrier()
> > follow mem accesses
> >                             prev mem accesses
> >                             barrier()
> >                             follow mem accesses
> > 
> > In this case, thread B accesses will be weakly ordered. This is OK,
> > because at that point, thread A is not particularly interested in
> > ordering them with respect to its own accesses.
> > 
> > 2) Concurrent Thread A vs Thread B accesses
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > In this case, thread B accesses, which are ensured to be in program
> > order thanks to the compiler barrier, will be "upgraded" to full
> > smp_mb() thanks to the IPIs executing memory barriers on each active
> > system threads. Each non-running process threads are intrinsically
> > serialized by the scheduler.
> > 
> > For my Intel Xeon E5405 (new set of results, disabled kernel debugging)
> > 
> > T=1: 0m18.921s
> > T=2: 0m19.457s
> > T=3: 0m21.619s
> > T=4: 0m21.641s
> > T=5: 0m23.426s
> > T=6: 0m26.450s
> > T=7: 0m27.731s
> > 
> > The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
> > in a loop and other threads busy-waiting in user-space on a variable shows that
> > the thread doing sys_membarrier is doing mostly system calls, and other threads
> > are mostly running in user-space. Side-note, in this test, it's important to
> > check that individual threads are not always fully at 100% user-space time (they
> > range between ~95% and 100%), because when some thread in the test is always at
> > 100% on the same CPU, this means it does not get the IPI at all. (I actually
> > found out about a bug in my own code while developing it with this test.)
> > 
> > Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
> > Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
> > Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
> > Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> > 
> > Results in liburcu:
> > 
> > Operations in 10s, 6 readers, 2 writers:
> > 
> > (what we previously had)
> > memory barriers in reader: 973494744 reads, 892368 writes
> > signal-based scheme:      6289946025 reads,   1251 writes
> > 
> > (what we have now, with dynamic sys_membarrier check)
> > memory barriers in reader: 907693804 reads, 817793 writes
> > sys_membarrier scheme:    4061976535 reads, 526807 writes
> > 
> > So the dynamic sys_membarrier availability check adds some overhead to the
> > read-side, but besides that, we can see that we are close to the read-side
> > performance of the signal-based scheme and also close (5/8) to the performance
> > of the memory-barrier write-side. We have a write-side speedup of 421:1 over the
> > signal-based scheme by using the sys_membarrier system call. This allows a 4.5:1
> > read-side speedup over the memory barrier scheme.
> > 
> > The system call number is only assigned for x86_64 in this RFC patch.
> > 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > CC: mingo@elte.hu
> > CC: laijs@cn.fujitsu.com
> > CC: dipankar@in.ibm.com
> > CC: akpm@linux-foundation.org
> > CC: josh@joshtriplett.org
> > CC: dvhltc@us.ibm.com
> > CC: niv@us.ibm.com
> > CC: tglx@linutronix.de
> > CC: peterz@infradead.org
> > CC: rostedt@goodmis.org
> > CC: Valdis.Kletnieks@vt.edu
> > CC: dhowells@redhat.com
> > ---
> >  arch/x86/include/asm/unistd_64.h |    2 +
> >  kernel/sched.c                   |   59 ++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 60 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:31.000000000 -0500
> > +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:37.000000000 -0500
> > @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
> >  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
> >  #define __NR_perf_event_open			298
> >  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> > +#define __NR_membarrier				299
> > +__SYSCALL(__NR_membarrier, sys_membarrier)
> > 
> >  #ifndef __NO_STUBS
> >  #define __ARCH_WANT_OLD_READDIR
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 19:21:31.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c	2010-01-10 22:22:40.000000000 -0500
> > @@ -2861,12 +2861,26 @@ context_switch(struct rq *rq, struct tas
> >  	 */
> >  	arch_start_context_switch(prev);
> > 
> > +	/*
> > +	 * sys_membarrier IPI-mb scheme requires a memory barrier between
> > +	 * user-space thread execution and update to mm_cpumask.
> > +	 */
> > +	if (likely(oldmm) && likely(oldmm != mm))
> > +		smp_mb__before_clear_bit();
> > +
> >  	if (unlikely(!mm)) {
> >  		next->active_mm = oldmm;
> >  		atomic_inc(&oldmm->mm_count);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> >  		switch_mm(oldmm, mm, next);
> > +		/*
> > +		 * sys_membarrier IPI-mb scheme requires a memory barrier
> > +		 * between update to mm_cpumask and user-space thread execution.
> > +		 */
> > +		if (likely(oldmm != mm))
> > +			smp_mb__after_clear_bit();
> > +	}
> > 
> >  	if (unlikely(!prev->mm)) {
> >  		prev->active_mm = NULL;
> > @@ -10822,6 +10836,49 @@ struct cgroup_subsys cpuacct_subsys = {
> >  };
> >  #endif	/* CONFIG_CGROUP_CPUACCT */
> > 
> > +/*
> > + * Execute a memory barrier on all active threads from the current process
> > + * on SMP systems. Do not rely on implicit barriers in
> > + * smp_call_function_many(), just in case they are ever relaxed in the future.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + *
> > + * Execute a memory barrier on all running threads of the current process.
> > + * Upon completion, the caller thread is ensured that all process threads
> > + * have passed through a state where memory accesses match program order.
> > + * (non-running threads are de facto in such a state)
> > + */
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +#ifdef CONFIG_SMP
> > +	if (unlikely(thread_group_empty(current)))
> > +		return 0;
> > +	/*
> > +	 * Memory barrier on the caller thread _before_ sending first
> > +	 * IPI. Matches memory barriers around mm_cpumask modification in
> > +	 * context_switch().
> > +	 */
> > +	smp_mb();
> > +	preempt_disable();
> > +	smp_call_function_many(mm_cpumask(current->mm), membarrier_ipi,
> > +			       NULL, 1);
> > +	preempt_enable();
> > +	/*
> > +	 * Memory barrier on the caller thread _after_ we finished
> > +	 * waiting for the last IPI. Matches memory barriers around mm_cpumask
> > +	 * modification in context_switch().
> > +	 */
> > +	smp_mb();
> > +#endif	/* #ifdef CONFIG_SMP */
> > +	return 0;
> > +}
> > +
> >  #ifndef CONFIG_SMP
> > 
> >  int rcu_expedited_torture_stats(char *page)
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11  4:29                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
  2010-01-11 17:27                                                         ` Paul E. McKenney
@ 2010-01-11 17:50                                                         ` Peter Zijlstra
  2010-01-11 20:52                                                           ` Mathieu Desnoyers
  1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-11 17:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller

On Sun, 2010-01-10 at 23:29 -0500, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process.

Please start a new thread for new versions, I really didn't find this
until I started reading in date order instead of thread order.


> Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> ===================================================================
> --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:31.000000000 -0500
> +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:37.000000000 -0500
> @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
>  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
>  #define __NR_perf_event_open			298
>  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> +#define __NR_membarrier				299
> +__SYSCALL(__NR_membarrier, sys_membarrier)
>  
>  #ifndef __NO_STUBS
>  #define __ARCH_WANT_OLD_READDIR
> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 19:21:31.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c	2010-01-10 22:22:40.000000000 -0500
> @@ -2861,12 +2861,26 @@ context_switch(struct rq *rq, struct tas
>  	 */
>  	arch_start_context_switch(prev);
>  
> +	/*
> +	 * sys_membarrier IPI-mb scheme requires a memory barrier between
> +	 * user-space thread execution and update to mm_cpumask.
> +	 */
> +	if (likely(oldmm) && likely(oldmm != mm))
> +		smp_mb__before_clear_bit();
> +
>  	if (unlikely(!mm)) {
>  		next->active_mm = oldmm;
>  		atomic_inc(&oldmm->mm_count);
>  		enter_lazy_tlb(oldmm, next);
> -	} else
> +	} else {
>  		switch_mm(oldmm, mm, next);
> +		/*
> +		 * sys_membarrier IPI-mb scheme requires a memory barrier
> +		 * between update to mm_cpumask and user-space thread execution.
> +		 */
> +		if (likely(oldmm != mm))
> +			smp_mb__after_clear_bit();
> +	}
>  
>  	if (unlikely(!prev->mm)) {
>  		prev->active_mm = NULL;
> @@ -10822,6 +10836,49 @@ struct cgroup_subsys cpuacct_subsys = {
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
>  
> +/*
> + * Execute a memory barrier on all active threads from the current process
> + * on SMP systems. Do not rely on implicit barriers in
> + * smp_call_function_many(), just in case they are ever relaxed in the future.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> +	smp_mb();
> +}
> +
> +/*
> + * sys_membarrier - issue memory barrier on current process running threads
> + *
> + * Execute a memory barrier on all running threads of the current process.
> + * Upon completion, the caller thread is ensured that all process threads
> + * have passed through a state where memory accesses match program order.
> + * (non-running threads are de facto in such a state)
> + */
> +SYSCALL_DEFINE0(membarrier)
> +{
> +#ifdef CONFIG_SMP
> +	if (unlikely(thread_group_empty(current)))
> +		return 0;
> +	/*
> +	 * Memory barrier on the caller thread _before_ sending first
> +	 * IPI. Matches memory barriers around mm_cpumask modification in
> +	 * context_switch().
> +	 */
> +	smp_mb();
> +	preempt_disable();
> +	smp_call_function_many(mm_cpumask(current->mm), membarrier_ipi,
> +			       NULL, 1);
> +	preempt_enable();
> +	/*
> +	 * Memory barrier on the caller thread _after_ we finished
> +	 * waiting for the last IPI. Matches memory barriers around mm_cpumask
> +	 * modification in context_switch().
> +	 */
> +	smp_mb();
> +#endif	/* #ifdef CONFIG_SMP */
> +	return 0;
> +}
> +
>  #ifndef CONFIG_SMP
>  
>  int rcu_expedited_torture_stats(char *page)

Right, so here you rely on the arch switch_mm() implementation to keep
mm_cpumask() current, but then stick a memory barrier in the generic
code... seems odd.

x86 switch_mm() does indeed keep it current, but writing cr3 is also a
rather serializing instruction.

Furthermore, do we really need that smp_mb() in the membarrier_ipi()
function? Shouldn't we investigate if either:
 - receiving an IPI implies an mb, or
 - enter/leave kernelspace implies an mb
?

So while I much like the simplified version, that previous one was
heavily over engineer, I 
 1) don't like that memory barrier in the generic code, 
 2) don't think that arch's necessarily keep that mask as tight.

[ even if for x86 smp_mb__{before,after}_clear_bit are a nop, tying that
  to switch_mm() semantics just reeks ]

See for example the sparc64 implementation of switch_mm() that only sets
cpus in mm_cpumask(), but only tlb flushes clear them. Also, I wouldn't
know if switch_mm() implies an mb on sparc64.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-11 16:25                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Paul E. McKenney
@ 2010-01-11 20:21                                                         ` Mathieu Desnoyers
  2010-01-11 21:48                                                           ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-11 20:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Jan 10, 2010 at 11:25:21PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > [...]
> > > > Even when taking the spinlocks, efficient iteration on active threads is
> > > > done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
> > > > the same cpumask, and thus requires the same memory barriers around the
> > > > updates.
> > > 
> > > Ouch!!!  Good point and good catch!!!
> > > 
> > > > We could switch to an inefficient iteration on all online CPUs instead,
> > > > and check read runqueue ->mm with the spinlock held. Is that what you
> > > > propose ? This will cause reading of large amounts of runqueue
> > > > information, especially on large systems running few threads. The other
> > > > way around is to iterate on all the process threads: in this case, small
> > > > systems running many threads will have to read information about many
> > > > inactive threads, which is not much better.
> > > 
> > > I am not all that worried about exactly what we do as long as it is
> > > pretty obviously correct.  We can then improve performance when and as
> > > the need arises.  We might need to use any of the strategies you
> > > propose, or perhaps even choose among them depending on the number of
> > > threads in the process, the number of CPUs, and so forth.  (I hope not,
> > > but...)
> > > 
> > > My guess is that an obviously correct approach would work well for a
> > > slowpath.  If someone later runs into performance problems, we can fix
> > > them with the added knowledge of what they are trying to do.
> > > 
> > 
> > OK, here is what I propose. Let's choose between two implementations
> > (v3a and v3b), which implement two "obviously correct" approaches. In
> > summary:
> > 
> > * baseline (based on 2.6.32.2)
> >    text	   data	    bss	    dec	    hex	filename
> >   76887	   8782	   2044	  87713	  156a1	kernel/sched.o
> > 
> > * v3a: ipi to many using mm_cpumask
> > 
> > - adds smp_mb__before_clear_bit()/smp_mb__after_clear_bit() before and
> >   after mm_cpumask stores in context_switch(). They are only executed
> >   when oldmm and mm are different. (it's my turn to hide behind an
> >   appropriately-sized boulder for touching the scheduler). ;) Note that
> >   it's not that bad, as these barriers turn into simple compiler barrier()
> >   on:
> >     avr32, blackfin, cris, frb, h8300, m32r, m68k, mn10300, score, sh,
> >     sparc, x86 and xtensa.
> >   The less lucky architectures gaining two smp_mb() are:
> >     alpha, arm, ia64, mips, parisc, powerpc and s390.
> >   ia64 is gaining only one smp_mb() thanks to its acquire semantic.
> > - size
> >    text	   data	    bss	    dec	    hex	filename
> >   77239	   8782	   2044	  88065	  15801	kernel/sched.o
> >   -> adds 352 bytes of text
> > - Number of lines (system call source code, w/o comments) : 18
> > 
> > * v3b: iteration on min(num_online_cpus(), nr threads in the process),
> >   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
> >   cpumask. Does not allocate the cpumask if only a single IPI is needed.
> > 
> > - only adds sys_membarrier() and related functions.
> > - size
> >    text	   data	    bss	    dec	    hex	filename
> >   78047	   8782	   2044	  88873	  15b29	kernel/sched.o
> >   -> adds 1160 bytes of text
> > - Number of lines (system call source code, w/o comments) : 163
> > 
> > I'll reply to this email with the two implementations. Comments are
> > welcome.
> 
> Cool!!!  Just for completeness, I point out the following trivial
> implementation:
> 
> /*
>  * sys_membarrier - issue memory barrier on current process running threads
>  *
>  * Execute a memory barrier on all running threads of the current process.
>  * Upon completion, the caller thread is ensured that all process threads
>  * have passed through a state where memory accesses match program order.
>  * (non-running threads are de facto in such a state)
>  *
>  * Note that synchronize_sched() has the side-effect of doing a memory
>  * barrier on each CPU.
>  */
> SYSCALL_DEFINE0(membarrier)
> {
> 	synchronize_sched();
> }
> 
> This does unnecessarily hit all CPUs in the system, but has the same
> minimal impact that in-kernel RCU already has.  It has long latency,
> (milliseconds) which might well disqualify it from consideration for
> some applications.  On the other hand, it automatically batches multiple
> concurrent calls to sys_membarrier().

Benchmarking this implementation:

1000 calls to sys_membarrier() take:

T=1: 0m16.007s
T=2: 0m16.006s
T=3: 0m16.010s
T=4: 0m16.008s
T=5: 0m16.005s
T=6: 0m16.005s
T=7: 0m16.005s

For a 16 ms per call (my HZ is 250), as you expected. So this solution
brings a slowdown of 10,000 times compared to the IPI-based solution.
We'd be better off using signals instead.

Thanks,

Mathieu


> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 17:50                                                         ` Peter Zijlstra
@ 2010-01-11 20:52                                                           ` Mathieu Desnoyers
  2010-01-11 21:19                                                             ` Peter Zijlstra
                                                                               ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-11 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Sun, 2010-01-10 at 23:29 -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> 
> Please start a new thread for new versions, I really didn't find this
> until I started reading in date order instead of thread order.

OK

> 
> 
> > Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:31.000000000 -0500
> > +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:37.000000000 -0500
> > @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
> >  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
> >  #define __NR_perf_event_open			298
> >  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> > +#define __NR_membarrier				299
> > +__SYSCALL(__NR_membarrier, sys_membarrier)
> >  
> >  #ifndef __NO_STUBS
> >  #define __ARCH_WANT_OLD_READDIR
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 19:21:31.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c	2010-01-10 22:22:40.000000000 -0500
> > @@ -2861,12 +2861,26 @@ context_switch(struct rq *rq, struct tas
> >  	 */
> >  	arch_start_context_switch(prev);
> >  
> > +	/*
> > +	 * sys_membarrier IPI-mb scheme requires a memory barrier between
> > +	 * user-space thread execution and update to mm_cpumask.
> > +	 */
> > +	if (likely(oldmm) && likely(oldmm != mm))
> > +		smp_mb__before_clear_bit();
> > +
> >  	if (unlikely(!mm)) {
> >  		next->active_mm = oldmm;
> >  		atomic_inc(&oldmm->mm_count);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> >  		switch_mm(oldmm, mm, next);
> > +		/*
> > +		 * sys_membarrier IPI-mb scheme requires a memory barrier
> > +		 * between update to mm_cpumask and user-space thread execution.
> > +		 */
> > +		if (likely(oldmm != mm))
> > +			smp_mb__after_clear_bit();
> > +	}
> >  
> >  	if (unlikely(!prev->mm)) {
> >  		prev->active_mm = NULL;
> > @@ -10822,6 +10836,49 @@ struct cgroup_subsys cpuacct_subsys = {
> >  };
> >  #endif	/* CONFIG_CGROUP_CPUACCT */
> >  
> > +/*
> > + * Execute a memory barrier on all active threads from the current process
> > + * on SMP systems. Do not rely on implicit barriers in
> > + * smp_call_function_many(), just in case they are ever relaxed in the future.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + *
> > + * Execute a memory barrier on all running threads of the current process.
> > + * Upon completion, the caller thread is ensured that all process threads
> > + * have passed through a state where memory accesses match program order.
> > + * (non-running threads are de facto in such a state)
> > + */
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +#ifdef CONFIG_SMP
> > +	if (unlikely(thread_group_empty(current)))
> > +		return 0;
> > +	/*
> > +	 * Memory barrier on the caller thread _before_ sending first
> > +	 * IPI. Matches memory barriers around mm_cpumask modification in
> > +	 * context_switch().
> > +	 */
> > +	smp_mb();
> > +	preempt_disable();
> > +	smp_call_function_many(mm_cpumask(current->mm), membarrier_ipi,
> > +			       NULL, 1);
> > +	preempt_enable();
> > +	/*
> > +	 * Memory barrier on the caller thread _after_ we finished
> > +	 * waiting for the last IPI. Matches memory barriers around mm_cpumask
> > +	 * modification in context_switch().
> > +	 */
> > +	smp_mb();
> > +#endif	/* #ifdef CONFIG_SMP */
> > +	return 0;
> > +}
> > +
> >  #ifndef CONFIG_SMP
> >  
> >  int rcu_expedited_torture_stats(char *page)
> 
> Right, so here you rely on the arch switch_mm() implementation to keep
> mm_cpumask() current, but then stick a memory barrier in the generic
> code... seems odd.

Agreed. I wanted to provide an idea of the required memory barriers
without changing every arch's switch_mm(). If it is generally agreed
that this kind of overhead is OK, then I think the path to follow is to
audit each architecture switch_mm() and add comments and/or barriers as
needed. We can write that switch_mm() requirement at the top of the
system call, so as we gradually assign system call numbers, the matching
switch_mm() modifications should be done.

> 
> x86 switch_mm() does indeed keep it current, but writing cr3 is also a
> rather serializing instruction.

Agreed. Write to CR3 is listed as a synchronizing instruction, which
effect includes the equivalent of a "mfence" instruction.

> 
> Furthermore, do we really need that smp_mb() in the membarrier_ipi()
> function? Shouldn't we investigate if either:
>  - receiving an IPI implies an mb, or

kernel/smp.c:

void generic_smp_call_function_interrupt(void)
...
        /*
         * Ensure entry is visible on call_function_queue after we have
         * entered the IPI. See comment in smp_call_function_many.
         * If we don't have this, then we may miss an entry on the list
         * and never get another IPI to process it.
         */
        smp_mb();

So, I think that if we have other concurrent IPI requests sent, we have
no guarantee that the IPI handler will indeed issue the memory barrier 
_after_ we added the entry to the list. The list itself is synchronized
with a spin lock-irqoff, but not with an explicit memory barrier. Or am
I missing a subtlety here ?

>  - enter/leave kernelspace implies an mb
> ?

On x86, iret is a serializing instruction. However, I haven't found any
information saying that int 0x80 nor sysenter/sysexit are serializing
instructions. And there seem to be no explicit mfence in x86 entry_*.S
(except a 64-bit gs swap workaround). So I'm tempted to answer: no,
entry/return kernelspace does not seem to imply a mb on x86.

> 
> So while I much like the simplified version, that previous one was
> heavily over engineer, I 
>  1) don't like that memory barrier in the generic code, 

Would that help if we proceed to arch-per-arch modification of
switch_mm() instead ?

>  2) don't think that arch's necessarily keep that mask as tight.
> 
> [ even if for x86 smp_mb__{before,after}_clear_bit are a nop, tying that
>   to switch_mm() semantics just reeks ]
> 
> See for example the sparc64 implementation of switch_mm() that only sets
> cpus in mm_cpumask(), but only tlb flushes clear them. Also, I wouldn't
> know if switch_mm() implies an mb on sparc64.

Yes, I've seen that some architecture delay the clear mask until much
later (lazy TLB shootdown I think is the correct term). As far as these
barriers are concerned, as long as we have one barrier before the clear
bit, and one barrier after the set bit, we're fine.

So, depending on the order, we might need:

smp_mb()
clear bit
set bit
smp_mb()

or, simply (and this applies to lazy TLB shootdown on sparc64):

set bit
mb()
clear bit (possibly performed by tlb flush)

So the clear bit can occur far, far away in the future, we don't care.
We'll just send extra IPIs when unneeded in this time-frame.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 20:52                                                           ` Mathieu Desnoyers
@ 2010-01-11 21:19                                                             ` Peter Zijlstra
  2010-01-11 22:04                                                               ` Mathieu Desnoyers
  2010-01-11 21:19                                                             ` Peter Zijlstra
  2010-01-11 21:31                                                             ` Peter Zijlstra
  2 siblings, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-11 21:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller

On Mon, 2010-01-11 at 15:52 -0500, Mathieu Desnoyers wrote:
> 
> So the clear bit can occur far, far away in the future, we don't care.
> We'll just send extra IPIs when unneeded in this time-frame.

I think we should try harder not to disturb CPUs, particularly in the
face of RT tasks and DoS scenarios. Therefore I don't think we should
just wildly send to mm_cpumask(), but verify (although speculatively)
that the remote tasks' mm matches ours.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 20:52                                                           ` Mathieu Desnoyers
  2010-01-11 21:19                                                             ` Peter Zijlstra
@ 2010-01-11 21:19                                                             ` Peter Zijlstra
  2010-01-11 21:31                                                             ` Peter Zijlstra
  2 siblings, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-11 21:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller

On Mon, 2010-01-11 at 15:52 -0500, Mathieu Desnoyers wrote:
> >  1) don't like that memory barrier in the generic code, 
> 
> Would that help if we proceed to arch-per-arch modification of
> switch_mm() instead ? 

Much preferred indeed.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 20:52                                                           ` Mathieu Desnoyers
  2010-01-11 21:19                                                             ` Peter Zijlstra
  2010-01-11 21:19                                                             ` Peter Zijlstra
@ 2010-01-11 21:31                                                             ` Peter Zijlstra
  2 siblings, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-11 21:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller, Linus Torvalds

On Mon, 2010-01-11 at 15:52 -0500, Mathieu Desnoyers wrote:

> >  - receiving an IPI implies an mb, or
> 
> So, I think that if we have other concurrent IPI requests sent, we have
> no guarantee that the IPI handler will indeed issue the memory barrier 
> _after_ we added the entry to the list. The list itself is synchronized
> with a spin lock-irqoff, but not with an explicit memory barrier. Or am
> I missing a subtlety here ?

Linus once said ( http://lkml.org/lkml/2009/2/18/145 ) :

"... at least on x86, taking an interrupt should be a serializing 
event, so there should be no reason for anything on the receiving side."

But he also notes that in generic this does not need to be so.

My question was more generic than: is this currently the case for x86,
though, I was wondering if we want this to be true and should therefore
make it so.

However re-reading the referenced discussion I think we might not want
to force it in generic because it might cause unneeded slow-down.

> >  - enter/leave kernelspace implies an mb
> > ?
> 
> On x86, iret is a serializing instruction. However, I haven't found any
> information saying that int 0x80 nor sysenter/sysexit are serializing
> instructions. And there seem to be no explicit mfence in x86 entry_*.S
> (except a 64-bit gs swap workaround). So I'm tempted to answer: no,
> entry/return kernelspace does not seem to imply a mb on x86. 

Right, there's some TIF_flags we can play with, but simply executing the
mb is a much simpler (and clearer) alternative.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-11 20:21                                                         ` Mathieu Desnoyers
@ 2010-01-11 21:48                                                           ` Paul E. McKenney
  2010-01-14  2:56                                                             ` Lai Jiangshan
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-11 21:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Mon, Jan 11, 2010 at 03:21:04PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sun, Jan 10, 2010 at 11:25:21PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > [...]
> > > > > Even when taking the spinlocks, efficient iteration on active threads is
> > > > > done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
> > > > > the same cpumask, and thus requires the same memory barriers around the
> > > > > updates.
> > > > 
> > > > Ouch!!!  Good point and good catch!!!
> > > > 
> > > > > We could switch to an inefficient iteration on all online CPUs instead,
> > > > > and check read runqueue ->mm with the spinlock held. Is that what you
> > > > > propose ? This will cause reading of large amounts of runqueue
> > > > > information, especially on large systems running few threads. The other
> > > > > way around is to iterate on all the process threads: in this case, small
> > > > > systems running many threads will have to read information about many
> > > > > inactive threads, which is not much better.
> > > > 
> > > > I am not all that worried about exactly what we do as long as it is
> > > > pretty obviously correct.  We can then improve performance when and as
> > > > the need arises.  We might need to use any of the strategies you
> > > > propose, or perhaps even choose among them depending on the number of
> > > > threads in the process, the number of CPUs, and so forth.  (I hope not,
> > > > but...)
> > > > 
> > > > My guess is that an obviously correct approach would work well for a
> > > > slowpath.  If someone later runs into performance problems, we can fix
> > > > them with the added knowledge of what they are trying to do.
> > > > 
> > > 
> > > OK, here is what I propose. Let's choose between two implementations
> > > (v3a and v3b), which implement two "obviously correct" approaches. In
> > > summary:
> > > 
> > > * baseline (based on 2.6.32.2)
> > >    text	   data	    bss	    dec	    hex	filename
> > >   76887	   8782	   2044	  87713	  156a1	kernel/sched.o
> > > 
> > > * v3a: ipi to many using mm_cpumask
> > > 
> > > - adds smp_mb__before_clear_bit()/smp_mb__after_clear_bit() before and
> > >   after mm_cpumask stores in context_switch(). They are only executed
> > >   when oldmm and mm are different. (it's my turn to hide behind an
> > >   appropriately-sized boulder for touching the scheduler). ;) Note that
> > >   it's not that bad, as these barriers turn into simple compiler barrier()
> > >   on:
> > >     avr32, blackfin, cris, frb, h8300, m32r, m68k, mn10300, score, sh,
> > >     sparc, x86 and xtensa.
> > >   The less lucky architectures gaining two smp_mb() are:
> > >     alpha, arm, ia64, mips, parisc, powerpc and s390.
> > >   ia64 is gaining only one smp_mb() thanks to its acquire semantic.
> > > - size
> > >    text	   data	    bss	    dec	    hex	filename
> > >   77239	   8782	   2044	  88065	  15801	kernel/sched.o
> > >   -> adds 352 bytes of text
> > > - Number of lines (system call source code, w/o comments) : 18
> > > 
> > > * v3b: iteration on min(num_online_cpus(), nr threads in the process),
> > >   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
> > >   cpumask. Does not allocate the cpumask if only a single IPI is needed.
> > > 
> > > - only adds sys_membarrier() and related functions.
> > > - size
> > >    text	   data	    bss	    dec	    hex	filename
> > >   78047	   8782	   2044	  88873	  15b29	kernel/sched.o
> > >   -> adds 1160 bytes of text
> > > - Number of lines (system call source code, w/o comments) : 163
> > > 
> > > I'll reply to this email with the two implementations. Comments are
> > > welcome.
> > 
> > Cool!!!  Just for completeness, I point out the following trivial
> > implementation:
> > 
> > /*
> >  * sys_membarrier - issue memory barrier on current process running threads
> >  *
> >  * Execute a memory barrier on all running threads of the current process.
> >  * Upon completion, the caller thread is ensured that all process threads
> >  * have passed through a state where memory accesses match program order.
> >  * (non-running threads are de facto in such a state)
> >  *
> >  * Note that synchronize_sched() has the side-effect of doing a memory
> >  * barrier on each CPU.
> >  */
> > SYSCALL_DEFINE0(membarrier)
> > {
> > 	synchronize_sched();
> > }
> > 
> > This does unnecessarily hit all CPUs in the system, but has the same
> > minimal impact that in-kernel RCU already has.  It has long latency,
> > (milliseconds) which might well disqualify it from consideration for
> > some applications.  On the other hand, it automatically batches multiple
> > concurrent calls to sys_membarrier().
> 
> Benchmarking this implementation:
> 
> 1000 calls to sys_membarrier() take:
> 
> T=1: 0m16.007s
> T=2: 0m16.006s
> T=3: 0m16.010s
> T=4: 0m16.008s
> T=5: 0m16.005s
> T=6: 0m16.005s
> T=7: 0m16.005s
> 
> For a 16 ms per call (my HZ is 250), as you expected. So this solution
> brings a slowdown of 10,000 times compared to the IPI-based solution.
> We'd be better off using signals instead.

>From a latency viewpoint, yes.  But synchronize_sched() consumes far
less CPU time than do signals, avoids waking up sleeping CPUs, batches
concurrent requests, and seems to be of some use in the kernel.  ;-)

But, as I said, just for completeness.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 21:19                                                             ` Peter Zijlstra
@ 2010-01-11 22:04                                                               ` Mathieu Desnoyers
  2010-01-11 22:20                                                                 ` Peter Zijlstra
  0 siblings, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-11 22:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Mon, 2010-01-11 at 15:52 -0500, Mathieu Desnoyers wrote:
> > 
> > So the clear bit can occur far, far away in the future, we don't care.
> > We'll just send extra IPIs when unneeded in this time-frame.
> 
> I think we should try harder not to disturb CPUs, particularly in the
> face of RT tasks and DoS scenarios. Therefore I don't think we should
> just wildly send to mm_cpumask(), but verify (although speculatively)
> that the remote tasks' mm matches ours.
> 

Well, my point of view is that if IPI TLB shootdown does not care about
disturbing CPUs running other processes in the time window of the lazy
removal, why should we ? We're adding an overhead very close to that of
an unrequired IPI shootdown which returns immediately without doing
anything.

The tradeoff here seems to be:
- more overhead within switch_mm() for more precise mm_cpumask.
vs
- lazy removal of the cpumask, which implies that some processors
  running a different process can receive the IPI for nothing.

I really doubt we could create an IPI DoS based on such a small
time window.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 22:04                                                               ` Mathieu Desnoyers
@ 2010-01-11 22:20                                                                 ` Peter Zijlstra
  2010-01-11 22:48                                                                   ` Paul E. McKenney
  2010-01-11 22:48                                                                   ` Mathieu Desnoyers
  0 siblings, 2 replies; 107+ messages in thread
From: Peter Zijlstra @ 2010-01-11 22:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller

On Mon, 2010-01-11 at 17:04 -0500, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Mon, 2010-01-11 at 15:52 -0500, Mathieu Desnoyers wrote:
> > > 
> > > So the clear bit can occur far, far away in the future, we don't care.
> > > We'll just send extra IPIs when unneeded in this time-frame.
> > 
> > I think we should try harder not to disturb CPUs, particularly in the
> > face of RT tasks and DoS scenarios. Therefore I don't think we should
> > just wildly send to mm_cpumask(), but verify (although speculatively)
> > that the remote tasks' mm matches ours.
> > 
> 
> Well, my point of view is that if IPI TLB shootdown does not care about
> disturbing CPUs running other processes in the time window of the lazy
> removal, why should we ?

while (1)
 sys_membarrier();

is a very good reason, TLB shootdown doesn't have that problem.

>  We're adding an overhead very close to that of
> an unrequired IPI shootdown which returns immediately without doing
> anything.

Except we don't clear the mask.

> The tradeoff here seems to be:
> - more overhead within switch_mm() for more precise mm_cpumask.
> vs
> - lazy removal of the cpumask, which implies that some processors
>   running a different process can receive the IPI for nothing.
> 
> I really doubt we could create an IPI DoS based on such a small
> time window.

What small window? When there's less runnable tasks than available mm
contexts some architectures can go quite a long while without
invalidating TLBs.

So what again is wrong with:

 int cpu, this_cpu = get_cpu();

 smp_mb(); 

 for_each_cpu(cpu, mm_cpumask(current->mm)) {
   if (cpu == this_cpu)
     continue;
   if (cpu_curr(cpu)->mm != current->mm)
     continue;
   smp_send_call_function_single(cpu, do_mb, NULL, 1);
 }

 put_cpu();

?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-11  4:30                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b) Mathieu Desnoyers
@ 2010-01-11 22:43                                                         ` Paul E. McKenney
  2010-01-12 15:38                                                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-11 22:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Sun, Jan 10, 2010 at 11:30:16PM -0500, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process.
> 
> It aims at greatly simplifying and enhancing the current signal-based
> liburcu userspace RCU synchronize_rcu() implementation.
> (found at http://lttng.org/urcu)

I didn't expect quite this comprehensive of an implementation from the
outset, but I guess I cannot complain.  ;-)

Overall, good stuff.

Interestingly enough, what you have implemented is analogous to
synchronize_rcu_expedited() and friends that have recently been added
to the in-kernel RCU API.  By this analogy, my earlier semi-suggestion
of synchronize_rcu(0 would be a candidate non-expedited implementation.
Long latency, but extremely low CPU consumption, full batching of
concurrent requests (even unrelated ones), and so on.

A few questions interspersed below.

> Changelog since v1:
> 
> - Only perform the IPI in CONFIG_SMP.
> - Only perform the IPI if the process has more than one thread.
> - Only send IPIs to CPUs involved with threads belonging to our process.
> - Adaptative IPI scheme (single vs many IPI with threshold).
> - Issue smp_mb() at the beginning and end of the system call.
> 
> Changelog since v2:
> 
> - Iteration on min(num_online_cpus(), nr threads in the process),
>   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
>   cpumask. Does not allocate the cpumask if only a single IPI is needed.
> 
> 
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the 
> write-side are turned into an invokation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
> 
> To explain the benefit of this scheme, let's introduce two example threads:
> 
> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> 
> In a scheme where all smp_mb() in thread A synchronize_rcu() are
> ordering memory accesses with respect to smp_mb() present in 
> rcu_read_lock/unlock(), we can change all smp_mb() from
> synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> rcu_read_lock/unlock() into compiler barriers "barrier()".
> 
> Before the change, we had, for each smp_mb() pairs:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> smp_mb()                    smp_mb()
> follow mem accesses         follow mem accesses
> 
> After the change, these pairs become:
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> As we can see, there are two possible scenarios: either Thread B memory
> accesses do not happen concurrently with Thread A accesses (1), or they
> do (2).
> 
> 1) Non-concurrent Thread A vs Thread B accesses:
> 
> Thread A                    Thread B
> prev mem accesses
> sys_membarrier()
> follow mem accesses
>                             prev mem accesses
>                             barrier()
>                             follow mem accesses
> 
> In this case, thread B accesses will be weakly ordered. This is OK,
> because at that point, thread A is not particularly interested in
> ordering them with respect to its own accesses.
> 
> 2) Concurrent Thread A vs Thread B accesses
> 
> Thread A                    Thread B
> prev mem accesses           prev mem accesses
> sys_membarrier()            barrier()
> follow mem accesses         follow mem accesses
> 
> In this case, thread B accesses, which are ensured to be in program
> order thanks to the compiler barrier, will be "upgraded" to full
> smp_mb() thanks to the IPIs executing memory barriers on each active
> system threads. Each non-running process threads are intrinsically
> serialized by the scheduler.
> 
> Just tried with a cache-hot kernel compilation using 6/8 CPUs.
> 
> Normally:                                              real 2m41.852s
> With the sys_membarrier+1 busy-looping thread running: real 5m41.830s
> 
> So... 2x slower. That hurts.
> 
> So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
> small allocation overhead and benefit from cpumask broadcast if
> possible so we scale better. But that all depends on how big the
> allocation overhead is.
> 
> Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
> calls, one thread is doing the sys_membarrier, the others are busy
> looping)).  Given that it costs almost half as much to perform the
> cpumask allocation than to send a single IPI, as we iterate on the CPUs
> until we find more than N match or iterated on all cpus.  If we only have
> N match or less, we send single IPIs. If we need more than that, then we
> switch to the cpumask allocation and send a broadcast IPI to the cpumask
> we construct for the matching CPUs. Let's call it the "adaptative IPI
> scheme".
> 
> For my Intel Xeon E5405
> 
> *This is calibration only, not taking the runqueue locks*
> 
> Just doing local mb()+single IPI to T other threads:
> 
> T=1: 0m18.801s
> T=2: 0m29.086s
> T=3: 0m46.841s
> T=4: 0m53.758s
> T=5: 1m10.856s
> T=6: 1m21.142s
> T=7: 1m38.362s
> 
> Just doing cpumask alloc+IPI-many to T other threads:
> 
> T=1: 0m21.778s
> T=2: 0m22.741s
> T=3: 0m22.185s
> T=4: 0m24.660s
> T=5: 0m26.855s
> T=6: 0m30.841s
> T=7: 0m29.551s
> 
> So I think the right threshold should be 1 thread (assuming other
> architecture will behave like mine). So starting with 2 threads, we
> allocate the cpumask before sending IPIs.
> 
> *end of calibration*
> 
> Resulting adaptative scheme, with runqueue locks:
> 
> T=1: 0m20.990s
> T=2: 0m22.588s
> T=3: 0m27.028s
> T=4: 0m29.027s
> T=5: 0m32.592s
> T=6: 0m36.556s
> T=7: 0m33.093s
> 
> The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
> in a loop and other threads busy-waiting in user-space on a variable shows that
> the thread doing sys_membarrier is doing mostly system calls, and other threads
> are mostly running in user-space. Side-note, in this test, it's important to
> check that individual threads are not always fully at 100% user-space time (they
> range between ~95% and 100%), because when some thread in the test is always at
> 100% on the same CPU, this means it does not get the IPI at all. (I actually
> found out about a bug in my own code while developing it with this test.)

The below data is for how many threads in the process?  Also, is "top"
accurate given that the IPI handler will have interrupts disabled?

> Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
> Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
> Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
> Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> 
> The system call number is only assigned for x86_64 in this RFC patch.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: mingo@elte.hu
> CC: laijs@cn.fujitsu.com
> CC: dipankar@in.ibm.com
> CC: akpm@linux-foundation.org
> CC: josh@joshtriplett.org
> CC: dvhltc@us.ibm.com
> CC: niv@us.ibm.com
> CC: tglx@linutronix.de
> CC: peterz@infradead.org
> CC: rostedt@goodmis.org
> CC: Valdis.Kletnieks@vt.edu
> CC: dhowells@redhat.com
> ---
>  arch/x86/include/asm/unistd_64.h |    2 
>  kernel/sched.c                   |  219 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 221 insertions(+)
> 
> Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> ===================================================================
> --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 22:23:59.000000000 -0500
> +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 22:29:30.000000000 -0500
> @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
>  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
>  #define __NR_perf_event_open			298
>  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> +#define __NR_membarrier				299
> +__SYSCALL(__NR_membarrier, sys_membarrier)
> 
>  #ifndef __NO_STUBS
>  #define __ARCH_WANT_OLD_READDIR
> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 22:23:59.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c	2010-01-10 23:12:35.000000000 -0500
> @@ -119,6 +119,11 @@
>   */
>  #define RUNTIME_INF	((u64)~0ULL)
> 
> +/*
> + * IPI vs cpumask broadcast threshold. Threshold of 1 IPI.
> + */
> +#define ADAPT_IPI_THRESHOLD	1
> +
>  static inline int rt_policy(int policy)
>  {
>  	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
> @@ -10822,6 +10827,220 @@ struct cgroup_subsys cpuacct_subsys = {
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
> 
> +/*
> + * Execute a memory barrier on all CPUs on SMP systems.
> + * Do not rely on implicit barriers in smp_call_function(), just in case they
> + * are ever relaxed in the future.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> +	smp_mb();
> +}
> +
> +/*
> + * Handle out-of-mem by sending per-cpu IPIs instead.
> + */

Good handling for out-of-memory errors!

> +static void membarrier_cpus_retry(int this_cpu)
> +{
> +	struct mm_struct *mm;
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		if (unlikely(cpu == this_cpu))
> +			continue;
> +		spin_lock_irq(&cpu_rq(cpu)->lock);
> +		mm = cpu_curr(cpu)->mm;
> +		spin_unlock_irq(&cpu_rq(cpu)->lock);
> +		if (current->mm == mm)
> +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);

There is of course some possibility of interrupting a real-time task,
as the destination CPU could context-switch once we drop the ->lock.
Not a criticism, just something to keep in mind.  After all, the only ways
I can think of to avoid this possibility do so by keeping the CPU from
switching to the real-time task, which sort of defeats the purpose.  ;-)

> +	}
> +}
> +
> +static void membarrier_threads_retry(int this_cpu)
> +{
> +	struct mm_struct *mm;
> +	struct task_struct *t;
> +	struct rq *rq;
> +	int cpu;
> +
> +	list_for_each_entry_rcu(t, &current->thread_group, thread_group) {
> +		local_irq_disable();
> +		rq = __task_rq_lock(t);
> +		mm = rq->curr->mm;
> +		cpu = rq->cpu;
> +		__task_rq_unlock(rq);
> +		local_irq_enable();
> +		if (cpu == this_cpu)
> +			continue;
> +		if (current->mm == mm)
> +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);

Ditto.

> +	}
> +}
> +
> +static void membarrier_cpus(int this_cpu)
> +{
> +	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
> +	cpumask_var_t tmpmask;
> +	struct mm_struct *mm;
> +
> +	/* Get CPU IDs up to threshold */
> +	for_each_online_cpu(cpu) {
> +		if (unlikely(cpu == this_cpu))
> +			continue;

OK, the above "if" handles the single-threaded-process case.

The UP-kernel case is handled by the #ifdef in sys_membarrier(), though
with a bit larger code footprint than the embedded guys would probably
prefer.  (Or is the compiler smart enough to omit these function given no
calls to them?  If not, recommend putting them under CONFIG_SMP #ifdef.)

> +		spin_lock_irq(&cpu_rq(cpu)->lock);
> +		mm = cpu_curr(cpu)->mm;
> +		spin_unlock_irq(&cpu_rq(cpu)->lock);
> +		if (current->mm == mm) {
> +			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
> +				nr_cpus++;
> +				break;
> +			}
> +			cpu_ipi[nr_cpus++] = cpu;
> +		}
> +	}
> +	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
> +		for (i = 0; i < nr_cpus; i++) {
> +			smp_call_function_single(cpu_ipi[i],
> +						 membarrier_ipi,
> +						 NULL, 1);
> +		}
> +	} else {
> +		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> +			membarrier_cpus_retry(this_cpu);
> +			return;
> +		}
> +		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
> +			cpumask_set_cpu(cpu_ipi[i], tmpmask);
> +		/* Continue previous online cpu iteration */
> +		cpumask_set_cpu(cpu, tmpmask);
> +		for (;;) {
> +			cpu = cpumask_next(cpu, cpu_online_mask);
> +			if (unlikely(cpu == this_cpu))
> +				continue;
> +			if (unlikely(cpu >= nr_cpu_ids))
> +				break;
> +			spin_lock_irq(&cpu_rq(cpu)->lock);
> +			mm = cpu_curr(cpu)->mm;
> +			spin_unlock_irq(&cpu_rq(cpu)->lock);
> +			if (current->mm == mm)
> +				cpumask_set_cpu(cpu, tmpmask);
> +		}
> +		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> +		free_cpumask_var(tmpmask);
> +	}
> +}
> +
> +static void membarrier_threads(int this_cpu)
> +{
> +	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
> +	cpumask_var_t tmpmask;
> +	struct mm_struct *mm;
> +	struct task_struct *t;
> +	struct rq *rq;
> +
> +	/* Get CPU IDs up to threshold */
> +	list_for_each_entry_rcu(t, &current->thread_group,
> +				thread_group) {
> +		local_irq_disable();
> +		rq = __task_rq_lock(t);
> +		mm = rq->curr->mm;
> +		cpu = rq->cpu;
> +		__task_rq_unlock(rq);
> +		local_irq_enable();
> +		if (cpu == this_cpu)
> +			continue;
> +		if (current->mm == mm) {

I do not believe that the above test is gaining you anything.  It would
fail only if the task switched since the __task_rq_unlock(), but then
again, it could switch immediately after the above test just as well.

> +			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
> +				nr_cpus++;
> +				break;
> +			}
> +			cpu_ipi[nr_cpus++] = cpu;
> +		}
> +	}
> +	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
> +		for (i = 0; i < nr_cpus; i++) {
> +			smp_call_function_single(cpu_ipi[i],
> +						 membarrier_ipi,
> +						 NULL, 1);
> +		}
> +	} else {
> +		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> +			membarrier_threads_retry(this_cpu);
> +			return;
> +		}
> +		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
> +			cpumask_set_cpu(cpu_ipi[i], tmpmask);
> +		/* Continue previous thread iteration */
> +		cpumask_set_cpu(cpu, tmpmask);
> +		list_for_each_entry_continue_rcu(t,
> +						 &current->thread_group,
> +						 thread_group) {
> +			local_irq_disable();
> +			rq = __task_rq_lock(t);
> +			mm = rq->curr->mm;
> +			cpu = rq->cpu;
> +			__task_rq_unlock(rq);
> +			local_irq_enable();
> +			if (cpu == this_cpu)
> +				continue;
> +			if (current->mm == mm)

Ditto.

> +				cpumask_set_cpu(cpu, tmpmask);
A> +		}
> +		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> +		free_cpumask_var(tmpmask);
> +	}
> +}
> +
> +/*
> + * sys_membarrier - issue memory barrier on current process running threads
> + *
> + * Execute a memory barrier on all running threads of the current process.
> + * Upon completion, the caller thread is ensured that all process threads
> + * have passed through a state where memory accesses match program order.
> + * (non-running threads are de facto in such a state)
> + *
> + * We do not use mm_cpumask because there is no guarantee that each architecture
> + * switch_mm issues a smp_mb() before and after mm_cpumask modification upon
> + * scheduling change. Furthermore, leave_mm is also modifying the mm_cpumask (at
> + * least on x86) from the TLB flush IPI handler. So rather than playing tricky
> + * games with lazy TLB flush, let's simply iterate on online cpus/thread group,
> + * whichever is the smallest.
> + */
> +SYSCALL_DEFINE0(membarrier)
> +{
> +#ifdef CONFIG_SMP
> +	int this_cpu;
> +
> +	if (unlikely(thread_group_empty(current)))
> +		return 0;
> +
> +	rcu_read_lock();	/* protect cpu_curr(cpu)-> and rcu list */
> +	preempt_disable();

Hmmm...  You are going to hate me for pointing this out, Mathieu, but
holding preempt_disable() across the whole sys_membarrier() processing
might be hurting real-time latency more than would unconditionally
IPIing all the CPUs.  :-/

That said, we have no shortage of situations where we scan the CPUs with
preemption disabled, and with interrupts disabled, for that matter.

> +	/*
> +	 * Memory barrier on the caller thread _before_ sending first IPI.
> +	 */
> +	smp_mb();
> +	/*
> +	 * We don't need to include ourself in IPI, as we already
> +	 * surround our execution with memory barriers.
> +	 */
> +	this_cpu = smp_processor_id();
> +	/* Approximate which is fastest: CPU or thread group iteration ? */
> +	if (num_online_cpus() <= atomic_read(&current->mm->mm_users))
> +		membarrier_cpus(this_cpu);
> +	else
> +		membarrier_threads(this_cpu);
> +	/*
> +	 * Memory barrier on the caller thread _after_ we finished
> +	 * waiting for the last IPI.
> +	 */
> +	smp_mb();
> +	preempt_enable();
> +	rcu_read_unlock();
> +#endif	/* #ifdef CONFIG_SMP */
> +	return 0;
> +}
> +
>  #ifndef CONFIG_SMP
> 
>  int rcu_expedited_torture_stats(char *page)
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 22:20                                                                 ` Peter Zijlstra
@ 2010-01-11 22:48                                                                   ` Paul E. McKenney
  2010-01-11 22:48                                                                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-11 22:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller

On Mon, Jan 11, 2010 at 11:20:16PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-11 at 17:04 -0500, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Mon, 2010-01-11 at 15:52 -0500, Mathieu Desnoyers wrote:
> > > > 
> > > > So the clear bit can occur far, far away in the future, we don't care.
> > > > We'll just send extra IPIs when unneeded in this time-frame.
> > > 
> > > I think we should try harder not to disturb CPUs, particularly in the
> > > face of RT tasks and DoS scenarios. Therefore I don't think we should
> > > just wildly send to mm_cpumask(), but verify (although speculatively)
> > > that the remote tasks' mm matches ours.
> > 
> > Well, my point of view is that if IPI TLB shootdown does not care about
> > disturbing CPUs running other processes in the time window of the lazy
> > removal, why should we ?
> 
> while (1)
>  sys_membarrier();
> 
> is a very good reason, TLB shootdown doesn't have that problem.

You can get a similar effect by doing mmap() to a fixed virtual address
in a tight loop, right?  Of course, mmap() has quite a bit more overhead
than sys_membarrier(), so the resulting IPIs probably won't hit the
other CPUs quite as hard, but it will hit them repeatedly.

> >  We're adding an overhead very close to that of
> > an unrequired IPI shootdown which returns immediately without doing
> > anything.
> 
> Except we don't clear the mask.
> 
> > The tradeoff here seems to be:
> > - more overhead within switch_mm() for more precise mm_cpumask.
> > vs
> > - lazy removal of the cpumask, which implies that some processors
> >   running a different process can receive the IPI for nothing.
> > 
> > I really doubt we could create an IPI DoS based on such a small
> > time window.
> 
> What small window? When there's less runnable tasks than available mm
> contexts some architectures can go quite a long while without
> invalidating TLBs.
> 
> So what again is wrong with:
> 
>  int cpu, this_cpu = get_cpu();
> 
>  smp_mb(); 
> 
>  for_each_cpu(cpu, mm_cpumask(current->mm)) {
>    if (cpu == this_cpu)
>      continue;
>    if (cpu_curr(cpu)->mm != current->mm)
>      continue;
>    smp_send_call_function_single(cpu, do_mb, NULL, 1);
>  }
> 
>  put_cpu();
> 
> ?

Well, if you have lots of CPUs, you will have disabled preemption for
quite some time.  Not that there aren't already numerous similar
problems throughout the Linux kernel...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)
  2010-01-11 22:20                                                                 ` Peter Zijlstra
  2010-01-11 22:48                                                                   ` Paul E. McKenney
@ 2010-01-11 22:48                                                                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-11 22:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Steven Rostedt, Oleg Nesterov, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar, David S. Miller

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Mon, 2010-01-11 at 17:04 -0500, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Mon, 2010-01-11 at 15:52 -0500, Mathieu Desnoyers wrote:
> > > > 
> > > > So the clear bit can occur far, far away in the future, we don't care.
> > > > We'll just send extra IPIs when unneeded in this time-frame.
> > > 
> > > I think we should try harder not to disturb CPUs, particularly in the
> > > face of RT tasks and DoS scenarios. Therefore I don't think we should
> > > just wildly send to mm_cpumask(), but verify (although speculatively)
> > > that the remote tasks' mm matches ours.
> > > 
> > 
> > Well, my point of view is that if IPI TLB shootdown does not care about
> > disturbing CPUs running other processes in the time window of the lazy
> > removal, why should we ?
> 
> while (1)
>  sys_membarrier();
> 
> is a very good reason, TLB shootdown doesn't have that problem.
> 
> >  We're adding an overhead very close to that of
> > an unrequired IPI shootdown which returns immediately without doing
> > anything.
> 
> Except we don't clear the mask.
> 

Good point. And I'm not so confident that clearing it ourself would be
safe in any way.

> > The tradeoff here seems to be:
> > - more overhead within switch_mm() for more precise mm_cpumask.
> > vs
> > - lazy removal of the cpumask, which implies that some processors
> >   running a different process can receive the IPI for nothing.
> > 
> > I really doubt we could create an IPI DoS based on such a small
> > time window.
> 
> What small window? When there's less runnable tasks than available mm
> contexts some architectures can go quite a long while without
> invalidating TLBs.

OK.

> 
> So what again is wrong with:
> 
>  int cpu, this_cpu = get_cpu();
> 
>  smp_mb(); 
> 
>  for_each_cpu(cpu, mm_cpumask(current->mm)) {
>    if (cpu == this_cpu)
>      continue;
>    if (cpu_curr(cpu)->mm != current->mm)
>      continue;
>    smp_send_call_function_single(cpu, do_mb, NULL, 1);
>  }
> 
>  put_cpu();
> 
> ?
> 

Almost. Missing smp_mb() at the end. We also have to specify that the
smp_mb() we plan to require in switch_mm() should now surround:

- clear mask
- set mask
- ->mm update

Or, for a simpler way to protect ->mm read, we can go with the runqueue
spinlock.

Also, I'd like to use a send-to-many IPI rather than sending to single
CPUs one by one, because the former has a much better scalability for
architectures supporting IPI broadcast. This, however, implies
allocating a temporary cpumask.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-11 22:43                                                         ` Paul E. McKenney
@ 2010-01-12 15:38                                                           ` Mathieu Desnoyers
  2010-01-12 16:27                                                             ` Steven Rostedt
  2010-01-12 18:12                                                             ` Paul E. McKenney
  0 siblings, 2 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-12 15:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Jan 10, 2010 at 11:30:16PM -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> > 
> > It aims at greatly simplifying and enhancing the current signal-based
> > liburcu userspace RCU synchronize_rcu() implementation.
> > (found at http://lttng.org/urcu)
> 
> I didn't expect quite this comprehensive of an implementation from the
> outset, but I guess I cannot complain.  ;-)
> 
> Overall, good stuff.
> 
> Interestingly enough, what you have implemented is analogous to
> synchronize_rcu_expedited() and friends that have recently been added
> to the in-kernel RCU API.  By this analogy, my earlier semi-suggestion
> of synchronize_rcu(0 would be a candidate non-expedited implementation.
> Long latency, but extremely low CPU consumption, full batching of
> concurrent requests (even unrelated ones), and so on.

Yes, the main different I think is that the sys_membarrier
infrastructure focuses on IPI-ing only the current process running
threads.

> 
> A few questions interspersed below.
> 
> > Changelog since v1:
> > 
> > - Only perform the IPI in CONFIG_SMP.
> > - Only perform the IPI if the process has more than one thread.
> > - Only send IPIs to CPUs involved with threads belonging to our process.
> > - Adaptative IPI scheme (single vs many IPI with threshold).
> > - Issue smp_mb() at the beginning and end of the system call.
> > 
> > Changelog since v2:
> > 
> > - Iteration on min(num_online_cpus(), nr threads in the process),
> >   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
> >   cpumask. Does not allocate the cpumask if only a single IPI is needed.
> > 
> > 
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the 
> > write-side are turned into an invokation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> > 
> > To explain the benefit of this scheme, let's introduce two example threads:
> > 
> > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> > Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> > 
> > In a scheme where all smp_mb() in thread A synchronize_rcu() are
> > ordering memory accesses with respect to smp_mb() present in 
> > rcu_read_lock/unlock(), we can change all smp_mb() from
> > synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> > rcu_read_lock/unlock() into compiler barriers "barrier()".
> > 
> > Before the change, we had, for each smp_mb() pairs:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > smp_mb()                    smp_mb()
> > follow mem accesses         follow mem accesses
> > 
> > After the change, these pairs become:
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > As we can see, there are two possible scenarios: either Thread B memory
> > accesses do not happen concurrently with Thread A accesses (1), or they
> > do (2).
> > 
> > 1) Non-concurrent Thread A vs Thread B accesses:
> > 
> > Thread A                    Thread B
> > prev mem accesses
> > sys_membarrier()
> > follow mem accesses
> >                             prev mem accesses
> >                             barrier()
> >                             follow mem accesses
> > 
> > In this case, thread B accesses will be weakly ordered. This is OK,
> > because at that point, thread A is not particularly interested in
> > ordering them with respect to its own accesses.
> > 
> > 2) Concurrent Thread A vs Thread B accesses
> > 
> > Thread A                    Thread B
> > prev mem accesses           prev mem accesses
> > sys_membarrier()            barrier()
> > follow mem accesses         follow mem accesses
> > 
> > In this case, thread B accesses, which are ensured to be in program
> > order thanks to the compiler barrier, will be "upgraded" to full
> > smp_mb() thanks to the IPIs executing memory barriers on each active
> > system threads. Each non-running process threads are intrinsically
> > serialized by the scheduler.
> > 
> > Just tried with a cache-hot kernel compilation using 6/8 CPUs.
> > 
> > Normally:                                              real 2m41.852s
> > With the sys_membarrier+1 busy-looping thread running: real 5m41.830s
> > 
> > So... 2x slower. That hurts.
> > 
> > So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
> > small allocation overhead and benefit from cpumask broadcast if
> > possible so we scale better. But that all depends on how big the
> > allocation overhead is.
> > 
> > Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
> > calls, one thread is doing the sys_membarrier, the others are busy
> > looping)).  Given that it costs almost half as much to perform the
> > cpumask allocation than to send a single IPI, as we iterate on the CPUs
> > until we find more than N match or iterated on all cpus.  If we only have
> > N match or less, we send single IPIs. If we need more than that, then we
> > switch to the cpumask allocation and send a broadcast IPI to the cpumask
> > we construct for the matching CPUs. Let's call it the "adaptative IPI
> > scheme".
> > 
> > For my Intel Xeon E5405
> > 
> > *This is calibration only, not taking the runqueue locks*
> > 
> > Just doing local mb()+single IPI to T other threads:
> > 
> > T=1: 0m18.801s
> > T=2: 0m29.086s
> > T=3: 0m46.841s
> > T=4: 0m53.758s
> > T=5: 1m10.856s
> > T=6: 1m21.142s
> > T=7: 1m38.362s
> > 
> > Just doing cpumask alloc+IPI-many to T other threads:
> > 
> > T=1: 0m21.778s
> > T=2: 0m22.741s
> > T=3: 0m22.185s
> > T=4: 0m24.660s
> > T=5: 0m26.855s
> > T=6: 0m30.841s
> > T=7: 0m29.551s
> > 
> > So I think the right threshold should be 1 thread (assuming other
> > architecture will behave like mine). So starting with 2 threads, we
> > allocate the cpumask before sending IPIs.
> > 
> > *end of calibration*
> > 
> > Resulting adaptative scheme, with runqueue locks:
> > 
> > T=1: 0m20.990s
> > T=2: 0m22.588s
> > T=3: 0m27.028s
> > T=4: 0m29.027s
> > T=5: 0m32.592s
> > T=6: 0m36.556s
> > T=7: 0m33.093s
> > 
> > The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
> > in a loop and other threads busy-waiting in user-space on a variable shows that
> > the thread doing sys_membarrier is doing mostly system calls, and other threads
> > are mostly running in user-space. Side-note, in this test, it's important to
> > check that individual threads are not always fully at 100% user-space time (they
> > range between ~95% and 100%), because when some thread in the test is always at
> > 100% on the same CPU, this means it does not get the IPI at all. (I actually
> > found out about a bug in my own code while developing it with this test.)
> 
> The below data is for how many threads in the process?

8 threads: one doing sys_membarrier() in a loop, 7 others waiting on a
variable.

> Also, is "top"
> accurate given that the IPI handler will have interrupts disabled?

Probably not. AFAIK. "top" does not really consider interrupts into its
accounting. So, better take this top output with a grain of salt or two.

> 
> > Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
> > Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
> > Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
> > Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> > 
> > The system call number is only assigned for x86_64 in this RFC patch.
> > 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > CC: mingo@elte.hu
> > CC: laijs@cn.fujitsu.com
> > CC: dipankar@in.ibm.com
> > CC: akpm@linux-foundation.org
> > CC: josh@joshtriplett.org
> > CC: dvhltc@us.ibm.com
> > CC: niv@us.ibm.com
> > CC: tglx@linutronix.de
> > CC: peterz@infradead.org
> > CC: rostedt@goodmis.org
> > CC: Valdis.Kletnieks@vt.edu
> > CC: dhowells@redhat.com
> > ---
> >  arch/x86/include/asm/unistd_64.h |    2 
> >  kernel/sched.c                   |  219 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 221 insertions(+)
> > 
> > Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 22:23:59.000000000 -0500
> > +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 22:29:30.000000000 -0500
> > @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
> >  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
> >  #define __NR_perf_event_open			298
> >  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> > +#define __NR_membarrier				299
> > +__SYSCALL(__NR_membarrier, sys_membarrier)
> > 
> >  #ifndef __NO_STUBS
> >  #define __ARCH_WANT_OLD_READDIR
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 22:23:59.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c	2010-01-10 23:12:35.000000000 -0500
> > @@ -119,6 +119,11 @@
> >   */
> >  #define RUNTIME_INF	((u64)~0ULL)
> > 
> > +/*
> > + * IPI vs cpumask broadcast threshold. Threshold of 1 IPI.
> > + */
> > +#define ADAPT_IPI_THRESHOLD	1
> > +
> >  static inline int rt_policy(int policy)
> >  {
> >  	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
> > @@ -10822,6 +10827,220 @@ struct cgroup_subsys cpuacct_subsys = {
> >  };
> >  #endif	/* CONFIG_CGROUP_CPUACCT */
> > 
> > +/*
> > + * Execute a memory barrier on all CPUs on SMP systems.
> > + * Do not rely on implicit barriers in smp_call_function(), just in case they
> > + * are ever relaxed in the future.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * Handle out-of-mem by sending per-cpu IPIs instead.
> > + */
> 
> Good handling for out-of-memory errors!
> 
> > +static void membarrier_cpus_retry(int this_cpu)
> > +{
> > +	struct mm_struct *mm;
> > +	int cpu;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		if (unlikely(cpu == this_cpu))
> > +			continue;
> > +		spin_lock_irq(&cpu_rq(cpu)->lock);
> > +		mm = cpu_curr(cpu)->mm;
> > +		spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +		if (current->mm == mm)
> > +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> 
> There is of course some possibility of interrupting a real-time task,
> as the destination CPU could context-switch once we drop the ->lock.
> Not a criticism, just something to keep in mind.  After all, the only ways
> I can think of to avoid this possibility do so by keeping the CPU from
> switching to the real-time task, which sort of defeats the purpose.  ;-)

Absolutely. And it's of no use to add a check within the IPI handler to
verify if it was indeed needed, because all we would skip is a simple
smp_mb(), which is relatively minor in terms of overhead compared to the
IPI itself.

> 
> > +	}
> > +}
> > +
> > +static void membarrier_threads_retry(int this_cpu)
> > +{
> > +	struct mm_struct *mm;
> > +	struct task_struct *t;
> > +	struct rq *rq;
> > +	int cpu;
> > +
> > +	list_for_each_entry_rcu(t, &current->thread_group, thread_group) {
> > +		local_irq_disable();
> > +		rq = __task_rq_lock(t);
> > +		mm = rq->curr->mm;
> > +		cpu = rq->cpu;
> > +		__task_rq_unlock(rq);
> > +		local_irq_enable();
> > +		if (cpu == this_cpu)
> > +			continue;
> > +		if (current->mm == mm)
> > +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> 
> Ditto.
> 
> > +	}
> > +}
> > +
> > +static void membarrier_cpus(int this_cpu)
> > +{
> > +	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
> > +	cpumask_var_t tmpmask;
> > +	struct mm_struct *mm;
> > +
> > +	/* Get CPU IDs up to threshold */
> > +	for_each_online_cpu(cpu) {
> > +		if (unlikely(cpu == this_cpu))
> > +			continue;
> 
> OK, the above "if" handles the single-threaded-process case.
> 

No. See

 +   if (unlikely(thread_group_empty(current)))
 +           return 0;

in the caller below. The if you present here simply ensures that we
don't do a superfluous function call on the current thread. It's
probably not really worth it for a slow path though.

> The UP-kernel case is handled by the #ifdef in sys_membarrier(), though
> with a bit larger code footprint than the embedded guys would probably
> prefer.  (Or is the compiler smart enough to omit these function given no
> calls to them?  If not, recommend putting them under CONFIG_SMP #ifdef.)

Hrm, that's a bit odd. I agree that UP systems could simply return
-ENOSYS for sys_membarrier, but then I wonder how userland could
distinguish between:

- an old kernel not supporting sys_membarrier()
  -> in this case we need to use the smp_mb() fallback on the read-side
     and in synchronize_rcu().
- a recent kernel supporting sys_membarrier(), CONFIG_SMP
  -> can use the barrier() on read-side, call sys_membarrier upon
     update.
- a recent kernel supporting sys_membarrier, !CONFIG_SMP
  -> calls to sys_membarrier() are not required, nor is barrier().

Or maybe we just postpone the userland smp_mb() question to another
thread. This will eventually need to be addressed anyway. Maybe with a
vgetmaxcpu() vsyscall.

> 
> > +		spin_lock_irq(&cpu_rq(cpu)->lock);
> > +		mm = cpu_curr(cpu)->mm;
> > +		spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +		if (current->mm == mm) {
> > +			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
> > +				nr_cpus++;
> > +				break;
> > +			}
> > +			cpu_ipi[nr_cpus++] = cpu;
> > +		}
> > +	}
> > +	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
> > +		for (i = 0; i < nr_cpus; i++) {
> > +			smp_call_function_single(cpu_ipi[i],
> > +						 membarrier_ipi,
> > +						 NULL, 1);
> > +		}
> > +	} else {
> > +		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > +			membarrier_cpus_retry(this_cpu);
> > +			return;
> > +		}
> > +		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
> > +			cpumask_set_cpu(cpu_ipi[i], tmpmask);
> > +		/* Continue previous online cpu iteration */
> > +		cpumask_set_cpu(cpu, tmpmask);
> > +		for (;;) {
> > +			cpu = cpumask_next(cpu, cpu_online_mask);
> > +			if (unlikely(cpu == this_cpu))
> > +				continue;
> > +			if (unlikely(cpu >= nr_cpu_ids))
> > +				break;
> > +			spin_lock_irq(&cpu_rq(cpu)->lock);
> > +			mm = cpu_curr(cpu)->mm;
> > +			spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +			if (current->mm == mm)
> > +				cpumask_set_cpu(cpu, tmpmask);
> > +		}
> > +		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > +		free_cpumask_var(tmpmask);
> > +	}
> > +}
> > +
> > +static void membarrier_threads(int this_cpu)
> > +{
> > +	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
> > +	cpumask_var_t tmpmask;
> > +	struct mm_struct *mm;
> > +	struct task_struct *t;
> > +	struct rq *rq;
> > +
> > +	/* Get CPU IDs up to threshold */
> > +	list_for_each_entry_rcu(t, &current->thread_group,
> > +				thread_group) {
> > +		local_irq_disable();
> > +		rq = __task_rq_lock(t);
> > +		mm = rq->curr->mm;
> > +		cpu = rq->cpu;
> > +		__task_rq_unlock(rq);
> > +		local_irq_enable();
> > +		if (cpu == this_cpu)
> > +			continue;
> > +		if (current->mm == mm) {
> 
> I do not believe that the above test is gaining you anything.  It would
> fail only if the task switched since the __task_rq_unlock(), but then
> again, it could switch immediately after the above test just as well.

OK. Anyway I think I'll go the the shorter implementation using the
mm_cpumask, and add an additionnal ->mm check with spinlocks.

> 
> > +			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
> > +				nr_cpus++;
> > +				break;
> > +			}
> > +			cpu_ipi[nr_cpus++] = cpu;
> > +		}
> > +	}
> > +	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
> > +		for (i = 0; i < nr_cpus; i++) {
> > +			smp_call_function_single(cpu_ipi[i],
> > +						 membarrier_ipi,
> > +						 NULL, 1);
> > +		}
> > +	} else {
> > +		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > +			membarrier_threads_retry(this_cpu);
> > +			return;
> > +		}
> > +		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
> > +			cpumask_set_cpu(cpu_ipi[i], tmpmask);
> > +		/* Continue previous thread iteration */
> > +		cpumask_set_cpu(cpu, tmpmask);
> > +		list_for_each_entry_continue_rcu(t,
> > +						 &current->thread_group,
> > +						 thread_group) {
> > +			local_irq_disable();
> > +			rq = __task_rq_lock(t);
> > +			mm = rq->curr->mm;
> > +			cpu = rq->cpu;
> > +			__task_rq_unlock(rq);
> > +			local_irq_enable();
> > +			if (cpu == this_cpu)
> > +				continue;
> > +			if (current->mm == mm)
> 
> Ditto.
> 
> > +				cpumask_set_cpu(cpu, tmpmask);
> A> +		}
> > +		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > +		free_cpumask_var(tmpmask);
> > +	}
> > +}
> > +
> > +/*
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + *
> > + * Execute a memory barrier on all running threads of the current process.
> > + * Upon completion, the caller thread is ensured that all process threads
> > + * have passed through a state where memory accesses match program order.
> > + * (non-running threads are de facto in such a state)
> > + *
> > + * We do not use mm_cpumask because there is no guarantee that each architecture
> > + * switch_mm issues a smp_mb() before and after mm_cpumask modification upon
> > + * scheduling change. Furthermore, leave_mm is also modifying the mm_cpumask (at
> > + * least on x86) from the TLB flush IPI handler. So rather than playing tricky
> > + * games with lazy TLB flush, let's simply iterate on online cpus/thread group,
> > + * whichever is the smallest.
> > + */
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +#ifdef CONFIG_SMP
> > +	int this_cpu;
> > +
> > +	if (unlikely(thread_group_empty(current)))
> > +		return 0;
> > +
> > +	rcu_read_lock();	/* protect cpu_curr(cpu)-> and rcu list */
> > +	preempt_disable();
> 
> Hmmm...  You are going to hate me for pointing this out, Mathieu, but
> holding preempt_disable() across the whole sys_membarrier() processing
> might be hurting real-time latency more than would unconditionally
> IPIing all the CPUs.  :-/

Hehe, I pointed this out myself a few emails ago :) This is why I
started by using raw_smp_processor_id(). Well, let's make it simple
first, and then we can improve if needed.

> 
> That said, we have no shortage of situations where we scan the CPUs with
> preemption disabled, and with interrupts disabled, for that matter.

Yep.

Thanks,

Mathieu

> 
> > +	/*
> > +	 * Memory barrier on the caller thread _before_ sending first IPI.
> > +	 */
> > +	smp_mb();
> > +	/*
> > +	 * We don't need to include ourself in IPI, as we already
> > +	 * surround our execution with memory barriers.
> > +	 */
> > +	this_cpu = smp_processor_id();
> > +	/* Approximate which is fastest: CPU or thread group iteration ? */
> > +	if (num_online_cpus() <= atomic_read(&current->mm->mm_users))
> > +		membarrier_cpus(this_cpu);
> > +	else
> > +		membarrier_threads(this_cpu);
> > +	/*
> > +	 * Memory barrier on the caller thread _after_ we finished
> > +	 * waiting for the last IPI.
> > +	 */
> > +	smp_mb();
> > +	preempt_enable();
> > +	rcu_read_unlock();
> > +#endif	/* #ifdef CONFIG_SMP */
> > +	return 0;
> > +}
> > +
> >  #ifndef CONFIG_SMP
> > 
> >  int rcu_expedited_torture_stats(char *page)
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-12 15:38                                                           ` Mathieu Desnoyers
@ 2010-01-12 16:27                                                             ` Steven Rostedt
  2010-01-12 16:38                                                               ` Mathieu Desnoyers
  2010-01-12 16:54                                                               ` Paul E. McKenney
  2010-01-12 18:12                                                             ` Paul E. McKenney
  1 sibling, 2 replies; 107+ messages in thread
From: Steven Rostedt @ 2010-01-12 16:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Tue, 2010-01-12 at 10:38 -0500, Mathieu Desnoyers wrote:

> > The UP-kernel case is handled by the #ifdef in sys_membarrier(), though
> > with a bit larger code footprint than the embedded guys would probably
> > prefer.  (Or is the compiler smart enough to omit these function given no
> > calls to them?  If not, recommend putting them under CONFIG_SMP #ifdef.)
> 
> Hrm, that's a bit odd. I agree that UP systems could simply return
> -ENOSYS for sys_membarrier, but then I wonder how userland could
> distinguish between:
> 
> - an old kernel not supporting sys_membarrier()
>   -> in this case we need to use the smp_mb() fallback on the read-side
>      and in synchronize_rcu().
> - a recent kernel supporting sys_membarrier(), CONFIG_SMP
>   -> can use the barrier() on read-side, call sys_membarrier upon
>      update.
> - a recent kernel supporting sys_membarrier, !CONFIG_SMP
>   -> calls to sys_membarrier() are not required, nor is barrier().
> 
> Or maybe we just postpone the userland smp_mb() question to another
> thread. This will eventually need to be addressed anyway. Maybe with a
> vgetmaxcpu() vsyscall.

I think Paul means to wrap all your other functions under the #ifdef.
What you have for sys_membarrier() is fine (just return 0 on UP) but you
also need to wrap the helper function above it under #ifdef CONFIG_SMP.
Don't rely on the compiler to optimize them out. If anything, you'll
probably get a bunch of warnings about static functions unused.

-- Steve



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-12 16:27                                                             ` Steven Rostedt
@ 2010-01-12 16:38                                                               ` Mathieu Desnoyers
  2010-01-12 16:54                                                               ` Paul E. McKenney
  1 sibling, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-12 16:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Tue, 2010-01-12 at 10:38 -0500, Mathieu Desnoyers wrote:
> 
> > > The UP-kernel case is handled by the #ifdef in sys_membarrier(), though
> > > with a bit larger code footprint than the embedded guys would probably
> > > prefer.  (Or is the compiler smart enough to omit these function given no
> > > calls to them?  If not, recommend putting them under CONFIG_SMP #ifdef.)
> > 
> > Hrm, that's a bit odd. I agree that UP systems could simply return
> > -ENOSYS for sys_membarrier, but then I wonder how userland could
> > distinguish between:
> > 
> > - an old kernel not supporting sys_membarrier()
> >   -> in this case we need to use the smp_mb() fallback on the read-side
> >      and in synchronize_rcu().
> > - a recent kernel supporting sys_membarrier(), CONFIG_SMP
> >   -> can use the barrier() on read-side, call sys_membarrier upon
> >      update.
> > - a recent kernel supporting sys_membarrier, !CONFIG_SMP
> >   -> calls to sys_membarrier() are not required, nor is barrier().
> > 
> > Or maybe we just postpone the userland smp_mb() question to another
> > thread. This will eventually need to be addressed anyway. Maybe with a
> > vgetmaxcpu() vsyscall.
> 
> I think Paul means to wrap all your other functions under the #ifdef.
> What you have for sys_membarrier() is fine (just return 0 on UP) but you
> also need to wrap the helper function above it under #ifdef CONFIG_SMP.
> Don't rely on the compiler to optimize them out. If anything, you'll
> probably get a bunch of warnings about static functions unused.

Ah! Indeed! Thanks for helping me see the light. ;)

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-12 16:27                                                             ` Steven Rostedt
  2010-01-12 16:38                                                               ` Mathieu Desnoyers
@ 2010-01-12 16:54                                                               ` Paul E. McKenney
  1 sibling, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-12 16:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Tue, Jan 12, 2010 at 11:27:39AM -0500, Steven Rostedt wrote:
> On Tue, 2010-01-12 at 10:38 -0500, Mathieu Desnoyers wrote:
> 
> > > The UP-kernel case is handled by the #ifdef in sys_membarrier(), though
> > > with a bit larger code footprint than the embedded guys would probably
> > > prefer.  (Or is the compiler smart enough to omit these function given no
> > > calls to them?  If not, recommend putting them under CONFIG_SMP #ifdef.)
> > 
> > Hrm, that's a bit odd. I agree that UP systems could simply return
> > -ENOSYS for sys_membarrier, but then I wonder how userland could
> > distinguish between:
> > 
> > - an old kernel not supporting sys_membarrier()
> >   -> in this case we need to use the smp_mb() fallback on the read-side
> >      and in synchronize_rcu().
> > - a recent kernel supporting sys_membarrier(), CONFIG_SMP
> >   -> can use the barrier() on read-side, call sys_membarrier upon
> >      update.
> > - a recent kernel supporting sys_membarrier, !CONFIG_SMP
> >   -> calls to sys_membarrier() are not required, nor is barrier().
> > 
> > Or maybe we just postpone the userland smp_mb() question to another
> > thread. This will eventually need to be addressed anyway. Maybe with a
> > vgetmaxcpu() vsyscall.
> 
> I think Paul means to wrap all your other functions under the #ifdef.
> What you have for sys_membarrier() is fine (just return 0 on UP) but you
> also need to wrap the helper function above it under #ifdef CONFIG_SMP.
> Don't rely on the compiler to optimize them out. If anything, you'll
> probably get a bunch of warnings about static functions unused.

Yes -- much clearer statement of what I was getting at.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-12 15:38                                                           ` Mathieu Desnoyers
  2010-01-12 16:27                                                             ` Steven Rostedt
@ 2010-01-12 18:12                                                             ` Paul E. McKenney
  2010-01-12 18:56                                                               ` Mathieu Desnoyers
  1 sibling, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-12 18:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Tue, Jan 12, 2010 at 10:38:54AM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sun, Jan 10, 2010 at 11:30:16PM -0500, Mathieu Desnoyers wrote:
> > > Here is an implementation of a new system call, sys_membarrier(), which
> > > executes a memory barrier on all threads of the current process.
> > > 
> > > It aims at greatly simplifying and enhancing the current signal-based
> > > liburcu userspace RCU synchronize_rcu() implementation.
> > > (found at http://lttng.org/urcu)
> > 
> > I didn't expect quite this comprehensive of an implementation from the
> > outset, but I guess I cannot complain.  ;-)
> > 
> > Overall, good stuff.
> > 
> > Interestingly enough, what you have implemented is analogous to
> > synchronize_rcu_expedited() and friends that have recently been added
> > to the in-kernel RCU API.  By this analogy, my earlier semi-suggestion
> > of synchronize_rcu(0 would be a candidate non-expedited implementation.
> > Long latency, but extremely low CPU consumption, full batching of
> > concurrent requests (even unrelated ones), and so on.
> 
> Yes, the main different I think is that the sys_membarrier
> infrastructure focuses on IPI-ing only the current process running
> threads.

Which does indeed make sense for the expedited interface.  On the other
hand, if you have a bunch of concurrent non-expedited requests from
different processes, covering all CPUs efficiently satisfies all of
the requests in one go.  And, if you use synchronize_sched() for the
non-expedited case, there will be no IPIs in the common case.

> > A few questions interspersed below.
> > 
> > > Changelog since v1:
> > > 
> > > - Only perform the IPI in CONFIG_SMP.
> > > - Only perform the IPI if the process has more than one thread.
> > > - Only send IPIs to CPUs involved with threads belonging to our process.
> > > - Adaptative IPI scheme (single vs many IPI with threshold).
> > > - Issue smp_mb() at the beginning and end of the system call.
> > > 
> > > Changelog since v2:
> > > 
> > > - Iteration on min(num_online_cpus(), nr threads in the process),
> > >   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
> > >   cpumask. Does not allocate the cpumask if only a single IPI is needed.
> > > 
> > > 
> > > Both the signal-based and the sys_membarrier userspace RCU schemes
> > > permit us to remove the memory barrier from the userspace RCU
> > > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > > accelerating them. These memory barriers are replaced by compiler
> > > barriers on the read-side, and all matching memory barriers on the 
> > > write-side are turned into an invokation of a memory barrier on all
> > > active threads in the process. By letting the kernel perform this
> > > synchronization rather than dumbly sending a signal to every process
> > > threads (as we currently do), we diminish the number of unnecessary wake
> > > ups and only issue the memory barriers on active threads. Non-running
> > > threads do not need to execute such barrier anyway, because these are
> > > implied by the scheduler context switches.
> > > 
> > > To explain the benefit of this scheme, let's introduce two example threads:
> > > 
> > > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> > > Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())
> > > 
> > > In a scheme where all smp_mb() in thread A synchronize_rcu() are
> > > ordering memory accesses with respect to smp_mb() present in 
> > > rcu_read_lock/unlock(), we can change all smp_mb() from
> > > synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
> > > rcu_read_lock/unlock() into compiler barriers "barrier()".
> > > 
> > > Before the change, we had, for each smp_mb() pairs:
> > > 
> > > Thread A                    Thread B
> > > prev mem accesses           prev mem accesses
> > > smp_mb()                    smp_mb()
> > > follow mem accesses         follow mem accesses
> > > 
> > > After the change, these pairs become:
> > > 
> > > Thread A                    Thread B
> > > prev mem accesses           prev mem accesses
> > > sys_membarrier()            barrier()
> > > follow mem accesses         follow mem accesses
> > > 
> > > As we can see, there are two possible scenarios: either Thread B memory
> > > accesses do not happen concurrently with Thread A accesses (1), or they
> > > do (2).
> > > 
> > > 1) Non-concurrent Thread A vs Thread B accesses:
> > > 
> > > Thread A                    Thread B
> > > prev mem accesses
> > > sys_membarrier()
> > > follow mem accesses
> > >                             prev mem accesses
> > >                             barrier()
> > >                             follow mem accesses
> > > 
> > > In this case, thread B accesses will be weakly ordered. This is OK,
> > > because at that point, thread A is not particularly interested in
> > > ordering them with respect to its own accesses.
> > > 
> > > 2) Concurrent Thread A vs Thread B accesses
> > > 
> > > Thread A                    Thread B
> > > prev mem accesses           prev mem accesses
> > > sys_membarrier()            barrier()
> > > follow mem accesses         follow mem accesses
> > > 
> > > In this case, thread B accesses, which are ensured to be in program
> > > order thanks to the compiler barrier, will be "upgraded" to full
> > > smp_mb() thanks to the IPIs executing memory barriers on each active
> > > system threads. Each non-running process threads are intrinsically
> > > serialized by the scheduler.
> > > 
> > > Just tried with a cache-hot kernel compilation using 6/8 CPUs.
> > > 
> > > Normally:                                              real 2m41.852s
> > > With the sys_membarrier+1 busy-looping thread running: real 5m41.830s
> > > 
> > > So... 2x slower. That hurts.
> > > 
> > > So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
> > > small allocation overhead and benefit from cpumask broadcast if
> > > possible so we scale better. But that all depends on how big the
> > > allocation overhead is.
> > > 
> > > Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
> > > calls, one thread is doing the sys_membarrier, the others are busy
> > > looping)).  Given that it costs almost half as much to perform the
> > > cpumask allocation than to send a single IPI, as we iterate on the CPUs
> > > until we find more than N match or iterated on all cpus.  If we only have
> > > N match or less, we send single IPIs. If we need more than that, then we
> > > switch to the cpumask allocation and send a broadcast IPI to the cpumask
> > > we construct for the matching CPUs. Let's call it the "adaptative IPI
> > > scheme".
> > > 
> > > For my Intel Xeon E5405
> > > 
> > > *This is calibration only, not taking the runqueue locks*
> > > 
> > > Just doing local mb()+single IPI to T other threads:
> > > 
> > > T=1: 0m18.801s
> > > T=2: 0m29.086s
> > > T=3: 0m46.841s
> > > T=4: 0m53.758s
> > > T=5: 1m10.856s
> > > T=6: 1m21.142s
> > > T=7: 1m38.362s
> > > 
> > > Just doing cpumask alloc+IPI-many to T other threads:
> > > 
> > > T=1: 0m21.778s
> > > T=2: 0m22.741s
> > > T=3: 0m22.185s
> > > T=4: 0m24.660s
> > > T=5: 0m26.855s
> > > T=6: 0m30.841s
> > > T=7: 0m29.551s
> > > 
> > > So I think the right threshold should be 1 thread (assuming other
> > > architecture will behave like mine). So starting with 2 threads, we
> > > allocate the cpumask before sending IPIs.
> > > 
> > > *end of calibration*
> > > 
> > > Resulting adaptative scheme, with runqueue locks:
> > > 
> > > T=1: 0m20.990s
> > > T=2: 0m22.588s
> > > T=3: 0m27.028s
> > > T=4: 0m29.027s
> > > T=5: 0m32.592s
> > > T=6: 0m36.556s
> > > T=7: 0m33.093s
> > > 
> > > The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
> > > in a loop and other threads busy-waiting in user-space on a variable shows that
> > > the thread doing sys_membarrier is doing mostly system calls, and other threads
> > > are mostly running in user-space. Side-note, in this test, it's important to
> > > check that individual threads are not always fully at 100% user-space time (they
> > > range between ~95% and 100%), because when some thread in the test is always at
> > > 100% on the same CPU, this means it does not get the IPI at all. (I actually
> > > found out about a bug in my own code while developing it with this test.)
> > 
> > The below data is for how many threads in the process?
> 
> 8 threads: one doing sys_membarrier() in a loop, 7 others waiting on a
> variable.

OK, thanks for the info!

> > Also, is "top"
> > accurate given that the IPI handler will have interrupts disabled?
> 
> Probably not. AFAIK. "top" does not really consider interrupts into its
> accounting. So, better take this top output with a grain of salt or two.

Might need something like oprofile to get good info?

> > > Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
> > > Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
> > > Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
> > > Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> > > 
> > > The system call number is only assigned for x86_64 in this RFC patch.
> > > 
> > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > > CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > CC: mingo@elte.hu
> > > CC: laijs@cn.fujitsu.com
> > > CC: dipankar@in.ibm.com
> > > CC: akpm@linux-foundation.org
> > > CC: josh@joshtriplett.org
> > > CC: dvhltc@us.ibm.com
> > > CC: niv@us.ibm.com
> > > CC: tglx@linutronix.de
> > > CC: peterz@infradead.org
> > > CC: rostedt@goodmis.org
> > > CC: Valdis.Kletnieks@vt.edu
> > > CC: dhowells@redhat.com
> > > ---
> > >  arch/x86/include/asm/unistd_64.h |    2 
> > >  kernel/sched.c                   |  219 +++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 221 insertions(+)
> > > 
> > > Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> > > ===================================================================
> > > --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 22:23:59.000000000 -0500
> > > +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 22:29:30.000000000 -0500
> > > @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
> > >  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
> > >  #define __NR_perf_event_open			298
> > >  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> > > +#define __NR_membarrier				299
> > > +__SYSCALL(__NR_membarrier, sys_membarrier)
> > > 
> > >  #ifndef __NO_STUBS
> > >  #define __ARCH_WANT_OLD_READDIR
> > > Index: linux-2.6-lttng/kernel/sched.c
> > > ===================================================================
> > > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 22:23:59.000000000 -0500
> > > +++ linux-2.6-lttng/kernel/sched.c	2010-01-10 23:12:35.000000000 -0500
> > > @@ -119,6 +119,11 @@
> > >   */
> > >  #define RUNTIME_INF	((u64)~0ULL)
> > > 
> > > +/*
> > > + * IPI vs cpumask broadcast threshold. Threshold of 1 IPI.
> > > + */
> > > +#define ADAPT_IPI_THRESHOLD	1
> > > +
> > >  static inline int rt_policy(int policy)
> > >  {
> > >  	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
> > > @@ -10822,6 +10827,220 @@ struct cgroup_subsys cpuacct_subsys = {
> > >  };
> > >  #endif	/* CONFIG_CGROUP_CPUACCT */
> > > 
> > > +/*
> > > + * Execute a memory barrier on all CPUs on SMP systems.
> > > + * Do not rely on implicit barriers in smp_call_function(), just in case they
> > > + * are ever relaxed in the future.
> > > + */
> > > +static void membarrier_ipi(void *unused)
> > > +{
> > > +	smp_mb();
> > > +}
> > > +
> > > +/*
> > > + * Handle out-of-mem by sending per-cpu IPIs instead.
> > > + */
> > 
> > Good handling for out-of-memory errors!
> > 
> > > +static void membarrier_cpus_retry(int this_cpu)
> > > +{
> > > +	struct mm_struct *mm;
> > > +	int cpu;
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		if (unlikely(cpu == this_cpu))
> > > +			continue;
> > > +		spin_lock_irq(&cpu_rq(cpu)->lock);
> > > +		mm = cpu_curr(cpu)->mm;
> > > +		spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > +		if (current->mm == mm)
> > > +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > 
> > There is of course some possibility of interrupting a real-time task,
> > as the destination CPU could context-switch once we drop the ->lock.
> > Not a criticism, just something to keep in mind.  After all, the only ways
> > I can think of to avoid this possibility do so by keeping the CPU from
> > switching to the real-time task, which sort of defeats the purpose.  ;-)
> 
> Absolutely. And it's of no use to add a check within the IPI handler to
> verify if it was indeed needed, because all we would skip is a simple
> smp_mb(), which is relatively minor in terms of overhead compared to the
> IPI itself.

Agreed!

> > > +	}
> > > +}
> > > +
> > > +static void membarrier_threads_retry(int this_cpu)
> > > +{
> > > +	struct mm_struct *mm;
> > > +	struct task_struct *t;
> > > +	struct rq *rq;
> > > +	int cpu;
> > > +
> > > +	list_for_each_entry_rcu(t, &current->thread_group, thread_group) {
> > > +		local_irq_disable();
> > > +		rq = __task_rq_lock(t);
> > > +		mm = rq->curr->mm;
> > > +		cpu = rq->cpu;
> > > +		__task_rq_unlock(rq);
> > > +		local_irq_enable();
> > > +		if (cpu == this_cpu)
> > > +			continue;
> > > +		if (current->mm == mm)
> > > +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > 
> > Ditto.
> > 
> > > +	}
> > > +}
> > > +
> > > +static void membarrier_cpus(int this_cpu)
> > > +{
> > > +	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
> > > +	cpumask_var_t tmpmask;
> > > +	struct mm_struct *mm;
> > > +
> > > +	/* Get CPU IDs up to threshold */
> > > +	for_each_online_cpu(cpu) {
> > > +		if (unlikely(cpu == this_cpu))
> > > +			continue;
> > 
> > OK, the above "if" handles the single-threaded-process case.
> > 
> 
> No. See
> 
>  +   if (unlikely(thread_group_empty(current)))
>  +           return 0;
> 
> in the caller below. The if you present here simply ensures that we
> don't do a superfluous function call on the current thread. It's
> probably not really worth it for a slow path though.

OK, got it.

> > The UP-kernel case is handled by the #ifdef in sys_membarrier(), though
> > with a bit larger code footprint than the embedded guys would probably
> > prefer.  (Or is the compiler smart enough to omit these function given no
> > calls to them?  If not, recommend putting them under CONFIG_SMP #ifdef.)
> 
> Hrm, that's a bit odd. I agree that UP systems could simply return
> -ENOSYS for sys_membarrier, but then I wonder how userland could
> distinguish between:
> 
> - an old kernel not supporting sys_membarrier()
>   -> in this case we need to use the smp_mb() fallback on the read-side
>      and in synchronize_rcu().
> - a recent kernel supporting sys_membarrier(), CONFIG_SMP
>   -> can use the barrier() on read-side, call sys_membarrier upon
>      update.
> - a recent kernel supporting sys_membarrier, !CONFIG_SMP
>   -> calls to sys_membarrier() are not required, nor is barrier().
> 
> Or maybe we just postpone the userland smp_mb() question to another
> thread. This will eventually need to be addressed anyway. Maybe with a
> vgetmaxcpu() vsyscall.

[covered in Steve's email]

> > > +		spin_lock_irq(&cpu_rq(cpu)->lock);
> > > +		mm = cpu_curr(cpu)->mm;
> > > +		spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > +		if (current->mm == mm) {
> > > +			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
> > > +				nr_cpus++;
> > > +				break;
> > > +			}
> > > +			cpu_ipi[nr_cpus++] = cpu;
> > > +		}
> > > +	}
> > > +	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
> > > +		for (i = 0; i < nr_cpus; i++) {
> > > +			smp_call_function_single(cpu_ipi[i],
> > > +						 membarrier_ipi,
> > > +						 NULL, 1);
> > > +		}
> > > +	} else {
> > > +		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > > +			membarrier_cpus_retry(this_cpu);
> > > +			return;
> > > +		}
> > > +		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
> > > +			cpumask_set_cpu(cpu_ipi[i], tmpmask);
> > > +		/* Continue previous online cpu iteration */
> > > +		cpumask_set_cpu(cpu, tmpmask);
> > > +		for (;;) {
> > > +			cpu = cpumask_next(cpu, cpu_online_mask);
> > > +			if (unlikely(cpu == this_cpu))
> > > +				continue;
> > > +			if (unlikely(cpu >= nr_cpu_ids))
> > > +				break;
> > > +			spin_lock_irq(&cpu_rq(cpu)->lock);
> > > +			mm = cpu_curr(cpu)->mm;
> > > +			spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > +			if (current->mm == mm)
> > > +				cpumask_set_cpu(cpu, tmpmask);
> > > +		}
> > > +		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > > +		free_cpumask_var(tmpmask);
> > > +	}
> > > +}
> > > +
> > > +static void membarrier_threads(int this_cpu)
> > > +{
> > > +	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
> > > +	cpumask_var_t tmpmask;
> > > +	struct mm_struct *mm;
> > > +	struct task_struct *t;
> > > +	struct rq *rq;
> > > +
> > > +	/* Get CPU IDs up to threshold */
> > > +	list_for_each_entry_rcu(t, &current->thread_group,
> > > +				thread_group) {
> > > +		local_irq_disable();
> > > +		rq = __task_rq_lock(t);
> > > +		mm = rq->curr->mm;
> > > +		cpu = rq->cpu;
> > > +		__task_rq_unlock(rq);
> > > +		local_irq_enable();
> > > +		if (cpu == this_cpu)
> > > +			continue;
> > > +		if (current->mm == mm) {
> > 
> > I do not believe that the above test is gaining you anything.  It would
> > fail only if the task switched since the __task_rq_unlock(), but then
> > again, it could switch immediately after the above test just as well.
> 
> OK. Anyway I think I'll go the the shorter implementation using the
> mm_cpumask, and add an additionnal ->mm check with spinlocks.

Checking the ones not in mm_cpumask?  I guess I will find out when I
see the new patch.

							Thanx, Paul

> > > +			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
> > > +				nr_cpus++;
> > > +				break;
> > > +			}
> > > +			cpu_ipi[nr_cpus++] = cpu;
> > > +		}
> > > +	}
> > > +	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
> > > +		for (i = 0; i < nr_cpus; i++) {
> > > +			smp_call_function_single(cpu_ipi[i],
> > > +						 membarrier_ipi,
> > > +						 NULL, 1);
> > > +		}
> > > +	} else {
> > > +		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > > +			membarrier_threads_retry(this_cpu);
> > > +			return;
> > > +		}
> > > +		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
> > > +			cpumask_set_cpu(cpu_ipi[i], tmpmask);
> > > +		/* Continue previous thread iteration */
> > > +		cpumask_set_cpu(cpu, tmpmask);
> > > +		list_for_each_entry_continue_rcu(t,
> > > +						 &current->thread_group,
> > > +						 thread_group) {
> > > +			local_irq_disable();
> > > +			rq = __task_rq_lock(t);
> > > +			mm = rq->curr->mm;
> > > +			cpu = rq->cpu;
> > > +			__task_rq_unlock(rq);
> > > +			local_irq_enable();
> > > +			if (cpu == this_cpu)
> > > +				continue;
> > > +			if (current->mm == mm)
> > 
> > Ditto.
> > 
> > > +				cpumask_set_cpu(cpu, tmpmask);
> > A> +		}
> > > +		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > > +		free_cpumask_var(tmpmask);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * sys_membarrier - issue memory barrier on current process running threads
> > > + *
> > > + * Execute a memory barrier on all running threads of the current process.
> > > + * Upon completion, the caller thread is ensured that all process threads
> > > + * have passed through a state where memory accesses match program order.
> > > + * (non-running threads are de facto in such a state)
> > > + *
> > > + * We do not use mm_cpumask because there is no guarantee that each architecture
> > > + * switch_mm issues a smp_mb() before and after mm_cpumask modification upon
> > > + * scheduling change. Furthermore, leave_mm is also modifying the mm_cpumask (at
> > > + * least on x86) from the TLB flush IPI handler. So rather than playing tricky
> > > + * games with lazy TLB flush, let's simply iterate on online cpus/thread group,
> > > + * whichever is the smallest.
> > > + */
> > > +SYSCALL_DEFINE0(membarrier)
> > > +{
> > > +#ifdef CONFIG_SMP
> > > +	int this_cpu;
> > > +
> > > +	if (unlikely(thread_group_empty(current)))
> > > +		return 0;
> > > +
> > > +	rcu_read_lock();	/* protect cpu_curr(cpu)-> and rcu list */
> > > +	preempt_disable();
> > 
> > Hmmm...  You are going to hate me for pointing this out, Mathieu, but
> > holding preempt_disable() across the whole sys_membarrier() processing
> > might be hurting real-time latency more than would unconditionally
> > IPIing all the CPUs.  :-/
> 
> Hehe, I pointed this out myself a few emails ago :) This is why I
> started by using raw_smp_processor_id(). Well, let's make it simple
> first, and then we can improve if needed.
> 
> > 
> > That said, we have no shortage of situations where we scan the CPUs with
> > preemption disabled, and with interrupts disabled, for that matter.
> 
> Yep.
> 
> Thanks,
> 
> Mathieu
> 
> > 
> > > +	/*
> > > +	 * Memory barrier on the caller thread _before_ sending first IPI.
> > > +	 */
> > > +	smp_mb();
> > > +	/*
> > > +	 * We don't need to include ourself in IPI, as we already
> > > +	 * surround our execution with memory barriers.
> > > +	 */
> > > +	this_cpu = smp_processor_id();
> > > +	/* Approximate which is fastest: CPU or thread group iteration ? */
> > > +	if (num_online_cpus() <= atomic_read(&current->mm->mm_users))
> > > +		membarrier_cpus(this_cpu);
> > > +	else
> > > +		membarrier_threads(this_cpu);
> > > +	/*
> > > +	 * Memory barrier on the caller thread _after_ we finished
> > > +	 * waiting for the last IPI.
> > > +	 */
> > > +	smp_mb();
> > > +	preempt_enable();
> > > +	rcu_read_unlock();
> > > +#endif	/* #ifdef CONFIG_SMP */
> > > +	return 0;
> > > +}
> > > +
> > >  #ifndef CONFIG_SMP
> > > 
> > >  int rcu_expedited_torture_stats(char *page)
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-12 18:12                                                             ` Paul E. McKenney
@ 2010-01-12 18:56                                                               ` Mathieu Desnoyers
  2010-01-13  0:23                                                                 ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-12 18:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Tue, Jan 12, 2010 at 10:38:54AM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Sun, Jan 10, 2010 at 11:30:16PM -0500, Mathieu Desnoyers wrote:
> > > > Here is an implementation of a new system call, sys_membarrier(), which
> > > > executes a memory barrier on all threads of the current process.
> > > > 
> > > > It aims at greatly simplifying and enhancing the current signal-based
> > > > liburcu userspace RCU synchronize_rcu() implementation.
> > > > (found at http://lttng.org/urcu)
> > > 
> > > I didn't expect quite this comprehensive of an implementation from the
> > > outset, but I guess I cannot complain.  ;-)
> > > 
> > > Overall, good stuff.
> > > 
> > > Interestingly enough, what you have implemented is analogous to
> > > synchronize_rcu_expedited() and friends that have recently been added
> > > to the in-kernel RCU API.  By this analogy, my earlier semi-suggestion
> > > of synchronize_rcu(0 would be a candidate non-expedited implementation.
> > > Long latency, but extremely low CPU consumption, full batching of
> > > concurrent requests (even unrelated ones), and so on.
> > 
> > Yes, the main different I think is that the sys_membarrier
> > infrastructure focuses on IPI-ing only the current process running
> > threads.
> 
> Which does indeed make sense for the expedited interface.  On the other
> hand, if you have a bunch of concurrent non-expedited requests from
> different processes, covering all CPUs efficiently satisfies all of
> the requests in one go.  And, if you use synchronize_sched() for the
> non-expedited case, there will be no IPIs in the common case.

So are you proposing we add a "int expedited" parameter to the
system call, and let the caller choose between the ipi and
synchronize_sched() schemes ?

[...]
> > > Also, is "top"
> > > accurate given that the IPI handler will have interrupts disabled?
> > 
> > Probably not. AFAIK. "top" does not really consider interrupts into its
> > accounting. So, better take this top output with a grain of salt or two.
> 
> Might need something like oprofile to get good info?

Could be. Although I just wanted to point out the kind of pattern we
should expect. I'm not convinced it's so useful to give the detailed
oprofile info. I'm rephrasing the above paragraph to state that top is
not super-accurate here.

[...]

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)
  2010-01-12 18:56                                                               ` Mathieu Desnoyers
@ 2010-01-13  0:23                                                                 ` Paul E. McKenney
  0 siblings, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-13  0:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells, laijs,
	dipankar

On Tue, Jan 12, 2010 at 01:56:41PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Tue, Jan 12, 2010 at 10:38:54AM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Sun, Jan 10, 2010 at 11:30:16PM -0500, Mathieu Desnoyers wrote:
> > > > > Here is an implementation of a new system call, sys_membarrier(), which
> > > > > executes a memory barrier on all threads of the current process.
> > > > > 
> > > > > It aims at greatly simplifying and enhancing the current signal-based
> > > > > liburcu userspace RCU synchronize_rcu() implementation.
> > > > > (found at http://lttng.org/urcu)
> > > > 
> > > > I didn't expect quite this comprehensive of an implementation from the
> > > > outset, but I guess I cannot complain.  ;-)
> > > > 
> > > > Overall, good stuff.
> > > > 
> > > > Interestingly enough, what you have implemented is analogous to
> > > > synchronize_rcu_expedited() and friends that have recently been added
> > > > to the in-kernel RCU API.  By this analogy, my earlier semi-suggestion
> > > > of synchronize_rcu(0 would be a candidate non-expedited implementation.
> > > > Long latency, but extremely low CPU consumption, full batching of
> > > > concurrent requests (even unrelated ones), and so on.
> > > 
> > > Yes, the main different I think is that the sys_membarrier
> > > infrastructure focuses on IPI-ing only the current process running
> > > threads.
> > 
> > Which does indeed make sense for the expedited interface.  On the other
> > hand, if you have a bunch of concurrent non-expedited requests from
> > different processes, covering all CPUs efficiently satisfies all of
> > the requests in one go.  And, if you use synchronize_sched() for the
> > non-expedited case, there will be no IPIs in the common case.
> 
> So are you proposing we add a "int expedited" parameter to the
> system call, and let the caller choose between the ipi and
> synchronize_sched() schemes ?

Sure, why not?

> [...]
> > > > Also, is "top"
> > > > accurate given that the IPI handler will have interrupts disabled?
> > > 
> > > Probably not. AFAIK. "top" does not really consider interrupts into its
> > > accounting. So, better take this top output with a grain of salt or two.
> > 
> > Might need something like oprofile to get good info?
> 
> Could be. Although I just wanted to point out the kind of pattern we
> should expect. I'm not convinced it's so useful to give the detailed
> oprofile info. I'm rephrasing the above paragraph to state that top is
> not super-accurate here.

K.

							Thanx, Paul

> [...]
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-11 21:48                                                           ` Paul E. McKenney
@ 2010-01-14  2:56                                                             ` Lai Jiangshan
  2010-01-14  5:13                                                               ` Paul E. McKenney
  0 siblings, 1 reply; 107+ messages in thread
From: Lai Jiangshan @ 2010-01-14  2:56 UTC (permalink / raw)
  To: paulmck, Mathieu Desnoyers
  Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, linux-kernel,
	Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks, dhowells,
	dipankar

Paul E. McKenney wrote:
> On Mon, Jan 11, 2010 at 03:21:04PM -0500, Mathieu Desnoyers wrote:
>> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
>>> On Sun, Jan 10, 2010 at 11:25:21PM -0500, Mathieu Desnoyers wrote:
>>>> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
>>>> [...]
>>>>>> Even when taking the spinlocks, efficient iteration on active threads is
>>>>>> done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
>>>>>> the same cpumask, and thus requires the same memory barriers around the
>>>>>> updates.
>>>>> Ouch!!!  Good point and good catch!!!
>>>>>
>>>>>> We could switch to an inefficient iteration on all online CPUs instead,
>>>>>> and check read runqueue ->mm with the spinlock held. Is that what you
>>>>>> propose ? This will cause reading of large amounts of runqueue
>>>>>> information, especially on large systems running few threads. The other
>>>>>> way around is to iterate on all the process threads: in this case, small
>>>>>> systems running many threads will have to read information about many
>>>>>> inactive threads, which is not much better.
>>>>> I am not all that worried about exactly what we do as long as it is
>>>>> pretty obviously correct.  We can then improve performance when and as
>>>>> the need arises.  We might need to use any of the strategies you
>>>>> propose, or perhaps even choose among them depending on the number of
>>>>> threads in the process, the number of CPUs, and so forth.  (I hope not,
>>>>> but...)
>>>>>
>>>>> My guess is that an obviously correct approach would work well for a
>>>>> slowpath.  If someone later runs into performance problems, we can fix
>>>>> them with the added knowledge of what they are trying to do.
>>>>>
>>>> OK, here is what I propose. Let's choose between two implementations
>>>> (v3a and v3b), which implement two "obviously correct" approaches. In
>>>> summary:
>>>>
>>>> * baseline (based on 2.6.32.2)
>>>>    text	   data	    bss	    dec	    hex	filename
>>>>   76887	   8782	   2044	  87713	  156a1	kernel/sched.o
>>>>
>>>> * v3a: ipi to many using mm_cpumask
>>>>
>>>> - adds smp_mb__before_clear_bit()/smp_mb__after_clear_bit() before and
>>>>   after mm_cpumask stores in context_switch(). They are only executed
>>>>   when oldmm and mm are different. (it's my turn to hide behind an
>>>>   appropriately-sized boulder for touching the scheduler). ;) Note that
>>>>   it's not that bad, as these barriers turn into simple compiler barrier()
>>>>   on:
>>>>     avr32, blackfin, cris, frb, h8300, m32r, m68k, mn10300, score, sh,
>>>>     sparc, x86 and xtensa.
>>>>   The less lucky architectures gaining two smp_mb() are:
>>>>     alpha, arm, ia64, mips, parisc, powerpc and s390.
>>>>   ia64 is gaining only one smp_mb() thanks to its acquire semantic.
>>>> - size
>>>>    text	   data	    bss	    dec	    hex	filename
>>>>   77239	   8782	   2044	  88065	  15801	kernel/sched.o
>>>>   -> adds 352 bytes of text
>>>> - Number of lines (system call source code, w/o comments) : 18
>>>>
>>>> * v3b: iteration on min(num_online_cpus(), nr threads in the process),
>>>>   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
>>>>   cpumask. Does not allocate the cpumask if only a single IPI is needed.
>>>>
>>>> - only adds sys_membarrier() and related functions.
>>>> - size
>>>>    text	   data	    bss	    dec	    hex	filename
>>>>   78047	   8782	   2044	  88873	  15b29	kernel/sched.o
>>>>   -> adds 1160 bytes of text
>>>> - Number of lines (system call source code, w/o comments) : 163
>>>>
>>>> I'll reply to this email with the two implementations. Comments are
>>>> welcome.
>>> Cool!!!  Just for completeness, I point out the following trivial
>>> implementation:
>>>
>>> /*
>>>  * sys_membarrier - issue memory barrier on current process running threads
>>>  *
>>>  * Execute a memory barrier on all running threads of the current process.
>>>  * Upon completion, the caller thread is ensured that all process threads
>>>  * have passed through a state where memory accesses match program order.
>>>  * (non-running threads are de facto in such a state)
>>>  *
>>>  * Note that synchronize_sched() has the side-effect of doing a memory
>>>  * barrier on each CPU.
>>>  */
>>> SYSCALL_DEFINE0(membarrier)
>>> {
>>> 	synchronize_sched();
>>> }
>>>
>>> This does unnecessarily hit all CPUs in the system, but has the same
>>> minimal impact that in-kernel RCU already has.  It has long latency,
>>> (milliseconds) which might well disqualify it from consideration for
>>> some applications.  On the other hand, it automatically batches multiple
>>> concurrent calls to sys_membarrier().
>> Benchmarking this implementation:
>>
>> 1000 calls to sys_membarrier() take:
>>
>> T=1: 0m16.007s
>> T=2: 0m16.006s
>> T=3: 0m16.010s
>> T=4: 0m16.008s
>> T=5: 0m16.005s
>> T=6: 0m16.005s
>> T=7: 0m16.005s
>>
>> For a 16 ms per call (my HZ is 250), as you expected. So this solution
>> brings a slowdown of 10,000 times compared to the IPI-based solution.
>> We'd be better off using signals instead.
> 
>>From a latency viewpoint, yes.  But synchronize_sched() consumes far
> less CPU time than do signals, avoids waking up sleeping CPUs, batches
> concurrent requests, and seems to be of some use in the kernel.  ;-)
> 
> But, as I said, just for completeness.
> 
> 							Thanx, Paul
> 


Actually, I like this implementation.
(synchronize_sched() need be changed to synchronize_kernel_and_user_sched()
or something else)

IPI-implementation and signal-implementation cost too much.
and this implementation just wait until things are done, very low cost.

The time of kernel rcu G.P. is typically 3/HZ seconds
(for all implementations except preemptable rcu). It is a large
latency. but it's nothing important I think:
1) user should also call synchronize_sched() rarely.
2) If user care this latency, user can just implement a userland call_rcu
userland_call_rcu() {
	insert rcu_head to rcu_callback_list.
}

rcu_callback_thread()
{
	for (;;) {
		handl_list = rcu_callback_list;
		rcu_callback_list = NULL;

		userland_synchronize_sched();

		handle the callback in handl_list
	}
}
3) kernel rcu VS userland IPI-implementation RCU:
userland_synchronize_sched() is less latency than kernel rcu?
userland has more priority to send a lot of IPIs?
It sounds crazy for me.

See also this email(2010-1-11) I sent to you offlist:
> /* Lai jiangshan define it for fun */
> #define synchronize_kernel_sched() synchronize_sched()
> 
> /* We can use the current RCU code to implement one of the following */
> extern void synchronize_kernel_and_user_sched(void);
> extern void synchronize_user_sched(void);
> 
> /*
>  * wait until all cpu(which in userspace) enter kernel and call mb()
>  * (recommend)
>  */
> extern void synchronize_user_mb(void);
> 
> void sys_membarrier(void)
> {
> 	/*
> 	 * 1) We add very little overhead to kernel, we just wait at kernel space.
> 	 * 2) Several processes which call sys_membarrier() wait the same *batch*.
> 	 */
> 
> 	synchronize_kernel_and_user_sched();
> 	/* OR synchronize_user_sched()/synchronize_user_mb() */
> }
> 


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-14  2:56                                                             ` Lai Jiangshan
@ 2010-01-14  5:13                                                               ` Paul E. McKenney
  2010-01-14  5:39                                                                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-01-14  5:13 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Mathieu Desnoyers, Steven Rostedt, Oleg Nesterov, Peter Zijlstra,
	linux-kernel, Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks,
	dhowells, dipankar

On Thu, Jan 14, 2010 at 10:56:08AM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > On Mon, Jan 11, 2010 at 03:21:04PM -0500, Mathieu Desnoyers wrote:
> >> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> >>> On Sun, Jan 10, 2010 at 11:25:21PM -0500, Mathieu Desnoyers wrote:
> >>>> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> >>>> [...]
> >>>>>> Even when taking the spinlocks, efficient iteration on active threads is
> >>>>>> done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
> >>>>>> the same cpumask, and thus requires the same memory barriers around the
> >>>>>> updates.
> >>>>> Ouch!!!  Good point and good catch!!!
> >>>>>
> >>>>>> We could switch to an inefficient iteration on all online CPUs instead,
> >>>>>> and check read runqueue ->mm with the spinlock held. Is that what you
> >>>>>> propose ? This will cause reading of large amounts of runqueue
> >>>>>> information, especially on large systems running few threads. The other
> >>>>>> way around is to iterate on all the process threads: in this case, small
> >>>>>> systems running many threads will have to read information about many
> >>>>>> inactive threads, which is not much better.
> >>>>> I am not all that worried about exactly what we do as long as it is
> >>>>> pretty obviously correct.  We can then improve performance when and as
> >>>>> the need arises.  We might need to use any of the strategies you
> >>>>> propose, or perhaps even choose among them depending on the number of
> >>>>> threads in the process, the number of CPUs, and so forth.  (I hope not,
> >>>>> but...)
> >>>>>
> >>>>> My guess is that an obviously correct approach would work well for a
> >>>>> slowpath.  If someone later runs into performance problems, we can fix
> >>>>> them with the added knowledge of what they are trying to do.
> >>>>>
> >>>> OK, here is what I propose. Let's choose between two implementations
> >>>> (v3a and v3b), which implement two "obviously correct" approaches. In
> >>>> summary:
> >>>>
> >>>> * baseline (based on 2.6.32.2)
> >>>>    text	   data	    bss	    dec	    hex	filename
> >>>>   76887	   8782	   2044	  87713	  156a1	kernel/sched.o
> >>>>
> >>>> * v3a: ipi to many using mm_cpumask
> >>>>
> >>>> - adds smp_mb__before_clear_bit()/smp_mb__after_clear_bit() before and
> >>>>   after mm_cpumask stores in context_switch(). They are only executed
> >>>>   when oldmm and mm are different. (it's my turn to hide behind an
> >>>>   appropriately-sized boulder for touching the scheduler). ;) Note that
> >>>>   it's not that bad, as these barriers turn into simple compiler barrier()
> >>>>   on:
> >>>>     avr32, blackfin, cris, frb, h8300, m32r, m68k, mn10300, score, sh,
> >>>>     sparc, x86 and xtensa.
> >>>>   The less lucky architectures gaining two smp_mb() are:
> >>>>     alpha, arm, ia64, mips, parisc, powerpc and s390.
> >>>>   ia64 is gaining only one smp_mb() thanks to its acquire semantic.
> >>>> - size
> >>>>    text	   data	    bss	    dec	    hex	filename
> >>>>   77239	   8782	   2044	  88065	  15801	kernel/sched.o
> >>>>   -> adds 352 bytes of text
> >>>> - Number of lines (system call source code, w/o comments) : 18
> >>>>
> >>>> * v3b: iteration on min(num_online_cpus(), nr threads in the process),
> >>>>   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
> >>>>   cpumask. Does not allocate the cpumask if only a single IPI is needed.
> >>>>
> >>>> - only adds sys_membarrier() and related functions.
> >>>> - size
> >>>>    text	   data	    bss	    dec	    hex	filename
> >>>>   78047	   8782	   2044	  88873	  15b29	kernel/sched.o
> >>>>   -> adds 1160 bytes of text
> >>>> - Number of lines (system call source code, w/o comments) : 163
> >>>>
> >>>> I'll reply to this email with the two implementations. Comments are
> >>>> welcome.
> >>> Cool!!!  Just for completeness, I point out the following trivial
> >>> implementation:
> >>>
> >>> /*
> >>>  * sys_membarrier - issue memory barrier on current process running threads
> >>>  *
> >>>  * Execute a memory barrier on all running threads of the current process.
> >>>  * Upon completion, the caller thread is ensured that all process threads
> >>>  * have passed through a state where memory accesses match program order.
> >>>  * (non-running threads are de facto in such a state)
> >>>  *
> >>>  * Note that synchronize_sched() has the side-effect of doing a memory
> >>>  * barrier on each CPU.
> >>>  */
> >>> SYSCALL_DEFINE0(membarrier)
> >>> {
> >>> 	synchronize_sched();
> >>> }
> >>>
> >>> This does unnecessarily hit all CPUs in the system, but has the same
> >>> minimal impact that in-kernel RCU already has.  It has long latency,
> >>> (milliseconds) which might well disqualify it from consideration for
> >>> some applications.  On the other hand, it automatically batches multiple
> >>> concurrent calls to sys_membarrier().
> >> Benchmarking this implementation:
> >>
> >> 1000 calls to sys_membarrier() take:
> >>
> >> T=1: 0m16.007s
> >> T=2: 0m16.006s
> >> T=3: 0m16.010s
> >> T=4: 0m16.008s
> >> T=5: 0m16.005s
> >> T=6: 0m16.005s
> >> T=7: 0m16.005s
> >>
> >> For a 16 ms per call (my HZ is 250), as you expected. So this solution
> >> brings a slowdown of 10,000 times compared to the IPI-based solution.
> >> We'd be better off using signals instead.
> > 
> >>From a latency viewpoint, yes.  But synchronize_sched() consumes far
> > less CPU time than do signals, avoids waking up sleeping CPUs, batches
> > concurrent requests, and seems to be of some use in the kernel.  ;-)
> > 
> > But, as I said, just for completeness.
> > 
> > 							Thanx, Paul
> 
> 
> Actually, I like this implementation.
> (synchronize_sched() need be changed to synchronize_kernel_and_user_sched()
> or something else)

The global memory barriers is indeed very much a side-effect of
synchronize_sched(), not its main purpose, you are right that its name
is a bit strange for this purpose.  ;-)

> IPI-implementation and signal-implementation cost too much.
> and this implementation just wait until things are done, very low cost.
> 
> The time of kernel rcu G.P. is typically 3/HZ seconds
> (for all implementations except preemptable rcu). It is a large
> latency. but it's nothing important I think:
> 1) user should also call synchronize_sched() rarely.
> 2) If user care this latency, user can just implement a userland call_rcu

In the common case, you are correct.  On the other hand, we did need to
do synchronize_rcu_expedited() and friends in the kernel, so it is
reasonable to expect that user-level RCU uses will also need expedited
interfaces.

> userland_call_rcu() {
> 	insert rcu_head to rcu_callback_list.
> }
> 
> rcu_callback_thread()
> {
> 	for (;;) {
> 		handl_list = rcu_callback_list;
> 		rcu_callback_list = NULL;
> 
> 		userland_synchronize_sched();
> 
> 		handle the callback in handl_list
> 	}
> }
> 3) kernel rcu VS userland IPI-implementation RCU:
> userland_synchronize_sched() is less latency than kernel rcu?
> userland has more priority to send a lot of IPIs?
> It sounds crazy for me.

You say "crazy" as if it was a bad thing.  ;-)

(Sorry, couldn't resist...)

But it is important to keep in mind that sys_membarrier() is just one
part of the user-level RCU implementation.  When you add in the necessary
waiting on per-thread counters, the user-level RCU is probably not that
much cheaper than the expedited in-kernel RCU primitives.

> See also this email(2010-1-11) I sent to you offlist:
> > /* Lai jiangshan define it for fun */
> > #define synchronize_kernel_sched() synchronize_sched()
> > 
> > /* We can use the current RCU code to implement one of the following */
> > extern void synchronize_kernel_and_user_sched(void);
> > extern void synchronize_user_sched(void);
> > 
> > /*
> >  * wait until all cpu(which in userspace) enter kernel and call mb()
> >  * (recommend)
> >  */
> > extern void synchronize_user_mb(void);
> > 
> > void sys_membarrier(void)
> > {
> > 	/*
> > 	 * 1) We add very little overhead to kernel, we just wait at kernel space.
> > 	 * 2) Several processes which call sys_membarrier() wait the same *batch*.
> > 	 */
> > 
> > 	synchronize_kernel_and_user_sched();
> > 	/* OR synchronize_user_sched()/synchronize_user_mb() */
> > }

If I am not getting too confused, Mathieu's latest patch does do
synchronize_sched() for the non-expedited case.  Mathieu pointed it
out in his email of January 9th, though not as a serious suggestion,
from what I can tell.  Your (private) email was indeed next, so as far
as I am concerned you do indeed share the credit/blame for suggesting
use of synchronize_sched() as a long-latency/low-overhead implementation
of sys_membarrier().

Mathieu, given that Lai has now posted publicly, could you please include
at least note crediting him for the first serious suggestion of using
synchronize_sched()?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
  2010-01-14  5:13                                                               ` Paul E. McKenney
@ 2010-01-14  5:39                                                                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 107+ messages in thread
From: Mathieu Desnoyers @ 2010-01-14  5:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, Steven Rostedt, Oleg Nesterov, Peter Zijlstra,
	linux-kernel, Ingo Molnar, akpm, josh, tglx, Valdis.Kletnieks,
	dhowells, dipankar

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Jan 14, 2010 at 10:56:08AM +0800, Lai Jiangshan wrote:
> > Paul E. McKenney wrote:
> > > On Mon, Jan 11, 2010 at 03:21:04PM -0500, Mathieu Desnoyers wrote:
> > >> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > >>> On Sun, Jan 10, 2010 at 11:25:21PM -0500, Mathieu Desnoyers wrote:
> > >>>> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > >>>> [...]
> > >>>>>> Even when taking the spinlocks, efficient iteration on active threads is
> > >>>>>> done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
> > >>>>>> the same cpumask, and thus requires the same memory barriers around the
> > >>>>>> updates.
> > >>>>> Ouch!!!  Good point and good catch!!!
> > >>>>>
> > >>>>>> We could switch to an inefficient iteration on all online CPUs instead,
> > >>>>>> and check read runqueue ->mm with the spinlock held. Is that what you
> > >>>>>> propose ? This will cause reading of large amounts of runqueue
> > >>>>>> information, especially on large systems running few threads. The other
> > >>>>>> way around is to iterate on all the process threads: in this case, small
> > >>>>>> systems running many threads will have to read information about many
> > >>>>>> inactive threads, which is not much better.
> > >>>>> I am not all that worried about exactly what we do as long as it is
> > >>>>> pretty obviously correct.  We can then improve performance when and as
> > >>>>> the need arises.  We might need to use any of the strategies you
> > >>>>> propose, or perhaps even choose among them depending on the number of
> > >>>>> threads in the process, the number of CPUs, and so forth.  (I hope not,
> > >>>>> but...)
> > >>>>>
> > >>>>> My guess is that an obviously correct approach would work well for a
> > >>>>> slowpath.  If someone later runs into performance problems, we can fix
> > >>>>> them with the added knowledge of what they are trying to do.
> > >>>>>
> > >>>> OK, here is what I propose. Let's choose between two implementations
> > >>>> (v3a and v3b), which implement two "obviously correct" approaches. In
> > >>>> summary:
> > >>>>
> > >>>> * baseline (based on 2.6.32.2)
> > >>>>    text	   data	    bss	    dec	    hex	filename
> > >>>>   76887	   8782	   2044	  87713	  156a1	kernel/sched.o
> > >>>>
> > >>>> * v3a: ipi to many using mm_cpumask
> > >>>>
> > >>>> - adds smp_mb__before_clear_bit()/smp_mb__after_clear_bit() before and
> > >>>>   after mm_cpumask stores in context_switch(). They are only executed
> > >>>>   when oldmm and mm are different. (it's my turn to hide behind an
> > >>>>   appropriately-sized boulder for touching the scheduler). ;) Note that
> > >>>>   it's not that bad, as these barriers turn into simple compiler barrier()
> > >>>>   on:
> > >>>>     avr32, blackfin, cris, frb, h8300, m32r, m68k, mn10300, score, sh,
> > >>>>     sparc, x86 and xtensa.
> > >>>>   The less lucky architectures gaining two smp_mb() are:
> > >>>>     alpha, arm, ia64, mips, parisc, powerpc and s390.
> > >>>>   ia64 is gaining only one smp_mb() thanks to its acquire semantic.
> > >>>> - size
> > >>>>    text	   data	    bss	    dec	    hex	filename
> > >>>>   77239	   8782	   2044	  88065	  15801	kernel/sched.o
> > >>>>   -> adds 352 bytes of text
> > >>>> - Number of lines (system call source code, w/o comments) : 18
> > >>>>
> > >>>> * v3b: iteration on min(num_online_cpus(), nr threads in the process),
> > >>>>   taking runqueue spinlocks, allocating a cpumask, ipi to many to the
> > >>>>   cpumask. Does not allocate the cpumask if only a single IPI is needed.
> > >>>>
> > >>>> - only adds sys_membarrier() and related functions.
> > >>>> - size
> > >>>>    text	   data	    bss	    dec	    hex	filename
> > >>>>   78047	   8782	   2044	  88873	  15b29	kernel/sched.o
> > >>>>   -> adds 1160 bytes of text
> > >>>> - Number of lines (system call source code, w/o comments) : 163
> > >>>>
> > >>>> I'll reply to this email with the two implementations. Comments are
> > >>>> welcome.
> > >>> Cool!!!  Just for completeness, I point out the following trivial
> > >>> implementation:
> > >>>
> > >>> /*
> > >>>  * sys_membarrier - issue memory barrier on current process running threads
> > >>>  *
> > >>>  * Execute a memory barrier on all running threads of the current process.
> > >>>  * Upon completion, the caller thread is ensured that all process threads
> > >>>  * have passed through a state where memory accesses match program order.
> > >>>  * (non-running threads are de facto in such a state)
> > >>>  *
> > >>>  * Note that synchronize_sched() has the side-effect of doing a memory
> > >>>  * barrier on each CPU.
> > >>>  */
> > >>> SYSCALL_DEFINE0(membarrier)
> > >>> {
> > >>> 	synchronize_sched();
> > >>> }
> > >>>
> > >>> This does unnecessarily hit all CPUs in the system, but has the same
> > >>> minimal impact that in-kernel RCU already has.  It has long latency,
> > >>> (milliseconds) which might well disqualify it from consideration for
> > >>> some applications.  On the other hand, it automatically batches multiple
> > >>> concurrent calls to sys_membarrier().
> > >> Benchmarking this implementation:
> > >>
> > >> 1000 calls to sys_membarrier() take:
> > >>
> > >> T=1: 0m16.007s
> > >> T=2: 0m16.006s
> > >> T=3: 0m16.010s
> > >> T=4: 0m16.008s
> > >> T=5: 0m16.005s
> > >> T=6: 0m16.005s
> > >> T=7: 0m16.005s
> > >>
> > >> For a 16 ms per call (my HZ is 250), as you expected. So this solution
> > >> brings a slowdown of 10,000 times compared to the IPI-based solution.
> > >> We'd be better off using signals instead.
> > > 
> > >>From a latency viewpoint, yes.  But synchronize_sched() consumes far
> > > less CPU time than do signals, avoids waking up sleeping CPUs, batches
> > > concurrent requests, and seems to be of some use in the kernel.  ;-)
> > > 
> > > But, as I said, just for completeness.
> > > 
> > > 							Thanx, Paul
> > 
> > 
> > Actually, I like this implementation.
> > (synchronize_sched() need be changed to synchronize_kernel_and_user_sched()
> > or something else)
> 
> The global memory barriers is indeed very much a side-effect of
> synchronize_sched(), not its main purpose, you are right that its name
> is a bit strange for this purpose.  ;-)

It's not a "synchronize_user_sched()" at all though. Because, as you say
below, it's only part of the solution. The rest of the synchronization
needed for RCU is performed by liburcu. The kernel system call, in this
proposal, is just one piece of the puzzle.

> 
> > IPI-implementation and signal-implementation cost too much.
> > and this implementation just wait until things are done, very low cost.
> > 
> > The time of kernel rcu G.P. is typically 3/HZ seconds
> > (for all implementations except preemptable rcu). It is a large
> > latency. but it's nothing important I think:
> > 1) user should also call synchronize_sched() rarely.
> > 2) If user care this latency, user can just implement a userland call_rcu
> 
> In the common case, you are correct.  On the other hand, we did need to
> do synchronize_rcu_expedited() and friends in the kernel, so it is
> reasonable to expect that user-level RCU uses will also need expedited
> interfaces.

Yes, I can foresee that some library users will require relatively fast
synchronize_rcu() execution. Even if there might be better designs based
on call_rcu() implementations (I currently have a defer_rcu() which is
quite close), we cannot force all library users to use such a perfect
design.

> 
> > userland_call_rcu() {
> > 	insert rcu_head to rcu_callback_list.
> > }
> > 
> > rcu_callback_thread()
> > {
> > 	for (;;) {
> > 		handl_list = rcu_callback_list;
> > 		rcu_callback_list = NULL;
> > 
> > 		userland_synchronize_sched();
> > 
> > 		handle the callback in handl_list
> > 	}
> > }
> > 3) kernel rcu VS userland IPI-implementation RCU:
> > userland_synchronize_sched() is less latency than kernel rcu?
> > userland has more priority to send a lot of IPIs?
> > It sounds crazy for me.
> 
> You say "crazy" as if it was a bad thing.  ;-)
> 
> (Sorry, couldn't resist...)
> 
> But it is important to keep in mind that sys_membarrier() is just one
> part of the user-level RCU implementation.  When you add in the necessary
> waiting on per-thread counters, the user-level RCU is probably not that
> much cheaper than the expedited in-kernel RCU primitives.

Indeed, these overheads are probably quite close.

> 
> > See also this email(2010-1-11) I sent to you offlist:
> > > /* Lai jiangshan define it for fun */
> > > #define synchronize_kernel_sched() synchronize_sched()
> > > 
> > > /* We can use the current RCU code to implement one of the following */
> > > extern void synchronize_kernel_and_user_sched(void);
> > > extern void synchronize_user_sched(void);
> > > 
> > > /*
> > >  * wait until all cpu(which in userspace) enter kernel and call mb()
> > >  * (recommend)
> > >  */
> > > extern void synchronize_user_mb(void);
> > > 
> > > void sys_membarrier(void)
> > > {
> > > 	/*
> > > 	 * 1) We add very little overhead to kernel, we just wait at kernel space.
> > > 	 * 2) Several processes which call sys_membarrier() wait the same *batch*.
> > > 	 */
> > > 
> > > 	synchronize_kernel_and_user_sched();
> > > 	/* OR synchronize_user_sched()/synchronize_user_mb() */
> > > }
> 
> If I am not getting too confused, Mathieu's latest patch does do
> synchronize_sched() for the non-expedited case.  Mathieu pointed it
> out in his email of January 9th, though not as a serious suggestion,
> from what I can tell.  Your (private) email was indeed next, so as far
> as I am concerned you do indeed share the credit/blame for suggesting
> use of synchronize_sched() as a long-latency/low-overhead implementation
> of sys_membarrier().
> 
> Mathieu, given that Lai has now posted publicly, could you please include
> at least note crediting him for the first serious suggestion of using
> synchronize_sched()?

Yep, will do in v6.

Thanks!

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2010-01-14  5:44 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
2010-01-07  5:02 ` Paul E. McKenney
2010-01-07  5:39   ` Mathieu Desnoyers
2010-01-07  8:32   ` Peter Zijlstra
2010-01-07 16:39     ` Paul E. McKenney
2010-01-07  5:28 ` Josh Triplett
2010-01-07  6:04   ` Mathieu Desnoyers
2010-01-07  6:32     ` Josh Triplett
2010-01-07 17:45       ` Mathieu Desnoyers
2010-01-07 16:46     ` Paul E. McKenney
2010-01-07  5:40 ` Steven Rostedt
2010-01-07  6:19   ` Mathieu Desnoyers
2010-01-07  6:35     ` Josh Triplett
2010-01-07  8:44       ` Peter Zijlstra
2010-01-07 13:15         ` Steven Rostedt
2010-01-07 15:07         ` Mathieu Desnoyers
2010-01-07 16:52         ` Paul E. McKenney
2010-01-07 17:18           ` Peter Zijlstra
2010-01-07 17:31             ` Paul E. McKenney
2010-01-07 17:44               ` Mathieu Desnoyers
2010-01-07 17:55                 ` Paul E. McKenney
2010-01-07 17:44               ` Steven Rostedt
2010-01-07 17:56                 ` Paul E. McKenney
2010-01-07 18:04                   ` Steven Rostedt
2010-01-07 18:40                     ` Paul E. McKenney
2010-01-07 17:36             ` Mathieu Desnoyers
2010-01-07 14:27     ` Steven Rostedt
2010-01-07 15:10       ` Mathieu Desnoyers
2010-01-07 16:49   ` Paul E. McKenney
2010-01-07 17:00     ` Steven Rostedt
2010-01-07  8:27 ` Peter Zijlstra
2010-01-07 18:30   ` Oleg Nesterov
2010-01-07 18:39     ` Paul E. McKenney
2010-01-07 18:59       ` Steven Rostedt
2010-01-07 19:16         ` Paul E. McKenney
2010-01-07 19:40           ` Steven Rostedt
2010-01-07 20:58             ` Paul E. McKenney
2010-01-07 21:35               ` Steven Rostedt
2010-01-07 22:34                 ` Paul E. McKenney
2010-01-08 22:28                 ` Mathieu Desnoyers
2010-01-08 23:53                 ` Mathieu Desnoyers
2010-01-09  0:20                   ` Paul E. McKenney
2010-01-09  1:02                     ` Mathieu Desnoyers
2010-01-09  1:21                       ` Paul E. McKenney
2010-01-09  1:22                         ` Paul E. McKenney
2010-01-09  2:38                         ` Mathieu Desnoyers
2010-01-09  5:42                           ` Paul E. McKenney
2010-01-09 19:20                             ` Mathieu Desnoyers
2010-01-09 23:05                               ` Steven Rostedt
2010-01-09 23:16                                 ` Steven Rostedt
2010-01-10  0:03                                   ` Paul E. McKenney
2010-01-10  0:41                                     ` Steven Rostedt
2010-01-10  1:14                                       ` Mathieu Desnoyers
2010-01-10  1:44                                       ` Mathieu Desnoyers
2010-01-10  2:12                                         ` Steven Rostedt
2010-01-10  5:25                                           ` Paul E. McKenney
2010-01-10 11:50                                             ` Steven Rostedt
2010-01-10 16:03                                               ` Mathieu Desnoyers
2010-01-10 16:21                                                 ` Steven Rostedt
2010-01-10 17:10                                                   ` Mathieu Desnoyers
2010-01-10 21:02                                                     ` Steven Rostedt
2010-01-10 21:41                                                       ` Mathieu Desnoyers
2010-01-11  1:21                                                       ` Paul E. McKenney
2010-01-10 17:45                                               ` Paul E. McKenney
2010-01-10 18:24                                                 ` Mathieu Desnoyers
2010-01-11  1:17                                                   ` Paul E. McKenney
2010-01-11  4:25                                                     ` Mathieu Desnoyers
2010-01-11  4:29                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
2010-01-11 17:27                                                         ` Paul E. McKenney
2010-01-11 17:35                                                           ` Mathieu Desnoyers
2010-01-11 17:50                                                         ` Peter Zijlstra
2010-01-11 20:52                                                           ` Mathieu Desnoyers
2010-01-11 21:19                                                             ` Peter Zijlstra
2010-01-11 22:04                                                               ` Mathieu Desnoyers
2010-01-11 22:20                                                                 ` Peter Zijlstra
2010-01-11 22:48                                                                   ` Paul E. McKenney
2010-01-11 22:48                                                                   ` Mathieu Desnoyers
2010-01-11 21:19                                                             ` Peter Zijlstra
2010-01-11 21:31                                                             ` Peter Zijlstra
2010-01-11  4:30                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b) Mathieu Desnoyers
2010-01-11 22:43                                                         ` Paul E. McKenney
2010-01-12 15:38                                                           ` Mathieu Desnoyers
2010-01-12 16:27                                                             ` Steven Rostedt
2010-01-12 16:38                                                               ` Mathieu Desnoyers
2010-01-12 16:54                                                               ` Paul E. McKenney
2010-01-12 18:12                                                             ` Paul E. McKenney
2010-01-12 18:56                                                               ` Mathieu Desnoyers
2010-01-13  0:23                                                                 ` Paul E. McKenney
2010-01-11 16:25                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Paul E. McKenney
2010-01-11 20:21                                                         ` Mathieu Desnoyers
2010-01-11 21:48                                                           ` Paul E. McKenney
2010-01-14  2:56                                                             ` Lai Jiangshan
2010-01-14  5:13                                                               ` Paul E. McKenney
2010-01-14  5:39                                                                 ` Mathieu Desnoyers
2010-01-10  5:18                                         ` Paul E. McKenney
2010-01-10  1:12                                     ` Mathieu Desnoyers
2010-01-10  5:19                                       ` Paul E. McKenney
2010-01-10  1:04                                   ` Mathieu Desnoyers
2010-01-10  1:01                                 ` Mathieu Desnoyers
2010-01-09 23:59                               ` Paul E. McKenney
2010-01-10  1:11                                 ` Mathieu Desnoyers
2010-01-07  9:50 ` Andi Kleen
2010-01-07 15:12   ` Mathieu Desnoyers
2010-01-07 16:56   ` Paul E. McKenney
2010-01-07 11:04 ` David Howells
2010-01-07 15:15   ` Mathieu Desnoyers
2010-01-07 15:47   ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).