linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
@ 2017-01-16 19:51 Mathieu Desnoyers
  2017-01-16 20:15 ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2017-01-16 19:51 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: linux-kernel, Mathieu Desnoyers, Josh Triplett, KOSAKI Motohiro,
	Steven Rostedt, Nicholas Miell, Ingo Molnar, Alan Cox,
	Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, Pranith Kumar, Michael Kerrisk,
	Shuah Khan, Andrew Morton, Linus Torvalds

Threads running on nohz_full CPUs are not considered by
synchronize_sched, but they should be covered by a membarrier system
call with MEMBARRIER_CMD_SHARED command.

Introduce two new commands to membarrier:
MEMBARRIER_CMD_REGISTER_EXPEDITED and
MEMBARRIER_CMD_UNREGISTER_EXPEDITED.

No-hz full threads requiring to receive interrupts to ensure correct
memory ordering pairing compiler barriers with membarrier system call
should register as "expedited" threads.

[ This RFC patch lacks documentation. I mainly want feedback to see if
  everyone is OK with the general approach. ]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/exec.c                       |  1 +
 include/linux/sched.h           | 27 +++++++++++++++
 include/uapi/linux/membarrier.h |  6 ++++
 kernel/fork.c                   |  2 ++
 kernel/membarrier.c             | 77 +++++++++++++++++++++++++++++++++++++++--
 5 files changed, 111 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index e579466..2cf1f87 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1771,6 +1771,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	membarrier_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9e..1242eb9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1998,6 +1998,9 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+#ifdef CONFIG_MEMBARRIER
+	unsigned int membarrier_expedited;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -3671,4 +3674,28 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
 void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
+#ifdef CONFIG_MEMBARRIER
+static inline void membarrier_fork(struct task_struct *t,
+		unsigned long clone_flags)
+{
+	if (clone_flags & CLONE_THREAD)
+		t->membarrier_expedited = 0;
+	else
+		t->membarrier_expedited = current->membarrier_expedited;
+}
+
+static inline void membarrier_execve(struct task_struct *t)
+{
+	t->membarrier_expedited = 0;
+}
+#else
+static inline void membarrier_fork(struct task_struct *t,
+		unsigned long clone_flags)
+{
+}
+static inline void membarrier_execve(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108b..4b78f07 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,6 +40,10 @@
  *                          (non-running threads are de facto in such a
  *                          state). This covers threads from all processes
  *                          running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_REGISTER_EXPEDITED:
+ *                          TODO
+ * @MEMBARRIER_CMD_UNREGISTER_EXPEDITED:
+ *                          TODO
  *
  * Command to be passed to the membarrier system call. The commands need to
  * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
@@ -48,6 +52,8 @@
 enum membarrier_cmd {
 	MEMBARRIER_CMD_QUERY = 0,
 	MEMBARRIER_CMD_SHARED = (1 << 0),
+	MEMBARRIER_CMD_REGISTER_EXPEDITED = (1 << 1),
+	MEMBARRIER_CMD_UNREGISTER_EXPEDITED = (1 << 2),
 };
 
 #endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8a..cec23e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1769,6 +1769,8 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	copy_seccomp(p);
 
+	membarrier_fork(p, clone_flags);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
index 536c727..65a6fbf 100644
--- a/kernel/membarrier.c
+++ b/kernel/membarrier.c
@@ -16,12 +16,79 @@
 
 #include <linux/syscalls.h>
 #include <linux/membarrier.h>
+#include <linux/tick.h>
+#include <linux/smp.h>
+#include <linux/sched.h>
+
+/*
+ * TODO: private sched.h is needed for runqueue. Should we move the
+ * sched code under kernel/sched/ ?
+ */
+#include "sched/sched.h"
 
 /*
  * Bitmask made from a "or" of all commands within enum membarrier_cmd,
  * except MEMBARRIER_CMD_QUERY.
  */
-#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
+#define MEMBARRIER_CMD_BITMASK	\
+	(MEMBARRIER_CMD_SHARED \
+	| MEMBARRIER_CMD_REGISTER_EXPEDITED \
+	| MEMBARRIER_CMD_UNREGISTER_EXPEDITED)
+
+static int membarrier_register_expedited(struct task_struct *t)
+{
+	struct rq *rq;
+
+	if (t->membarrier_expedited == UINT_MAX)
+		return -EOVERFLOW;
+	rq = this_rq();
+	raw_spin_lock(&rq->lock);
+	t->membarrier_expedited++;
+	raw_spin_unlock(&rq->lock);
+	return 0;
+}
+
+static int membarrier_unregister_expedited(struct task_struct *t)
+{
+	struct rq *rq;
+
+	if (!t->membarrier_expedited)
+		return -ENOENT;
+	rq = this_rq();
+	raw_spin_lock(&rq->lock);
+	t->membarrier_expedited--;
+	raw_spin_unlock(&rq->lock);
+	return 0;
+}
+
+static void memory_barrier(void *info)
+{
+	smp_mb();
+}
+
+static void membarrier_nohz_full_expedited(void)
+{
+	int cpu;
+
+	if (!tick_nohz_full_enabled())
+		return;
+	for_each_cpu(cpu, tick_nohz_full_mask) {
+		struct rq *rq;
+		struct task_struct *t;
+
+		rq = cpu_rq(cpu);
+		raw_spin_lock(&rq->lock);
+		t = rq->curr;
+		if (t->membarrier_expedited) {
+			int ret;
+
+			ret = smp_call_function_single(cpu, memory_barrier,
+					NULL, 1);
+			WARN_ON_ONCE(ret);
+		}
+		raw_spin_unlock(&rq->lock);
+	}
+}
 
 /**
  * sys_membarrier - issue memory barriers on a set of threads
@@ -57,9 +124,15 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
 	case MEMBARRIER_CMD_QUERY:
 		return MEMBARRIER_CMD_BITMASK;
 	case MEMBARRIER_CMD_SHARED:
-		if (num_online_cpus() > 1)
+		if (num_online_cpus() > 1) {
 			synchronize_sched();
+			membarrier_nohz_full_expedited();
+		}
 		return 0;
+	case MEMBARRIER_CMD_REGISTER_EXPEDITED:
+		return membarrier_register_expedited(current);
+	case MEMBARRIER_CMD_UNREGISTER_EXPEDITED:
+		return membarrier_unregister_expedited(current);
 	default:
 		return -EINVAL;
 	}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-16 19:51 [RFC PATCH] membarrier: handle nohz_full with expedited thread registration Mathieu Desnoyers
@ 2017-01-16 20:15 ` Linus Torvalds
  2017-01-16 22:56   ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2017-01-16 20:15 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E . McKenney, Linux Kernel Mailing List, Josh Triplett,
	KOSAKI Motohiro, Steven Rostedt, Nicholas Miell, Ingo Molnar,
	Alan Cox, Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, Pranith Kumar, Michael Kerrisk,
	Shuah Khan, Andrew Morton

Excuse my french, but this looks like incredible shit to me.

On Mon, Jan 16, 2017 at 11:51 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> +
> +static int membarrier_register_expedited(struct task_struct *t)
> +{
> +       struct rq *rq;
> +
> +       if (t->membarrier_expedited == UINT_MAX)
> +               return -EOVERFLOW;
> +       rq = this_rq();
> +       raw_spin_lock(&rq->lock);
> +       t->membarrier_expedited++;
> +       raw_spin_unlock(&rq->lock);
> +       return 0;
> +}

Yeah, taking the rq lock with

 (a) a random "rq" that isn't stable

 (b) without disabling interrupts

 (c) using an internal scheduler helper that isn't supposed to be used
externally

 (d) when it doesn't even make any sense in the first place for a
per-thread value that is never modified by any other threads!

 (e) .. and you expose this ABSOLUTELY SHIT as a random system call.

Oh, and the clone semantics make no sense either.

In fact, it just makes me doubt everything about the whole membarrier
concept, because it appears *so* terminally broken.

So unless I'm seriously missing something, this is just about the
worst piece of code I have seen this year.

No.

NO NO NO.

It really smells so broken that I'm wondering if I'm missing anything.
But I don't think I am. I think the code is just pure garbage.

                  Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-16 20:15 ` Linus Torvalds
@ 2017-01-16 22:56   ` Mathieu Desnoyers
  2017-01-16 23:50     ` Linus Torvalds
  2017-01-17  3:55     ` Frederic Weisbecker
  0 siblings, 2 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2017-01-16 22:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	rostedt, Nicholas Miell, Ingo Molnar, One Thousand Gnomes,
	Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, bobby prani, Michael Kerrisk,
	Shuah Khan, Andrew Morton

----- On Jan 16, 2017, at 3:15 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> Excuse my french, but this looks like incredible shit to me.

I'm currently trying to figure out how we can get membarrier
to play nicely with recent no-hz kernel features. Indeed, my
initial prototype is a mess. The good news is based on the number
of flaws you found in this RFC, there is plenty of room for
improvement. :)

> 
> On Mon, Jan 16, 2017 at 11:51 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> +
>> +static int membarrier_register_expedited(struct task_struct *t)
>> +{
>> +       struct rq *rq;
>> +
>> +       if (t->membarrier_expedited == UINT_MAX)
>> +               return -EOVERFLOW;
>> +       rq = this_rq();
>> +       raw_spin_lock(&rq->lock);
>> +       t->membarrier_expedited++;
>> +       raw_spin_unlock(&rq->lock);
>> +       return 0;
>> +}
> 
> Yeah, taking the rq lock with
> 
> (a) a random "rq" that isn't stable
> 
> (b) without disabling interrupts

So for both register and unregister functions, as well as the use in
membarrier_nohz_full_expedited(), disabling interrupts around the rq
lock should fix this. But perhaps it would be wiser trying not to use the
rq lock at all.

> 
> (c) using an internal scheduler helper that isn't supposed to be used
> externally

Yeah, I hate doing that. Hence the TODO comment I've placed near the include:

 * TODO: private sched.h is needed for runqueue. Should we move the
 * sched code under kernel/sched/ ?

I'm open to better ideas.

> 
> (d) when it doesn't even make any sense in the first place for a
> per-thread value that is never modified by any other threads!

The variable "membarrier_expedited" is indeed only modified by the
current thread, but it is read by other threads calling
membarrier_nohz_full_expedited().

For each nohz cpu, membarrier_nohz_full_expedited() needs to access
the cpu's current thread membarrier_expedited field (rq->curr->membarrier_expedited).
Are there ways to ensure loading this field can be done without having
it re-used after exit of its thread other than using the rq lock ?

We also need to be careful about a membarrier registration/unregistration
done concurrently with a membarrier_nohz_full_expedited() on another CPU.

Here is a basic membarrier CMD_SHARED ordering scenario. The CMD_SHARED
membarrier ensures that all memory accesses performed in program order
from each targeted thread are ordered with respect to membarrier().

Initial values:
A = B = 0

   CPU 0                       |   CPU 1
                               |
   x = load A                  |
                               |   store B = 1
   membarrier(CMD_SHARED)      |
                               |   barrier() (compiler-level barrier)
   y = load B                  |
                               |   store A = 1

Expect: if x == 1, then y == 1


Now if we add nohz membarrier_expedited registration and unregistration
to the picture for a nohz full cpu, here is how I envision we could do
it without the rq lock:

Initial values:
A = B = 0

   CPU 0                                  |   CPU 1 (no-hz full)
                                          |
                                          |   membarrier(REGISTER_EXPEDITED)
                                          |     set t->membarrier_exped = 1
                                          |     smp_mb() [3]
                                          |   store B = 1
                                          |   barrier() (compiler-level barrier)
                                          |   store A = 1
   x = load A                             |   
   membarrier(CMD_SHARED)                 |
     smp_mb() [1]                         |
     iter. on nohz cpus                   |
       if iter_t->membarrier_exped == 0   |
          (skip)                          |
       else                               |
          IPI smp_mb() to target cpu      |
                                          |   membarrier(UNREGISTER_EXPEDITED)
                                          |     smp_mb() [4]
                                          |     set t->membarrier_exped = 0
     smp_mb() [2]                         |
   y = load B                             |   

Expect: if x == 1, then y == 1

[1,2,3,4] If we don't have those barriers, CPU 0 could observe the store to A,
but not see the store to B.

[1] Orders user-space loads/stores before membarrier CMD_SHARED before load of
    membarrier_exped fields,
[2] Orders user-space loads/stores after membarrier CMD_SHARED after load of
    membarrier_exped fields,
[3] Orders user-space loads/stores after membarrier REGISTER_EXPEDITED after
    store to t->membarrier_exped,
[4] Orders user-space loads/stores before membarrier UNREGISTER_EXPEDITED before
    store to t->membarrier_exped,

[1] pairs with [3]: [1] orders load A before load membarrier_exped, matching [3]
                    which orders store membarrier_exped = 1 before store A = 1.
[2] pairs with [4]: [2] orders load B after load membarrier_exped, matching [4]
                    which orders store membarrier_exped = 0 after store B = 1.

Using the rq lock was one way to get those barriers, but if we can
use explicit memory barriers instead, it would allow us to do this
without interacting with the scheduler internals.

> 
> (e) .. and you expose this ABSOLUTELY SHIT as a random system call.
> 
> Oh, and the clone semantics make no sense either.

Currently, this patch clears the state on exec and when forking a new thread,
but keeps the thread state when forking a new process, which AFAIU is
in line with current practices. But perhaps not, what I am missing ?

> 
> In fact, it just makes me doubt everything about the whole membarrier
> concept, because it appears *so* terminally broken.
> 
> So unless I'm seriously missing something, this is just about the
> worst piece of code I have seen this year.

The good news is we are still in January. ;-)

> 
> No.
> 
> NO NO NO.
> 
> It really smells so broken that I'm wondering if I'm missing anything.
> But I don't think I am. I think the code is just pure garbage.

If there are other ways to synchronize the loads of thread's membarrier_expedited
fields by remote CPUs when iterating on nohz mask wrt thread exiting, then a
refcount-based approach could be used along with explicit memory barriers, and
we could do all this without grabbing the rq lock.

Thanks,

Mathieu


> 
>                   Linus

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-16 22:56   ` Mathieu Desnoyers
@ 2017-01-16 23:50     ` Linus Torvalds
  2017-01-17  2:09       ` Mathieu Desnoyers
  2017-01-17  3:55     ` Frederic Weisbecker
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2017-01-16 23:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	rostedt, Nicholas Miell, Ingo Molnar, One Thousand Gnomes,
	Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, bobby prani, Michael Kerrisk,
	Shuah Khan, Andrew Morton

On Mon, Jan 16, 2017 at 2:56 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> So for both register and unregister functions, as well as the use in
> membarrier_nohz_full_expedited(), disabling interrupts around the rq
> lock should fix this. But perhaps it would be wiser trying not to use the
> rq lock at all.

Definitely.


>> (d) when it doesn't even make any sense in the first place for a
>> per-thread value that is never modified by any other threads!
>
> The variable "membarrier_expedited" is indeed only modified by the
> current thread, but it is read by other threads calling
> membarrier_nohz_full_expedited().

Why not just make the write be a "smp_store_release()", and the read
be a "smp_load_acquire()". That guarantees a certain amount of
ordering. The only amount that I suspect makes sense, in fact.

But it's not clear what the problem is, so..

>> Oh, and the clone semantics make no sense either.
>
> Currently, this patch clears the state on exec and when forking a new thread,
> but keeps the thread state when forking a new process, which AFAIU is
> in line with current practices. But perhaps not, what I am missing ?

I'm not seeing how a regular fork() could possibly ever make sense to
have the membarrier state in the newly forked process. Not that
"fork()" is really well-defined for within a single thread anyway (it
actually is as far as Linux is concerned, but not in POSIX, afaik).

So if there is no major reason for it, I would strongly suggest that
_if_ all this makes sense in the first place, the membarrier thing
should just be cleared unconditionally both for exec and for
clone/fork.

              Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-16 23:50     ` Linus Torvalds
@ 2017-01-17  2:09       ` Mathieu Desnoyers
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2017-01-17  2:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	rostedt, Nicholas Miell, Ingo Molnar, One Thousand Gnomes,
	Lai Jiangshan, Stephen Hemminger, Thomas Gleixner,
	Peter Zijlstra, David Howells, bobby prani, Michael Kerrisk,
	Shuah Khan, Andrew Morton

----- On Jan 16, 2017, at 6:50 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

> Why not just make the write be a "smp_store_release()", and the read
> be a "smp_load_acquire()". That guarantees a certain amount of
> ordering. The only amount that I suspect makes sense, in fact.
> 
> But it's not clear what the problem is, so..

If we only use a smp_store_release() for the store to membarrier_exped,
the "unregister" (setting back to 0) would be OK, but not the "register",
as the following scenario shows:

Initial values:
A = B = 0

   CPU 0                                  |   CPU 1 (no-hz full)
                                          |
                                          |   membarrier(REGISTER_EXPEDITED)
                                          |     (write barrier implied by store-release)
                                          |     set t->membarrier_exped = 1 (store-release imply memory barrier before store)
                                          |   store B = 1
                                          |   barrier() (compiler-level barrier)
                                          |   store A = 1
   x = load A                             |  
   membarrier(CMD_SHARED)                 |
     smp_mb() [1]                         |
     iter. on nohz cpus                   |
       if iter_t->membarrier_exped == 0   |
          (skip)                          |
     smp_mb() [2]                         |
   y = load B                             |  

Expect: if x == 1, then y == 1

CPU 0 can observe A == 1, membarrier_exped == 0, and B == 0,
because there is no memory barrier between store to
membarrier_exped and store to A on CPU 1.

What we seem to need on the registration/unregistration side
is store-acquire for registration, and store-release for
unregistration. This pairs with a load of membarrier_exped
that has both acquire and release barriers ([1] and [2] above).

> I'm not seeing how a regular fork() could possibly ever make sense to
> have the membarrier state in the newly forked process. Not that
> "fork()" is really well-defined for within a single thread anyway (it
> actually is as far as Linux is concerned, but not in POSIX, afaik).
> 
> So if there is no major reason for it, I would strongly suggest that
> _if_ all this makes sense in the first place, the membarrier thing
> should just be cleared unconditionally both for exec and for
> clone/fork.

That's fine with me!

Thanks,

Mathieu

> 
>               Linus

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-16 22:56   ` Mathieu Desnoyers
  2017-01-16 23:50     ` Linus Torvalds
@ 2017-01-17  3:55     ` Frederic Weisbecker
  2017-01-17 20:53       ` Paul E. McKenney
  2017-01-17 21:56       ` Mathieu Desnoyers
  1 sibling, 2 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2017-01-17  3:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Paul E. McKenney, linux-kernel, Josh Triplett,
	KOSAKI Motohiro, rostedt, Nicholas Miell, Ingo Molnar,
	One Thousand Gnomes, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Peter Zijlstra, David Howells, bobby prani,
	Michael Kerrisk, Shuah Khan, Andrew Morton

On Mon, Jan 16, 2017 at 10:56:07PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 16, 2017, at 3:15 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
> 
> > Excuse my french, but this looks like incredible shit to me.
> 
> I'm currently trying to figure out how we can get membarrier
> to play nicely with recent no-hz kernel features. Indeed, my
> initial prototype is a mess. The good news is based on the number
> of flaws you found in this RFC, there is plenty of room for
> improvement. :)
> 
> > 
> > On Mon, Jan 16, 2017 at 11:51 AM, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >> +
> >> +static int membarrier_register_expedited(struct task_struct *t)
> >> +{
> >> +       struct rq *rq;
> >> +
> >> +       if (t->membarrier_expedited == UINT_MAX)
> >> +               return -EOVERFLOW;
> >> +       rq = this_rq();
> >> +       raw_spin_lock(&rq->lock);
> >> +       t->membarrier_expedited++;
> >> +       raw_spin_unlock(&rq->lock);
> >> +       return 0;
> >> +}
> > 
> > Yeah, taking the rq lock with
> > 
> > (a) a random "rq" that isn't stable
> > 
> > (b) without disabling interrupts
> 
> So for both register and unregister functions, as well as the use in
> membarrier_nohz_full_expedited(), disabling interrupts around the rq
> lock should fix this. But perhaps it would be wiser trying not to use the
> rq lock at all.
> 
> > 
> > (c) using an internal scheduler helper that isn't supposed to be used
> > externally
> 
> Yeah, I hate doing that. Hence the TODO comment I've placed near the include:
> 
>  * TODO: private sched.h is needed for runqueue. Should we move the
>  * sched code under kernel/sched/ ?
> 
> I'm open to better ideas.

I haven't thought the problem very deep yet, now at a quick glance, it seems to me
that if we want to make sure we spare the IRQ for nohz_full CPUs unless they
run a task that carries that membarrier_expedited flag, then I fear we have to take
the target rq lock.

So either we do that and the rq related code should move to kernel/sched/, or we
can think of some alternative.

Here is one that is not very elegant but avoids the rq lock from the membarrier
request side.

We could do things the other way around: when the nohz_full task does
membarrier_register_expedited(), it increments a counter on all CPUs it is affine
to and sends a global IPI (or rather only the CPUs outside the nohz_full range + those with
that positive counter) so that all CPUs see that update immediately. Then when
a membarrier request happens somewhere we know where we are allowed to send the IPI.
We can maintain a cpumask based on top of those per-CPU counters.

Now that's a bit of a brute method because we blindly poke a range of CPUs where
a given task might run but it avoids the rq lock on the membarrier request side.

That idea can be utterly simplified if membarrier_register_expedited() were to be
called for a CPU instead of a task. Also we wouldn't bother about synchronization
against concurrent task affinity changes (which might otherwise require rq lock from the
expedited register path).

But well we may as well exit the CPU nohz_full state for the timeframe where we need
to care about membarrier.

In fact due to the complexity involved, I have to ask first if we really need this feature.
Typically nohz_full workloads don't want to be disturbed at all, so do we have real
and significant usecases of CPU isolation workloads that want to be concerned by
this membarrier so much that they can tolerate some random IRQ?

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-17  3:55     ` Frederic Weisbecker
@ 2017-01-17 20:53       ` Paul E. McKenney
  2017-01-18 11:00         ` Peter Zijlstra
  2017-01-17 21:56       ` Mathieu Desnoyers
  1 sibling, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2017-01-17 20:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mathieu Desnoyers, Linus Torvalds, linux-kernel, Josh Triplett,
	KOSAKI Motohiro, rostedt, Nicholas Miell, Ingo Molnar,
	One Thousand Gnomes, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Peter Zijlstra, David Howells, bobby prani,
	Michael Kerrisk, Shuah Khan, Andrew Morton

On Tue, Jan 17, 2017 at 04:55:22AM +0100, Frederic Weisbecker wrote:

[ . . . ]

> In fact due to the complexity involved, I have to ask first if we
> really need this feature.  Typically nohz_full workloads don't want to
> be disturbed at all, so do we have real and significant usecases of CPU
> isolation workloads that want to be concerned by this membarrier so much
> that they can tolerate some random IRQ?

I believe that we need to explore the options for implementing it and
to -at- -least- have a patch ready, even if that patch doesn't go
upstream immediately.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-17  3:55     ` Frederic Weisbecker
  2017-01-17 20:53       ` Paul E. McKenney
@ 2017-01-17 21:56       ` Mathieu Desnoyers
  2017-01-17 23:29         ` Steven Rostedt
  1 sibling, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2017-01-17 21:56 UTC (permalink / raw)
  To: fweisbec
  Cc: Linus Torvalds, Paul E. McKenney, linux-kernel, Josh Triplett,
	KOSAKI Motohiro, rostedt, Nicholas Miell, Ingo Molnar,
	One Thousand Gnomes, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Peter Zijlstra, David Howells, bobby prani,
	Michael Kerrisk, Shuah Khan, Andrew Morton

----- On Jan 16, 2017, at 10:55 PM, fweisbec fweisbec@gmail.com wrote:

> On Mon, Jan 16, 2017 at 10:56:07PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 16, 2017, at 3:15 PM, Linus Torvalds torvalds@linux-foundation.org
>> wrote:
>> 
>> > Excuse my french, but this looks like incredible shit to me.
>> 
>> I'm currently trying to figure out how we can get membarrier
>> to play nicely with recent no-hz kernel features. Indeed, my
>> initial prototype is a mess. The good news is based on the number
>> of flaws you found in this RFC, there is plenty of room for
>> improvement. :)
>> 
>> > 
>> > On Mon, Jan 16, 2017 at 11:51 AM, Mathieu Desnoyers
>> > <mathieu.desnoyers@efficios.com> wrote:
>> >> +
>> >> +static int membarrier_register_expedited(struct task_struct *t)
>> >> +{
>> >> +       struct rq *rq;
>> >> +
>> >> +       if (t->membarrier_expedited == UINT_MAX)
>> >> +               return -EOVERFLOW;
>> >> +       rq = this_rq();
>> >> +       raw_spin_lock(&rq->lock);
>> >> +       t->membarrier_expedited++;
>> >> +       raw_spin_unlock(&rq->lock);
>> >> +       return 0;
>> >> +}
>> > 
>> > Yeah, taking the rq lock with
>> > 
>> > (a) a random "rq" that isn't stable
>> > 
>> > (b) without disabling interrupts
>> 
>> So for both register and unregister functions, as well as the use in
>> membarrier_nohz_full_expedited(), disabling interrupts around the rq
>> lock should fix this. But perhaps it would be wiser trying not to use the
>> rq lock at all.
>> 
>> > 
>> > (c) using an internal scheduler helper that isn't supposed to be used
>> > externally
>> 
>> Yeah, I hate doing that. Hence the TODO comment I've placed near the include:
>> 
>>  * TODO: private sched.h is needed for runqueue. Should we move the
>>  * sched code under kernel/sched/ ?
>> 
>> I'm open to better ideas.
> 
> I haven't thought the problem very deep yet, now at a quick glance, it seems to me
> that if we want to make sure we spare the IRQ for nohz_full CPUs unless they
> run a task that carries that membarrier_expedited flag, then I fear we have to take
> the target rq lock.
> 
> So either we do that and the rq related code should move to kernel/sched/, or we
> can think of some alternative.

One possibility is to use the memory barriers (acquire-release) scheme discussed with
Linus to synchronize updates to t->membarrier_expedited flag, reading the flag, and
the memory accesses performed in user-space. We can then remove the rq lock from the
membarrier register/unregister, but, as you point out, this would not be
enough, because it does not take into account the scheduler activity wrt iterating
on the nohz cpumask, getting the current task, checking the flag and sending IPIs.

One goal here is to ensure we don't disturb threads that do not wish to receive
those IPIs, so holding the rq lock while we send the IPI still seems to be safer.

So we may still need to take the rq lock in membarrier_nohz_full_expedited(). Arguably,
this function could be moved to kernel/sched/.

> 
> Here is one that is not very elegant but avoids the rq lock from the membarrier
> request side.
> 
> We could do things the other way around: when the nohz_full task does
> membarrier_register_expedited(), it increments a counter on all CPUs it is affine
> to and sends a global IPI (or rather only the CPUs outside the nohz_full range + those with
> that positive counter) so that all CPUs see that update immediately. Then when
> a membarrier request happens somewhere we know where we are allowed to send the IPI.
> We can maintain a cpumask based on top of those per-CPU counters.

Not sure how this would ensure that we don't disrupt threads that do not wish
to receive the IPI without holding the rq lock from the membarrier request side.

> 
> Now that's a bit of a brute method because we blindly poke a range of CPUs where
> a given task might run but it avoids the rq lock on the membarrier request side.
> 
> That idea can be utterly simplified if membarrier_register_expedited() were to be
> called for a CPU instead of a task. Also we wouldn't bother about synchronization
> against concurrent task affinity changes (which might otherwise require rq lock from the
> expedited register path).

I think it's important to limit side-effects as much as we can, and not interfere with
other threads that do not wish to receive those IPIs.

> 
> But well we may as well exit the CPU nohz_full state for the timeframe where we need
> to care about membarrier.
> 
> In fact due to the complexity involved, I have to ask first if we really need this feature.
> Typically nohz_full workloads don't want to be disturbed at all, so do we have real
> and significant usecases of CPU isolation workloads that want to be concerned by
> this membarrier so much that they can tolerate some random IRQ?

One use-case I care about is user-space tracing, which gets good speedup from using
the membarrier scheme. If we want to trace a thread that runs on a nohz full cpu, we
either need to have the entire process use memory barriers on the fast-path, or specialize
the code run by the nohz full threads, or we need to send an IPI from membarrier() in
some way. More generally, I suspect other users of liburcu that would benefit from
combined use with nohz full may be ready to accept a targeted IPI in order to turn
memory barriers into compiler barriers on their fast path (would have to be confirmed).
I don't know if there are garbage collector users out there who would benefit from
combining membarrier and nohz full.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-17 21:56       ` Mathieu Desnoyers
@ 2017-01-17 23:29         ` Steven Rostedt
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2017-01-17 23:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: fweisbec, Linus Torvalds, Paul E. McKenney, linux-kernel,
	Josh Triplett, KOSAKI Motohiro, Nicholas Miell, Ingo Molnar,
	One Thousand Gnomes, Lai Jiangshan, Stephen Hemminger,
	Thomas Gleixner, Peter Zijlstra, David Howells, bobby prani,
	Michael Kerrisk, Shuah Khan, Andrew Morton

On Tue, 17 Jan 2017 21:56:38 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> One goal here is to ensure we don't disturb threads that do not wish to receive
> those IPIs, so holding the rq lock while we send the IPI still seems to be safer.

What type of IPI is being sent? Be careful because IPIs can take the rq
lock, and if you are expecting the IPI to finish while holding the rq
lock you may introduce a deadlock.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-17 20:53       ` Paul E. McKenney
@ 2017-01-18 11:00         ` Peter Zijlstra
  2017-01-19 22:01           ` Paul McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2017-01-18 11:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Mathieu Desnoyers, Linus Torvalds,
	linux-kernel, Josh Triplett, KOSAKI Motohiro, rostedt,
	Nicholas Miell, Ingo Molnar, One Thousand Gnomes, Lai Jiangshan,
	Stephen Hemminger, Thomas Gleixner, David Howells, bobby prani,
	Michael Kerrisk, Shuah Khan, Andrew Morton

On Tue, Jan 17, 2017 at 12:53:21PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 17, 2017 at 04:55:22AM +0100, Frederic Weisbecker wrote:
> 
> [ . . . ]
> 
> > In fact due to the complexity involved, I have to ask first if we
> > really need this feature.  Typically nohz_full workloads don't want to
> > be disturbed at all, so do we have real and significant usecases of CPU
> > isolation workloads that want to be concerned by this membarrier so much
> > that they can tolerate some random IRQ?
> 
> I believe that we need to explore the options for implementing it and
> to -at- -least- have a patch ready, even if that patch doesn't go
> upstream immediately.

I tend to agree with Frederic here in that the design requirements seem
mutually exclusive.

NOHZ_FULL users do _not_ want interruptions of any sort, in fact some
want to make that a hard fail of the task.

OTOH sys_membarrier(CMD_SHARED) promises to serialize against anything
observable.

The only logical solution is to error the sys_membarrier(CMD_SHARED)
call when a NOHZ_FULL task shares memory with the caller. Now
determining this is somewhat tricky of course :/


I really don't see how there is another possible solution that makes
sense here. If there is shared memory between a NOHZ_FULL task and
others, a urcu implementation used by those must not rely on
sys_membarrier() but instead use a more expensive one, for instance one
where rcu_read_{,un}lock() do explicit counting and have memory barriers
in.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration
  2017-01-18 11:00         ` Peter Zijlstra
@ 2017-01-19 22:01           ` Paul McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul McKenney @ 2017-01-19 22:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Frederic Weisbecker, Mathieu Desnoyers,
	Linus Torvalds, linux-kernel, Josh Triplett, KOSAKI Motohiro,
	rostedt, Nicholas Miell, Ingo Molnar, One Thousand Gnomes,
	Lai Jiangshan, Stephen Hemminger, Thomas Gleixner, David Howells,
	bobby prani, Michael Kerrisk, Shuah Khan, Andrew Morton

On Wed, Jan 18, 2017 at 3:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 17, 2017 at 12:53:21PM -0800, Paul E. McKenney wrote:
>> On Tue, Jan 17, 2017 at 04:55:22AM +0100, Frederic Weisbecker wrote:
>>
>> [ . . . ]
>>
>> > In fact due to the complexity involved, I have to ask first if we
>> > really need this feature.  Typically nohz_full workloads don't want to
>> > be disturbed at all, so do we have real and significant usecases of CPU
>> > isolation workloads that want to be concerned by this membarrier so much
>> > that they can tolerate some random IRQ?
>>
>> I believe that we need to explore the options for implementing it and
>> to -at- -least- have a patch ready, even if that patch doesn't go
>> upstream immediately.
>
> I tend to agree with Frederic here in that the design requirements seem
> mutually exclusive.
>
> NOHZ_FULL users do _not_ want interruptions of any sort, in fact some
> want to make that a hard fail of the task.
>
> OTOH sys_membarrier(CMD_SHARED) promises to serialize against anything
> observable.
>
> The only logical solution is to error the sys_membarrier(CMD_SHARED)
> call when a NOHZ_FULL task shares memory with the caller. Now
> determining this is somewhat tricky of course :/
>
>
> I really don't see how there is another possible solution that makes
> sense here. If there is shared memory between a NOHZ_FULL task and
> others, a urcu implementation used by those must not rely on
> sys_membarrier() but instead use a more expensive one, for instance one
> where rcu_read_{,un}lock() do explicit counting and have memory barriers
> in.

Actually, agreed.  After all, if we knew of a solution that made
sense, we would have already implemented it, so some out-of-tree
experimentation makes sense.  One possibility is to have a flag that
user processes could set that aborted (or splatted or whatever) on any
interruption, so that tasks using signal URCU would avoid setting that
flag, and, as you say, tasks setting the flag using URCU would need to
use one of the memory-barrier variants.

                                                       Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-01-19 22:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-16 19:51 [RFC PATCH] membarrier: handle nohz_full with expedited thread registration Mathieu Desnoyers
2017-01-16 20:15 ` Linus Torvalds
2017-01-16 22:56   ` Mathieu Desnoyers
2017-01-16 23:50     ` Linus Torvalds
2017-01-17  2:09       ` Mathieu Desnoyers
2017-01-17  3:55     ` Frederic Weisbecker
2017-01-17 20:53       ` Paul E. McKenney
2017-01-18 11:00         ` Peter Zijlstra
2017-01-19 22:01           ` Paul McKenney
2017-01-17 21:56       ` Mathieu Desnoyers
2017-01-17 23:29         ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).