[RFC PATCH] sched/wait: Introduce new, more compact wait_event*() primitives

From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: [RFC PATCH] sched/wait: Introduce new, more compact wait_event*() primitives
Date: Wed, 8 Mar 2017 09:37:20 +0100	[thread overview]
Message-ID: <20170308083719.GA3251@gmail.com> (raw)
In-Reply-To: <CA+55aFz4CKT-yCUfQT6bDC9OyMUw4SLAoatxiVAizD0JzKtqxA@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Mar 3, 2017 at 12:13 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The fact that you can include <linux/wait.h>, and then cannot use the
> > wait event functions because you're missing "signal_pending()" is
> > complete garbage. This needs to be fixed.
> 
> Here's a (totally untested) patch that tries to do exactly that.

Sorry, I didn't forget about this problem, I worked on it over the weekend, it 
just turned out to be a lot more complex than I expected.

I took a more radical approach than you, and here's the patches I have so far:

0c32c5717beb sched/wait: Introduce new, more compact wait_event*() primitives
67b01ca3fda2 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
48af925509bc sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
bc151d1fd3e9 sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
b2d294b5824c sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
e9bbf53778c2 sched/wait: Improve the bit-wait API parameter names in the API function prototypes
6a6899db8e5a sched/wait: Standardize wait_bit_queue naming
8e060b6e033c sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name
139793f6ac6f sched/wait: Standardize internal naming of wait-queue heads
bb393b1a7e11 sched/wait: Standardize internal naming of wait-queue entries
28376289373c sched/wait: Rename wait_queue_t => wait_queue_entry_t

You can find the latest (WIP, frequently rebased) version in:

  git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.sched/core

The first 10 patches reshape the waitqueue code to be more hackable (to me!), 
because I kept bumping into confusing names all the time, but the real change 
you'd be most interested in is, in an RFC (to be rebased!) form, the following 
one:

  0c32c5717beb sched/wait: Introduce new, more compact wait_event*() primitives

I've attached that patch below as well.

The idea is to allow call sites to supply the 'condition' function as free-form C 
code, while pushing everything else into non-macro form: there's a 'struct 
wait_event_state' on stack, and a state machine. The waiting logic is converted 
from procedural form to a state machine, because we have to call out into the 
'condition' code in different circumstances.

It's a pretty weird construct, but it works I think, and has quite a few 
advantages (and a few disadvantages ...).

There's a lot of caveats with this RFC patch:

 - Not signed off, I'm not 100% sure it's fully correct yet, but superficial 
   testing suggests that it appears to boot - but PeterZ saw a boot hang on a
   server machine ...

 - NOTE: the v1/v2 structure is just a hack to make it easier to test - the final
   version wouldn't have such a construct.

 - Peter doesn't like the bool's in the struct - we could use char instead.

 - It needs a proper state machine diagram and more documentation - this is just a
   very quick prototype to see whether the concept has any chance of working.

The main advantage is the much reduced call site overhead, plus the fact that we 
move much of the logic into the .c space.

Right now event_wait() [which is in fact the smallest wait-event primitive!] 
generates this humunguous amount of inlined code on x86-64 defconfig:

	long test_flag;

	DECLARE_WAIT_QUEUE_HEAD(test_waitqueue_head);

	void test_function(void)
	{
	        wait_event(test_waitqueue_head, test_flag != 0);
	}

00000000000084e0 <test_function>:
    84e0:	55                   	push   %rbp
    84e1:	48 89 e5             	mov    %rsp,%rbp
    84e4:	48 83 ec 28          	sub    $0x28,%rsp

    84e8:	65 8b 05 00 00 00 00 	mov    %gs:0x0(%rip),%eax        # 84ef <test_function+0xf>
    84ef:	85 c0                	test   %eax,%eax
    84f1:	74 4f                	je     8542 <test_function+0x62>
    84f3:	48 83 3d 00 00 00 00 	cmpq   $0x0,0x0(%rip)        # 84fb <test_function+0x1b>
    84fa:	00 
    84fb:	74 02                	je     84ff <test_function+0x1f>

    84fd:	c9                   	leaveq 
    84fe:	c3                   	retq   

    84ff:	48 8d 7d d8          	lea    -0x28(%rbp),%rdi
    8503:	31 f6                	xor    %esi,%esi
    8505:	e8 00 00 00 00       	callq  850a <test_function+0x2a>
    850a:	eb 05                	jmp    8511 <test_function+0x31>
    850c:	e8 00 00 00 00       	callq  8511 <test_function+0x31>
    8511:	48 8d 75 d8          	lea    -0x28(%rbp),%rsi
    8515:	ba 02 00 00 00       	mov    $0x2,%edx
    851a:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
    8521:	e8 00 00 00 00       	callq  8526 <test_function+0x46>
    8526:	48 83 3d 00 00 00 00 	cmpq   $0x0,0x0(%rip)        # 852e <test_function+0x4e>
    852d:	00 
    852e:	74 dc                	je     850c <test_function+0x2c>
    8530:	48 8d 75 d8          	lea    -0x28(%rbp),%rsi
    8534:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
    853b:	e8 00 00 00 00       	callq  8540 <test_function+0x60>
    8540:	c9                   	leaveq 
    8541:	c3                   	retq   
    8542:	e8 00 00 00 00       	callq  8547 <test_function+0x67>
    8547:	eb aa                	jmp    84f3 <test_function+0x13>

That's 5 inlined function calls (!).

Note that it's even worse with some of the more complex wait_event*() constructs, 
for example the popular wait_event_interruptible_timeout() API generates this much 
code per call site:

00000000000084d0 <test_function_timeout>:
    84d0:	55                   	push   %rbp
    84d1:	48 89 e5             	mov    %rsp,%rbp
    84d4:	53                   	push   %rbx
    84d5:	48 83 ec 28          	sub    $0x28,%rsp
    84d9:	65 8b 05 00 00 00 00 	mov    %gs:0x0(%rip),%eax        # 84e0 <test_function_timeout+0x10>
    84e0:	85 c0                	test   %eax,%eax
    84e2:	0f 84 a0 00 00 00    	je     8588 <test_function_timeout+0xb8>
    84e8:	48 83 3d 00 00 00 00 	cmpq   $0x0,0x0(%rip)        # 84f0 <test_function_timeout+0x20>
    84ef:	00 
    84f0:	74 07                	je     84f9 <test_function_timeout+0x29>
    84f2:	48 83 c4 28          	add    $0x28,%rsp
    84f6:	5b                   	pop    %rbx
    84f7:	5d                   	pop    %rbp
    84f8:	c3                   	retq   
    84f9:	48 8d 7d d0          	lea    -0x30(%rbp),%rdi
    84fd:	31 f6                	xor    %esi,%esi
    84ff:	bb 10 27 00 00       	mov    $0x2710,%ebx
    8504:	e8 00 00 00 00       	callq  8509 <test_function_timeout+0x39>
    8509:	48 8d 75 d0          	lea    -0x30(%rbp),%rsi
    850d:	ba 01 00 00 00       	mov    $0x1,%edx
    8512:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
    8519:	e8 00 00 00 00       	callq  851e <test_function_timeout+0x4e>
    851e:	48 83 3d 00 00 00 00 	cmpq   $0x0,0x0(%rip)        # 8526 <test_function_timeout+0x56>
    8525:	00 
    8526:	0f 95 c2             	setne  %dl
    8529:	31 c9                	xor    %ecx,%ecx
    852b:	84 d2                	test   %dl,%dl
    852d:	75 42                	jne    8571 <test_function_timeout+0xa1>
    852f:	84 c9                	test   %cl,%cl
    8531:	75 3e                	jne    8571 <test_function_timeout+0xa1>
    8533:	48 85 c0             	test   %rax,%rax
    8536:	75 ba                	jne    84f2 <test_function_timeout+0x22>
    8538:	48 89 df             	mov    %rbx,%rdi
    853b:	e8 00 00 00 00       	callq  8540 <test_function_timeout+0x70>
    8540:	48 8d 75 d0          	lea    -0x30(%rbp),%rsi
    8544:	ba 01 00 00 00       	mov    $0x1,%edx
    8549:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
    8550:	48 89 c3             	mov    %rax,%rbx
    8553:	e8 00 00 00 00       	callq  8558 <test_function_timeout+0x88>
    8558:	48 83 3d 00 00 00 00 	cmpq   $0x0,0x0(%rip)        # 8560 <test_function_timeout+0x90>
    855f:	00 
    8560:	0f 95 c2             	setne  %dl
    8563:	48 85 db             	test   %rbx,%rbx
    8566:	0f 94 c1             	sete   %cl
    8569:	84 d2                	test   %dl,%dl
    856b:	74 c2                	je     852f <test_function_timeout+0x5f>
    856d:	84 c9                	test   %cl,%cl
    856f:	74 ba                	je     852b <test_function_timeout+0x5b>
    8571:	48 8d 75 d0          	lea    -0x30(%rbp),%rsi
    8575:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
    857c:	e8 00 00 00 00       	callq  8581 <test_function_timeout+0xb1>
    8581:	48 83 c4 28          	add    $0x28,%rsp
    8585:	5b                   	pop    %rbx
    8586:	5d                   	pop    %rbp
    8587:	c3                   	retq   
    8588:	e8 00 00 00 00       	callq  858d <test_function_timeout+0xbd>
    858d:	e9 56 ff ff ff       	jmpq   84e8 <test_function_timeout+0x18>

Which is about 50 instructions (!!!), and about 180 bytes of text overheaed.

We have over 1,900 wait_event() call sites:

  triton:~/tip> git grep -E 'wait_event.*\(' | wc -l
  1972

which means that with an average per call site overhead of 100 bytes code, this 
means a total text bloat of about ~190 KB...

This was pretty much hidden from common forms of bloat and overhead analysis so 
far, because it's all inlined at the CPP level and hidden 'inside' various normal 
functions.

With my patch it's just a loop with a single function call, and much lower call 
site impact:

0000000000008490 <test_function>:
    8490:	55                   	push   %rbp
    8491:	48 89 e5             	mov    %rsp,%rbp
    8494:	48 83 ec 38          	sub    $0x38,%rsp

    8498:	c6 45 c8 00          	movb   $0x0,-0x38(%rbp)
    849c:	31 d2                	xor    %edx,%edx
    849e:	48 83 3d 00 00 00 00 	cmpq   $0x0,0x0(%rip)        # 84a6 <test_function+0x16>
    84a5:	00 
    84a6:	48 8d 75 c8          	lea    -0x38(%rbp),%rsi
    84aa:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
    84b1:	0f 95 c2             	setne  %dl
    84b4:	e8 00 00 00 00       	callq  84b9 <test_function+0x29>
    84b9:	80 7d ca 00          	cmpb   $0x0,-0x36(%rbp)
    84bd:	74 dd                	je     849c <test_function+0xc>

    84bf:	c9                   	leaveq 
    84c0:	c3                   	retq   

Which is about 9 instructions.

So with the wait-event state machine it's a very sweet, compact construct with 
only a single function call - 9 instructions total, even with more complex API 
variants. The original version generates over 20-50 instructions depending on 
complexity.

Also note that if we uninline much of the iterator with my approach, we could more 
aggressively inline _that_ central state machine function and get rid of the many 
function calls it does internally.

I.e. it's not just debloating but an all around speedup for the actual-waiting 
case, which does happen frequently. For the doesn't-wait case there's a bit more 
overhead due to the extra function call.

But, the never ending quest of an uncaring universe makes our life more difficult, 
because there are disadvantages of a state machine as well:

 - State machines are so hellishly difficult to read, to get right and to 
   maintain.

 - While most wait_event() constructs actually do end up schedule()ing for real, 
   but still the 'wait condition is already true' case gets burdened with an extra 
   function call, because we have to call into wait.c. We could inline that first 
   check, at the cost of more inlining overhead.

OTOH with this method we'd only have to get the state machine right roughly once, 
and have all the API variants in the wait.c space. We could use proper function 
pointers with inlining and other meta coding constructs within wait.c to reduce 
the number of variants in the source code space without adding runtime overhead. 
In fact I think we could optimize and form these functions a lot better within 
wait.c than in the macro space.

In any case, it's clear that this stuff is in no way v4.11 material, so as a 
bridging fix I propose we add a sched/signal.h include to wait.h (or just move 
signal_pending() temporarily), until it's all resolved for real for v4.12.

Thoughts?

Thanks,

	Ingo

=================>
>From 0c32c5717beb4a5895cd80b3941246867afe1004 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Sun, 5 Mar 2017 14:28:09 +0100
Subject: [PATCH] sched/wait: Introduce new, more compact wait_event*() primitives

Turn the wait_event() interface into a state machine.

Only very lightly tested, but should demonstrate the principle.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
NOT-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/wait.h | 29 +++++++++++++++++++++++++-
 kernel/sched/wait.c  | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index ead731ef5632..285f282c928e 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -225,6 +225,31 @@ void __wake_up_sync(struct wait_queue_head *wq_head, unsigned int mode, int nr);
 
 extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
 
+struct wait_event_state {
+	bool				queued;
+	bool				prepared;
+	bool				done;
+
+	long				ret;
+	struct wait_queue_entry		wq_entry;
+};
+
+extern long wait_event_loop(struct wait_queue_head *wq_head, struct wait_event_state *wes, int condition);
+
+#define wait_event_v2(wq_head, condition)					\
+({										\
+	struct wait_event_state __wes;						\
+	long __ret;								\
+										\
+	__wes.queued = 0;							\
+										\
+	do {									\
+		__ret = wait_event_loop(&(wq_head), &__wes, (condition) != 0);	\
+	} while (!__wes.done);							\
+										\
+	__ret;									\
+})
+
 /*
  * The below macro ___wait_event() has an explicit shadow of the __ret
  * variable when used from the wait_event_*() macros.
@@ -277,7 +302,7 @@ __out:	__ret;									\
  * wake_up() has to be called after changing any variable that could
  * change the result of the wait condition.
  */
-#define wait_event(wq_head, condition)						\
+#define wait_event_v1(wq_head, condition)					\
 do {										\
 	might_sleep();								\
 	if (condition)								\
@@ -285,6 +310,8 @@ do {										\
 	__wait_event(wq_head, condition);					\
 } while (0)
 
+#define wait_event wait_event_v2
+
 #define __io_wait_event(wq_head, condition)					\
 	(void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
 			    io_schedule())
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 48794482d9ac..4542d9f6a5a4 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -293,6 +293,64 @@ static inline bool is_kthread_should_stop(void)
 }
 
 /*
+ * The main wait_event*() event loop iteration state machine.
+ *
+ * Note that this function itself does not loop, it returns to
+ * the caller to evaluate the call site dependent condition in
+ * every iteration.
+ */
+long wait_event_loop(struct wait_queue_head *wq_head, struct wait_event_state *wes, int condition)
+{
+	if (!wes->queued) {
+		might_sleep();
+
+		/*
+		 * If we are not initialized yet and the condition is already
+		 * met, we can return immediately:
+		 */
+		if (condition) {
+			wes->done = 1;
+			return 0;
+		}
+
+		/* Set up the wait-queue entry: */
+		init_wait_entry(&wes->wq_entry, 0);
+
+		wes->done = 0;
+		wes->queued = 1;
+		wes->prepared = 0;
+		wes->ret = 0;
+	} else {
+		/* Here is where we notice an updated wait condition: */
+		if (condition) {
+			finish_wait(wq_head, &wes->wq_entry);
+			wes->done = 1;
+			return 0;
+		}
+	}
+
+	if (!wes->prepared) {
+prepare_again:
+		wes->ret = prepare_to_wait_event(wq_head, &wes->wq_entry, 0);
+		wes->prepared = 1;
+
+		return 0;
+	}
+
+	if (___wait_is_interruptible(0) && wes->ret) {
+		/* We already got dequeued, so mark it done: */
+		wes->done = 1;
+
+		/* But return any eventual interruption code: */
+		return wes->ret;
+	}
+
+	schedule();
+	goto prepare_again;
+}
+EXPORT_SYMBOL_GPL(wait_event_loop);
+
+/*
  * DEFINE_WAIT_FUNC(wait, woken_wake_func);
  *
  * add_wait_queue(&wq_head, &wait);