* [PATCH v2] sched/completion: convert completions to use simple wait queues
@ 2016-04-28 12:57 Daniel Wagner
2016-04-29 6:14 ` Daniel Wagner
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Daniel Wagner @ 2016-04-28 12:57 UTC (permalink / raw)
To: linux-kernel, linux-rt-users
Cc: Peter Zijlstra (Intel),
Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner
From: Daniel Wagner <daniel.wagner@bmw-carit.de>
Completions have no long lasting callbacks and therefore do not need
the complex waitqueue variant. Use simple waitqueues which reduces
the contention on the waitqueue lock.
This was a carry forward from v3.10-rt, with some RT specific chunks,
dropped, and updated to align with names that were chosen to match the
simple waitqueue support.
While the conversion of complete() is trivial the complete_all() is
more difficult. complete_all() could be called from IRQ context and
therefore we don't want to wake up potentially a lot of
waiters. Therefore, only the first waiter is waked and the rest of the
waiters are waked by the first waiter. To avoid a larger struct
completion data structure the done integer is spitted into a unsigned
short for the flags and one unsigned short done.
The size of vmlinuz doesn't change too much:
add/remove: 3/0 grow/shrink: 3/10 up/down: 242/-236 (6)
function old new delta
swake_up_all_locked - 181 +181
__kstrtab_swake_up_all_locked - 20 +20
__ksymtab_swake_up_all_locked - 16 +16
complete_all 73 87 +14
try_wait_for_completion 99 107 +8
completion_done 40 43 +3
complete 73 65 -8
wait_for_completion_timeout 283 265 -18
wait_for_completion_killable_timeout 319 301 -18
wait_for_completion_io_timeout 283 265 -18
wait_for_completion_io 275 257 -18
wait_for_completion 275 257 -18
wait_for_completion_interruptible_timeout 304 285 -19
kexec_purgatory 26473 26449 -24
wait_for_completion_killable 544 499 -45
wait_for_completion_interruptible 522 472 -50
The downside of this approach is we can only wake up 32k waiters
instead of 2m. Though this doesn't seem to be a real issue.
With a lockdep inspired waiter tracker I verified how many waiters
are queued up on a complete() or complete_all() call.
The first line contains starts with class name of the swait object
followed by 4 columns which count the number of waiters. After that
there is a left ip/symbol column for the waiter and the right
ip/symbol column for the waker.
I run mmtest with config/config-global-dhp__scheduler-unbound with
additional kernbench:
swait_stat version 0.1
---------------------------------------------------------------------------------------------
class name 1 waiter 2 waiters 3 waiters 4+ waiters
---------------------------------------------------------------------------------------------
&rsp->gp_wq 129572 0 0 0
[<ffffffff810c5b81>] kthread+0x101/0x120
20154 [<ffffffff8110cf1f>] rcu_gp_kthread_wake+0x3f/0x50
535 [<ffffffff8110f603>] rcu_nocb_kthread+0x423/0x4b0
43867 [<ffffffff8110cfc1>] rcu_report_qs_rsp+0x51/0x80
44010 [<ffffffff8110d105>] rcu_report_qs_rnp+0x115/0x130
15882 [<ffffffff81111778>] rcu_process_callbacks+0x268/0x4a0
4437 [<ffffffff8111043c>] note_gp_changes+0xbc/0xc0
687 [<ffffffff8110f83e>] rcu_eqs_enter_common+0x1ae/0x1e0
&x->wait#11 39002 0 0 0
[<ffffffff810a4c43>] _do_fork+0x253/0x3c0
39002 [<ffffffff810a2e9b>] mm_release+0xbb/0x140
&rnp->nocb_gp_wq[1] 10277 0 0 0
[<ffffffff810c5b81>] kthread+0x101/0x120
10277 [<ffffffff810c5b81>] kthread+0x101/0x120
&rdp->nocb_wq 9862 0 0 0
[<ffffffff810c5b81>] kthread+0x101/0x120
4931 [<ffffffff8110ce05>] wake_nocb_leader+0x45/0x50
4290 [<ffffffff8110ced7>] __call_rcu_nocb_enqueue+0xc7/0xd0
629 [<ffffffff8110f728>] rcu_eqs_enter_common+0x98/0x1e0
12 [<ffffffff811115e5>] rcu_process_callbacks+0xd5/0x4a0
&rnp->nocb_gp_wq[0] 9769 0 0 0
[<ffffffff810c5b81>] kthread+0x101/0x120
9769 [<ffffffff810c5b81>] kthread+0x101/0x120
&x->wait#8 4123 0 0 0
[<ffffffffa011f03f>] xfs_buf_submit_wait+0x7f/0x280 [xfs]
4123 [<ffffffffa011e855>] xfs_buf_ioend+0xf5/0x230 [xfs]
(wait).wait#98 1594 0 0 0
[<ffffffff81471e94>] blk_execute_rq+0xb4/0x130
1594 [<ffffffff81471f33>] blk_end_sync_rq+0x23/0x30
&x->wait 827 0 0 0
[<ffffffff810c571d>] kthread_park+0x4d/0x60
[<ffffffff810c602f>] kthread_stop+0x4f/0x140
320 [<ffffffff810c566c>] __kthread_parkme+0x3c/0x70
507 [<ffffffff810a2e9b>] mm_release+0xbb/0x140
(done).wait#119 512 0 0 0
[<ffffffff810c5836>] kthread_create_on_node+0x106/0x1d0
512 [<ffffffff810c5b51>] kthread+0xd1/0x120
&x->wait#5 347 0 0 0
[<ffffffff810beb97>] flush_work+0x127/0x1d0
[<ffffffff810bc976>] flush_workqueue+0x176/0x5b0
273 [<ffffffff810bc742>] wq_barrier_func+0x12/0x20
74 [<ffffffff810bf308>] pwq_dec_nr_in_flight+0x98/0xa0
(done).wait#10 315 0 0 0
[<ffffffff810c5836>] kthread_create_on_node+0x106/0x1d0
315 [<ffffffff810c5b51>] kthread+0xd1/0x120
&x->wait#4 298 0 0 0
[<ffffffff815d7f2b>] devtmpfs_create_node+0x10b/0x150
298 [<ffffffff815d7dce>] devtmpfsd+0x10e/0x160
&x->wait#3 171 0 0 0
[<ffffffff8110bd26>] __wait_rcu_gp+0xc6/0xf0
171 [<ffffffff8110bc52>] wakeme_after_rcu+0x12/0x20
[...]
The stats show that at least for this workload there was never more
than 1 waiter when complete() or complete_all() was called. That
matches also the code review of all complete_all() calls.
One common pattern is
- prepare packet to transmit
- complete_init(&done)
- trigger hardware to transmit packet
- wait_for_completion(&done)
- irq handler calls complete_all(&done)
e.g. see drivers/i2c/busses/i2c-bcm-iproc.c
git
The filesystem system uses completion in a more complex pattern which
I couldn't really decipher but some simple fs benchmarks didn't show
multiple waiters.
Only one complete_all() user could been identified so far, which happens
to be drivers/base/power/main.c. Several waiters appear when suspend
to disk or mem is executed.
As one can see above in the swait_stat output, the fork() path is
using completion. A histogram of a fork bomp (1000 forks) benchmark
shows a slight performance drop by 4%.
[wagi@handman completion-test-5 (master)]$ cat forky-4.6.0-rc4.txt | perl histo -min 0.12 -max 0.20 -int 0.01 -stars -scale 10
# NumSamples = 1000; Max = 0.208; Min = 0.123
# Mean = 0.146406; Variance = 0.000275351163999956; SD = 0.0165937085668019
# Each * represents a count of 10
0.1200 - 0.1300 [ 113]: ************
0.1300 - 0.1400 [ 324]: *********************************
0.1400 - 0.1500 [ 219]: **********************
0.1500 - 0.1600 [ 139]: **************
0.1600 - 0.1700 [ 94]: **********
0.1700 - 0.1800 [ 54]: ******
0.1800 - 0.1900 [ 37]: ****
0.1900 - 0.2000 [ 18]: **
[wagi@handman completion-test-5 (master)]$ cat forky-4.6.0-rc4-00001-g0a16067.txt | perl histo -min 0.12 -max 0.20 -int 0.01 -stars -scale 10
# NumSamples = 1000; Max = 0.207; Min = 0.121
# Mean = 0.152056; Variance = 0.000295474863999994; SD = 0.0171893823042014
# Each * represents a count of 10
0.1200 - 0.1300 [ 17]: **
0.1300 - 0.1400 [ 282]: *****************************
0.1400 - 0.1500 [ 240]: ************************
0.1500 - 0.1600 [ 158]: ****************
0.1600 - 0.1700 [ 114]: ************
0.1700 - 0.1800 [ 94]: **********
0.1800 - 0.1900 [ 66]: *******
0.1900 - 0.2000 [ 25]: ***
0.2000 - 0.2100 [ 1]: *
Compiling a kernel 100 times results in following statistics gather
by 'time make -j200'
user
mean std var max min
kernbech-4.6.0-rc4 9.126 0.2919 0.08523 9.92 8.55
kernbech-4.6.0-rc4-00001-g0... 9.24 -1.25% 0.2768 5.17% 0.07664 10.07% 10.11 -1.92% 8.44 1.29%
system
mean std var max min
kernbech-4.6.0-rc4 1.676e+03 2.409 5.804 1.681e+03 1.666e+03
kernbech-4.6.0-rc4-00001-g0... 1.675e+03 0.07% 2.433 -1.01% 5.922 -2.03% 1.682e+03 -0.03% 1.67e+03 -0.20%
elapsed
mean std var max min
kernbech-4.6.0-rc4 2.303e+03 26.67 711.1 2.357e+03 2.232e+03
kernbech-4.6.0-rc4-00001-g0... 2.298e+03 0.23% 28.75 -7.83% 826.8 -16.26% 2.348e+03 0.38% 2.221e+03 0.49%
CPU
mean std var max min
kernbech-4.6.0-rc4 4.418e+03 48.9 2.391e+03 4.565e+03 4.347e+03
kernbech-4.6.0-rc4-00001-g0... 4.424e+03 -0.15% 55.73 -13.98% 3.106e+03 -29.90% 4.572e+03 -0.15% 4.356e+03 -0.21%
While the mean is slightly less the var and std are increasing quite
noticeable.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
---
I have also created a picture with the histograms for the above
tests. Since most of use are not able to process the postscript data
directly I omitted it to attach it directly. You can find it
here:
http://monom.org/data/completion/kernbench-completion-swait.png
changes since v1: none, just more tests and bigger commit message.
include/linux/completion.h | 23 ++++++++++++++++-------
include/linux/swait.h | 1 +
kernel/sched/completion.c | 43 ++++++++++++++++++++++++++-----------------
kernel/sched/swait.c | 24 ++++++++++++++++++++++++
4 files changed, 67 insertions(+), 24 deletions(-)
diff --git a/include/linux/completion.h b/include/linux/completion.h
index 5d5aaae..45fd91a 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -8,7 +8,7 @@
* See kernel/sched/completion.c for details.
*/
-#include <linux/wait.h>
+#include <linux/swait.h>
/*
* struct completion - structure used to maintain state for a "completion"
@@ -22,13 +22,22 @@
* reinit_completion(), and macros DECLARE_COMPLETION(),
* DECLARE_COMPLETION_ONSTACK().
*/
+
+#define COMPLETION_DEFER (1 << 0)
+
struct completion {
- unsigned int done;
- wait_queue_head_t wait;
+ union {
+ struct {
+ unsigned short flags;
+ unsigned short done;
+ };
+ unsigned int val;
+ };
+ struct swait_queue_head wait;
};
#define COMPLETION_INITIALIZER(work) \
- { 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
+ { 0, 0, __SWAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
#define COMPLETION_INITIALIZER_ONSTACK(work) \
({ init_completion(&work); work; })
@@ -72,8 +81,8 @@ struct completion {
*/
static inline void init_completion(struct completion *x)
{
- x->done = 0;
- init_waitqueue_head(&x->wait);
+ x->val = 0;
+ init_swait_queue_head(&x->wait);
}
/**
@@ -85,7 +94,7 @@ static inline void init_completion(struct completion *x)
*/
static inline void reinit_completion(struct completion *x)
{
- x->done = 0;
+ x->val = 0;
}
extern void wait_for_completion(struct completion *);
diff --git a/include/linux/swait.h b/include/linux/swait.h
index c1f9c62..83f004a 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -87,6 +87,7 @@ static inline int swait_active(struct swait_queue_head *q)
extern void swake_up(struct swait_queue_head *q);
extern void swake_up_all(struct swait_queue_head *q);
extern void swake_up_locked(struct swait_queue_head *q);
+extern void swake_up_all_locked(struct swait_queue_head *q);
extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
extern void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state);
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index 8d0f35d..d4dccd3 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -30,10 +30,10 @@ void complete(struct completion *x)
{
unsigned long flags;
- spin_lock_irqsave(&x->wait.lock, flags);
+ raw_spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_locked(&x->wait, TASK_NORMAL, 1);
- spin_unlock_irqrestore(&x->wait.lock, flags);
+ swake_up_locked(&x->wait);
+ raw_spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete);
@@ -50,10 +50,15 @@ void complete_all(struct completion *x)
{
unsigned long flags;
- spin_lock_irqsave(&x->wait.lock, flags);
- x->done += UINT_MAX/2;
- __wake_up_locked(&x->wait, TASK_NORMAL, 0);
- spin_unlock_irqrestore(&x->wait.lock, flags);
+ raw_spin_lock_irqsave(&x->wait.lock, flags);
+ x->done += USHRT_MAX/2;
+ if (irqs_disabled_flags(flags)) {
+ x->flags = COMPLETION_DEFER;
+ swake_up_locked(&x->wait);
+ } else {
+ swake_up_all_locked(&x->wait);
+ }
+ raw_spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete_all);
@@ -62,20 +67,20 @@ do_wait_for_common(struct completion *x,
long (*action)(long), long timeout, int state)
{
if (!x->done) {
- DECLARE_WAITQUEUE(wait, current);
+ DECLARE_SWAITQUEUE(wait);
- __add_wait_queue_tail_exclusive(&x->wait, &wait);
+ __prepare_to_swait(&x->wait, &wait);
do {
if (signal_pending_state(state, current)) {
timeout = -ERESTARTSYS;
break;
}
__set_current_state(state);
- spin_unlock_irq(&x->wait.lock);
+ raw_spin_unlock_irq(&x->wait.lock);
timeout = action(timeout);
- spin_lock_irq(&x->wait.lock);
+ raw_spin_lock_irq(&x->wait.lock);
} while (!x->done && timeout);
- __remove_wait_queue(&x->wait, &wait);
+ __finish_swait(&x->wait, &wait);
if (!x->done)
return timeout;
}
@@ -89,9 +94,13 @@ __wait_for_common(struct completion *x,
{
might_sleep();
- spin_lock_irq(&x->wait.lock);
+ raw_spin_lock_irq(&x->wait.lock);
timeout = do_wait_for_common(x, action, timeout, state);
- spin_unlock_irq(&x->wait.lock);
+ raw_spin_unlock_irq(&x->wait.lock);
+ if (x->flags & COMPLETION_DEFER) {
+ x->flags = 0;
+ swake_up_all(&x->wait);
+ }
return timeout;
}
@@ -277,12 +286,12 @@ bool try_wait_for_completion(struct completion *x)
if (!READ_ONCE(x->done))
return 0;
- spin_lock_irqsave(&x->wait.lock, flags);
+ raw_spin_lock_irqsave(&x->wait.lock, flags);
if (!x->done)
ret = 0;
else
x->done--;
- spin_unlock_irqrestore(&x->wait.lock, flags);
+ raw_spin_unlock_irqrestore(&x->wait.lock, flags);
return ret;
}
EXPORT_SYMBOL(try_wait_for_completion);
@@ -311,7 +320,7 @@ bool completion_done(struct completion *x)
* after it's acquired the lock.
*/
smp_rmb();
- spin_unlock_wait(&x->wait.lock);
+ raw_spin_unlock_wait(&x->wait.lock);
return true;
}
EXPORT_SYMBOL(completion_done);
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 82f0dff..efe366b 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -72,6 +72,30 @@ void swake_up_all(struct swait_queue_head *q)
}
EXPORT_SYMBOL(swake_up_all);
+void swake_up_all_locked(struct swait_queue_head *q)
+{
+ struct swait_queue *curr;
+ LIST_HEAD(tmp);
+
+ if (!swait_active(q))
+ return;
+
+ list_splice_init(&q->task_list, &tmp);
+ while (!list_empty(&tmp)) {
+ curr = list_first_entry(&tmp, typeof(*curr), task_list);
+
+ wake_up_state(curr->task, TASK_NORMAL);
+ list_del_init(&curr->task_list);
+
+ if (list_empty(&tmp))
+ break;
+
+ raw_spin_unlock_irq(&q->lock);
+ raw_spin_lock_irq(&q->lock);
+ }
+}
+EXPORT_SYMBOL(swake_up_all_locked);
+
void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
{
wait->task = current;
--
2.5.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2] sched/completion: convert completions to use simple wait queues
2016-04-28 12:57 [PATCH v2] sched/completion: convert completions to use simple wait queues Daniel Wagner
@ 2016-04-29 6:14 ` Daniel Wagner
2016-05-12 14:08 ` Daniel Wagner
2016-05-16 20:33 ` Luiz Capitulino
2 siblings, 0 replies; 6+ messages in thread
From: Daniel Wagner @ 2016-04-29 6:14 UTC (permalink / raw)
To: linux-kernel, linux-rt-users
Cc: Peter Zijlstra (Intel), Thomas Gleixner, Sebastian Andrzej Siewior
On 04/28/2016 02:57 PM, Daniel Wagner wrote:
> Only one complete_all() user could been identified so far, which happens
> to be drivers/base/power/main.c. Several waiters appear when suspend
> to disk or mem is executed.
BTW, this is what I get when doing a 'echo "disk" > /sys/power/state' on
a 4 socket E5-4610 (Ivy Bridge EP) system.
swait_stat version 0.1
---------------------------------------------------------------------------------------------
class name 1 waiter 2 waiters 3 waiters 4+ waiters
---------------------------------------------------------------------------------------------
[...]
&x->wait#12 90 11 5 1
[<ffffffff815dd462>] dpm_wait+0x32/0x40
20 [<ffffffff815de5d4>] __device_suspend+0x1b4/0x370
4 [<ffffffff815de1e4>] __device_suspend_late+0x74/0x210
22 [<ffffffff815ddf21>] __device_suspend_noirq+0x51/0x200
2 [<ffffffff815ddaf9>] device_resume_early+0x69/0x1b0
59 [<ffffffff815ddce0>] device_resume+0x50/0x1f0
[...]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] sched/completion: convert completions to use simple wait queues
2016-04-28 12:57 [PATCH v2] sched/completion: convert completions to use simple wait queues Daniel Wagner
2016-04-29 6:14 ` Daniel Wagner
@ 2016-05-12 14:08 ` Daniel Wagner
2016-05-16 20:38 ` Luiz Capitulino
2016-05-16 20:33 ` Luiz Capitulino
2 siblings, 1 reply; 6+ messages in thread
From: Daniel Wagner @ 2016-05-12 14:08 UTC (permalink / raw)
To: linux-kernel, linux-rt-users
Cc: Peter Zijlstra (Intel),
Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner
On 04/28/2016 02:57 PM, Daniel Wagner wrote:
> As one can see above in the swait_stat output, the fork() path is
> using completion. A histogram of a fork bomp (1000 forks) benchmark
> shows a slight performance drop by 4%.
>
> [wagi@handman completion-test-5 (master)]$ cat forky-4.6.0-rc4.txt | perl histo -min 0.12 -max 0.20 -int 0.01 -stars -scale 10
> # NumSamples = 1000; Max = 0.208; Min = 0.123
> # Mean = 0.146406; Variance = 0.000275351163999956; SD = 0.0165937085668019
> # Each * represents a count of 10
> 0.1200 - 0.1300 [ 113]: ************
> 0.1300 - 0.1400 [ 324]: *********************************
> 0.1400 - 0.1500 [ 219]: **********************
> 0.1500 - 0.1600 [ 139]: **************
> 0.1600 - 0.1700 [ 94]: **********
> 0.1700 - 0.1800 [ 54]: ******
> 0.1800 - 0.1900 [ 37]: ****
> 0.1900 - 0.2000 [ 18]: **
>
> [wagi@handman completion-test-5 (master)]$ cat forky-4.6.0-rc4-00001-g0a16067.txt | perl histo -min 0.12 -max 0.20 -int 0.01 -stars -scale 10
> # NumSamples = 1000; Max = 0.207; Min = 0.121
> # Mean = 0.152056; Variance = 0.000295474863999994; SD = 0.0171893823042014
> # Each * represents a count of 10
> 0.1200 - 0.1300 [ 17]: **
> 0.1300 - 0.1400 [ 282]: *****************************
> 0.1400 - 0.1500 [ 240]: ************************
> 0.1500 - 0.1600 [ 158]: ****************
> 0.1600 - 0.1700 [ 114]: ************
> 0.1700 - 0.1800 [ 94]: **********
> 0.1800 - 0.1900 [ 66]: *******
> 0.1900 - 0.2000 [ 25]: ***
> 0.2000 - 0.2100 [ 1]: *
I redid the above test and changed my fork bomb to this:
for (i = 0; i < MAX_CHILDREN; i++) {
switch(fork()) {
case -1:
exit(1);
case 0:
_exit(0);
}
}
for (i = 0; i < MAX_CHILDREN; i++) {
do {
pid = waitpid(-1, &status, WUNTRACED );
if (pid < 0 && errno != ECHILD)
exit(1);
} while (!WIFEXITED(status) && !WIFSIGNALED(status));
}
Obviously, fork is not a very good benchmark since we might end up
into memory allocation etc. The distribution I get from baseline and
this batch look very similiar:
[wagi@handman completion (master)]$ cat results/forky-4.6.0-rc4.txt | perl histo -min 0.09 -max 0.11 -int 0.001 -stars -scale 100
0.0910 - 0.0920 [ 3]: *
0.0920 - 0.0930 [ 8]: *
0.0930 - 0.0940 [ 52]: *
0.0940 - 0.0950 [ 404]: *****
0.0950 - 0.0960 [ 1741]: ******************
0.0960 - 0.0970 [ 2221]: ***********************
0.0970 - 0.0980 [ 1612]: *****************
0.0980 - 0.0990 [ 1346]: **************
0.0990 - 0.1000 [ 1223]: *************
0.1000 - 0.1010 [ 724]: ********
0.1010 - 0.1020 [ 362]: ****
0.1020 - 0.1030 [ 186]: **
0.1030 - 0.1040 [ 71]: *
0.1040 - 0.1050 [ 29]: *
0.1050 - 0.1060 [ 12]: *
0.1060 - 0.1070 [ 4]: *
0.1080 - 0.1090 [ 2]: *
[wagi@handman completion (master)]$ cat results/forky-4.6.0-rc4-00001-gc4c770c.txt | perl histo -min 0.09 -max 0.11 -int 0.001 -stars -scale 100
0.0930 - 0.0940 [ 3]: *
0.0940 - 0.0950 [ 9]: *
0.0950 - 0.0960 [ 25]: *
0.0960 - 0.0970 [ 77]: *
0.0970 - 0.0980 [ 324]: ****
0.0980 - 0.0990 [ 1503]: ****************
0.0990 - 0.1000 [ 2247]: ***********************
0.1000 - 0.1010 [ 1708]: ******************
0.1010 - 0.1020 [ 1486]: ***************
0.1020 - 0.1030 [ 1215]: *************
0.1030 - 0.1040 [ 729]: ********
0.1040 - 0.1050 [ 368]: ****
0.1050 - 0.1060 [ 197]: **
0.1060 - 0.1070 [ 65]: *
0.1070 - 0.1080 [ 32]: *
0.1080 - 0.1090 [ 7]: *
0.1090 - 0.1100 [ 2]: *
A t-test (determine if two sets of data are significantly different)
returns a p value of 0 (< 1%). That means we reject the null
hypothesis of equal avarages. That means we have a 0.3% decrease in
perforamnce compared with the baseline.
> Compiling a kernel 100 times results in following statistics gather
> by 'time make -j200'
>
> user
> mean std var max min
> kernbech-4.6.0-rc4 9.126 0.2919 0.08523 9.92 8.55
> kernbech-4.6.0-rc4-00001-g0... 9.24 -1.25% 0.2768 5.17% 0.07664 10.07% 10.11 -1.92% 8.44 1.29%
>
>
> system
> mean std var max min
> kernbech-4.6.0-rc4 1.676e+03 2.409 5.804 1.681e+03 1.666e+03
> kernbech-4.6.0-rc4-00001-g0... 1.675e+03 0.07% 2.433 -1.01% 5.922 -2.03% 1.682e+03 -0.03% 1.67e+03 -0.20%
>
>
> elapsed
> mean std var max min
> kernbech-4.6.0-rc4 2.303e+03 26.67 711.1 2.357e+03 2.232e+03
> kernbech-4.6.0-rc4-00001-g0... 2.298e+03 0.23% 28.75 -7.83% 826.8 -16.26% 2.348e+03 0.38% 2.221e+03 0.49%
>
>
> CPU
> mean std var max min
> kernbech-4.6.0-rc4 4.418e+03 48.9 2.391e+03 4.565e+03 4.347e+03
> kernbech-4.6.0-rc4-00001-g0... 4.424e+03 -0.15% 55.73 -13.98% 3.106e+03 -29.90% 4.572e+03 -0.15% 4.356e+03 -0.21%
>
>
> While the mean is slightly less the var and std are increasing quite
> noticeable.
The idea behind doing the kernel builds is that I wanted to see if
there is an impact observable from a real work load. The above numbers are hard
to interpret, though if you only look at the elapsed time you see it takes
slightly longer. I repeated this test with 500 runs and the numbers I get
are the same as above. So at least it is consisted and repeatable experiment.
Obviously, I tried to micro benchmark what's going on, but so far I
have had any luck. A kernel module which has two threads which do a
ping-pong completion test, A typical trace looks like this:
trigger-2376 [000] 218.982609: sched_waking: comm=waiter/0 pid=2375 prio=120 target_cpu=000
trigger-2376 [000] 218.982609: sched_stat_runtime: comm=trigger pid=2376 runtime=1355 [ns] vruntime=40692621118 [ns]
trigger-2376 [000] 218.982609: sched_wakeup: waiter/0:2375 [120] success=1 CPU:000
trigger-2376 [000] 218.982610: rcu_utilization: Start context switch
trigger-2376 [000] 218.982610: rcu_utilization: End context switch
trigger-2376 [000] 218.982610: sched_stat_runtime: comm=trigger pid=2376 runtime=1072 [ns] vruntime=40692622190 [ns]
trigger-2376 [000] 218.982611: sched_switch: trigger:2376 [120] S ==> waiter/0:2375 [120]
waiter/0-2375 [000] 218.982611: latency_complete: latency=2285
waiter/0-2375 [000] 218.982611: sched_waking: comm=trigger pid=2376 prio=120 target_cpu=000
waiter/0-2375 [000] 218.982611: sched_stat_runtime: comm=waiter/0 pid=2375 runtime=1217 [ns] vruntime=40692622747 [ns]
waiter/0-2375 [000] 218.982612: sched_wakeup: trigger:2376 [120] success=1 CPU:000
waiter/0-2375 [000] 218.982612: rcu_utilization: Start context switch
waiter/0-2375 [000] 218.982612: rcu_utilization: End context switch
waiter/0-2375 [000] 218.982612: sched_stat_runtime: comm=waiter/0 pid=2375 runtime=1099 [ns] vruntime=40692623846 [ns]
waiter/0-2375 [000] 218.982613: sched_switch: waiter/0:2375 [120] S ==> trigger:2376 [120]
I have plotted the latency_complete (the time it takes from complete()
till the waiter is running)
https://www.monom.org/data/completion/completion-latency.png
The stats for the above plot are:
[wagi@handman results (master)]$ csvstat-3 completion-latency-4.6.0-rc4.txt
1. 805
<class 'int'>
Nulls: False
Min: 643
Max: 351709
Sum: 3396063015
Mean: 715.6573082933786
Median: 706.0
Standard Deviation: 385.24467795803787
Unique values: 4662
5 most frequent values:
697: 121547
703: 120730
693: 112609
699: 112543
701: 112370
Row count: 4745376
[wagi@handman results (master)]$ csvstat-3 completion-latency-4.6.0-rc4-00001-gc4c770c.txt
1. 4949
<class 'int'>
Nulls: False
Min: 660
Max: 376049
Sum: 3417112614
Mean: 710.0990997187752
Median: 696
Standard Deviation: 500.7461712849926
Unique values: 4930
5 most frequent values:
693: 188698
689: 165564
692: 158333
688: 156896
684: 155032
Row count: 4812163
In short, I haven't figured out yet why the kernel builds get slightly slower.
The first idea that the fork path is a problem is not 'proofable' with the
the fork bomb. At least if it is executed in a tight loop.
cheers,
daniel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] sched/completion: convert completions to use simple wait queues
2016-04-28 12:57 [PATCH v2] sched/completion: convert completions to use simple wait queues Daniel Wagner
2016-04-29 6:14 ` Daniel Wagner
2016-05-12 14:08 ` Daniel Wagner
@ 2016-05-16 20:33 ` Luiz Capitulino
2 siblings, 0 replies; 6+ messages in thread
From: Luiz Capitulino @ 2016-05-16 20:33 UTC (permalink / raw)
To: Daniel Wagner
Cc: linux-kernel, linux-rt-users, Peter Zijlstra (Intel),
Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner
On Thu, 28 Apr 2016 14:57:24 +0200
Daniel Wagner <wagi@monom.org> wrote:
> From: Daniel Wagner <daniel.wagner@bmw-carit.de>
>
> Completions have no long lasting callbacks and therefore do not need
> the complex waitqueue variant. Use simple waitqueues which reduces
> the contention on the waitqueue lock.
>
> This was a carry forward from v3.10-rt, with some RT specific chunks,
> dropped, and updated to align with names that were chosen to match the
> simple waitqueue support.
>
> While the conversion of complete() is trivial the complete_all() is
> more difficult. complete_all() could be called from IRQ context and
> therefore we don't want to wake up potentially a lot of
> waiters. Therefore, only the first waiter is waked and the rest of the
> waiters are waked by the first waiter. To avoid a larger struct
> completion data structure the done integer is spitted into a unsigned
> short for the flags and one unsigned short done.
>
> The size of vmlinuz doesn't change too much:
>
>
> add/remove: 3/0 grow/shrink: 3/10 up/down: 242/-236 (6)
> function old new delta
> swake_up_all_locked - 181 +181
> __kstrtab_swake_up_all_locked - 20 +20
> __ksymtab_swake_up_all_locked - 16 +16
> complete_all 73 87 +14
> try_wait_for_completion 99 107 +8
> completion_done 40 43 +3
> complete 73 65 -8
> wait_for_completion_timeout 283 265 -18
> wait_for_completion_killable_timeout 319 301 -18
> wait_for_completion_io_timeout 283 265 -18
> wait_for_completion_io 275 257 -18
> wait_for_completion 275 257 -18
> wait_for_completion_interruptible_timeout 304 285 -19
> kexec_purgatory 26473 26449 -24
> wait_for_completion_killable 544 499 -45
> wait_for_completion_interruptible 522 472 -50
>
>
> The downside of this approach is we can only wake up 32k waiters
> instead of 2m. Though this doesn't seem to be a real issue.
>
> With a lockdep inspired waiter tracker I verified how many waiters
> are queued up on a complete() or complete_all() call.
>
> The first line contains starts with class name of the swait object
> followed by 4 columns which count the number of waiters. After that
> there is a left ip/symbol column for the waiter and the right
> ip/symbol column for the waker.
>
> I run mmtest with config/config-global-dhp__scheduler-unbound with
> additional kernbench:
>
>
> swait_stat version 0.1
> ---------------------------------------------------------------------------------------------
> class name 1 waiter 2 waiters 3 waiters 4+ waiters
> ---------------------------------------------------------------------------------------------
> &rsp->gp_wq 129572 0 0 0
> [<ffffffff810c5b81>] kthread+0x101/0x120
> 20154 [<ffffffff8110cf1f>] rcu_gp_kthread_wake+0x3f/0x50
> 535 [<ffffffff8110f603>] rcu_nocb_kthread+0x423/0x4b0
> 43867 [<ffffffff8110cfc1>] rcu_report_qs_rsp+0x51/0x80
> 44010 [<ffffffff8110d105>] rcu_report_qs_rnp+0x115/0x130
> 15882 [<ffffffff81111778>] rcu_process_callbacks+0x268/0x4a0
> 4437 [<ffffffff8111043c>] note_gp_changes+0xbc/0xc0
> 687 [<ffffffff8110f83e>] rcu_eqs_enter_common+0x1ae/0x1e0
> &x->wait#11 39002 0 0 0
> [<ffffffff810a4c43>] _do_fork+0x253/0x3c0
> 39002 [<ffffffff810a2e9b>] mm_release+0xbb/0x140
> &rnp->nocb_gp_wq[1] 10277 0 0 0
> [<ffffffff810c5b81>] kthread+0x101/0x120
> 10277 [<ffffffff810c5b81>] kthread+0x101/0x120
> &rdp->nocb_wq 9862 0 0 0
> [<ffffffff810c5b81>] kthread+0x101/0x120
> 4931 [<ffffffff8110ce05>] wake_nocb_leader+0x45/0x50
> 4290 [<ffffffff8110ced7>] __call_rcu_nocb_enqueue+0xc7/0xd0
> 629 [<ffffffff8110f728>] rcu_eqs_enter_common+0x98/0x1e0
> 12 [<ffffffff811115e5>] rcu_process_callbacks+0xd5/0x4a0
> &rnp->nocb_gp_wq[0] 9769 0 0 0
> [<ffffffff810c5b81>] kthread+0x101/0x120
> 9769 [<ffffffff810c5b81>] kthread+0x101/0x120
> &x->wait#8 4123 0 0 0
> [<ffffffffa011f03f>] xfs_buf_submit_wait+0x7f/0x280 [xfs]
> 4123 [<ffffffffa011e855>] xfs_buf_ioend+0xf5/0x230 [xfs]
> (wait).wait#98 1594 0 0 0
> [<ffffffff81471e94>] blk_execute_rq+0xb4/0x130
> 1594 [<ffffffff81471f33>] blk_end_sync_rq+0x23/0x30
> &x->wait 827 0 0 0
> [<ffffffff810c571d>] kthread_park+0x4d/0x60
> [<ffffffff810c602f>] kthread_stop+0x4f/0x140
> 320 [<ffffffff810c566c>] __kthread_parkme+0x3c/0x70
> 507 [<ffffffff810a2e9b>] mm_release+0xbb/0x140
> (done).wait#119 512 0 0 0
> [<ffffffff810c5836>] kthread_create_on_node+0x106/0x1d0
> 512 [<ffffffff810c5b51>] kthread+0xd1/0x120
> &x->wait#5 347 0 0 0
> [<ffffffff810beb97>] flush_work+0x127/0x1d0
> [<ffffffff810bc976>] flush_workqueue+0x176/0x5b0
> 273 [<ffffffff810bc742>] wq_barrier_func+0x12/0x20
> 74 [<ffffffff810bf308>] pwq_dec_nr_in_flight+0x98/0xa0
> (done).wait#10 315 0 0 0
> [<ffffffff810c5836>] kthread_create_on_node+0x106/0x1d0
> 315 [<ffffffff810c5b51>] kthread+0xd1/0x120
> &x->wait#4 298 0 0 0
> [<ffffffff815d7f2b>] devtmpfs_create_node+0x10b/0x150
> 298 [<ffffffff815d7dce>] devtmpfsd+0x10e/0x160
> &x->wait#3 171 0 0 0
> [<ffffffff8110bd26>] __wait_rcu_gp+0xc6/0xf0
> 171 [<ffffffff8110bc52>] wakeme_after_rcu+0x12/0x20
> [...]
>
> The stats show that at least for this workload there was never more
> than 1 waiter when complete() or complete_all() was called. That
> matches also the code review of all complete_all() calls.
>
> One common pattern is
>
> - prepare packet to transmit
> - complete_init(&done)
> - trigger hardware to transmit packet
> - wait_for_completion(&done)
> - irq handler calls complete_all(&done)
>
> e.g. see drivers/i2c/busses/i2c-bcm-iproc.c
> git
> The filesystem system uses completion in a more complex pattern which
> I couldn't really decipher but some simple fs benchmarks didn't show
> multiple waiters.
>
> Only one complete_all() user could been identified so far, which happens
> to be drivers/base/power/main.c. Several waiters appear when suspend
> to disk or mem is executed.
>
> As one can see above in the swait_stat output, the fork() path is
> using completion. A histogram of a fork bomp (1000 forks) benchmark
> shows a slight performance drop by 4%.
>
> [wagi@handman completion-test-5 (master)]$ cat forky-4.6.0-rc4.txt | perl histo -min 0.12 -max 0.20 -int 0.01 -stars -scale 10
> # NumSamples = 1000; Max = 0.208; Min = 0.123
> # Mean = 0.146406; Variance = 0.000275351163999956; SD = 0.0165937085668019
> # Each * represents a count of 10
> 0.1200 - 0.1300 [ 113]: ************
> 0.1300 - 0.1400 [ 324]: *********************************
> 0.1400 - 0.1500 [ 219]: **********************
> 0.1500 - 0.1600 [ 139]: **************
> 0.1600 - 0.1700 [ 94]: **********
> 0.1700 - 0.1800 [ 54]: ******
> 0.1800 - 0.1900 [ 37]: ****
> 0.1900 - 0.2000 [ 18]: **
>
> [wagi@handman completion-test-5 (master)]$ cat forky-4.6.0-rc4-00001-g0a16067.txt | perl histo -min 0.12 -max 0.20 -int 0.01 -stars -scale 10
> # NumSamples = 1000; Max = 0.207; Min = 0.121
> # Mean = 0.152056; Variance = 0.000295474863999994; SD = 0.0171893823042014
> # Each * represents a count of 10
> 0.1200 - 0.1300 [ 17]: **
> 0.1300 - 0.1400 [ 282]: *****************************
> 0.1400 - 0.1500 [ 240]: ************************
> 0.1500 - 0.1600 [ 158]: ****************
> 0.1600 - 0.1700 [ 114]: ************
> 0.1700 - 0.1800 [ 94]: **********
> 0.1800 - 0.1900 [ 66]: *******
> 0.1900 - 0.2000 [ 25]: ***
> 0.2000 - 0.2100 [ 1]: *
>
> Compiling a kernel 100 times results in following statistics gather
> by 'time make -j200'
>
> user
> mean std var max min
> kernbech-4.6.0-rc4 9.126 0.2919 0.08523 9.92 8.55
> kernbech-4.6.0-rc4-00001-g0... 9.24 -1.25% 0.2768 5.17% 0.07664 10.07% 10.11 -1.92% 8.44 1.29%
>
>
> system
> mean std var max min
> kernbech-4.6.0-rc4 1.676e+03 2.409 5.804 1.681e+03 1.666e+03
> kernbech-4.6.0-rc4-00001-g0... 1.675e+03 0.07% 2.433 -1.01% 5.922 -2.03% 1.682e+03 -0.03% 1.67e+03 -0.20%
>
>
> elapsed
> mean std var max min
> kernbech-4.6.0-rc4 2.303e+03 26.67 711.1 2.357e+03 2.232e+03
> kernbech-4.6.0-rc4-00001-g0... 2.298e+03 0.23% 28.75 -7.83% 826.8 -16.26% 2.348e+03 0.38% 2.221e+03 0.49%
>
>
> CPU
> mean std var max min
> kernbech-4.6.0-rc4 4.418e+03 48.9 2.391e+03 4.565e+03 4.347e+03
> kernbech-4.6.0-rc4-00001-g0... 4.424e+03 -0.15% 55.73 -13.98% 3.106e+03 -29.90% 4.572e+03 -0.15% 4.356e+03 -0.21%
>
>
> While the mean is slightly less the var and std are increasing quite
> noticeable.
>
> Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
> ---
>
> I have also created a picture with the histograms for the above
> tests. Since most of use are not able to process the postscript data
> directly I omitted it to attach it directly. You can find it
> here:
>
> http://monom.org/data/completion/kernbench-completion-swait.png
>
> changes since v1: none, just more tests and bigger commit message.
>
>
> include/linux/completion.h | 23 ++++++++++++++++-------
> include/linux/swait.h | 1 +
> kernel/sched/completion.c | 43 ++++++++++++++++++++++++++-----------------
> kernel/sched/swait.c | 24 ++++++++++++++++++++++++
> 4 files changed, 67 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/completion.h b/include/linux/completion.h
> index 5d5aaae..45fd91a 100644
> --- a/include/linux/completion.h
> +++ b/include/linux/completion.h
> @@ -8,7 +8,7 @@
> * See kernel/sched/completion.c for details.
> */
>
> -#include <linux/wait.h>
> +#include <linux/swait.h>
>
> /*
> * struct completion - structure used to maintain state for a "completion"
> @@ -22,13 +22,22 @@
> * reinit_completion(), and macros DECLARE_COMPLETION(),
> * DECLARE_COMPLETION_ONSTACK().
> */
> +
> +#define COMPLETION_DEFER (1 << 0)
> +
> struct completion {
> - unsigned int done;
> - wait_queue_head_t wait;
> + union {
> + struct {
> + unsigned short flags;
> + unsigned short done;
> + };
> + unsigned int val;
> + };
> + struct swait_queue_head wait;
> };
>
> #define COMPLETION_INITIALIZER(work) \
> - { 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
> + { 0, 0, __SWAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
>
> #define COMPLETION_INITIALIZER_ONSTACK(work) \
> ({ init_completion(&work); work; })
> @@ -72,8 +81,8 @@ struct completion {
> */
> static inline void init_completion(struct completion *x)
> {
> - x->done = 0;
> - init_waitqueue_head(&x->wait);
> + x->val = 0;
> + init_swait_queue_head(&x->wait);
> }
>
> /**
> @@ -85,7 +94,7 @@ static inline void init_completion(struct completion *x)
> */
> static inline void reinit_completion(struct completion *x)
> {
> - x->done = 0;
> + x->val = 0;
> }
>
> extern void wait_for_completion(struct completion *);
> diff --git a/include/linux/swait.h b/include/linux/swait.h
> index c1f9c62..83f004a 100644
> --- a/include/linux/swait.h
> +++ b/include/linux/swait.h
> @@ -87,6 +87,7 @@ static inline int swait_active(struct swait_queue_head *q)
> extern void swake_up(struct swait_queue_head *q);
> extern void swake_up_all(struct swait_queue_head *q);
> extern void swake_up_locked(struct swait_queue_head *q);
> +extern void swake_up_all_locked(struct swait_queue_head *q);
>
> extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
> extern void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state);
> diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
> index 8d0f35d..d4dccd3 100644
> --- a/kernel/sched/completion.c
> +++ b/kernel/sched/completion.c
> @@ -30,10 +30,10 @@ void complete(struct completion *x)
> {
> unsigned long flags;
>
> - spin_lock_irqsave(&x->wait.lock, flags);
> + raw_spin_lock_irqsave(&x->wait.lock, flags);
> x->done++;
> - __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> - spin_unlock_irqrestore(&x->wait.lock, flags);
> + swake_up_locked(&x->wait);
> + raw_spin_unlock_irqrestore(&x->wait.lock, flags);
> }
> EXPORT_SYMBOL(complete);
>
> @@ -50,10 +50,15 @@ void complete_all(struct completion *x)
> {
> unsigned long flags;
>
> - spin_lock_irqsave(&x->wait.lock, flags);
> - x->done += UINT_MAX/2;
> - __wake_up_locked(&x->wait, TASK_NORMAL, 0);
> - spin_unlock_irqrestore(&x->wait.lock, flags);
> + raw_spin_lock_irqsave(&x->wait.lock, flags);
> + x->done += USHRT_MAX/2;
> + if (irqs_disabled_flags(flags)) {
> + x->flags = COMPLETION_DEFER;
> + swake_up_locked(&x->wait);
Does it impact performance if we always did this? This would allow
us to drop the special case and the changes to struct compaction.
> + } else {
> + swake_up_all_locked(&x->wait);
> + }
> + raw_spin_unlock_irqrestore(&x->wait.lock, flags);
> }
> EXPORT_SYMBOL(complete_all);
>
> @@ -62,20 +67,20 @@ do_wait_for_common(struct completion *x,
> long (*action)(long), long timeout, int state)
> {
> if (!x->done) {
> - DECLARE_WAITQUEUE(wait, current);
> + DECLARE_SWAITQUEUE(wait);
>
> - __add_wait_queue_tail_exclusive(&x->wait, &wait);
> + __prepare_to_swait(&x->wait, &wait);
> do {
> if (signal_pending_state(state, current)) {
> timeout = -ERESTARTSYS;
> break;
> }
> __set_current_state(state);
> - spin_unlock_irq(&x->wait.lock);
> + raw_spin_unlock_irq(&x->wait.lock);
> timeout = action(timeout);
> - spin_lock_irq(&x->wait.lock);
> + raw_spin_lock_irq(&x->wait.lock);
> } while (!x->done && timeout);
> - __remove_wait_queue(&x->wait, &wait);
> + __finish_swait(&x->wait, &wait);
> if (!x->done)
> return timeout;
> }
> @@ -89,9 +94,13 @@ __wait_for_common(struct completion *x,
> {
> might_sleep();
>
> - spin_lock_irq(&x->wait.lock);
> + raw_spin_lock_irq(&x->wait.lock);
> timeout = do_wait_for_common(x, action, timeout, state);
> - spin_unlock_irq(&x->wait.lock);
> + raw_spin_unlock_irq(&x->wait.lock);
> + if (x->flags & COMPLETION_DEFER) {
> + x->flags = 0;
> + swake_up_all(&x->wait);
> + }
> return timeout;
> }
>
> @@ -277,12 +286,12 @@ bool try_wait_for_completion(struct completion *x)
> if (!READ_ONCE(x->done))
> return 0;
>
> - spin_lock_irqsave(&x->wait.lock, flags);
> + raw_spin_lock_irqsave(&x->wait.lock, flags);
> if (!x->done)
> ret = 0;
> else
> x->done--;
> - spin_unlock_irqrestore(&x->wait.lock, flags);
> + raw_spin_unlock_irqrestore(&x->wait.lock, flags);
> return ret;
> }
> EXPORT_SYMBOL(try_wait_for_completion);
> @@ -311,7 +320,7 @@ bool completion_done(struct completion *x)
> * after it's acquired the lock.
> */
> smp_rmb();
> - spin_unlock_wait(&x->wait.lock);
> + raw_spin_unlock_wait(&x->wait.lock);
> return true;
> }
> EXPORT_SYMBOL(completion_done);
> diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> index 82f0dff..efe366b 100644
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -72,6 +72,30 @@ void swake_up_all(struct swait_queue_head *q)
> }
> EXPORT_SYMBOL(swake_up_all);
>
> +void swake_up_all_locked(struct swait_queue_head *q)
> +{
> + struct swait_queue *curr;
> + LIST_HEAD(tmp);
> +
> + if (!swait_active(q))
> + return;
> +
> + list_splice_init(&q->task_list, &tmp);
> + while (!list_empty(&tmp)) {
> + curr = list_first_entry(&tmp, typeof(*curr), task_list);
> +
> + wake_up_state(curr->task, TASK_NORMAL);
> + list_del_init(&curr->task_list);
> +
> + if (list_empty(&tmp))
> + break;
> +
> + raw_spin_unlock_irq(&q->lock);
> + raw_spin_lock_irq(&q->lock);
> + }
> +}
> +EXPORT_SYMBOL(swake_up_all_locked);
> +
> void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
> {
> wait->task = current;
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] sched/completion: convert completions to use simple wait queues
2016-05-12 14:08 ` Daniel Wagner
@ 2016-05-16 20:38 ` Luiz Capitulino
2016-05-23 14:09 ` Daniel Wagner
0 siblings, 1 reply; 6+ messages in thread
From: Luiz Capitulino @ 2016-05-16 20:38 UTC (permalink / raw)
To: Daniel Wagner
Cc: linux-kernel, linux-rt-users, Peter Zijlstra (Intel),
Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner
On Thu, 12 May 2016 16:08:34 +0200
Daniel Wagner <wagi@monom.org> wrote:
> In short, I haven't figured out yet why the kernel builds get slightly slower.
You're doing make -j 200, right? How many cores do you have? Couldn't it
be that you're saturating your CPUs?
You could try make -j<NR CPUs>, or some process creation benchmark. Although
I don't know what's the best way to measure this.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] sched/completion: convert completions to use simple wait queues
2016-05-16 20:38 ` Luiz Capitulino
@ 2016-05-23 14:09 ` Daniel Wagner
0 siblings, 0 replies; 6+ messages in thread
From: Daniel Wagner @ 2016-05-23 14:09 UTC (permalink / raw)
To: Luiz Capitulino
Cc: linux-kernel, linux-rt-users, Peter Zijlstra (Intel),
Thomas Gleixner, Sebastian Andrzej Siewior, Daniel Wagner
[Sorry for the late response. I was a few days on holiday]
On 05/16/2016 10:38 PM, Luiz Capitulino wrote:
> On Thu, 12 May 2016 16:08:34 +0200
> Daniel Wagner <wagi@monom.org> wrote:
>
>> In short, I haven't figured out yet why the kernel builds get slightly slower.
>
> You're doing make -j 200, right? How many cores do you have? Couldn't it
> be that you're saturating your CPUs?
For the above numbers I used mmtest as test framework with 2x<NR CPUs>,
that is 128.
> You could try make -j<NR CPUs>, or some process creation benchmark. Although
> I don't know what's the best way to measure this.
Yeah, I consider the kernel benchmark not as a good workload to figure
out what's going on. It's more like to see something is hiding. The
micro benchmarks I used so far couldn't highlight the problem(s). I
guess more testing is needed.
cheers,
daniel
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-05-23 14:10 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-28 12:57 [PATCH v2] sched/completion: convert completions to use simple wait queues Daniel Wagner
2016-04-29 6:14 ` Daniel Wagner
2016-05-12 14:08 ` Daniel Wagner
2016-05-16 20:38 ` Luiz Capitulino
2016-05-23 14:09 ` Daniel Wagner
2016-05-16 20:33 ` Luiz Capitulino
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).