linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Warning from swake_up_all in 4.14.15-rt13 non-RT
@ 2018-03-05 15:08 Corey Minyard
  2018-03-06 17:46 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Corey Minyard @ 2018-03-05 15:08 UTC (permalink / raw)
  To: linux-rt-users, linux-kernel, Tejun Heo

Starting with the change

8a64547a07980f9d25e962a78c2e10ee82bdb742 fs/dcache: use swait_queue 
instead of
waitqueue

we are getting the following warning when running with PREEMPT__LL when 
inserting
a USB drive.  This is on x86_64, 4.14.15-rt13.  It works fine with 
PREEMPT_RT.

# [  155.604042] usb 1-2: new high-speed USB device number 7 using xhci_hcd
[  155.736588] usb 1-2: New USB device found, idVendor=0781, idProduct=5567
[  155.743291] usb 1-2: New USB device strings: Mfr=1, Product=2, 
SerialNumber=3
[  155.750423] usb 1-2: Product: Cruzer Blade
[  155.754517] usb 1-2: Manufacturer: SanDisk
[  155.758616] usb 1-2: SerialNumber: 4C530302900731101541
[  155.764207] usb-storage 1-2:1.0: USB Mass Storage device detected
[  155.770457] scsi host7: usb-storage 1-2:1.0
[  156.831919] scsi 7:0:0:0: Direct-Access     SanDisk  Cruzer Blade     
1.26 PQ: 0 ANSI: 6
[  156.840160] sd 7:0:0:0: Attached scsi generic sg1 type 0
[  156.845766] ------------[ cut here ]------------
[  156.850387] WARNING: CPU: 0 PID: 36 at kernel/sched/swait.c:72 
swake_up_all+0xb4/0xc0
[  156.858208] Modules linked in:
[  156.861259] CPU: 0 PID: 36 Comm: kworker/0:1 Not tainted 4.14.15-rt13 #1
[  156.867950] Hardware name: Supermicro Super Server/To be filled by 
O.E.M., BIOS T20170302175436 03/02/2017
[  156.877590] Workqueue: events_freezable usb_stor_scan_dwork
[  156.883159] task: ffff8c7ead6c6a00 task.stack: ffffb19dc19d0000
[  156.889072] RIP: 0010:swake_up_all+0xb4/0xc0
[  156.893334] RSP: 0000:ffffb19dc19d3be0 EFLAGS: 00010046
[  156.898550] RAX: 0000000000000046 RBX: ffff8c7eab451788 RCX: 
0000000000000000
[  156.905673] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 
ffff8c7eab451770
[  156.912798] RBP: ffff8c7eab451770 R08: 0000000000023ca0 R09: 
ffffffff8bb7663e
[  156.919920] R10: ffffd80dd1a7dfc0 R11: 0000000000000000 R12: 
ffffb19dc19d3be0
[  156.927045] R13: 0000000000000003 R14: ffff8c7ea9e0e800 R15: 
ffff8c7ea69e5000
[  156.934171] FS:  0000000000000000(0000) GS:ffff8c7ebfc00000(0000) 
knlGS:0000000000000000
[  156.942246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.947983] CR2: 00000000fffd5000 CR3: 000000046bb4e000 CR4: 
00000000003406f0
[  156.955108] Call Trace:
[  156.957556]  percpu_ref_kill_and_confirm+0x93/0xa0
[  156.962345]  blk_freeze_queue_start+0x25/0x30
[  156.966696]  blk_set_queue_dying+0x2b/0x90
[  156.970786]  blk_cleanup_queue+0x28/0x110
[  156.974793]  __scsi_remove_device+0x66/0x130
[  156.979063]  scsi_probe_and_add_lun+0x878/0xbd0
[  156.983587]  ? scsi_probe_and_add_lun+0x9df/0xbd0
[  156.988285]  __scsi_scan_target+0x1e8/0x550
[  156.992462]  ? __wake_up_common_lock+0x79/0x90
[  156.996899]  scsi_scan_channel+0x5b/0x80
[  157.000815]  scsi_scan_host_selected+0xbe/0xf0
[  157.005252]  scsi_scan_host+0x15e/0x1a0
[  157.009083]  usb_stor_scan_dwork+0x1d/0x80
[  157.013177]  process_one_work+0x1dd/0x3e0
[  157.017189]  worker_thread+0x26/0x400
[  157.020844]  ? cancel_delayed_work+0x10/0x10
[  157.025107]  kthread+0x116/0x130
[  157.028333]  ? kthread_create_on_node+0x40/0x40
[  157.032858]  ret_from_fork+0x35/0x40
[  157.036435] Code: 49 39 c4 74 17 c6 45 00 00 fb 48 8d 7d 00 e8 c4 8a 
97 00 48 8b 04 24 49 39 c4 75 b9 c6 45 00 00 fb 48 8d 64 24 10 5b 5d 41 
5c c3
[  157.055292] ---[ end trace 86c20fd8d6c01794 ]---
[  157.060040] sd 7:0:0:0: [sdb] 15633408 512-byte logical blocks: (8.00 
GB/7.45 GiB)
[  157.070089] sd 7:0:0:0: [sdb] Write Protect is off
[  157.075183] sd 7:0:0:0: [sdb] Write cache: disabled, read cache: 
enabled, doesn't support DPO or FUA
[  157.100295]  sdb: sdb1
[  157.103778] sd 7:0:0:0: [sdb] Attached SCSI disk
[  157.379921] FAT-fs (sdb1): Volume was not properly unmounted. Some 
data may be corrupt. Please run fsck.


The following change is the obvious reason:

--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -69,6 +69,7 @@ void swake_up_all(struct swait_queue_head *q)
         struct swait_queue *curr;
         LIST_HEAD(tmp);

+       WARN_ON(irqs_disabled());
         raw_spin_lock_irq(&q->lock);
         list_splice_init(&q->task_list, &tmp);
         while (!list_empty(&tmp)) {

I've done a little bit of analysis here, percpu_ref_kill_and_confirm()
does spin_lock_irqsave() and then does a percpu_ref_put().  If the
refcount reaches zero, the release function of the refcount is
called.  In this case, the block code has set this to
blk_queue_usage_counter_release(), which calls swake_up_all().

It seems like a bad idea to call percpu_ref_put() with interrupts
disabled.  This problem actually doesn't appear to be RT-related,
there's just no warning call if the RT tree isn't used.

I'm not sure if it's best to just do the put outside the lock, or
have modified put function that returns a bool to know if a release
is required, then the release function can be called outside the
lock.  I can do patches and test, but I'm hoping for a little
guidance here.

I'm also wondering why we don't have a warning like this in the
*_spin_lock_irq() macros, perhaps turned on with a debug
option.  That would catch things like this sooner.

Thanks,

-corey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-05 15:08 Warning from swake_up_all in 4.14.15-rt13 non-RT Corey Minyard
@ 2018-03-06 17:46 ` Sebastian Andrzej Siewior
  2018-03-06 22:51   ` Corey Minyard
  2018-03-07 15:45   ` Corey Minyard
  0 siblings, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-06 17:46 UTC (permalink / raw)
  To: Corey Minyard
  Cc: linux-rt-users, linux-kernel, Tejun Heo, Peter Zijlstra, Thomas Gleixner

On 2018-03-05 09:08:11 [-0600], Corey Minyard wrote:
> Starting with the change
> 
> 8a64547a07980f9d25e962a78c2e10ee82bdb742 fs/dcache: use swait_queue instead
> of
> waitqueue
> The following change is the obvious reason:
> 
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -69,6 +69,7 @@ void swake_up_all(struct swait_queue_head *q)
>         struct swait_queue *curr;
>         LIST_HEAD(tmp);
> 
> +       WARN_ON(irqs_disabled());
>         raw_spin_lock_irq(&q->lock);
>         list_splice_init(&q->task_list, &tmp);
>         while (!list_empty(&tmp)) {
> 
> I've done a little bit of analysis here, percpu_ref_kill_and_confirm()
> does spin_lock_irqsave() and then does a percpu_ref_put().  If the
> refcount reaches zero, the release function of the refcount is
> called.  In this case, the block code has set this to
> blk_queue_usage_counter_release(), which calls swake_up_all().
> 
> It seems like a bad idea to call percpu_ref_put() with interrupts
> disabled.  This problem actually doesn't appear to be RT-related,
> there's just no warning call if the RT tree isn't used.

yeah but vanilla uses wake_up() which does spin_lock_irqsafe() so it is
not an issue there.

The odd part here is that percpu_ref_kill_and_confirm() does _irqsave()
which suggests that it might be called from any context and then it does
wait_event_lock_irq() which enables interrupts again while it waits. So
it can't be used from any context.

> I'm not sure if it's best to just do the put outside the lock, or
> have modified put function that returns a bool to know if a release
> is required, then the release function can be called outside the
> lock.  I can do patches and test, but I'm hoping for a little
> guidance here.

swake_up_all() does raw_spin_lock_irq() because it should be called from
non-IRQ context. And it drops the lock (+IRQ enable) between wake-ups in
case we "need_resched()" because we woke a high-priority waiter. There
is the list_splice because we wanted to drop the locks (and have IRQs
open) during the entire wake up process but finish_swait() may happen
during the wake up and so we must hold the lock while the list-item is
removed for the queue head.
I have no idea what is the wisest thing to do here. The obvious fix
would be to use the irqsafe() variant here and not drop the lock between
wake ups. That is essentially what swake_up_all_locked() does which I
need for the completions (and based on some testing most users have one
waiter except during PM and some crypto code).
It is probably no comparison to wake_up_q() (which does multiple wake
ups without a context switch) but then we did this before like that.

Preferably we would have a proper list_splice() and some magic in the
"early" dequeue part that works.

> I'm also wondering why we don't have a warning like this in the
> *_spin_lock_irq() macros, perhaps turned on with a debug
> option.  That would catch things like this sooner.

Ideally you would add lockdep_assert_irqs_enabled() to
local_irq_disable() so you would have it hidden behind lockdep with an
recursion check and everything. But this needs a lot of headers like
task_struct so…
I had once WARN_ON_ONCE(irqs_disabled()) added to testdrive it and had a
few false-positive in the early boot or constructs like in
__run_hrtimer(). I didn't look at it further…

> Thanks,
> 
> -corey

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-06 17:46 ` Sebastian Andrzej Siewior
@ 2018-03-06 22:51   ` Corey Minyard
  2018-03-07 15:45   ` Corey Minyard
  1 sibling, 0 replies; 23+ messages in thread
From: Corey Minyard @ 2018-03-06 22:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, linux-kernel, Tejun Heo, Peter Zijlstra,
	Thomas Gleixner, Kent Overstreet

On 03/06/2018 11:46 AM, Sebastian Andrzej Siewior wrote:
> On 2018-03-05 09:08:11 [-0600], Corey Minyard wrote:
>> Starting with the change
>>
>> 8a64547a07980f9d25e962a78c2e10ee82bdb742 fs/dcache: use swait_queue instead
>> of
>> waitqueue
> …
>> The following change is the obvious reason:
>>
>> --- a/kernel/sched/swait.c
>> +++ b/kernel/sched/swait.c
>> @@ -69,6 +69,7 @@ void swake_up_all(struct swait_queue_head *q)
>>          struct swait_queue *curr;
>>          LIST_HEAD(tmp);
>>
>> +       WARN_ON(irqs_disabled());
>>          raw_spin_lock_irq(&q->lock);
>>          list_splice_init(&q->task_list, &tmp);
>>          while (!list_empty(&tmp)) {
>>
>> I've done a little bit of analysis here, percpu_ref_kill_and_confirm()
>> does spin_lock_irqsave() and then does a percpu_ref_put().  If the
>> refcount reaches zero, the release function of the refcount is
>> called.  In this case, the block code has set this to
>> blk_queue_usage_counter_release(), which calls swake_up_all().
>>
>> It seems like a bad idea to call percpu_ref_put() with interrupts
>> disabled.  This problem actually doesn't appear to be RT-related,
>> there's just no warning call if the RT tree isn't used.
> yeah but vanilla uses wake_up() which does spin_lock_irqsafe() so it is
> not an issue there.
>
> The odd part here is that percpu_ref_kill_and_confirm() does _irqsave()
> which suggests that it might be called from any context and then it does
> wait_event_lock_irq() which enables interrupts again while it waits. So
> it can't be used from any context.

I'm adding the author (Kent) to this email, I should have done that 
originally.

You are right, it looks like all the percpu_ref_switch.. and 
percpu_ref_kill...
functions are broken here.

I also don't understand the need for a global lock for non-global variables.
It looks like this could become a bottleneck in a big SMP system.

I'm going to spend some time with this and try to figure out what is going
on.  Hopefully Kent or Tejun can offer some insight.

-corey

>> I'm not sure if it's best to just do the put outside the lock, or
>> have modified put function that returns a bool to know if a release
>> is required, then the release function can be called outside the
>> lock.  I can do patches and test, but I'm hoping for a little
>> guidance here.
> swake_up_all() does raw_spin_lock_irq() because it should be called from
> non-IRQ context. And it drops the lock (+IRQ enable) between wake-ups in
> case we "need_resched()" because we woke a high-priority waiter. There
> is the list_splice because we wanted to drop the locks (and have IRQs
> open) during the entire wake up process but finish_swait() may happen
> during the wake up and so we must hold the lock while the list-item is
> removed for the queue head.
> I have no idea what is the wisest thing to do here. The obvious fix
> would be to use the irqsafe() variant here and not drop the lock between
> wake ups. That is essentially what swake_up_all_locked() does which I
> need for the completions (and based on some testing most users have one
> waiter except during PM and some crypto code).
> It is probably no comparison to wake_up_q() (which does multiple wake
> ups without a context switch) but then we did this before like that.
>
> Preferably we would have a proper list_splice() and some magic in the
> "early" dequeue part that works.
>
>> I'm also wondering why we don't have a warning like this in the
>> *_spin_lock_irq() macros, perhaps turned on with a debug
>> option.  That would catch things like this sooner.
> Ideally you would add lockdep_assert_irqs_enabled() to
> local_irq_disable() so you would have it hidden behind lockdep with an
> recursion check and everything. But this needs a lot of headers like
> task_struct so…
> I had once WARN_ON_ONCE(irqs_disabled()) added to testdrive it and had a
> few false-positive in the early boot or constructs like in
> __run_hrtimer(). I didn't look at it further…
>
>> Thanks,
>>
>> -corey
> Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-06 17:46 ` Sebastian Andrzej Siewior
  2018-03-06 22:51   ` Corey Minyard
@ 2018-03-07 15:45   ` Corey Minyard
  2018-03-08 17:41     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 23+ messages in thread
From: Corey Minyard @ 2018-03-07 15:45 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, linux-kernel, Tejun Heo, Peter Zijlstra, Thomas Gleixner

On 03/06/2018 11:46 AM, Sebastian Andrzej Siewior wrote:
> On 2018-03-05 09:08:11 [-0600], Corey Minyard wrote:
>> Starting with the change
>>
>> 8a64547a07980f9d25e962a78c2e10ee82bdb742 fs/dcache: use swait_queue instead
>> of
>> waitqueue
> …
>> The following change is the obvious reason:
>>
>> --- a/kernel/sched/swait.c
>> +++ b/kernel/sched/swait.c
>> @@ -69,6 +69,7 @@ void swake_up_all(struct swait_queue_head *q)
>>          struct swait_queue *curr;
>>          LIST_HEAD(tmp);
>>
>> +       WARN_ON(irqs_disabled());
>>          raw_spin_lock_irq(&q->lock);
>>          list_splice_init(&q->task_list, &tmp);
>>          while (!list_empty(&tmp)) {
>>
>> I've done a little bit of analysis here, percpu_ref_kill_and_confirm()
>> does spin_lock_irqsave() and then does a percpu_ref_put().  If the
>> refcount reaches zero, the release function of the refcount is
>> called.  In this case, the block code has set this to
>> blk_queue_usage_counter_release(), which calls swake_up_all().
>>
>> It seems like a bad idea to call percpu_ref_put() with interrupts
>> disabled.  This problem actually doesn't appear to be RT-related,
>> there's just no warning call if the RT tree isn't used.
> yeah but vanilla uses wake_up() which does spin_lock_irqsafe() so it is
> not an issue there.
>
> The odd part here is that percpu_ref_kill_and_confirm() does _irqsave()
> which suggests that it might be called from any context and then it does
> wait_event_lock_irq() which enables interrupts again while it waits. So
> it can't be used from any context.
>
>> I'm not sure if it's best to just do the put outside the lock, or
>> have modified put function that returns a bool to know if a release
>> is required, then the release function can be called outside the
>> lock.  I can do patches and test, but I'm hoping for a little
>> guidance here.
> swake_up_all() does raw_spin_lock_irq() because it should be called from
> non-IRQ context. And it drops the lock (+IRQ enable) between wake-ups in
> case we "need_resched()" because we woke a high-priority waiter. There
> is the list_splice because we wanted to drop the locks (and have IRQs
> open) during the entire wake up process but finish_swait() may happen
> during the wake up and so we must hold the lock while the list-item is
> removed for the queue head.
> I have no idea what is the wisest thing to do here. The obvious fix
> would be to use the irqsafe() variant here and not drop the lock between
> wake ups. That is essentially what swake_up_all_locked() does which I
> need for the completions (and based on some testing most users have one
> waiter except during PM and some crypto code).
> It is probably no comparison to wake_up_q() (which does multiple wake
> ups without a context switch) but then we did this before like that.
>
> Preferably we would have a proper list_splice() and some magic in the
> "early" dequeue part that works.
>

Maybe just modify the block code to run the swake_up_all() call in a 
workqueue
or tasklet?  If you think that works, I'll create a patch, test it, and 
submit it if
all goes well.

Thanks,

-corey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-07 15:45   ` Corey Minyard
@ 2018-03-08 17:41     ` Sebastian Andrzej Siewior
  2018-03-08 19:54       ` Corey Minyard
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-08 17:41 UTC (permalink / raw)
  To: Corey Minyard
  Cc: linux-rt-users, linux-kernel, Tejun Heo, Peter Zijlstra, Thomas Gleixner

On 2018-03-07 09:45:29 [-0600], Corey Minyard wrote:
> > I have no idea what is the wisest thing to do here. The obvious fix
> > would be to use the irqsafe() variant here and not drop the lock between
> > wake ups. That is essentially what swake_up_all_locked() does which I
> > need for the completions (and based on some testing most users have one
> > waiter except during PM and some crypto code).
> > It is probably no comparison to wake_up_q() (which does multiple wake
> > ups without a context switch) but then we did this before like that.
> > 
> > Preferably we would have a proper list_splice() and some magic in the
> > "early" dequeue part that works.
> > 
> 
> Maybe just modify the block code to run the swake_up_all() call in a
> workqueue
> or tasklet?  If you think that works, I'll create a patch, test it, and
> submit it if
> all goes well.

It will work but I don't think pushing this into workqueue/tasklet is a
good idea. You want to wakeup all waiters on waitqueue X (probably one
waiter) and instead there is one one wakeup + ctx-switch which does the
final wakeup.
But now I had an idea: swake_up_all() could iterate over list and
instead performing wakes it would just wake_q_add() the tasks. Drop the
lock and then wake_up_q(). So in case there is wakeup pending and the
task removed itself from the list then the task may observe a spurious
wakeup.

> Thanks,
> 
> -corey

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-08 17:41     ` Sebastian Andrzej Siewior
@ 2018-03-08 19:54       ` Corey Minyard
  2018-03-09 11:04         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Corey Minyard @ 2018-03-08 19:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, linux-kernel, Tejun Heo, Peter Zijlstra, Thomas Gleixner

On 03/08/2018 11:41 AM, Sebastian Andrzej Siewior wrote:
> On 2018-03-07 09:45:29 [-0600], Corey Minyard wrote:
>>> I have no idea what is the wisest thing to do here. The obvious fix
>>> would be to use the irqsafe() variant here and not drop the lock between
>>> wake ups. That is essentially what swake_up_all_locked() does which I
>>> need for the completions (and based on some testing most users have one
>>> waiter except during PM and some crypto code).
>>> It is probably no comparison to wake_up_q() (which does multiple wake
>>> ups without a context switch) but then we did this before like that.
>>>
>>> Preferably we would have a proper list_splice() and some magic in the
>>> "early" dequeue part that works.
>>>
>> Maybe just modify the block code to run the swake_up_all() call in a
>> workqueue
>> or tasklet?  If you think that works, I'll create a patch, test it, and
>> submit it if
>> all goes well.
> It will work but I don't think pushing this into workqueue/tasklet is a
> good idea. You want to wakeup all waiters on waitqueue X (probably one
> waiter) and instead there is one one wakeup + ctx-switch which does the
> final wakeup.

True, but this is an uncommon and already fairly expensive operation being
done.  Adding a context switch doesn't seem that bad.

> But now I had an idea: swake_up_all() could iterate over list and
> instead performing wakes it would just wake_q_add() the tasks. Drop the
> lock and then wake_up_q(). So in case there is wakeup pending and the
> task removed itself from the list then the task may observe a spurious
> wakeup.

That sounds promising, but where does wake_up_q() get called?  No matter 
what
it's an expensive operation and I'm not sure where you would put it in 
this case.

I had another idea.  This is only occurring if RT is not enabled, 
because with
RT all the irq disable things go away and you are generally running in task
context.  So why not have a different version of swake_up_all() for non-RT
that does work from irqs-off context?

-corey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-08 19:54       ` Corey Minyard
@ 2018-03-09 11:04         ` Sebastian Andrzej Siewior
  2018-03-09 13:29           ` Corey Minyard
  2018-03-09 17:46           ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-09 11:04 UTC (permalink / raw)
  To: Corey Minyard, Peter Zijlstra, Thomas Gleixner, Steven Rostedt
  Cc: linux-rt-users, linux-kernel, Tejun Heo

On 2018-03-08 13:54:17 [-0600], Corey Minyard wrote:
> > It will work but I don't think pushing this into workqueue/tasklet is a
> > good idea. You want to wakeup all waiters on waitqueue X (probably one
> > waiter) and instead there is one one wakeup + ctx-switch which does the
> > final wakeup.
> 
> True, but this is an uncommon and already fairly expensive operation being
> done.  Adding a context switch doesn't seem that bad.

still no need to make it more expensive if it can be avoided.

> > But now I had an idea: swake_up_all() could iterate over list and
> > instead performing wakes it would just wake_q_add() the tasks. Drop the
> > lock and then wake_up_q(). So in case there is wakeup pending and the
> > task removed itself from the list then the task may observe a spurious
> > wakeup.
> 
> That sounds promising, but where does wake_up_q() get called?  No matter
> what
> it's an expensive operation and I'm not sure where you would put it in this
> case.

Look at this:

Subject: [RFC PATCH RT] sched/swait: use WAKE_Q for possible multiple wake ups

Corey Minyard reported swake_up_all() invocation with disabled
interrupts with the RT patch applied but disabled (low latency config).
The reason why swake_up_all() avoids the irqsafe variant is because it
shouldn't be called from IRQ-disabled section. The idea was to wake up
one task after the other, enable interrupts (and drop the lock) during
the wake ups so we can schedule away in case a task with a higher
priority was just waken up.
In RT we have swait based completions so I kind of needed a complete()
which could wake multiple sleepers without dropping the lock and
enabling interrupts.
To work around this shortcoming I propose to use WAKE_Q. swake_up_all()
will queue all to be woken up tasks on wake-queue with interrupts
disabled which should be "quick". After dropping the lock (and enabling
interrupts) it can wake the tasks one after the other.

Reported-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/swait.h     |  4 +++-
 kernel/sched/completion.c |  5 ++++-
 kernel/sched/swait.c      | 35 ++++++++++-------------------------
 3 files changed, 17 insertions(+), 27 deletions(-)

diff --git a/include/linux/swait.h b/include/linux/swait.h
index 853f3e61a9f4..929721cffdb3 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -148,7 +148,9 @@ static inline bool swq_has_sleeper(struct swait_queue_head *wq)
 extern void swake_up(struct swait_queue_head *q);
 extern void swake_up_all(struct swait_queue_head *q);
 extern void swake_up_locked(struct swait_queue_head *q);
-extern void swake_up_all_locked(struct swait_queue_head *q);
+
+struct wake_q_head;
+extern void swake_add_all_wq(struct swait_queue_head *q, struct wake_q_head *wq);
 
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 extern void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state);
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index 0fe2982e46a0..461d992e30f9 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -15,6 +15,7 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/debug.h>
 #include <linux/completion.h>
+#include <linux/sched/wake_q.h>
 
 /**
  * complete: - signals a single thread waiting on this completion
@@ -65,11 +66,13 @@ EXPORT_SYMBOL(complete);
 void complete_all(struct completion *x)
 {
 	unsigned long flags;
+	DEFINE_WAKE_Q(wq);
 
 	raw_spin_lock_irqsave(&x->wait.lock, flags);
 	x->done = UINT_MAX;
-	swake_up_all_locked(&x->wait);
+	swake_add_all_wq(&x->wait, &wq);
 	raw_spin_unlock_irqrestore(&x->wait.lock, flags);
+	wake_up_q(&wq);
 }
 EXPORT_SYMBOL(complete_all);
 
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index b14638a05ec9..1a09cc425bd8 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -2,6 +2,7 @@
 #include <linux/sched/signal.h>
 #include <linux/swait.h>
 #include <linux/suspend.h>
+#include <linux/sched/wake_q.h>
 
 void __init_swait_queue_head(struct swait_queue_head *q, const char *name,
 			     struct lock_class_key *key)
@@ -31,24 +32,19 @@ void swake_up_locked(struct swait_queue_head *q)
 }
 EXPORT_SYMBOL(swake_up_locked);
 
-void swake_up_all_locked(struct swait_queue_head *q)
+void swake_add_all_wq(struct swait_queue_head *q, struct wake_q_head *wq)
 {
 	struct swait_queue *curr;
-	int wakes = 0;
 
 	while (!list_empty(&q->task_list)) {
 
 		curr = list_first_entry(&q->task_list, typeof(*curr),
 					task_list);
-		wake_up_process(curr->task);
 		list_del_init(&curr->task_list);
-		wakes++;
+		wake_q_add(wq, curr->task);
 	}
-	if (pm_in_action)
-		return;
-	WARN(wakes > 2, "complete_all() with %d waiters\n", wakes);
 }
-EXPORT_SYMBOL(swake_up_all_locked);
+EXPORT_SYMBOL(swake_add_all_wq);
 
 void swake_up(struct swait_queue_head *q)
 {
@@ -66,25 +62,14 @@ EXPORT_SYMBOL(swake_up);
  */
 void swake_up_all(struct swait_queue_head *q)
 {
-	struct swait_queue *curr;
-	LIST_HEAD(tmp);
+	unsigned long flags;
+	DEFINE_WAKE_Q(wq);
 
-	WARN_ON(irqs_disabled());
-	raw_spin_lock_irq(&q->lock);
-	list_splice_init(&q->task_list, &tmp);
-	while (!list_empty(&tmp)) {
-		curr = list_first_entry(&tmp, typeof(*curr), task_list);
+	raw_spin_lock_irqsave(&q->lock, flags);
+	swake_add_all_wq(q, &wq);
+	raw_spin_unlock_irqrestore(&q->lock, flags);
 
-		wake_up_state(curr->task, TASK_NORMAL);
-		list_del_init(&curr->task_list);
-
-		if (list_empty(&tmp))
-			break;
-
-		raw_spin_unlock_irq(&q->lock);
-		raw_spin_lock_irq(&q->lock);
-	}
-	raw_spin_unlock_irq(&q->lock);
+	wake_up_q(&wq);
 }
 EXPORT_SYMBOL(swake_up_all);
 

> I had another idea.  This is only occurring if RT is not enabled, because
> with
> RT all the irq disable things go away and you are generally running in task
> context.  So why not have a different version of swake_up_all() for non-RT
> that does work from irqs-off context?

With the patch above I have puzzle part which would allow to use swait
based completions upstream. That ifdef would probably not help.

> -corey

Sebastian

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-09 11:04         ` Sebastian Andrzej Siewior
@ 2018-03-09 13:29           ` Corey Minyard
  2018-03-09 14:58             ` Sebastian Andrzej Siewior
  2018-03-09 17:46           ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Corey Minyard @ 2018-03-09 13:29 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Peter Zijlstra, Thomas Gleixner,
	Steven Rostedt
  Cc: linux-rt-users, linux-kernel, Tejun Heo

On 03/09/2018 05:04 AM, Sebastian Andrzej Siewior wrote:
> On 2018-03-08 13:54:17 [-0600], Corey Minyard wrote:
>>> It will work but I don't think pushing this into workqueue/tasklet is a
>>> good idea. You want to wakeup all waiters on waitqueue X (probably one
>>> waiter) and instead there is one one wakeup + ctx-switch which does the
>>> final wakeup.
>> True, but this is an uncommon and already fairly expensive operation being
>> done.  Adding a context switch doesn't seem that bad.
> still no need to make it more expensive if it can be avoided.
>
>>> But now I had an idea: swake_up_all() could iterate over list and
>>> instead performing wakes it would just wake_q_add() the tasks. Drop the
>>> lock and then wake_up_q(). So in case there is wakeup pending and the
>>> task removed itself from the list then the task may observe a spurious
>>> wakeup.
>> That sounds promising, but where does wake_up_q() get called?  No matter
>> what
>> it's an expensive operation and I'm not sure where you would put it in this
>> case.
> Look at this:
>
...
>   
>   void swake_up(struct swait_queue_head *q)
>   {
> @@ -66,25 +62,14 @@ EXPORT_SYMBOL(swake_up);
>    */
>   void swake_up_all(struct swait_queue_head *q)
>   {
> -	struct swait_queue *curr;
> -	LIST_HEAD(tmp);
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wq);
>   
> -	WARN_ON(irqs_disabled());
> -	raw_spin_lock_irq(&q->lock);
> -	list_splice_init(&q->task_list, &tmp);
> -	while (!list_empty(&tmp)) {
> -		curr = list_first_entry(&tmp, typeof(*curr), task_list);
> +	raw_spin_lock_irqsave(&q->lock, flags);
> +	swake_add_all_wq(q, &wq);
> +	raw_spin_unlock_irqrestore(&q->lock, flags);
>   
> -		wake_up_state(curr->task, TASK_NORMAL);
> -		list_del_init(&curr->task_list);
> -
> -		if (list_empty(&tmp))
> -			break;
> -
> -		raw_spin_unlock_irq(&q->lock);
> -		raw_spin_lock_irq(&q->lock);
> -	}
> -	raw_spin_unlock_irq(&q->lock);
> +	wake_up_q(&wq);

 From what I can tell, wake_up_q() is unbounded, and you have undone what
the previous code had tried to accomplish.  In the scenario I'm talking 
about,
interrupts are still disabled here.  That's why I was asking about where 
to put
wake_up_q(), I knew you could put it here, but it didn't seem to me to help
at all.

>   }
>   EXPORT_SYMBOL(swake_up_all);
>   
>
>> I had another idea.  This is only occurring if RT is not enabled, because
>> with
>> RT all the irq disable things go away and you are generally running in task
>> context.  So why not have a different version of swake_up_all() for non-RT
>> that does work from irqs-off context?
> With the patch above I have puzzle part which would allow to use swait
> based completions upstream. That ifdef would probably not help.

I agree that having a bounded time way to wake up a bunch of threads while
interrupts are disabled would solve a bunch of issues.  I just don't see 
how it
can be done without pushing it off to a softirq or workqueue.

-corey

>> -corey
> Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-09 13:29           ` Corey Minyard
@ 2018-03-09 14:58             ` Sebastian Andrzej Siewior
  2018-03-09 16:03               ` Corey Minyard
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-09 14:58 UTC (permalink / raw)
  To: Corey Minyard
  Cc: Peter Zijlstra, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On 2018-03-09 07:29:31 [-0600], Corey Minyard wrote:
> From what I can tell, wake_up_q() is unbounded, and you have undone what
> the previous code had tried to accomplish.  In the scenario I'm talking
> about,
> interrupts are still disabled here.  That's why I was asking about where to
> put
> wake_up_q(), I knew you could put it here, but it didn't seem to me to help
> at all.

So you are worried about unbound latencies on !RT. Okay. So for !RT this
does not help but it is not worse then before (before the RT patch was
applied and changed things). 
In fact it is better now (with RT patch and this one) because before
that patch you would not only open interrupts between the wake up but you
would leave the function with interrupts open which is wrong. Any
interrupt (or a context switch due to need-resched() that would invoke
percpu_ref_put() would freeze the CPU/system.
Also, every user that invoked swake_up_all() with enabled interrupts
will still perform the wake up with enabled interrupts. So nothing
changes here.

> > > I had another idea.  This is only occurring if RT is not enabled, because
> > > with
> > > RT all the irq disable things go away and you are generally running in task
> > > context.  So why not have a different version of swake_up_all() for non-RT
> > > that does work from irqs-off context?
> > With the patch above I have puzzle part which would allow to use swait
> > based completions upstream. That ifdef would probably not help.
> 
> I agree that having a bounded time way to wake up a bunch of threads while
> interrupts are disabled would solve a bunch of issues.  I just don't see how
> it
> can be done without pushing it off to a softirq or workqueue.

true but this is a different story. We started with a WARN_ON() which
triggered correctly and the problem it pointed to looks solved to me.

This "unbounded runtime during the wake up of many tasks with interrupts
disabled via percpu_ref_kill() -> blk_queue_usage_counter_release()"
thing exists already in the vanilla kernel and does not exist
with the RT patch applied and RT enabled. If you are affected by this
and you don't like it - fine. Using a workqueue is one way of getting
around it (the softirq is not preemptible in !RT so it wouldn't change
much). However, I see no benefit in carrying such a patch because as I
said only !RT is affected by this.

> -corey

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-09 14:58             ` Sebastian Andrzej Siewior
@ 2018-03-09 16:03               ` Corey Minyard
  0 siblings, 0 replies; 23+ messages in thread
From: Corey Minyard @ 2018-03-09 16:03 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On 03/09/2018 08:58 AM, Sebastian Andrzej Siewior wrote:
> On 2018-03-09 07:29:31 [-0600], Corey Minyard wrote:
>>  From what I can tell, wake_up_q() is unbounded, and you have undone what
>> the previous code had tried to accomplish.  In the scenario I'm talking
>> about,
>> interrupts are still disabled here.  That's why I was asking about where to
>> put
>> wake_up_q(), I knew you could put it here, but it didn't seem to me to help
>> at all.
> So you are worried about unbound latencies on !RT. Okay. So for !RT this
> does not help but it is not worse then before (before the RT patch was
> applied and changed things).
> In fact it is better now (with RT patch and this one) because before
> that patch you would not only open interrupts between the wake up but you
> would leave the function with interrupts open which is wrong. Any
> interrupt (or a context switch due to need-resched() that would invoke
> percpu_ref_put() would freeze the CPU/system.
> Also, every user that invoked swake_up_all() with enabled interrupts
> will still perform the wake up with enabled interrupts. So nothing
> changes here.

Ah, ok, that makes sense.  Sorry, I was mixing things up.  Yes. on RT 
this should
fix the unbounded time issue, and it should also solve the interrupts 
disabled
issue on !RT.

I'll try this out.

-corey

>>>> I had another idea.  This is only occurring if RT is not enabled, because
>>>> with
>>>> RT all the irq disable things go away and you are generally running in task
>>>> context.  So why not have a different version of swake_up_all() for non-RT
>>>> that does work from irqs-off context?
>>> With the patch above I have puzzle part which would allow to use swait
>>> based completions upstream. That ifdef would probably not help.
>> I agree that having a bounded time way to wake up a bunch of threads while
>> interrupts are disabled would solve a bunch of issues.  I just don't see how
>> it
>> can be done without pushing it off to a softirq or workqueue.
> true but this is a different story. We started with a WARN_ON() which
> triggered correctly and the problem it pointed to looks solved to me.
>
> This "unbounded runtime during the wake up of many tasks with interrupts
> disabled via percpu_ref_kill() -> blk_queue_usage_counter_release()"
> thing exists already in the vanilla kernel and does not exist
> with the RT patch applied and RT enabled. If you are affected by this
> and you don't like it - fine. Using a workqueue is one way of getting
> around it (the softirq is not preemptible in !RT so it wouldn't change
> much). However, I see no benefit in carrying such a patch because as I
> said only !RT is affected by this.
>
>> -corey
> Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-09 11:04         ` Sebastian Andrzej Siewior
  2018-03-09 13:29           ` Corey Minyard
@ 2018-03-09 17:46           ` Peter Zijlstra
  2018-03-09 20:25             ` Sebastian Andrzej Siewior
  2018-03-09 22:02             ` Warning from swake_up_all in 4.14.15-rt13 non-RT Corey Minyard
  1 sibling, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2018-03-09 17:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Corey Minyard, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On Fri, Mar 09, 2018 at 12:04:18PM +0100, Sebastian Andrzej Siewior wrote:
> +void swake_add_all_wq(struct swait_queue_head *q, struct wake_q_head *wq)
>  {
>  	struct swait_queue *curr;
>  
>  	while (!list_empty(&q->task_list)) {
>  
>  		curr = list_first_entry(&q->task_list, typeof(*curr),
>  					task_list);
>  		list_del_init(&curr->task_list);
> +		wake_q_add(wq, curr->task);
>  	}
>  }
> +EXPORT_SYMBOL(swake_add_all_wq);
>  
>  void swake_up(struct swait_queue_head *q)
>  {
> @@ -66,25 +62,14 @@ EXPORT_SYMBOL(swake_up);
>   */
>  void swake_up_all(struct swait_queue_head *q)
>  {
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wq);
>  
> +	raw_spin_lock_irqsave(&q->lock, flags);
> +	swake_add_all_wq(q, &wq);
> +	raw_spin_unlock_irqrestore(&q->lock, flags);
>  
> +	wake_up_q(&wq);
>  }
>  EXPORT_SYMBOL(swake_up_all);

This is fundamentally wrong. The whole point of wake_up_all() is that
_all_ is unbounded and should not ever land in a single critical
section, be it IRQ or PREEMPT disabled. The above does both.

Yes, wake_up_all() is crap, it is also fundamentally incompatible with
in-*irq usage. Nothing to be done about that.

So NAK on this.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-09 17:46           ` Peter Zijlstra
@ 2018-03-09 20:25             ` Sebastian Andrzej Siewior
  2018-03-09 22:26               ` Peter Zijlstra
  2018-03-09 22:02             ` Warning from swake_up_all in 4.14.15-rt13 non-RT Corey Minyard
  1 sibling, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-09 20:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Minyard, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On 2018-03-09 18:46:05 [+0100], Peter Zijlstra wrote:
> On Fri, Mar 09, 2018 at 12:04:18PM +0100, Sebastian Andrzej Siewior wrote:
> > +void swake_add_all_wq(struct swait_queue_head *q, struct wake_q_head *wq)
> >  {
> >  	struct swait_queue *curr;
> >  
> >  	while (!list_empty(&q->task_list)) {
> >  
> >  		curr = list_first_entry(&q->task_list, typeof(*curr),
> >  					task_list);
> >  		list_del_init(&curr->task_list);
> > +		wake_q_add(wq, curr->task);
> >  	}
> >  }
> > +EXPORT_SYMBOL(swake_add_all_wq);
> >  
> >  void swake_up(struct swait_queue_head *q)
> >  {
> > @@ -66,25 +62,14 @@ EXPORT_SYMBOL(swake_up);
> >   */
> >  void swake_up_all(struct swait_queue_head *q)
> >  {
> > +	unsigned long flags;
> > +	DEFINE_WAKE_Q(wq);
> >  
> > +	raw_spin_lock_irqsave(&q->lock, flags);
> > +	swake_add_all_wq(q, &wq);
> > +	raw_spin_unlock_irqrestore(&q->lock, flags);
> >  
> > +	wake_up_q(&wq);
> >  }
> >  EXPORT_SYMBOL(swake_up_all);
> 
> This is fundamentally wrong. The whole point of wake_up_all() is that
> _all_ is unbounded and should not ever land in a single critical
> section, be it IRQ or PREEMPT disabled. The above does both.

Is it just about the irqsave() usage or something else? I doubt it is
the list walk. It is still unbound if not called from irq-off region.
But it is now possible, I agree. The wake_q usage should be cheaper
compared to IRQ off+on in each loop. And we wanted to do the wake ups
with enabled interrupts - there is still the list_splice() from that
attempt. Now it can be.

> Yes, wake_up_all() is crap, it is also fundamentally incompatible with
> in-*irq usage. Nothing to be done about that.
I still have (or need) completions which are swait based and do
complete_all(). There are complete_all() caller which wake more than one
waiter (that is PM and crypto from the reports I got once I added the
WARN_ON())).
The in-IRQ usage is !RT only and was there before.

> So NAK on this.
So I need completions to be swait based and do complete_all() from IRQ
(on !RT, not RT). I have this one call which breaks the usage on !RT and
has wake_up_all() in it in vanilla which needs an swait equivalent since
it calls its callback from an rcu-sched section.

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-09 17:46           ` Peter Zijlstra
  2018-03-09 20:25             ` Sebastian Andrzej Siewior
@ 2018-03-09 22:02             ` Corey Minyard
  1 sibling, 0 replies; 23+ messages in thread
From: Corey Minyard @ 2018-03-09 22:02 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, Steven Rostedt, linux-rt-users, linux-kernel, Tejun Heo

On 03/09/2018 11:46 AM, Peter Zijlstra wrote:
> On Fri, Mar 09, 2018 at 12:04:18PM +0100, Sebastian Andrzej Siewior wrote:
>> +void swake_add_all_wq(struct swait_queue_head *q, struct wake_q_head *wq)
>>   {
>>   	struct swait_queue *curr;
>>   
>>   	while (!list_empty(&q->task_list)) {
>>   
>>   		curr = list_first_entry(&q->task_list, typeof(*curr),
>>   					task_list);
>>   		list_del_init(&curr->task_list);
>> +		wake_q_add(wq, curr->task);
>>   	}
>>   }
>> +EXPORT_SYMBOL(swake_add_all_wq);
>>   
>>   void swake_up(struct swait_queue_head *q)
>>   {
>> @@ -66,25 +62,14 @@ EXPORT_SYMBOL(swake_up);
>>    */
>>   void swake_up_all(struct swait_queue_head *q)
>>   {
>> +	unsigned long flags;
>> +	DEFINE_WAKE_Q(wq);
>>   
>> +	raw_spin_lock_irqsave(&q->lock, flags);
>> +	swake_add_all_wq(q, &wq);
>> +	raw_spin_unlock_irqrestore(&q->lock, flags);
>>   
>> +	wake_up_q(&wq);
>>   }
>>   EXPORT_SYMBOL(swake_up_all);
> This is fundamentally wrong. The whole point of wake_up_all() is that
> _all_ is unbounded and should not ever land in a single critical
> section, be it IRQ or PREEMPT disabled. The above does both.

It seems to me to be better than what was there, certainly more efficient.

And if I understand this correctly it is unbounded when !RT, but it is 
bounded
on RT.

And I'm biased, because it should fix my problem :).

> Yes, wake_up_all() is crap, it is also fundamentally incompatible with
> in-*irq usage. Nothing to be done about that.
>
> So NAK on this.

So what would you suggest?  At this point getting rid of all the users of
wake_up_all() from interrupt context is not really an option, though as
an eventual goal it would be good.

-corey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-09 20:25             ` Sebastian Andrzej Siewior
@ 2018-03-09 22:26               ` Peter Zijlstra
  2018-03-12 10:51                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2018-03-09 22:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Corey Minyard, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On Fri, Mar 09, 2018 at 09:25:50PM +0100, Sebastian Andrzej Siewior wrote:
> Is it just about the irqsave() usage or something else? I doubt it is
> the list walk. It is still unbound if not called from irq-off region.

The current list walk is preemptible. You put the entire iteration (of
unbound length) inside a single critical section which destroy RT.

> But it is now possible, I agree. The wake_q usage should be cheaper
> compared to IRQ off+on in each loop. And we wanted to do the wake ups
> with enabled interrupts - there is still the list_splice() from that
> attempt. Now it can be.

Unbound is still unbound, inf/n := inf. A 'cheaper' unbound doesn't RT
make.

> > Yes, wake_up_all() is crap, it is also fundamentally incompatible with
> > in-*irq usage. Nothing to be done about that.
> I still have (or need) completions which are swait based and do
> complete_all(). 

That's fine, as long as they're done from preemptible context. Back when
we introduced swait this was an explicit design goal/limitation. And
there were no in-irq users of this.

> There are complete_all() caller which wake more than one
> waiter (that is PM and crypto from the reports I got once I added the
> WARN_ON())).
> The in-IRQ usage is !RT only and was there before.

Then that's broken and needs to be undone. Also, why did you need the
WARN, lockdep should've equally triggered on this, no?

> > So NAK on this.
> So I need completions to be swait based and do complete_all() from IRQ
> (on !RT, not RT). I have this one call which breaks the usage on !RT and
> has wake_up_all() in it in vanilla which needs an swait equivalent since
> it calls its callback from an rcu-sched section.

Why isn't this a problem on RT?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-09 22:26               ` Peter Zijlstra
@ 2018-03-12 10:51                 ` Sebastian Andrzej Siewior
  2018-03-12 13:27                   ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-12 10:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Minyard, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On 2018-03-09 23:26:43 [+0100], Peter Zijlstra wrote:
> On Fri, Mar 09, 2018 at 09:25:50PM +0100, Sebastian Andrzej Siewior wrote:
> > Is it just about the irqsave() usage or something else? I doubt it is
> > the list walk. It is still unbound if not called from irq-off region.
> 
> The current list walk is preemptible. You put the entire iteration (of
> unbound length) inside a single critical section which destroy RT.

I considered that list walk as cheap. We don't do any wake ups with the
list walk - just mark the task for a later wake up. But if it is not I
could add an upper limit of 20 iterations or so.

> > But it is now possible, I agree. The wake_q usage should be cheaper
> > compared to IRQ off+on in each loop. And we wanted to do the wake ups
> > with enabled interrupts - there is still the list_splice() from that
> > attempt. Now it can be.
> 
> Unbound is still unbound, inf/n := inf. A 'cheaper' unbound doesn't RT
> make.

What I meant is that wake_q() is invoked with interrupts enabled and we
don't need the IRQ on/off on each iteration. But as I said in the upper
paragraph, I can add an upper limit for the list walk. And wake up
itself is with enabled interrupts.

> > > Yes, wake_up_all() is crap, it is also fundamentally incompatible with
> > > in-*irq usage. Nothing to be done about that.
> > I still have (or need) completions which are swait based and do
> > complete_all(). 
> 
> That's fine, as long as they're done from preemptible context. Back when
> we introduced swait this was an explicit design goal/limitation. And
> there were no in-irq users of this.

Yes at that time in !RT. wake_up() is using sleeping locks on RT and
swait is the only thing that can be used there. So if I don't get rid if
that !preemptible part I try to switch to swait.

> > There are complete_all() caller which wake more than one
> > waiter (that is PM and crypto from the reports I got once I added the
> > WARN_ON())).
> > The in-IRQ usage is !RT only and was there before.
> 
> Then that's broken and needs to be undone. Also, why did you need the
> WARN, lockdep should've equally triggered on this, no?

I added WARN_ON() and I didn't even think about lockdep. I wanted to
see a warning even with lockdep off.
After adding this for testing:

        {
                 raw_spin_lock_irq(&lock1);
                 raw_spin_unlock_irq(&lock1);
                 raw_spin_lock_irq(&lock2);
                 raw_spin_unlock_irq(&lock2);
 
                 raw_spin_lock_irq(&lock1);
                 raw_spin_lock_irq(&lock2);
                 raw_spin_unlock_irq(&lock2);
                 raw_spin_unlock_irq(&lock1);

                 raw_spin_lock_irq(&lock2);
                 raw_spin_lock_irq(&lock1);
                 raw_spin_unlock_irq(&lock1);
                 raw_spin_unlock_irq(&lock2);
         }

I see only one complaint about the lock order in the last block. With
that one gone there is no complain about the second block. So no,
lockdep does not report such things (this was just tested on RT and
TIP).

> > > So NAK on this.
> > So I need completions to be swait based and do complete_all() from IRQ
> > (on !RT, not RT). I have this one call which breaks the usage on !RT and
> > has wake_up_all() in it in vanilla which needs an swait equivalent since
> > it calls its callback from an rcu-sched section.
> 
> Why isn't this a problem on RT?
So we remain in the preempt_disable() section due to RCU-sched so we
have this, yes. But the "disabled interrupts" part is due to
spin_lock_irqsave() which is a non-issue on RT. So if we managed to get
rid of the rcu-sched then the swait can go and we can stick with the
wake_up_all() on RT, too.

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-12 10:51                 ` Sebastian Andrzej Siewior
@ 2018-03-12 13:27                   ` Peter Zijlstra
  2018-03-12 14:11                     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2018-03-12 13:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Corey Minyard, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On Mon, Mar 12, 2018 at 11:51:13AM +0100, Sebastian Andrzej Siewior wrote:
> On 2018-03-09 23:26:43 [+0100], Peter Zijlstra wrote:
> > On Fri, Mar 09, 2018 at 09:25:50PM +0100, Sebastian Andrzej Siewior wrote:
> > > Is it just about the irqsave() usage or something else? I doubt it is
> > > the list walk. It is still unbound if not called from irq-off region.
> > 
> > The current list walk is preemptible. You put the entire iteration (of
> > unbound length) inside a single critical section which destroy RT.
> 
> I considered that list walk as cheap. We don't do any wake ups with the
> list walk - just mark the task for a later wake up. But if it is not I
> could add an upper limit of 20 iterations or so.

So the problem is that as soon as this is exposed to userspace you've
lost.

If a user can stack like 10000 tasks on the completion before triggering
it, you've got yourself a giant !preempt section. Yes the wake_q stuff
is cheaper, but unbound is still unbound.

wake_all must not be used from !preemptible (IRQ or otherwise) sections.
And I'm not seeing how waking just the top 20 helps.

> > Why isn't this a problem on RT?
> So we remain in the preempt_disable() section due to RCU-sched so we
> have this, yes. But the "disabled interrupts" part is due to
> spin_lock_irqsave() which is a non-issue on RT. So if we managed to get
> rid of the rcu-sched then the swait can go and we can stick with the
> wake_up_all() on RT, too.

OK, so for RT we simply loose the IRQ-disable thing, but its still a
!preemptible section.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-12 13:27                   ` Peter Zijlstra
@ 2018-03-12 14:11                     ` Sebastian Andrzej Siewior
  2018-03-12 14:29                       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-12 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Minyard, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On 2018-03-12 14:27:29 [+0100], Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 11:51:13AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2018-03-09 23:26:43 [+0100], Peter Zijlstra wrote:
> > > On Fri, Mar 09, 2018 at 09:25:50PM +0100, Sebastian Andrzej Siewior wrote:
> > > > Is it just about the irqsave() usage or something else? I doubt it is
> > > > the list walk. It is still unbound if not called from irq-off region.
> > > 
> > > The current list walk is preemptible. You put the entire iteration (of
> > > unbound length) inside a single critical section which destroy RT.
> > 
> > I considered that list walk as cheap. We don't do any wake ups with the
> > list walk - just mark the task for a later wake up. But if it is not I
> > could add an upper limit of 20 iterations or so.
> 
> So the problem is that as soon as this is exposed to userspace you've
> lost.

I know. We had this very same problem with clock_nanosleep() which got
solved after timer rework.

> If a user can stack like 10000 tasks on the completion before triggering
> it, you've got yourself a giant !preempt section. Yes the wake_q stuff
> is cheaper, but unbound is still unbound.
> 
> wake_all must not be used from !preemptible (IRQ or otherwise) sections.
> And I'm not seeing how waking just the top 20 helps.

I assumed you complained about the unbounded list-walk with interrupts
disabled (which is cheap but unbound is unbound). So here I suggested I
move 20 entries off that list a time and enable interrupts again so an
interrupt could set need_resched.
But if we get invoked !preemptible then nothing changes.

> > > Why isn't this a problem on RT?
> > So we remain in the preempt_disable() section due to RCU-sched so we
> > have this, yes. But the "disabled interrupts" part is due to
> > spin_lock_irqsave() which is a non-issue on RT. So if we managed to get
> > rid of the rcu-sched then the swait can go and we can stick with the
> > wake_up_all() on RT, too.
> 
> OK, so for RT we simply loose the IRQ-disable thing, but its still a
> !preemptible section.

exactly. The irqsafe() was to guard non-RT config which uses the same
code.
So do I understand you correctly that irqsafe may remain for !RT config
but that invocation with disabled preemption due to sched_rcu (on RT,
too) must go?

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-12 14:11                     ` Sebastian Andrzej Siewior
@ 2018-03-12 14:29                       ` Peter Zijlstra
  2018-03-12 19:51                         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2018-03-12 14:29 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Corey Minyard, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On Mon, Mar 12, 2018 at 03:11:07PM +0100, Sebastian Andrzej Siewior wrote:
> I assumed you complained about the unbounded list-walk with interrupts
> disabled (which is cheap but unbound is unbound). So here I suggested I
> move 20 entries off that list a time and enable interrupts again so an
> interrupt could set need_resched.
> But if we get invoked !preemptible then nothing changes.

Right, so any !preemptible invocation of wake_all is bad.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Warning from swake_up_all in 4.14.15-rt13 non-RT
  2018-03-12 14:29                       ` Peter Zijlstra
@ 2018-03-12 19:51                         ` Sebastian Andrzej Siewior
  2018-03-13 18:40                           ` [RT PATCH 1/2] Revert "block: blk-mq: Use swait" Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-12 19:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Minyard, Thomas Gleixner, Steven Rostedt, linux-rt-users,
	linux-kernel, Tejun Heo

On 2018-03-12 15:29:33 [+0100], Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 03:11:07PM +0100, Sebastian Andrzej Siewior wrote:
> > I assumed you complained about the unbounded list-walk with interrupts
> > disabled (which is cheap but unbound is unbound). So here I suggested I
> > move 20 entries off that list a time and enable interrupts again so an
> > interrupt could set need_resched.
> > But if we get invoked !preemptible then nothing changes.
> 
> Right, so any !preemptible invocation of wake_all is bad.

I doubt can I can argue the move of wake_up_all() into a workqueue in a
sane way. Not to mention that it wouldn't do any good for me on RT since
I can't schedule workqueues in !preemptible context.

I've been staring at the original issue. The original commit that
properly introduced RCU-sched [0] reads like "it could have been normal
RCU but this one is faster in my micro bench so let's go for it".
Probably because the normal RCU invokes the callbacks once in a while.

So we could switch it to "normal" RCU instead. Anyone thinks that it
could be done? percpu_ref_get()/put() is a hot-path, I am not sure how
much worse it gets in a real benchmark.

In the meantime I prepared a patch which reverts the swait conversation
and invokes the callback swork_queue() which in turn moves the
wake_up_all() into a different process. Since it triggers rare, it
should not have performance regressions. And the wake_up_all() is
preemptible and RT is happy.

[0] a4244454df12 ("percpu-refcount: use RCU-sched insted of normal RCU")

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -381,7 +381,7 @@ void blk_clear_preempt_only(struct reque
 
 	spin_lock_irqsave(q->queue_lock, flags);
 	queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-	swake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->mq_freeze_wq);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
@@ -659,7 +659,7 @@ void blk_set_queue_dying(struct request_
 	}
 
 	/* Make blk_queue_enter() reexamine the DYING flag. */
-	swake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -860,7 +860,7 @@ int blk_queue_enter(struct request_queue
 		 */
 		smp_rmb();
 
-		ret = swait_event_interruptible(q->mq_freeze_wq,
+		ret = wait_event_interruptible(q->mq_freeze_wq,
 				(atomic_read(&q->mq_freeze_depth) == 0 &&
 				 (preempt || !blk_queue_preempt_only(q))) ||
 				blk_queue_dying(q));
@@ -876,12 +876,20 @@ void blk_queue_exit(struct request_queue
 	percpu_ref_put(&q->q_usage_counter);
 }
 
+static void blk_queue_usage_counter_release_swork(struct swork_event *sev)
+{
+	struct request_queue *q =
+		container_of(sev, struct request_queue, mq_pcpu_wake);
+
+	wake_up_all(&q->mq_freeze_wq);
+}
+
 static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 {
 	struct request_queue *q =
 		container_of(ref, struct request_queue, q_usage_counter);
 
-	swake_up_all(&q->mq_freeze_wq);
+	swork_queue(&q->mq_pcpu_wake);
 }
 
 static void blk_rq_timed_out_timer(struct timer_list *t)
@@ -957,7 +965,8 @@ struct request_queue *blk_alloc_queue_no
 	q->bypass_depth = 1;
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-	init_swait_queue_head(&q->mq_freeze_wq);
+	init_waitqueue_head(&q->mq_freeze_wq);
+	INIT_SWORK(&q->mq_pcpu_wake, blk_queue_usage_counter_release_swork);
 
 	/*
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
@@ -3843,6 +3852,7 @@ int __init blk_dev_init(void)
 
 	blk_requestq_cachep = kmem_cache_create("request_queue",
 			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
+	BUG_ON(swork_get());
 
 #ifdef CONFIG_DEBUG_FS
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -133,14 +133,14 @@ EXPORT_SYMBOL_GPL(blk_freeze_queue_start
 
 void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	swait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
+	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout)
 {
-	return swait_event_timeout(q->mq_freeze_wq,
+	return wait_event_timeout(q->mq_freeze_wq,
 					percpu_ref_is_zero(&q->q_usage_counter),
 					timeout);
 }
@@ -183,7 +183,7 @@ void blk_mq_unfreeze_queue(struct reques
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->q_usage_counter);
-		swake_up_all(&q->mq_freeze_wq);
+		wake_up_all(&q->mq_freeze_wq);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -29,6 +29,7 @@
 #include <linux/blkzoned.h>
 #include <linux/seqlock.h>
 #include <linux/u64_stats_sync.h>
+#include <linux/swork.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -647,7 +648,8 @@ struct request_queue {
 	struct throtl_data *td;
 #endif
 	struct rcu_head		rcu_head;
-	struct swait_queue_head	mq_freeze_wq;
+	wait_queue_head_t	mq_freeze_wq;
+	struct swork_event	mq_pcpu_wake;
 	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RT PATCH 1/2] Revert "block: blk-mq: Use swait"
  2018-03-12 19:51                         ` Sebastian Andrzej Siewior
@ 2018-03-13 18:40                           ` Sebastian Andrzej Siewior
  2018-03-13 18:42                             ` [RT PATCH 2/2] block: blk-mq: move blk_queue_usage_counter_release() into process context Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-13 18:40 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Peter Zijlstra, Corey Minyard, Thomas Gleixner, Steven Rostedt,
	linux-kernel, Tejun Heo

This reverts commit "block: blk-mq: Use swait". The issue remains but
will be fixed differently.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-core.c       | 6 +++---
 block/blk-mq.c         | 8 ++++----
 include/linux/blkdev.h | 2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ff1258ca236c..f9ac6f169c67 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -799,7 +799,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
 		 */
 		smp_rmb();
 
-		ret = swait_event_interruptible(q->mq_freeze_wq,
+		ret = wait_event_interruptible(q->mq_freeze_wq,
 				!atomic_read(&q->mq_freeze_depth) ||
 				blk_queue_dying(q));
 		if (blk_queue_dying(q))
@@ -819,7 +819,7 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 	struct request_queue *q =
 		container_of(ref, struct request_queue, q_usage_counter);
 
-	swake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->mq_freeze_wq);
 }
 
 static void blk_rq_timed_out_timer(unsigned long data)
@@ -895,7 +895,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->bypass_depth = 1;
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-	init_swait_queue_head(&q->mq_freeze_wq);
+	init_waitqueue_head(&q->mq_freeze_wq);
 
 	/*
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bbe43d32f71a..c5bd467dd97b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -132,14 +132,14 @@ EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
 void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	swait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
+	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout)
 {
-	return swait_event_timeout(q->mq_freeze_wq,
+	return wait_event_timeout(q->mq_freeze_wq,
 					percpu_ref_is_zero(&q->q_usage_counter),
 					timeout);
 }
@@ -182,7 +182,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->q_usage_counter);
-		swake_up_all(&q->mq_freeze_wq);
+		wake_up_all(&q->mq_freeze_wq);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
@@ -263,7 +263,7 @@ void blk_mq_wake_waiters(struct request_queue *q)
 	 * dying, we need to ensure that processes currently waiting on
 	 * the queue are notified as well.
 	 */
-	swake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6f278f1fd634..b68752bfb645 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -598,7 +598,7 @@ struct request_queue {
 	struct throtl_data *td;
 #endif
 	struct rcu_head		rcu_head;
-	struct swait_queue_head	mq_freeze_wq;
+	wait_queue_head_t	mq_freeze_wq;
 	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RT PATCH 2/2] block: blk-mq: move blk_queue_usage_counter_release() into process context
  2018-03-13 18:40                           ` [RT PATCH 1/2] Revert "block: blk-mq: Use swait" Sebastian Andrzej Siewior
@ 2018-03-13 18:42                             ` Sebastian Andrzej Siewior
  2018-03-13 20:10                               ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-13 18:42 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Peter Zijlstra, Corey Minyard, Thomas Gleixner, Steven Rostedt,
	linux-kernel, Tejun Heo

| BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:914
| in_atomic(): 1, irqs_disabled(): 0, pid: 255, name: kworker/u257:6
| 5 locks held by kworker/u257:6/255:
|  #0:  ("events_unbound"){.+.+.+}, at: [<ffffffff8108edf1>] process_one_work+0x171/0x5e0
|  #1:  ((&entry->work)){+.+.+.}, at: [<ffffffff8108edf1>] process_one_work+0x171/0x5e0
|  #2:  (&shost->scan_mutex){+.+.+.}, at: [<ffffffffa000faa3>] __scsi_add_device+0xa3/0x130 [scsi_mod]
|  #3:  (&set->tag_list_lock){+.+...}, at: [<ffffffff812f09fa>] blk_mq_init_queue+0x96a/0xa50
|  #4:  (rcu_read_lock_sched){......}, at: [<ffffffff8132887d>] percpu_ref_kill_and_confirm+0x1d/0x120
| Preemption disabled at:[<ffffffff812eff76>] blk_mq_freeze_queue_start+0x56/0x70
|
| CPU: 2 PID: 255 Comm: kworker/u257:6 Not tainted 3.18.7-rt0+ #1
| Workqueue: events_unbound async_run_entry_fn
|  0000000000000003 ffff8800bc29f998 ffffffff815b3a12 0000000000000000
|  0000000000000000 ffff8800bc29f9b8 ffffffff8109aa16 ffff8800bc29fa28
|  ffff8800bc5d1bc8 ffff8800bc29f9e8 ffffffff815b8dd4 ffff880000000000
| Call Trace:
|  [<ffffffff815b3a12>] dump_stack+0x4f/0x7c
|  [<ffffffff8109aa16>] __might_sleep+0x116/0x190
|  [<ffffffff815b8dd4>] rt_spin_lock+0x24/0x60
|  [<ffffffff810b6089>] __wake_up+0x29/0x60
|  [<ffffffff812ee06e>] blk_mq_usage_counter_release+0x1e/0x20
|  [<ffffffff81328966>] percpu_ref_kill_and_confirm+0x106/0x120
|  [<ffffffff812eff76>] blk_mq_freeze_queue_start+0x56/0x70
|  [<ffffffff812f0000>] blk_mq_update_tag_set_depth+0x40/0xd0
|  [<ffffffff812f0a1c>] blk_mq_init_queue+0x98c/0xa50
|  [<ffffffffa000dcf0>] scsi_mq_alloc_queue+0x20/0x60 [scsi_mod]
|  [<ffffffffa000ea35>] scsi_alloc_sdev+0x2f5/0x370 [scsi_mod]
|  [<ffffffffa000f494>] scsi_probe_and_add_lun+0x9e4/0xdd0 [scsi_mod]
|  [<ffffffffa000fb26>] __scsi_add_device+0x126/0x130 [scsi_mod]
|  [<ffffffffa013033f>] ata_scsi_scan_host+0xaf/0x200 [libata]
|  [<ffffffffa012b5b6>] async_port_probe+0x46/0x60 [libata]
|  [<ffffffff810978fb>] async_run_entry_fn+0x3b/0xf0
|  [<ffffffff8108ee81>] process_one_work+0x201/0x5e0

percpu_ref_kill_and_confirm() invokes blk_mq_usage_counter_release() in
a rcu-sched region. swait based wake queue can't be used due to
wake_up_all() usage and disabled interrupts in !RT configs (as reported
by Corey Minyard).
Uses work_queue() to invoke wake_up_all() in process context.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-core.c       | 13 ++++++++++++-
 include/linux/blkdev.h |  2 ++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f9ac6f169c67..4db4051724ea 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -814,12 +814,20 @@ void blk_queue_exit(struct request_queue *q)
 	percpu_ref_put(&q->q_usage_counter);
 }
 
+static void blk_queue_usage_counter_release_swork(struct swork_event *sev)
+{
+	struct request_queue *q =
+		container_of(sev, struct request_queue, mq_pcpu_wake);
+
+	wake_up_all(&q->mq_freeze_wq);
+}
+
 static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 {
 	struct request_queue *q =
 		container_of(ref, struct request_queue, q_usage_counter);
 
-	wake_up_all(&q->mq_freeze_wq);
+	swork_queue(&q->mq_pcpu_wake);
 }
 
 static void blk_rq_timed_out_timer(unsigned long data)
@@ -896,6 +904,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
 	init_waitqueue_head(&q->mq_freeze_wq);
+	INIT_SWORK(&q->mq_pcpu_wake, blk_queue_usage_counter_release_swork);
 
 	/*
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
@@ -3623,6 +3632,8 @@ int __init blk_dev_init(void)
 	if (!kblockd_workqueue)
 		panic("Failed to create kblockd\n");
 
+	BUG_ON(swork_get());
+
 	request_cachep = kmem_cache_create("blkdev_requests",
 			sizeof(struct request), 0, SLAB_PANIC, NULL);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b68752bfb645..49b53ad6d2d6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,6 +27,7 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
+#include <linux/swork.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -599,6 +600,7 @@ struct request_queue {
 #endif
 	struct rcu_head		rcu_head;
 	wait_queue_head_t	mq_freeze_wq;
+	struct swork_event	mq_pcpu_wake;
 	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RT PATCH 2/2] block: blk-mq: move blk_queue_usage_counter_release() into process context
  2018-03-13 18:42                             ` [RT PATCH 2/2] block: blk-mq: move blk_queue_usage_counter_release() into process context Sebastian Andrzej Siewior
@ 2018-03-13 20:10                               ` Peter Zijlstra
  2018-03-14 15:23                                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2018-03-13 20:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, Corey Minyard, Thomas Gleixner, Steven Rostedt,
	linux-kernel, Tejun Heo

On Tue, Mar 13, 2018 at 07:42:41PM +0100, Sebastian Andrzej Siewior wrote:
> +static void blk_queue_usage_counter_release_swork(struct swork_event *sev)
> +{
> +	struct request_queue *q =
> +		container_of(sev, struct request_queue, mq_pcpu_wake);
> +
> +	wake_up_all(&q->mq_freeze_wq);
> +}
> +
>  static void blk_queue_usage_counter_release(struct percpu_ref *ref)
>  {
>  	struct request_queue *q =
>  		container_of(ref, struct request_queue, q_usage_counter);
>  
> -	wake_up_all(&q->mq_freeze_wq);
> +	swork_queue(&q->mq_pcpu_wake);
>  }

Depending on if we expect there to actually be wakeups, you could do
something like:

	if (wq_has_sleepers(&q->mq_freeze_wq))
		swork_queue(&q->mq_pcpu_wake));

avoiding the whole workqueue thing in the case there wasn't anybody
waiting for it. But since I don't know this code, I can't say if it
makes sense or not. Tejun?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RT PATCH 2/2] block: blk-mq: move blk_queue_usage_counter_release() into process context
  2018-03-13 20:10                               ` Peter Zijlstra
@ 2018-03-14 15:23                                 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-14 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rt-users, Corey Minyard, Thomas Gleixner, Steven Rostedt,
	linux-kernel, Tejun Heo

On 2018-03-13 21:10:39 [+0100], Peter Zijlstra wrote:
> On Tue, Mar 13, 2018 at 07:42:41PM +0100, Sebastian Andrzej Siewior wrote:
> > +static void blk_queue_usage_counter_release_swork(struct swork_event *sev)
> > +{
> > +	struct request_queue *q =
> > +		container_of(sev, struct request_queue, mq_pcpu_wake);
> > +
> > +	wake_up_all(&q->mq_freeze_wq);
> > +}
> > +
> >  static void blk_queue_usage_counter_release(struct percpu_ref *ref)
> >  {
> >  	struct request_queue *q =
> >  		container_of(ref, struct request_queue, q_usage_counter);
> >  
> > -	wake_up_all(&q->mq_freeze_wq);
> > +	swork_queue(&q->mq_pcpu_wake);
> >  }
> 
> Depending on if we expect there to actually be wakeups, you could do
> something like:
> 
> 	if (wq_has_sleepers(&q->mq_freeze_wq))
> 		swork_queue(&q->mq_pcpu_wake));
> 
> avoiding the whole workqueue thing in the case there wasn't anybody
> waiting for it. But since I don't know this code, I can't say if it
> makes sense or not. Tejun?

this pops up on boot and shows that there are no waiter. So I consider
to take this.

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-03-14 15:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-05 15:08 Warning from swake_up_all in 4.14.15-rt13 non-RT Corey Minyard
2018-03-06 17:46 ` Sebastian Andrzej Siewior
2018-03-06 22:51   ` Corey Minyard
2018-03-07 15:45   ` Corey Minyard
2018-03-08 17:41     ` Sebastian Andrzej Siewior
2018-03-08 19:54       ` Corey Minyard
2018-03-09 11:04         ` Sebastian Andrzej Siewior
2018-03-09 13:29           ` Corey Minyard
2018-03-09 14:58             ` Sebastian Andrzej Siewior
2018-03-09 16:03               ` Corey Minyard
2018-03-09 17:46           ` Peter Zijlstra
2018-03-09 20:25             ` Sebastian Andrzej Siewior
2018-03-09 22:26               ` Peter Zijlstra
2018-03-12 10:51                 ` Sebastian Andrzej Siewior
2018-03-12 13:27                   ` Peter Zijlstra
2018-03-12 14:11                     ` Sebastian Andrzej Siewior
2018-03-12 14:29                       ` Peter Zijlstra
2018-03-12 19:51                         ` Sebastian Andrzej Siewior
2018-03-13 18:40                           ` [RT PATCH 1/2] Revert "block: blk-mq: Use swait" Sebastian Andrzej Siewior
2018-03-13 18:42                             ` [RT PATCH 2/2] block: blk-mq: move blk_queue_usage_counter_release() into process context Sebastian Andrzej Siewior
2018-03-13 20:10                               ` Peter Zijlstra
2018-03-14 15:23                                 ` Sebastian Andrzej Siewior
2018-03-09 22:02             ` Warning from swake_up_all in 4.14.15-rt13 non-RT Corey Minyard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).