All of lore.kernel.org
 help / color / mirror / Atom feed
* Deadlock in wbt / rq-qos
@ 2021-06-13 15:49 Omar Kilani
  2021-06-13 17:03 ` Omar Kilani
  2021-06-15  9:22 ` Ming Lei
  0 siblings, 2 replies; 6+ messages in thread
From: Omar Kilani @ 2021-06-13 15:49 UTC (permalink / raw)
  To: linux-block

Hi there,

I appear to have stumbled upon a deadlock in wbt or rq-qos.

My journal of a lot of data points is over here:

https://github.com/openzfs/zfs/issues/12204

I initially deadlocked on RHEL 8.4's 4.18.0-305.3.1.el8_4.x86_64
kernel, but the code in blk-wbt.c / blk-rq-qos.c is functionally
identical to 5.13.0-rc5, so I tried that and I'm able to deadlock that
as well. I believe the same code exists all the way back to 5.0.1.

The Something Weird (tm) about this is that it possibly only happens
on AMD EPYC CPUs. I just don't have the necessary setup to confirm
that either way, but it's a hunch because I can't reproduce it on an
Ice Lake VM (but the Ice Lake VM also has more storage bandwidth so
that could be the thing, and I can't decrease that storage bandwidth,
so I can't do a like-for-like test.)

I "instrumented" wbt / rq-qos with a bunch of printk's which you can
see with this patch:

https://gist.github.com/omarkilani/2ad526c3546b40537b546450c8f685dc

I then ran my repro workload to cause the deadlock, here's the dmesg
output just before the deadlock and then the backtraces with my printk
patch applied:

https://gist.githubusercontent.com/omarkilani/ff0a96d872e09b4fb648272d104e0053/raw/d3da3974162f8aa87b7309317af80929fadf250f/dmesg.wbt.deadlock.log

Happy to apply whatever / run whatever to get more data.

Thanks!

Regards,
Omar

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in wbt / rq-qos
  2021-06-13 15:49 Deadlock in wbt / rq-qos Omar Kilani
@ 2021-06-13 17:03 ` Omar Kilani
  2021-06-14 20:26   ` Omar Kilani
  2021-06-15  9:22 ` Ming Lei
  1 sibling, 1 reply; 6+ messages in thread
From: Omar Kilani @ 2021-06-13 17:03 UTC (permalink / raw)
  To: linux-block

Just looking at blk-wbt.c...

Should...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-wbt.c?h=v5.13-rc5&id=482e302a61f1fc62b0e13be20bc7a11a91b5832d#n164

if (!inflight || diff >= rwb->wb_background / 2)

Be:

if (!inflight || diff >= limit / 2)

?

On Sun, Jun 13, 2021 at 8:49 AM Omar Kilani <omar.kilani@gmail.com> wrote:
>
> Hi there,
>
> I appear to have stumbled upon a deadlock in wbt or rq-qos.
>
> My journal of a lot of data points is over here:
>
> https://github.com/openzfs/zfs/issues/12204
>
> I initially deadlocked on RHEL 8.4's 4.18.0-305.3.1.el8_4.x86_64
> kernel, but the code in blk-wbt.c / blk-rq-qos.c is functionally
> identical to 5.13.0-rc5, so I tried that and I'm able to deadlock that
> as well. I believe the same code exists all the way back to 5.0.1.
>
> The Something Weird (tm) about this is that it possibly only happens
> on AMD EPYC CPUs. I just don't have the necessary setup to confirm
> that either way, but it's a hunch because I can't reproduce it on an
> Ice Lake VM (but the Ice Lake VM also has more storage bandwidth so
> that could be the thing, and I can't decrease that storage bandwidth,
> so I can't do a like-for-like test.)
>
> I "instrumented" wbt / rq-qos with a bunch of printk's which you can
> see with this patch:
>
> https://gist.github.com/omarkilani/2ad526c3546b40537b546450c8f685dc
>
> I then ran my repro workload to cause the deadlock, here's the dmesg
> output just before the deadlock and then the backtraces with my printk
> patch applied:
>
> https://gist.githubusercontent.com/omarkilani/ff0a96d872e09b4fb648272d104e0053/raw/d3da3974162f8aa87b7309317af80929fadf250f/dmesg.wbt.deadlock.log
>
> Happy to apply whatever / run whatever to get more data.
>
> Thanks!
>
> Regards,
> Omar

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in wbt / rq-qos
  2021-06-13 17:03 ` Omar Kilani
@ 2021-06-14 20:26   ` Omar Kilani
  0 siblings, 0 replies; 6+ messages in thread
From: Omar Kilani @ 2021-06-14 20:26 UTC (permalink / raw)
  To: linux-block

I improved the logging output:

https://gist.github.com/omarkilani/2ad526c3546b40537b546450c8f685dc

And deadlocked again:

https://gist.githubusercontent.com/omarkilani/3d870b6dc440e04357add8c66d371d86/raw/29437a909af475b92fe4d259ff059beea65cdb8e/wbt.deadlock-002.log

On Sun, Jun 13, 2021 at 10:03 AM Omar Kilani <omar.kilani@gmail.com> wrote:
>
> Just looking at blk-wbt.c...
>
> Should...
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-wbt.c?h=v5.13-rc5&id=482e302a61f1fc62b0e13be20bc7a11a91b5832d#n164
>
> if (!inflight || diff >= rwb->wb_background / 2)
>
> Be:
>
> if (!inflight || diff >= limit / 2)
>
> ?
>
> On Sun, Jun 13, 2021 at 8:49 AM Omar Kilani <omar.kilani@gmail.com> wrote:
> >
> > Hi there,
> >
> > I appear to have stumbled upon a deadlock in wbt or rq-qos.
> >
> > My journal of a lot of data points is over here:
> >
> > https://github.com/openzfs/zfs/issues/12204
> >
> > I initially deadlocked on RHEL 8.4's 4.18.0-305.3.1.el8_4.x86_64
> > kernel, but the code in blk-wbt.c / blk-rq-qos.c is functionally
> > identical to 5.13.0-rc5, so I tried that and I'm able to deadlock that
> > as well. I believe the same code exists all the way back to 5.0.1.
> >
> > The Something Weird (tm) about this is that it possibly only happens
> > on AMD EPYC CPUs. I just don't have the necessary setup to confirm
> > that either way, but it's a hunch because I can't reproduce it on an
> > Ice Lake VM (but the Ice Lake VM also has more storage bandwidth so
> > that could be the thing, and I can't decrease that storage bandwidth,
> > so I can't do a like-for-like test.)
> >
> > I "instrumented" wbt / rq-qos with a bunch of printk's which you can
> > see with this patch:
> >
> > https://gist.github.com/omarkilani/2ad526c3546b40537b546450c8f685dc
> >
> > I then ran my repro workload to cause the deadlock, here's the dmesg
> > output just before the deadlock and then the backtraces with my printk
> > patch applied:
> >
> > https://gist.githubusercontent.com/omarkilani/ff0a96d872e09b4fb648272d104e0053/raw/d3da3974162f8aa87b7309317af80929fadf250f/dmesg.wbt.deadlock.log
> >
> > Happy to apply whatever / run whatever to get more data.
> >
> > Thanks!
> >
> > Regards,
> > Omar

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in wbt / rq-qos
  2021-06-13 15:49 Deadlock in wbt / rq-qos Omar Kilani
  2021-06-13 17:03 ` Omar Kilani
@ 2021-06-15  9:22 ` Ming Lei
       [not found]   ` <CA+8F9hjFDE9b31-qsxsVJf4SV9Ctr-mwOJrsw0kVeC7DdN=5XQ@mail.gmail.com>
  1 sibling, 1 reply; 6+ messages in thread
From: Ming Lei @ 2021-06-15  9:22 UTC (permalink / raw)
  To: Omar Kilani; +Cc: linux-block

On Sun, Jun 13, 2021 at 08:49:47AM -0700, Omar Kilani wrote:
> Hi there,
> 
> I appear to have stumbled upon a deadlock in wbt or rq-qos.
> 
> My journal of a lot of data points is over here:
> 
> https://github.com/openzfs/zfs/issues/12204
> 
> I initially deadlocked on RHEL 8.4's 4.18.0-305.3.1.el8_4.x86_64
> kernel, but the code in blk-wbt.c / blk-rq-qos.c is functionally
> identical to 5.13.0-rc5, so I tried that and I'm able to deadlock that
> as well. I believe the same code exists all the way back to 5.0.1.

Recently Jan Kara fixed one rq-qos deadlock issue, can you check if
the following patch fixes your issue?

https://lore.kernel.org/linux-block/e14aeaa7-45a3-b2f0-7738-3613189ae1d4@kernel.dk/T/#t

 
Thanks, 
Ming


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in wbt / rq-qos
       [not found]   ` <CA+8F9hjFDE9b31-qsxsVJf4SV9Ctr-mwOJrsw0kVeC7DdN=5XQ@mail.gmail.com>
@ 2021-06-15 14:07     ` Ming Lei
  2021-06-16 15:06       ` Omar Kilani
  0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2021-06-15 14:07 UTC (permalink / raw)
  To: Omar Kilani; +Cc: linux-block

On Tue, Jun 15, 2021 at 06:42:40AM -0700, Omar Kilani wrote:
> Hi Ming,
> 
> It looks to be the same issue based on the log timelines. I *think* that
> patch will fix it but it’s really subtle so I’ll test.
> 
> I can only trigger this on an AMD Milan machine for some reason that I
> don’t understand. Sometimes in 800 seconds, sometimes in 5 hours.
> 
> I have a new build with printk’s on the atomic_inc_below to check the
> acquire condition.
> 
> I’ll add that patch and re-test. But I couldn’t find that change in the
> linux-block git? Is it in a specific branch?

The patch is in the branch of for-5.14/block:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.14/block&id=11c7aa0ddea8611007768d3e6b58d45dc60a19e1

Thanks,
Ming


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in wbt / rq-qos
  2021-06-15 14:07     ` Ming Lei
@ 2021-06-16 15:06       ` Omar Kilani
  0 siblings, 0 replies; 6+ messages in thread
From: Omar Kilani @ 2021-06-16 15:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block

Hi Ming,

I can confirm after a day of running my repro tests that the patch has
fixed the issue.

Thank you Jan.

Regards,
Omar

On Tue, Jun 15, 2021 at 7:07 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Tue, Jun 15, 2021 at 06:42:40AM -0700, Omar Kilani wrote:
> > Hi Ming,
> >
> > It looks to be the same issue based on the log timelines. I *think* that
> > patch will fix it but it’s really subtle so I’ll test.
> >
> > I can only trigger this on an AMD Milan machine for some reason that I
> > don’t understand. Sometimes in 800 seconds, sometimes in 5 hours.
> >
> > I have a new build with printk’s on the atomic_inc_below to check the
> > acquire condition.
> >
> > I’ll add that patch and re-test. But I couldn’t find that change in the
> > linux-block git? Is it in a specific branch?
>
> The patch is in the branch of for-5.14/block:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.14/block&id=11c7aa0ddea8611007768d3e6b58d45dc60a19e1
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-16 15:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-13 15:49 Deadlock in wbt / rq-qos Omar Kilani
2021-06-13 17:03 ` Omar Kilani
2021-06-14 20:26   ` Omar Kilani
2021-06-15  9:22 ` Ming Lei
     [not found]   ` <CA+8F9hjFDE9b31-qsxsVJf4SV9Ctr-mwOJrsw0kVeC7DdN=5XQ@mail.gmail.com>
2021-06-15 14:07     ` Ming Lei
2021-06-16 15:06       ` Omar Kilani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.