linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* stalling IO regression in linux 5.12
@ 2022-08-10 16:35 Chris Murphy
  2022-08-10 17:48 ` Josef Bacik
  2022-08-15 11:25 ` stalling IO regression in linux 5.12 Thorsten Leemhuis
  0 siblings, 2 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-10 16:35 UTC (permalink / raw)
  To: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel; +Cc: Josef Bacik

CPU: Intel E5-2680 v3
RAM: 128 G
02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02), using megaraid_sas driver
8 Disks: TOSHIBA AL13SEB600


The problem exhibits as increasing load, increasing IO pressure (PSI), and actual IO goes to zero. It never happens on kernel 5.11 series, and always happens after 5.12-rc1 and persists through 5.18.0. There's a new mix of behaviors with 5.19, I suspect the mm improvements in this series might be masking the problem.

The workload involves openqa, which spins up 30 qemu-kvm instances, and does a bunch of tests, generating quite a lot of writes: qcow2 files, and video in the form of many screenshots, and various log files, for each VM. These VMs are each in their own cgroup. As the problem begins, I see increasing IO pressure, and decreasing IO, for each qemu instance's cgroup, and the cgroups for httpd, journald, auditd, and postgresql. IO pressure goes to nearly ~99% and IO is literally 0.

The problem left unattended to progress will eventually result in a completely unresponsive system, with no kernel messages. It reproduces in the following configurations, the first two I provide links to full dmesg with sysrq+w:

btrfs raid10 (native) on plain partitions [1]
btrfs single/dup on dmcrypt on mdadm raid 10 and parity raid [2]
XFS on dmcrypt on mdadm raid10 or parity raid

I've started a bisect, but for some reason I haven't figured out I've started getting compiled kernels that don't boot the hardware. The failure is very early on such that the UUID for the root file system isn't found, but not much to go on as to why.[3] I have tested the first and last skipped commits in the bisect log below, they successfully boot a VM but not the hardware.

Anyway, I'm kinda stuck at this point trying to narrow it down further. Any suggestions? Thanks.

[1] btrfs raid10, plain partitions
https://drive.google.com/file/d/1-oT3MX-hHYtQqI0F3SpgPjCIDXXTysLU/view?usp=sharing

[2] btrfs single/dup, dmcrypt, mdadm raid10
https://drive.google.com/file/d/1m_T3YYaEjBKUROz6dHt5_h92ZVRji9FM/view?usp=sharing

[3] 
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [c03c21ba6f4e95e406a1a7b4c34ef334b977c194] Merge tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect bad c03c21ba6f4e95e406a1a7b4c34ef334b977c194
# status: waiting for good commit(s), bad commit known
# good: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
git bisect good f40ddce88593482919761f74910f42f4b84c004b
# bad: [df24212a493afda0d4de42176bea10d45825e9a0] Merge tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect bad df24212a493afda0d4de42176bea10d45825e9a0
# good: [82851fce6107d5a3e66d95aee2ae68860a732703] Merge tag 'arm-dt-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 82851fce6107d5a3e66d95aee2ae68860a732703
# good: [99f1a5872b706094ece117368170a92c66b2e242] Merge tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect good 99f1a5872b706094ece117368170a92c66b2e242
# bad: [9eef02334505411667a7b51a8f349f8c6c4f3b66] Merge tag 'locking-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 9eef02334505411667a7b51a8f349f8c6c4f3b66
# bad: [9820b4dca0f9c6b7ab8b4307286cdace171b724d] Merge tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block
git bisect bad 9820b4dca0f9c6b7ab8b4307286cdace171b724d
# good: [bd018bbaa58640da786d4289563e71c5ef3938c7] Merge tag 'for-5.12/libata-2021-02-17' of git://git.kernel.dk/linux-block
git bisect good bd018bbaa58640da786d4289563e71c5ef3938c7
# skip: [203c018079e13510f913fd0fd426370f4de0fd05] Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.12/drivers
git bisect skip 203c018079e13510f913fd0fd426370f4de0fd05
# skip: [49d1ec8573f74ff1e23df1d5092211de46baa236] block: manage bio slab cache by xarray
git bisect skip 49d1ec8573f74ff1e23df1d5092211de46baa236
# bad: [73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7] nvme: cleanup zone information initialization
git bisect bad 73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7
# skip: [71217df39dc67a0aeed83352b0d712b7892036a2] block, bfq: make waker-queue detection more robust
git bisect skip 71217df39dc67a0aeed83352b0d712b7892036a2
# bad: [8358c28a5d44bf0223a55a2334086c3707bb4185] block: fix memory leak of bvec
git bisect bad 8358c28a5d44bf0223a55a2334086c3707bb4185
# skip: [3a905c37c3510ea6d7cfcdfd0f272ba731286560] block: skip bio_check_eod for partition-remapped bios
git bisect skip 3a905c37c3510ea6d7cfcdfd0f272ba731286560
# skip: [3c337690d2ebb7a01fa13bfa59ce4911f358df42] block, bfq: avoid spurious switches to soft_rt of interactive queues
git bisect skip 3c337690d2ebb7a01fa13bfa59ce4911f358df42
# skip: [3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea] bio: add a helper calculating nr segments to alloc
git bisect skip 3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea
# skip: [4eb1d689045552eb966ebf25efbc3ce648797d96] blk-crypto: use bio_kmalloc in blk_crypto_clone_bio
git bisect skip 4eb1d689045552eb966ebf25efbc3ce648797d96


--
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression in linux 5.12
  2022-08-10 16:35 stalling IO regression in linux 5.12 Chris Murphy
@ 2022-08-10 17:48 ` Josef Bacik
  2022-08-10 18:33   ` Chris Murphy
  2022-08-15 11:25 ` stalling IO regression in linux 5.12 Thorsten Leemhuis
  1 sibling, 1 reply; 58+ messages in thread
From: Josef Bacik @ 2022-08-10 17:48 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel

On Wed, Aug 10, 2022 at 12:35:34PM -0400, Chris Murphy wrote:
> CPU: Intel E5-2680 v3
> RAM: 128 G
> 02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02), using megaraid_sas driver
> 8 Disks: TOSHIBA AL13SEB600
> 
> 
> The problem exhibits as increasing load, increasing IO pressure (PSI), and actual IO goes to zero. It never happens on kernel 5.11 series, and always happens after 5.12-rc1 and persists through 5.18.0. There's a new mix of behaviors with 5.19, I suspect the mm improvements in this series might be masking the problem.
> 
> The workload involves openqa, which spins up 30 qemu-kvm instances, and does a bunch of tests, generating quite a lot of writes: qcow2 files, and video in the form of many screenshots, and various log files, for each VM. These VMs are each in their own cgroup. As the problem begins, I see increasing IO pressure, and decreasing IO, for each qemu instance's cgroup, and the cgroups for httpd, journald, auditd, and postgresql. IO pressure goes to nearly ~99% and IO is literally 0.
> 
> The problem left unattended to progress will eventually result in a completely unresponsive system, with no kernel messages. It reproduces in the following configurations, the first two I provide links to full dmesg with sysrq+w:
> 
> btrfs raid10 (native) on plain partitions [1]
> btrfs single/dup on dmcrypt on mdadm raid 10 and parity raid [2]
> XFS on dmcrypt on mdadm raid10 or parity raid
> 
> I've started a bisect, but for some reason I haven't figured out I've started getting compiled kernels that don't boot the hardware. The failure is very early on such that the UUID for the root file system isn't found, but not much to go on as to why.[3] I have tested the first and last skipped commits in the bisect log below, they successfully boot a VM but not the hardware.
> 
> Anyway, I'm kinda stuck at this point trying to narrow it down further. Any suggestions? Thanks.
> 

I looked at the traces, btrfs is stuck waiting on IO and blk tags, which means
we've got a lot of outstanding requests and are waiting for them to finish so we
can allocate more requests.

Additionally I'm seeing a bunch of the blkg async submit things, which are used
when we have the block cgroup stuff turned on and compression enabled, so we
punt any compressed bios to a per-cgroup async thread to submit the IO's in the
appropriate block cgroup context.

This could mean we're just being overly mean and generating too many IO's, but
since the IO goes to 0 I'm more inclined to believe there's a screw up in
whatever IO cgroup controller you're using.

To help narrow this down can you disable any IO controller you've got enabled
and see if you can reproduce?  If you can sysrq+w is super helpful as it'll
point us in the next direction to look.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression in linux 5.12
  2022-08-10 17:48 ` Josef Bacik
@ 2022-08-10 18:33   ` Chris Murphy
  2022-08-10 18:42     ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-10 18:33 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel



On Wed, Aug 10, 2022, at 1:48 PM, Josef Bacik wrote:

> To help narrow this down can you disable any IO controller you've got enabled
> and see if you can reproduce?  If you can sysrq+w is super helpful as it'll
> point us in the next direction to look.  Thanks,

I'm not following, sorry. I can boot with systemd.unified_cgroup_hierarchy=0 to make sure it's all off, but we're not using an IO cgroup controllers specifically as far as I'm aware.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression in linux 5.12
  2022-08-10 18:33   ` Chris Murphy
@ 2022-08-10 18:42     ` Chris Murphy
  2022-08-10 19:31       ` Josef Bacik
  2022-08-10 19:34       ` Chris Murphy
  0 siblings, 2 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-10 18:42 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel



On Wed, Aug 10, 2022, at 2:33 PM, Chris Murphy wrote:
> On Wed, Aug 10, 2022, at 1:48 PM, Josef Bacik wrote:
>
>> To help narrow this down can you disable any IO controller you've got enabled
>> and see if you can reproduce?  If you can sysrq+w is super helpful as it'll
>> point us in the next direction to look.  Thanks,
>
> I'm not following, sorry. I can boot with 
> systemd.unified_cgroup_hierarchy=0 to make sure it's all off, but we're 
> not using an IO cgroup controllers specifically as far as I'm aware.

OK yeah that won't work because the workload requires cgroup2 or it won't run.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression in linux 5.12
  2022-08-10 18:42     ` Chris Murphy
@ 2022-08-10 19:31       ` Josef Bacik
  2022-08-10 19:34       ` Chris Murphy
  1 sibling, 0 replies; 58+ messages in thread
From: Josef Bacik @ 2022-08-10 19:31 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel

On Wed, Aug 10, 2022 at 02:42:40PM -0400, Chris Murphy wrote:
> 
> 
> On Wed, Aug 10, 2022, at 2:33 PM, Chris Murphy wrote:
> > On Wed, Aug 10, 2022, at 1:48 PM, Josef Bacik wrote:
> >
> >> To help narrow this down can you disable any IO controller you've got enabled
> >> and see if you can reproduce?  If you can sysrq+w is super helpful as it'll
> >> point us in the next direction to look.  Thanks,
> >
> > I'm not following, sorry. I can boot with 
> > systemd.unified_cgroup_hierarchy=0 to make sure it's all off, but we're 
> > not using an IO cgroup controllers specifically as far as I'm aware.
> 
> OK yeah that won't work because the workload requires cgroup2 or it won't run.
>

Oh no I don't want cgroups completley off, just disable the io controller, so
figure out which cgroup your thing is being run in, and then

echo "-io" > <parent dir>/cgroup.subtree_control

If you cat /sys/fs/cgroup/whatever/cgroup/cgroup.controllers and you see "io" in
there keep doing the above in the next highest parent directory until io is no
longer in there.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression in linux 5.12
  2022-08-10 18:42     ` Chris Murphy
  2022-08-10 19:31       ` Josef Bacik
@ 2022-08-10 19:34       ` Chris Murphy
  2022-08-12 16:05         ` stalling IO regression since linux 5.12, through 5.18 Chris Murphy
  1 sibling, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-10 19:34 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel



On Wed, Aug 10, 2022, at 2:42 PM, Chris Murphy wrote:
> On Wed, Aug 10, 2022, at 2:33 PM, Chris Murphy wrote:
>> On Wed, Aug 10, 2022, at 1:48 PM, Josef Bacik wrote:
>>
>>> To help narrow this down can you disable any IO controller you've got enabled
>>> and see if you can reproduce?  If you can sysrq+w is super helpful as it'll
>>> point us in the next direction to look.  Thanks,
>>
>> I'm not following, sorry. I can boot with 
>> systemd.unified_cgroup_hierarchy=0 to make sure it's all off, but we're 
>> not using an IO cgroup controllers specifically as far as I'm aware.
>
> OK yeah that won't work because the workload requires cgroup2 or it won't run.


Booted with cgroup_disable=io, and confirmed cat /sys/fs/cgroup/cgroup.controllers does not list io.

I'll rerun the workload now. Sometimes reproduces fast, other times a couple hours.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-10 19:34       ` Chris Murphy
@ 2022-08-12 16:05         ` Chris Murphy
  2022-08-12 17:59           ` Josef Bacik
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-12 16:05 UTC (permalink / raw)
  To: Josef Bacik, paolo.valente
  Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel



On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote:
> Booted with cgroup_disable=io, and confirmed cat 
> /sys/fs/cgroup/cgroup.controllers does not list io.

The problem still reproduces with the cgroup IO controller disabled.

On a whim, I decided to switch the IO scheduler from Fedora's default bfq for rotating drives to mq-deadline. The problem does not reproduce for 15+ hours, which is not 100% conclusive but probably 99% conclusive. I then switched live while running the workload to bfq on all eight drives, and within 10 minutes the system cratered, all new commands just hang. Load average goes to triple digits, i/o wait increasing, i/o pressure for the workload tasks to 100%, and IO completely stalls to zero. I was able to switch only two of the drive queues back to mq-deadline and then lost responsivness in that shell and had to issue sysrq+b...

Before that I was able to extra sysrq+w and sysrq+t.
https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?usp=sharing

I can't tell if this is a bfq bug, or if there's some negative interaction between bfq and scsi or megaraid_sas. Obviously it's rare because otherwise people would have been falling over this much sooner. But at this point there's strong correlation that it's bfq related and is a kernel regression that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 but it's being partly masked by other improvements.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-12 16:05         ` stalling IO regression since linux 5.12, through 5.18 Chris Murphy
@ 2022-08-12 17:59           ` Josef Bacik
  2022-08-12 18:02             ` Jens Axboe
  0 siblings, 1 reply; 58+ messages in thread
From: Josef Bacik @ 2022-08-12 17:59 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Paolo Valente, Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel

On Fri, Aug 12, 2022 at 12:05 PM Chris Murphy <lists@colorremedies.com> wrote:
>
>
>
> On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote:
> > Booted with cgroup_disable=io, and confirmed cat
> > /sys/fs/cgroup/cgroup.controllers does not list io.
>
> The problem still reproduces with the cgroup IO controller disabled.
>
> On a whim, I decided to switch the IO scheduler from Fedora's default bfq for rotating drives to mq-deadline. The problem does not reproduce for 15+ hours, which is not 100% conclusive but probably 99% conclusive. I then switched live while running the workload to bfq on all eight drives, and within 10 minutes the system cratered, all new commands just hang. Load average goes to triple digits, i/o wait increasing, i/o pressure for the workload tasks to 100%, and IO completely stalls to zero. I was able to switch only two of the drive queues back to mq-deadline and then lost responsivness in that shell and had to issue sysrq+b...
>
> Before that I was able to extra sysrq+w and sysrq+t.
> https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?usp=sharing
>
> I can't tell if this is a bfq bug, or if there's some negative interaction between bfq and scsi or megaraid_sas. Obviously it's rare because otherwise people would have been falling over this much sooner. But at this point there's strong correlation that it's bfq related and is a kernel regression that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 but it's being partly masked by other improvements.

This matches observations we've had internally (inside Facebook) as
well as my continual integration performance testing.  It should
probably be looked into by the BFQ guys as it was working previously.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-12 17:59           ` Josef Bacik
@ 2022-08-12 18:02             ` Jens Axboe
  2022-08-14 20:28               ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Jens Axboe @ 2022-08-12 18:02 UTC (permalink / raw)
  To: Josef Bacik, Chris Murphy
  Cc: Paolo Valente, Btrfs BTRFS, Linux-RAID, linux-block,
	linux-kernel, Jan Kara

On 8/12/22 11:59 AM, Josef Bacik wrote:
> On Fri, Aug 12, 2022 at 12:05 PM Chris Murphy <lists@colorremedies.com> wrote:
>>
>>
>>
>> On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote:
>>> Booted with cgroup_disable=io, and confirmed cat
>>> /sys/fs/cgroup/cgroup.controllers does not list io.
>>
>> The problem still reproduces with the cgroup IO controller disabled.
>>
>> On a whim, I decided to switch the IO scheduler from Fedora's default bfq for rotating drives to mq-deadline. The problem does not reproduce for 15+ hours, which is not 100% conclusive but probably 99% conclusive. I then switched live while running the workload to bfq on all eight drives, and within 10 minutes the system cratered, all new commands just hang. Load average goes to triple digits, i/o wait increasing, i/o pressure for the workload tasks to 100%, and IO completely stalls to zero. I was able to switch only two of the drive queues back to mq-deadline and then lost responsivness in that shell and had to issue sysrq+b...
>>
>> Before that I was able to extra sysrq+w and sysrq+t.
>> https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?usp=sharing
>>
>> I can't tell if this is a bfq bug, or if there's some negative interaction between bfq and scsi or megaraid_sas. Obviously it's rare because otherwise people would have been falling over this much sooner. But at this point there's strong correlation that it's bfq related and is a kernel regression that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 but it's being partly masked by other improvements.
> 
> This matches observations we've had internally (inside Facebook) as
> well as my continual integration performance testing.  It should
> probably be looked into by the BFQ guys as it was working previously.
> Thanks,

5.12 has a few BFQ changes:

Jan Kara:
      bfq: Avoid false bfq queue merging
      bfq: Use 'ttime' local variable
      bfq: Use only idle IO periods for think time calculations

Jia Cheng Hu
      block, bfq: set next_rq to waker_bfqq->next_rq in waker injection

Paolo Valente
      block, bfq: use half slice_idle as a threshold to check short ttime
      block, bfq: increase time window for waker detection
      block, bfq: do not raise non-default weights
      block, bfq: avoid spurious switches to soft_rt of interactive queues
      block, bfq: do not expire a queue when it is the only busy one
      block, bfq: replace mechanism for evaluating I/O intensity
      block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
      block, bfq: fix switch back from soft-rt weitgh-raising
      block, bfq: save also weight-raised service on queue merging
      block, bfq: save also injection state on queue merging
      block, bfq: make waker-queue detection more robust

Might be worth trying to revert those from 5.12 to see if they are
causing the issue? Jan, Paolo - does this ring any bells?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-12 18:02             ` Jens Axboe
@ 2022-08-14 20:28               ` Chris Murphy
  2022-08-16 14:22                 ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-14 20:28 UTC (permalink / raw)
  To: Jens Axboe, Josef Bacik
  Cc: Paolo Valente, Btrfs BTRFS, Linux-RAID, linux-block,
	linux-kernel, Jan Kara



On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
> On 8/12/22 11:59 AM, Josef Bacik wrote:
>> On Fri, Aug 12, 2022 at 12:05 PM Chris Murphy <lists@colorremedies.com> wrote:
>>>
>>>
>>>
>>> On Wed, Aug 10, 2022, at 3:34 PM, Chris Murphy wrote:
>>>> Booted with cgroup_disable=io, and confirmed cat
>>>> /sys/fs/cgroup/cgroup.controllers does not list io.
>>>
>>> The problem still reproduces with the cgroup IO controller disabled.
>>>
>>> On a whim, I decided to switch the IO scheduler from Fedora's default bfq for rotating drives to mq-deadline. The problem does not reproduce for 15+ hours, which is not 100% conclusive but probably 99% conclusive. I then switched live while running the workload to bfq on all eight drives, and within 10 minutes the system cratered, all new commands just hang. Load average goes to triple digits, i/o wait increasing, i/o pressure for the workload tasks to 100%, and IO completely stalls to zero. I was able to switch only two of the drive queues back to mq-deadline and then lost responsivness in that shell and had to issue sysrq+b...
>>>
>>> Before that I was able to extra sysrq+w and sysrq+t.
>>> https://drive.google.com/file/d/16hdQjyBnuzzQIhiQT6fQdE0nkRQJj7EI/view?usp=sharing
>>>
>>> I can't tell if this is a bfq bug, or if there's some negative interaction between bfq and scsi or megaraid_sas. Obviously it's rare because otherwise people would have been falling over this much sooner. But at this point there's strong correlation that it's bfq related and is a kernel regression that's been around since 5.12.0 through 5.18.0, and I suspect also 5.19.0 but it's being partly masked by other improvements.
>> 
>> This matches observations we've had internally (inside Facebook) as
>> well as my continual integration performance testing.  It should
>> probably be looked into by the BFQ guys as it was working previously.
>> Thanks,
>
> 5.12 has a few BFQ changes:
>
> Jan Kara:
>       bfq: Avoid false bfq queue merging
>       bfq: Use 'ttime' local variable
>       bfq: Use only idle IO periods for think time calculations
>
> Jia Cheng Hu
>       block, bfq: set next_rq to waker_bfqq->next_rq in waker injection
>
> Paolo Valente
>       block, bfq: use half slice_idle as a threshold to check short ttime
>       block, bfq: increase time window for waker detection
>       block, bfq: do not raise non-default weights
>       block, bfq: avoid spurious switches to soft_rt of interactive queues
>       block, bfq: do not expire a queue when it is the only busy one
>       block, bfq: replace mechanism for evaluating I/O intensity
>       block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
>       block, bfq: fix switch back from soft-rt weitgh-raising
>       block, bfq: save also weight-raised service on queue merging
>       block, bfq: save also injection state on queue merging
>       block, bfq: make waker-queue detection more robust
>
> Might be worth trying to revert those from 5.12 to see if they are
> causing the issue? Jan, Paolo - does this ring any bells?

git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt

I tried checking out a33df75c6328, which is right before the first bfq commit, but that kernel won't boot the hardware.

Next I checked out v5.12, then reverted these commits in order (that they were found in the bisect.txt file):

7684fbde4516 bfq: Use only idle IO periods for think time calculations
28c6def00919 bfq: Use 'ttime' local variable
41e76c85660c bfq: Avoid false bfq queue merging
>>>a5bf0a92e1b8 bfq: bfq_check_waker() should be static
71217df39dc6 block, bfq: make waker-queue detection more robust
5a5436b98d5c block, bfq: save also injection state on queue merging
e673914d52f9 block, bfq: save also weight-raised service on queue merging
d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
>>>1a23e06cdab2 bfq: don't duplicate code for different paths
2391d13ed484 block, bfq: do not expire a queue when it is the only busy one
3c337690d2eb block, bfq: avoid spurious switches to soft_rt of interactive queues
91b896f65d32 block, bfq: do not raise non-default weights
ab1fb47e33dc block, bfq: increase time window for waker detection
d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker injection
b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check short ttime

The two commits prefixed by >>> above were not previously mentioned by Jens, but I reverted them anyway because they showed up in the git log command.

OK so, within 10 minutes the problem does happen still. This is block/bfq-iosched.c resulting from the above reverts, in case anyone wants to double check what I did:
https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression in linux 5.12
  2022-08-10 16:35 stalling IO regression in linux 5.12 Chris Murphy
  2022-08-10 17:48 ` Josef Bacik
@ 2022-08-15 11:25 ` Thorsten Leemhuis
  1 sibling, 0 replies; 58+ messages in thread
From: Thorsten Leemhuis @ 2022-08-15 11:25 UTC (permalink / raw)
  To: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, regressions

[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

Hi, this is your Linux kernel regression tracker.

On 10.08.22 18:35, Chris Murphy wrote:
> CPU: Intel E5-2680 v3
> RAM: 128 G
> 02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02), using megaraid_sas driver
> 8 Disks: TOSHIBA AL13SEB600
> 
> 
> The problem exhibits as increasing load, increasing IO pressure (PSI), and actual IO goes to zero. It never happens on kernel 5.11 series, and always happens after 5.12-rc1 and persists through 5.18.0. There's a new mix of behaviors with 5.19, I suspect the mm improvements in this series might be masking the problem.
> 
> The workload involves openqa, which spins up 30 qemu-kvm instances, and does a bunch of tests, generating quite a lot of writes: qcow2 files, and video in the form of many screenshots, and various log files, for each VM. These VMs are each in their own cgroup. As the problem begins, I see increasing IO pressure, and decreasing IO, for each qemu instance's cgroup, and the cgroups for httpd, journald, auditd, and postgresql. IO pressure goes to nearly ~99% and IO is literally 0.
> 
> The problem left unattended to progress will eventually result in a completely unresponsive system, with no kernel messages. It reproduces in the following configurations, the first two I provide links to full dmesg with sysrq+w:
> 
> btrfs raid10 (native) on plain partitions [1]
> btrfs single/dup on dmcrypt on mdadm raid 10 and parity raid [2]
> XFS on dmcrypt on mdadm raid10 or parity raid
> 
> I've started a bisect, but for some reason I haven't figured out I've started getting compiled kernels that don't boot the hardware. The failure is very early on such that the UUID for the root file system isn't found, but not much to go on as to why.[3] I have tested the first and last skipped commits in the bisect log below, they successfully boot a VM but not the hardware.
> 
> Anyway, I'm kinda stuck at this point trying to narrow it down further. Any suggestions? Thanks.
> 
> [1] btrfs raid10, plain partitions
> https://drive.google.com/file/d/1-oT3MX-hHYtQqI0F3SpgPjCIDXXTysLU/view?usp=sharing
> 
> [2] btrfs single/dup, dmcrypt, mdadm raid10
> https://drive.google.com/file/d/1m_T3YYaEjBKUROz6dHt5_h92ZVRji9FM/view?usp=sharing
> 
> [3] 
> $ git bisect log
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [c03c21ba6f4e95e406a1a7b4c34ef334b977c194] Merge tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
> git bisect bad c03c21ba6f4e95e406a1a7b4c34ef334b977c194
> # status: waiting for good commit(s), bad commit known
> # good: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
> git bisect good f40ddce88593482919761f74910f42f4b84c004b
> # bad: [df24212a493afda0d4de42176bea10d45825e9a0] Merge tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
> git bisect bad df24212a493afda0d4de42176bea10d45825e9a0
> # good: [82851fce6107d5a3e66d95aee2ae68860a732703] Merge tag 'arm-dt-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> git bisect good 82851fce6107d5a3e66d95aee2ae68860a732703
> # good: [99f1a5872b706094ece117368170a92c66b2e242] Merge tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
> git bisect good 99f1a5872b706094ece117368170a92c66b2e242
> # bad: [9eef02334505411667a7b51a8f349f8c6c4f3b66] Merge tag 'locking-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect bad 9eef02334505411667a7b51a8f349f8c6c4f3b66
> # bad: [9820b4dca0f9c6b7ab8b4307286cdace171b724d] Merge tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block
> git bisect bad 9820b4dca0f9c6b7ab8b4307286cdace171b724d
> # good: [bd018bbaa58640da786d4289563e71c5ef3938c7] Merge tag 'for-5.12/libata-2021-02-17' of git://git.kernel.dk/linux-block
> git bisect good bd018bbaa58640da786d4289563e71c5ef3938c7
> # skip: [203c018079e13510f913fd0fd426370f4de0fd05] Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.12/drivers
> git bisect skip 203c018079e13510f913fd0fd426370f4de0fd05
> # skip: [49d1ec8573f74ff1e23df1d5092211de46baa236] block: manage bio slab cache by xarray
> git bisect skip 49d1ec8573f74ff1e23df1d5092211de46baa236
> # bad: [73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7] nvme: cleanup zone information initialization
> git bisect bad 73d90386b559d6f4c3c5db5e6bb1b68aae8fd3e7
> # skip: [71217df39dc67a0aeed83352b0d712b7892036a2] block, bfq: make waker-queue detection more robust
> git bisect skip 71217df39dc67a0aeed83352b0d712b7892036a2
> # bad: [8358c28a5d44bf0223a55a2334086c3707bb4185] block: fix memory leak of bvec
> git bisect bad 8358c28a5d44bf0223a55a2334086c3707bb4185
> # skip: [3a905c37c3510ea6d7cfcdfd0f272ba731286560] block: skip bio_check_eod for partition-remapped bios
> git bisect skip 3a905c37c3510ea6d7cfcdfd0f272ba731286560
> # skip: [3c337690d2ebb7a01fa13bfa59ce4911f358df42] block, bfq: avoid spurious switches to soft_rt of interactive queues
> git bisect skip 3c337690d2ebb7a01fa13bfa59ce4911f358df42
> # skip: [3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea] bio: add a helper calculating nr segments to alloc
> git bisect skip 3e1a88ec96259282b9a8b45c3f1fda7a3ff4f6ea
> # skip: [4eb1d689045552eb966ebf25efbc3ce648797d96] blk-crypto: use bio_kmalloc in blk_crypto_clone_bio
> git bisect skip 4eb1d689045552eb966ebf25efbc3ce648797d96

Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot ^introduced v5.11..v5.12-rc1
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-14 20:28               ` Chris Murphy
@ 2022-08-16 14:22                 ` Chris Murphy
  2022-08-16 15:25                   ` Nikolay Borisov
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-16 14:22 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Paolo Valente
  Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Sun, Aug 14, 2022, at 4:28 PM, Chris Murphy wrote:
> On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
>> Might be worth trying to revert those from 5.12 to see if they are
>> causing the issue? Jan, Paolo - does this ring any bells?
>
> git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt
>
> I tried checking out a33df75c6328, which is right before the first bfq 
> commit, but that kernel won't boot the hardware.
>
> Next I checked out v5.12, then reverted these commits in order (that 
> they were found in the bisect.txt file):
>
> 7684fbde4516 bfq: Use only idle IO periods for think time calculations
> 28c6def00919 bfq: Use 'ttime' local variable
> 41e76c85660c bfq: Avoid false bfq queue merging
>>>>a5bf0a92e1b8 bfq: bfq_check_waker() should be static
> 71217df39dc6 block, bfq: make waker-queue detection more robust
> 5a5436b98d5c block, bfq: save also injection state on queue merging
> e673914d52f9 block, bfq: save also weight-raised service on queue merging
> d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
> 7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
> eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
>>>>1a23e06cdab2 bfq: don't duplicate code for different paths
> 2391d13ed484 block, bfq: do not expire a queue when it is the only busy 
> one
> 3c337690d2eb block, bfq: avoid spurious switches to soft_rt of 
> interactive queues
> 91b896f65d32 block, bfq: do not raise non-default weights
> ab1fb47e33dc block, bfq: increase time window for waker detection
> d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker 
> injection
> b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check 
> short ttime
>
> The two commits prefixed by >>> above were not previously mentioned by 
> Jens, but I reverted them anyway because they showed up in the git log 
> command.
>
> OK so, within 10 minutes the problem does happen still. This is 
> block/bfq-iosched.c resulting from the above reverts, in case anyone 
> wants to double check what I did:
> https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing

Any suggestions for further testing? I could try go down farther in the bisect.txt list. The problem is if the hardware falls over on an unbootable kernel, I have to bug someone with LOM access. That's a limited resource.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-16 14:22                 ` Chris Murphy
@ 2022-08-16 15:25                   ` Nikolay Borisov
  2022-08-16 15:34                     ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Nikolay Borisov @ 2022-08-16 15:25 UTC (permalink / raw)
  To: Chris Murphy, Jens Axboe, Jan Kara, Paolo Valente
  Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On 16.08.22 г. 17:22 ч., Chris Murphy wrote:
> 
> 
> On Sun, Aug 14, 2022, at 4:28 PM, Chris Murphy wrote:
>> On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
>>> Might be worth trying to revert those from 5.12 to see if they are
>>> causing the issue? Jan, Paolo - does this ring any bells?
>>
>> git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt
>>
>> I tried checking out a33df75c6328, which is right before the first bfq
>> commit, but that kernel won't boot the hardware.
>>
>> Next I checked out v5.12, then reverted these commits in order (that
>> they were found in the bisect.txt file):
>>
>> 7684fbde4516 bfq: Use only idle IO periods for think time calculations
>> 28c6def00919 bfq: Use 'ttime' local variable
>> 41e76c85660c bfq: Avoid false bfq queue merging
>>>>> a5bf0a92e1b8 bfq: bfq_check_waker() should be static
>> 71217df39dc6 block, bfq: make waker-queue detection more robust
>> 5a5436b98d5c block, bfq: save also injection state on queue merging
>> e673914d52f9 block, bfq: save also weight-raised service on queue merging
>> d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
>> 7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
>> eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
>>>>> 1a23e06cdab2 bfq: don't duplicate code for different paths
>> 2391d13ed484 block, bfq: do not expire a queue when it is the only busy
>> one
>> 3c337690d2eb block, bfq: avoid spurious switches to soft_rt of
>> interactive queues
>> 91b896f65d32 block, bfq: do not raise non-default weights
>> ab1fb47e33dc block, bfq: increase time window for waker detection
>> d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker
>> injection
>> b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check
>> short ttime
>>
>> The two commits prefixed by >>> above were not previously mentioned by
>> Jens, but I reverted them anyway because they showed up in the git log
>> command.
>>
>> OK so, within 10 minutes the problem does happen still. This is
>> block/bfq-iosched.c resulting from the above reverts, in case anyone
>> wants to double check what I did:
>> https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing
> 
> Any suggestions for further testing? I could try go down farther in the bisect.txt list. The problem is if the hardware falls over on an unbootable kernel, I have to bug someone with LOM access. That's a limited resource.
> 
> 

How about changing the scheduler either mq-deadline or noop, just to see 
if this is also reproducible with a different scheduler. I guess noop 
would imply the blk cgroup controller is going to be disabled

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-16 15:25                   ` Nikolay Borisov
@ 2022-08-16 15:34                     ` Chris Murphy
  2022-08-17  9:52                       ` Holger Hoffstätte
  2022-08-17 12:06                       ` Ming Lei
  0 siblings, 2 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-16 15:34 UTC (permalink / raw)
  To: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente
  Cc: Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
> On 16.08.22 г. 17:22 ч., Chris Murphy wrote:
>> 
>> 
>> On Sun, Aug 14, 2022, at 4:28 PM, Chris Murphy wrote:
>>> On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
>>>> Might be worth trying to revert those from 5.12 to see if they are
>>>> causing the issue? Jan, Paolo - does this ring any bells?
>>>
>>> git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt
>>>
>>> I tried checking out a33df75c6328, which is right before the first bfq
>>> commit, but that kernel won't boot the hardware.
>>>
>>> Next I checked out v5.12, then reverted these commits in order (that
>>> they were found in the bisect.txt file):
>>>
>>> 7684fbde4516 bfq: Use only idle IO periods for think time calculations
>>> 28c6def00919 bfq: Use 'ttime' local variable
>>> 41e76c85660c bfq: Avoid false bfq queue merging
>>>>>> a5bf0a92e1b8 bfq: bfq_check_waker() should be static
>>> 71217df39dc6 block, bfq: make waker-queue detection more robust
>>> 5a5436b98d5c block, bfq: save also injection state on queue merging
>>> e673914d52f9 block, bfq: save also weight-raised service on queue merging
>>> d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
>>> 7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
>>> eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
>>>>>> 1a23e06cdab2 bfq: don't duplicate code for different paths
>>> 2391d13ed484 block, bfq: do not expire a queue when it is the only busy
>>> one
>>> 3c337690d2eb block, bfq: avoid spurious switches to soft_rt of
>>> interactive queues
>>> 91b896f65d32 block, bfq: do not raise non-default weights
>>> ab1fb47e33dc block, bfq: increase time window for waker detection
>>> d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker
>>> injection
>>> b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check
>>> short ttime
>>>
>>> The two commits prefixed by >>> above were not previously mentioned by
>>> Jens, but I reverted them anyway because they showed up in the git log
>>> command.
>>>
>>> OK so, within 10 minutes the problem does happen still. This is
>>> block/bfq-iosched.c resulting from the above reverts, in case anyone
>>> wants to double check what I did:
>>> https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing
>> 
>> Any suggestions for further testing? I could try go down farther in the bisect.txt list. The problem is if the hardware falls over on an unbootable kernel, I have to bug someone with LOM access. That's a limited resource.
>> 
>> 
>
> How about changing the scheduler either mq-deadline or noop, just to see 
> if this is also reproducible with a different scheduler. I guess noop 
> would imply the blk cgroup controller is going to be disabled

I already reported on that: always happens with bfq within an hour or less. Doesn't happen with mq-deadline for ~25+ hours. Does happen with bfq with the above patches removed. Does happen with cgroup.disabled=io set.

Sounds to me like it's something bfq depends on and is somehow becoming perturbed in a way that mq-deadline does not, and has changed between 5.11 and 5.12. I have no idea what's under bfq that matches this description.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-16 15:34                     ` Chris Murphy
@ 2022-08-17  9:52                       ` Holger Hoffstätte
  2022-08-17 11:49                         ` Jan Kara
                                           ` (2 more replies)
  2022-08-17 12:06                       ` Ming Lei
  1 sibling, 3 replies; 58+ messages in thread
From: Holger Hoffstätte @ 2022-08-17  9:52 UTC (permalink / raw)
  To: Chris Murphy, Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente
  Cc: Linux-RAID, linux-block, linux-kernel, Josef Bacik, linux-block

On 2022-08-16 17:34, Chris Murphy wrote:
> 
> On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
>> How about changing the scheduler either mq-deadline or noop, just
>> to see if this is also reproducible with a different scheduler. I
>> guess noop would imply the blk cgroup controller is going to be
>> disabled
> 
> I already reported on that: always happens with bfq within an hour or
> less. Doesn't happen with mq-deadline for ~25+ hours. Does happen
> with bfq with the above patches removed. Does happen with
> cgroup.disabled=io set.
> 
> Sounds to me like it's something bfq depends on and is somehow
> becoming perturbed in a way that mq-deadline does not, and has
> changed between 5.11 and 5.12. I have no idea what's under bfq that
> matches this description.

Chris, just a shot in the dark but can you try the patch from

https://lore.kernel.org/linux-block/20220803121504.212071-1-yukuai1@huaweicloud.com/

on top of something more recent than 5.12? Ideally 5.19 where it applies
cleanly.

No guarantees, I just remembered this patch and your problem sounds like
a lost wakeup. Maybe BFQ just drives the sbitmap in a way that triggers the
symptom.

-h

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17  9:52                       ` Holger Hoffstätte
@ 2022-08-17 11:49                         ` Jan Kara
  2022-08-17 14:37                           ` Chris Murphy
  2022-08-17 15:09                           ` Chris Murphy
  2022-08-17 11:57                         ` Chris Murphy
  2022-08-17 18:16                         ` Chris Murphy
  2 siblings, 2 replies; 58+ messages in thread
From: Jan Kara @ 2022-08-17 11:49 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Chris Murphy, Nikolay Borisov, Jens Axboe, Jan Kara,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik

On Wed 17-08-22 11:52:54, Holger Hoffstätte wrote:
> On 2022-08-16 17:34, Chris Murphy wrote:
> > 
> > On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
> > > How about changing the scheduler either mq-deadline or noop, just
> > > to see if this is also reproducible with a different scheduler. I
> > > guess noop would imply the blk cgroup controller is going to be
> > > disabled
> > 
> > I already reported on that: always happens with bfq within an hour or
> > less. Doesn't happen with mq-deadline for ~25+ hours. Does happen
> > with bfq with the above patches removed. Does happen with
> > cgroup.disabled=io set.
> > 
> > Sounds to me like it's something bfq depends on and is somehow
> > becoming perturbed in a way that mq-deadline does not, and has
> > changed between 5.11 and 5.12. I have no idea what's under bfq that
> > matches this description.
> 
> Chris, just a shot in the dark but can you try the patch from
> 
> https://lore.kernel.org/linux-block/20220803121504.212071-1-yukuai1@huaweicloud.com/
> 
> on top of something more recent than 5.12? Ideally 5.19 where it applies
> cleanly.
> 
> No guarantees, I just remembered this patch and your problem sounds like
> a lost wakeup. Maybe BFQ just drives the sbitmap in a way that triggers the
> symptom.

Yes, symptoms look similar and it happens for devices with shared tagsets
(which megaraid sas is) but that problem usually appeared when there are
lots of LUNs sharing the tagset so that number of tags available per LUN
was rather low. Not sure if that is the case here but probably that patch
is worth a try.

Another thing worth trying is to compile the kernel without
CONFIG_BFQ_GROUP_IOSCHED. That will essentially disable cgroup support in
BFQ so we will see whether the problem may be cgroup related or not.

Another interesting thing might be to dump
/sys/kernel/debug/block/<device>/hctx*/{sched_tags,sched_tags_bitmap,tags,tags_bitmap}
as the system is hanging. That should tell us whether tags are in fact in
use or not when processes are blocking waiting for tags.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17  9:52                       ` Holger Hoffstätte
  2022-08-17 11:49                         ` Jan Kara
@ 2022-08-17 11:57                         ` Chris Murphy
  2022-08-17 12:31                           ` Holger Hoffstätte
  2022-08-17 18:16                         ` Chris Murphy
  2 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 11:57 UTC (permalink / raw)
  To: Holger Hoffstätte, Nikolay Borisov, Jens Axboe, Jan Kara,
	Paolo Valente
  Cc: Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 5:52 AM, Holger Hoffstätte wrote:
> On 2022-08-16 17:34, Chris Murphy wrote:
>> 
>> On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
>>> How about changing the scheduler either mq-deadline or noop, just
>>> to see if this is also reproducible with a different scheduler. I
>>> guess noop would imply the blk cgroup controller is going to be
>>> disabled
>> 
>> I already reported on that: always happens with bfq within an hour or
>> less. Doesn't happen with mq-deadline for ~25+ hours. Does happen
>> with bfq with the above patches removed. Does happen with
>> cgroup.disabled=io set.
>> 
>> Sounds to me like it's something bfq depends on and is somehow
>> becoming perturbed in a way that mq-deadline does not, and has
>> changed between 5.11 and 5.12. I have no idea what's under bfq that
>> matches this description.
>
> Chris, just a shot in the dark but can you try the patch from
>
> https://lore.kernel.org/linux-block/20220803121504.212071-1-yukuai1@huaweicloud.com/
>
> on top of something more recent than 5.12? Ideally 5.19 where it applies
> cleanly.

The problem doesn't reliably reproduce on 5.19. A patch for 5.12..5.18 would be much more testable.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-16 15:34                     ` Chris Murphy
  2022-08-17  9:52                       ` Holger Hoffstätte
@ 2022-08-17 12:06                       ` Ming Lei
  2022-08-17 14:34                         ` Chris Murphy
  1 sibling, 1 reply; 58+ messages in thread
From: Ming Lei @ 2022-08-17 12:06 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik,
	Ming Lei

Hello Chris,

On Tue, Aug 16, 2022 at 11:35 PM Chris Murphy <lists@colorremedies.com> wrote:
>
>
>
...
>
> I already reported on that: always happens with bfq within an hour or less. Doesn't happen with mq-deadline for ~25+ hours. Does happen with bfq with the above patches removed. Does happen with cgroup.disabled=io set.
>
> Sounds to me like it's something bfq depends on and is somehow becoming perturbed in a way that mq-deadline does not, and has changed between 5.11 and 5.12. I have no idea what's under bfq that matches this description.
>

blk-mq debugfs log is usually helpful for io stall issue, care to post
the blk-mq debugfs log:

(cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)

Thanks,
Ming


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 11:57                         ` Chris Murphy
@ 2022-08-17 12:31                           ` Holger Hoffstätte
  0 siblings, 0 replies; 58+ messages in thread
From: Holger Hoffstätte @ 2022-08-17 12:31 UTC (permalink / raw)
  To: Chris Murphy, Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente
  Cc: Linux-RAID, linux-block, linux-kernel, Josef Bacik

On 2022-08-17 13:57, Chris Murphy wrote:
> 
> 
> On Wed, Aug 17, 2022, at 5:52 AM, Holger Hoffstätte wrote:
>> On 2022-08-16 17:34, Chris Murphy wrote:
>>>
>>> On Tue, Aug 16, 2022, at 11:25 AM, Nikolay Borisov wrote:
>>>> How about changing the scheduler either mq-deadline or noop, just
>>>> to see if this is also reproducible with a different scheduler. I
>>>> guess noop would imply the blk cgroup controller is going to be
>>>> disabled
>>>
>>> I already reported on that: always happens with bfq within an hour or
>>> less. Doesn't happen with mq-deadline for ~25+ hours. Does happen
>>> with bfq with the above patches removed. Does happen with
>>> cgroup.disabled=io set.
>>>
>>> Sounds to me like it's something bfq depends on and is somehow
>>> becoming perturbed in a way that mq-deadline does not, and has
>>> changed between 5.11 and 5.12. I have no idea what's under bfq that
>>> matches this description.
>>
>> Chris, just a shot in the dark but can you try the patch from
>>
>> https://lore.kernel.org/linux-block/20220803121504.212071-1-yukuai1@huaweicloud.com/
>>
>> on top of something more recent than 5.12? Ideally 5.19 where it applies
>> cleanly.
> 
> The problem doesn't reliably reproduce on 5.19. A patch for 5.12..5.18 would be much more testable.

If you look at the changes to sbitmap at:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/lib/sbitmap.c

you'll find that they are relatively recent, so Yukai's patch will probably also apply
to 5.18 - I don't know. Also look at the most recent commit which mentions
"Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits."

Reusing the busy bits sounds "not great" either and (AFAIU) may also be a cause for
lost wakeups, but I'm sure Jan and Ming know all that better than me.

Especially Jan's suggestions re. disabling BFQ cgroup support is probably the easiest
thing to try first. What you're observing may not have a single root cause, and even if
it does, it might not be where we suspect.

-h

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 12:06                       ` Ming Lei
@ 2022-08-17 14:34                         ` Chris Murphy
  2022-08-17 14:53                           ` Ming Lei
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 14:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 8:06 AM, Ming Lei wrote:

> blk-mq debugfs log is usually helpful for io stall issue, care to post
> the blk-mq debugfs log:
>
> (cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)

This is only sda
https://drive.google.com/file/d/1aAld-kXb3RUiv_ShAvD_AGAFDRS03Lr0/view?usp=sharing

This is all the block devices
https://drive.google.com/file/d/1iHqRuoz8ZzvkNcMtkV3Ep7h5Uof7sTKw/view?usp=sharing

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 11:49                         ` Jan Kara
@ 2022-08-17 14:37                           ` Chris Murphy
  2022-08-17 15:09                           ` Chris Murphy
  1 sibling, 0 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 14:37 UTC (permalink / raw)
  To: Jan Kara, Holger Hoffstätte
  Cc: Nikolay Borisov, Jens Axboe, Paolo Valente, Linux-RAID,
	linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 7:49 AM, Jan Kara wrote:

> Another thing worth trying is to compile the kernel without
> CONFIG_BFQ_GROUP_IOSCHED. That will essentially disable cgroup support in
> BFQ so we will see whether the problem may be cgroup related or not.

Does boot param cgroup.disable=io affect it? Because the problem still happens with that parameter. Otherwise I can build a kernel with it disabled.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 14:34                         ` Chris Murphy
@ 2022-08-17 14:53                           ` Ming Lei
  2022-08-17 15:02                             ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Ming Lei @ 2022-08-17 14:53 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik

On Wed, Aug 17, 2022 at 10:34:38AM -0400, Chris Murphy wrote:
> 
> 
> On Wed, Aug 17, 2022, at 8:06 AM, Ming Lei wrote:
> 
> > blk-mq debugfs log is usually helpful for io stall issue, care to post
> > the blk-mq debugfs log:
> >
> > (cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)
> 
> This is only sda
> https://drive.google.com/file/d/1aAld-kXb3RUiv_ShAvD_AGAFDRS03Lr0/view?usp=sharing

From the log, there isn't any in-flight IO request.

So please confirm that it is collected after the IO stall is triggered.

If yes, the issue may not be related with BFQ, and should be related
with blk-cgroup code.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 14:53                           ` Ming Lei
@ 2022-08-17 15:02                             ` Chris Murphy
  2022-08-17 15:34                               ` Ming Lei
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 15:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 10:53 AM, Ming Lei wrote:
> On Wed, Aug 17, 2022 at 10:34:38AM -0400, Chris Murphy wrote:
>> 
>> 
>> On Wed, Aug 17, 2022, at 8:06 AM, Ming Lei wrote:
>> 
>> > blk-mq debugfs log is usually helpful for io stall issue, care to post
>> > the blk-mq debugfs log:
>> >
>> > (cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)
>> 
>> This is only sda
>> https://drive.google.com/file/d/1aAld-kXb3RUiv_ShAvD_AGAFDRS03Lr0/view?usp=sharing
>
> From the log, there isn't any in-flight IO request.
>
> So please confirm that it is collected after the IO stall is triggered.

Yes, iotop reports no reads or writes at the time of collection. IO pressure 99% for auditd, systemd-journald, rsyslogd, and postgresql, with increasing pressure from all the qemu processes.

Keep in mind this is a raid10, so maybe it's enough for just one block device IO to stall and the whole thing stops? That's why I included all block devices.

> If yes, the issue may not be related with BFQ, and should be related
> with blk-cgroup code.

Problem happens with cgroup.disable=io, does this setting affect blk-cgroup?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 11:49                         ` Jan Kara
  2022-08-17 14:37                           ` Chris Murphy
@ 2022-08-17 15:09                           ` Chris Murphy
  2022-08-17 16:30                             ` Jan Kara
  1 sibling, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 15:09 UTC (permalink / raw)
  To: Jan Kara, Holger Hoffstätte
  Cc: Nikolay Borisov, Jens Axboe, Paolo Valente, Linux-RAID,
	linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 7:49 AM, Jan Kara wrote:

>
> Another thing worth trying is to compile the kernel without
> CONFIG_BFQ_GROUP_IOSCHED. That will essentially disable cgroup support in
> BFQ so we will see whether the problem may be cgroup related or not.

The problem happens with a 5.12.0 kernel built without CONFIG_BFQ_GROUP_IOSCHED.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 15:02                             ` Chris Murphy
@ 2022-08-17 15:34                               ` Ming Lei
  2022-08-17 16:34                                 ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Ming Lei @ 2022-08-17 15:34 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik

On Wed, Aug 17, 2022 at 11:02:25AM -0400, Chris Murphy wrote:
> 
> 
> On Wed, Aug 17, 2022, at 10:53 AM, Ming Lei wrote:
> > On Wed, Aug 17, 2022 at 10:34:38AM -0400, Chris Murphy wrote:
> >> 
> >> 
> >> On Wed, Aug 17, 2022, at 8:06 AM, Ming Lei wrote:
> >> 
> >> > blk-mq debugfs log is usually helpful for io stall issue, care to post
> >> > the blk-mq debugfs log:
> >> >
> >> > (cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)
> >> 
> >> This is only sda
> >> https://drive.google.com/file/d/1aAld-kXb3RUiv_ShAvD_AGAFDRS03Lr0/view?usp=sharing
> >
> > From the log, there isn't any in-flight IO request.
> >
> > So please confirm that it is collected after the IO stall is triggered.
> 
> Yes, iotop reports no reads or writes at the time of collection. IO pressure 99% for auditd, systemd-journald, rsyslogd, and postgresql, with increasing pressure from all the qemu processes.
> 
> Keep in mind this is a raid10, so maybe it's enough for just one block device IO to stall and the whole thing stops? That's why I included all block devices.
> 

From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
request based block devices, but sda is _not_ included in this log, and
only sdi, sdg and sdf are collected, is that expected?

BTW, all request based block devices should be observed in blk-mq debugfs.



thanks,
Ming


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 15:09                           ` Chris Murphy
@ 2022-08-17 16:30                             ` Jan Kara
  2022-08-17 16:47                               ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2022-08-17 16:30 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Jan Kara, Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik

On Wed 17-08-22 11:09:26, Chris Murphy wrote:
> 
> 
> On Wed, Aug 17, 2022, at 7:49 AM, Jan Kara wrote:
> 
> >
> > Another thing worth trying is to compile the kernel without
> > CONFIG_BFQ_GROUP_IOSCHED. That will essentially disable cgroup support in
> > BFQ so we will see whether the problem may be cgroup related or not.
> 
> The problem happens with a 5.12.0 kernel built without
> CONFIG_BFQ_GROUP_IOSCHED.

Thanks for testing! Just to answer your previous question: This is
different from cgroup.disable=io because BFQ takes different code paths. So
this makes it even less likely this is some obscure BFQ bug. Why BFQ could
be different here from mq-deadline is that it artificially reduces device
queue depth (it sets shallow_depth when allocating new tags) and maybe that
triggers some bug in request tag allocation.

BTW, are you sure the first problematic kernel is 5.12? Because support for
shared tagsets was added to megaraid_sas driver in 5.11 (5.11-rc3 in
particular - commit 81e7eb5bf08f3 ("Revert "Revert "scsi: megaraid_sas:
Added support for shared host tagset for cpuhotplug"")) and that is one
candidate I'd expect to start to trigger issues. BTW that may be an
interesting thing to try: Can you boot with
"megaraid_sas.host_tagset_enable = 0" kernel option and see whether the
issue reproduces?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 15:34                               ` Ming Lei
@ 2022-08-17 16:34                                 ` Chris Murphy
  2022-08-18  1:03                                   ` Ming Lei
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 16:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 11:34 AM, Ming Lei wrote:

> From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
> request based block devices, but sda is _not_ included in this log, and
> only sdi, sdg and sdf are collected, is that expected?

While the problem was happening I did

cd /sys/kernel/debug/block
find . -type f -exec grep -aH . {} \;

The file has the nodes out of order, but I don't know enough about the interface to see if there are things that are missing, or what it means.


> BTW, all request based block devices should be observed in blk-mq debugfs.

/sys/kernel/debug/block contains

drwxr-xr-x.  2 root root 0 Aug 17 15:20 md0
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sda
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdb
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdc
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdd
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sde
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdf
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdg
drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdh
drwxr-xr-x.  4 root root 0 Aug 17 15:20 sdi
drwxr-xr-x.  2 root root 0 Aug 17 15:20 zram0


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 16:30                             ` Jan Kara
@ 2022-08-17 16:47                               ` Chris Murphy
  2022-08-17 17:57                                 ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 16:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik



On Wed, Aug 17, 2022, at 12:30 PM, Jan Kara wrote:

> BTW, are you sure the first problematic kernel is 5.12? 

100%

It consistently reproduces with any 5.12 series kernel, including from c03c21ba6f4e which is before rc1. It's frustrating that git bisect produces kernels that won't boot, I was more than half way through! :D And could have been done by now...

We've been running on 5.11 series kernels for a year because of this problem.


> BTW that may be an
> interesting thing to try: Can you boot with
> "megaraid_sas.host_tagset_enable = 0" kernel option and see whether the
> issue reproduces?

Yep.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 16:47                               ` Chris Murphy
@ 2022-08-17 17:57                                 ` Chris Murphy
  2022-08-17 18:15                                   ` Jan Kara
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 17:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik



On Wed, Aug 17, 2022, at 12:47 PM, Chris Murphy wrote:
Can you boot with
>> "megaraid_sas.host_tagset_enable = 0" kernel option and see whether the
>> issue reproduces?

This has been running an hour without symptoms. It's strongly suggestive, but needs to run overnight to be sure. Anecdotally, the max write IO is less than what I'm used to seeing.

[    0.583121] Kernel command line: BOOT_IMAGE=(md/0)/vmlinuz-5.12.5-300.fc34.x86_64 root=UUID=04f1fb7f-5cc4-4dfb-a7cf-b6b6925bf895 ro rootflags=subvol=root rd.md.uuid=e7782150:092e161a:68395862:31375bca biosdevname=1 net.ifnames=0 log_buf_len=8M plymouth.enable=0 megaraid_sas.host_tagset_enable=0
...
[    6.745964] megasas: 07.714.04.00-rc1
[    6.758472] megaraid_sas 0000:02:00.0: BAR:0x1  BAR's base_addr(phys):0x0000000092000000  mapped virt_addr:0x00000000c54554ff
[    6.758477] megaraid_sas 0000:02:00.0: FW now in Ready state
[    6.770658] megaraid_sas 0000:02:00.0: 63 bit DMA mask and 32 bit consistent mask
[    6.795060] megaraid_sas 0000:02:00.0: firmware supports msix	: (96)
[    6.807537] megaraid_sas 0000:02:00.0: requested/available msix 49/49
[    6.819259] megaraid_sas 0000:02:00.0: current msix/online cpus	: (49/48)
[    6.830800] megaraid_sas 0000:02:00.0: RDPQ mode	: (disabled)
[    6.842031] megaraid_sas 0000:02:00.0: Current firmware supports maximum commands: 928	 LDIO threshold: 0
[    6.871246] megaraid_sas 0000:02:00.0: Performance mode :Latency (latency index = 1)
[    6.882265] megaraid_sas 0000:02:00.0: FW supports sync cache	: No
[    6.893034] megaraid_sas 0000:02:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[    6.988550] megaraid_sas 0000:02:00.0: FW provided supportMaxExtLDs: 1	max_lds: 64
[    6.988554] megaraid_sas 0000:02:00.0: controller type	: MR(2048MB)
[    6.988555] megaraid_sas 0000:02:00.0: Online Controller Reset(OCR)	: Enabled
[    6.988556] megaraid_sas 0000:02:00.0: Secure JBOD support	: No
[    6.988557] megaraid_sas 0000:02:00.0: NVMe passthru support	: No
[    6.988558] megaraid_sas 0000:02:00.0: FW provided TM TaskAbort/Reset timeout	: 0 secs/0 secs
[    6.988559] megaraid_sas 0000:02:00.0: JBOD sequence map support	: No
[    6.988560] megaraid_sas 0000:02:00.0: PCI Lane Margining support	: No
[    7.025160] megaraid_sas 0000:02:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[    7.025162] megaraid_sas 0000:02:00.0: INIT adapter done
[    7.025164] megaraid_sas 0000:02:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5707
[    7.029878] megaraid_sas 0000:02:00.0: pci id		: (0x1000)/(0x005d)/(0x1028)/(0x1f47)
[    7.029881] megaraid_sas 0000:02:00.0: unevenspan support	: yes
[    7.029882] megaraid_sas 0000:02:00.0: firmware crash dump	: no
[    7.029883] megaraid_sas 0000:02:00.0: JBOD sequence map	: disabled
[    7.029915] megaraid_sas 0000:02:00.0: Max firmware commands: 927 shared with nr_hw_queues = 1
[    7.029918] scsi host11: Avago SAS based MegaRAID driver




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 17:57                                 ` Chris Murphy
@ 2022-08-17 18:15                                   ` Jan Kara
  2022-08-17 18:18                                     ` Chris Murphy
  2022-08-17 18:21                                     ` Holger Hoffstätte
  0 siblings, 2 replies; 58+ messages in thread
From: Jan Kara @ 2022-08-17 18:15 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Jan Kara, Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik

On Wed 17-08-22 13:57:00, Chris Murphy wrote:
> On Wed, Aug 17, 2022, at 12:47 PM, Chris Murphy wrote:
> Can you boot with
> >> "megaraid_sas.host_tagset_enable = 0" kernel option and see whether the
> >> issue reproduces?
> 
> This has been running an hour without symptoms. It's strongly suggestive,
> but needs to run overnight to be sure. Anecdotally, the max write IO is
> less than what I'm used to seeing.

OK, if this indeed passes then b6e68ee82585 ("blk-mq: Improve performance
of non-mq IO schedulers with multiple HW queues") might be what's causing
issues (although I don't know how yet...).

								Honza

> 
> [    0.583121] Kernel command line: BOOT_IMAGE=(md/0)/vmlinuz-5.12.5-300.fc34.x86_64 root=UUID=04f1fb7f-5cc4-4dfb-a7cf-b6b6925bf895 ro rootflags=subvol=root rd.md.uuid=e7782150:092e161a:68395862:31375bca biosdevname=1 net.ifnames=0 log_buf_len=8M plymouth.enable=0 megaraid_sas.host_tagset_enable=0
> ...
> [    6.745964] megasas: 07.714.04.00-rc1
> [    6.758472] megaraid_sas 0000:02:00.0: BAR:0x1  BAR's base_addr(phys):0x0000000092000000  mapped virt_addr:0x00000000c54554ff
> [    6.758477] megaraid_sas 0000:02:00.0: FW now in Ready state
> [    6.770658] megaraid_sas 0000:02:00.0: 63 bit DMA mask and 32 bit consistent mask
> [    6.795060] megaraid_sas 0000:02:00.0: firmware supports msix	: (96)
> [    6.807537] megaraid_sas 0000:02:00.0: requested/available msix 49/49
> [    6.819259] megaraid_sas 0000:02:00.0: current msix/online cpus	: (49/48)
> [    6.830800] megaraid_sas 0000:02:00.0: RDPQ mode	: (disabled)
> [    6.842031] megaraid_sas 0000:02:00.0: Current firmware supports maximum commands: 928	 LDIO threshold: 0
> [    6.871246] megaraid_sas 0000:02:00.0: Performance mode :Latency (latency index = 1)
> [    6.882265] megaraid_sas 0000:02:00.0: FW supports sync cache	: No
> [    6.893034] megaraid_sas 0000:02:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
> [    6.988550] megaraid_sas 0000:02:00.0: FW provided supportMaxExtLDs: 1	max_lds: 64
> [    6.988554] megaraid_sas 0000:02:00.0: controller type	: MR(2048MB)
> [    6.988555] megaraid_sas 0000:02:00.0: Online Controller Reset(OCR)	: Enabled
> [    6.988556] megaraid_sas 0000:02:00.0: Secure JBOD support	: No
> [    6.988557] megaraid_sas 0000:02:00.0: NVMe passthru support	: No
> [    6.988558] megaraid_sas 0000:02:00.0: FW provided TM TaskAbort/Reset timeout	: 0 secs/0 secs
> [    6.988559] megaraid_sas 0000:02:00.0: JBOD sequence map support	: No
> [    6.988560] megaraid_sas 0000:02:00.0: PCI Lane Margining support	: No
> [    7.025160] megaraid_sas 0000:02:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
> [    7.025162] megaraid_sas 0000:02:00.0: INIT adapter done
> [    7.025164] megaraid_sas 0000:02:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5707
> [    7.029878] megaraid_sas 0000:02:00.0: pci id		: (0x1000)/(0x005d)/(0x1028)/(0x1f47)
> [    7.029881] megaraid_sas 0000:02:00.0: unevenspan support	: yes
> [    7.029882] megaraid_sas 0000:02:00.0: firmware crash dump	: no
> [    7.029883] megaraid_sas 0000:02:00.0: JBOD sequence map	: disabled
> [    7.029915] megaraid_sas 0000:02:00.0: Max firmware commands: 927 shared with nr_hw_queues = 1
> [    7.029918] scsi host11: Avago SAS based MegaRAID driver
> 
> 
> 
> 
> -- 
> Chris Murphy
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17  9:52                       ` Holger Hoffstätte
  2022-08-17 11:49                         ` Jan Kara
  2022-08-17 11:57                         ` Chris Murphy
@ 2022-08-17 18:16                         ` Chris Murphy
  2022-08-17 18:38                           ` Holger Hoffstätte
  2 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 18:16 UTC (permalink / raw)
  To: Holger Hoffstätte, Nikolay Borisov, Jens Axboe, Jan Kara,
	Paolo Valente
  Cc: Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 5:52 AM, Holger Hoffstätte wrote:

> Chris, just a shot in the dark but can you try the patch from
>
> https://lore.kernel.org/linux-block/20220803121504.212071-1-yukuai1@huaweicloud.com/
>
> on top of something more recent than 5.12? Ideally 5.19 where it applies
> cleanly.


This patch applies cleanly on 5.12.0. I can try newer kernels later, but as the problem so easily reproduces with 5.12 and the problem first appeared there, is why I'm sticking with it. (For sure we prefer to be on 5.19 series.)

Let me know if I should try it still.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 18:15                                   ` Jan Kara
@ 2022-08-17 18:18                                     ` Chris Murphy
  2022-08-17 18:33                                       ` Jan Kara
  2022-08-17 18:21                                     ` Holger Hoffstätte
  1 sibling, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 18:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik



On Wed, Aug 17, 2022, at 2:15 PM, Jan Kara wrote:

> OK, if this indeed passes then b6e68ee82585 ("blk-mq: Improve performance
> of non-mq IO schedulers with multiple HW queues") might be what's causing
> issues (although I don't know how yet...).

I can revert it from 5.12.0 and try. Let me know which next test is preferred :)


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 18:15                                   ` Jan Kara
  2022-08-17 18:18                                     ` Chris Murphy
@ 2022-08-17 18:21                                     ` Holger Hoffstätte
  1 sibling, 0 replies; 58+ messages in thread
From: Holger Hoffstätte @ 2022-08-17 18:21 UTC (permalink / raw)
  To: Jan Kara, Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Paolo Valente, Linux-RAID,
	linux-block, linux-kernel, Josef Bacik

On 2022-08-17 20:15, Jan Kara wrote:
> On Wed 17-08-22 13:57:00, Chris Murphy wrote:
>> On Wed, Aug 17, 2022, at 12:47 PM, Chris Murphy wrote:
>> Can you boot with
>>>> "megaraid_sas.host_tagset_enable = 0" kernel option and see whether the
>>>> issue reproduces?
>>
>> This has been running an hour without symptoms. It's strongly suggestive,
>> but needs to run overnight to be sure. Anecdotally, the max write IO is
>> less than what I'm used to seeing.
> 
> OK, if this indeed passes then b6e68ee82585 ("blk-mq: Improve performance
> of non-mq IO schedulers with multiple HW queues") might be what's causing
> issues (although I don't know how yet...).
> 
> 								Honza

Certainly explains why BFQ turned up as a suspect, considering it's still
single-queue (fair MQ scheduling is .. complicated).

-h

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 18:18                                     ` Chris Murphy
@ 2022-08-17 18:33                                       ` Jan Kara
  2022-08-17 18:54                                         ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2022-08-17 18:33 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Jan Kara, Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik

On Wed 17-08-22 14:18:01, Chris Murphy wrote:
> 
> 
> On Wed, Aug 17, 2022, at 2:15 PM, Jan Kara wrote:
> 
> > OK, if this indeed passes then b6e68ee82585 ("blk-mq: Improve performance
> > of non-mq IO schedulers with multiple HW queues") might be what's causing
> > issues (although I don't know how yet...).
> 
> I can revert it from 5.12.0 and try. Let me know which next test is preferred :)

Let's try to revert this first so that we have it narrowed down what
started causing the issues. 

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 18:16                         ` Chris Murphy
@ 2022-08-17 18:38                           ` Holger Hoffstätte
  0 siblings, 0 replies; 58+ messages in thread
From: Holger Hoffstätte @ 2022-08-17 18:38 UTC (permalink / raw)
  To: Chris Murphy, Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente
  Cc: Linux-RAID, linux-block, linux-kernel, Josef Bacik

On 2022-08-17 20:16, Chris Murphy wrote:
> 
> 
> On Wed, Aug 17, 2022, at 5:52 AM, Holger Hoffstätte wrote:
> 
>> Chris, just a shot in the dark but can you try the patch from
>>
>> https://lore.kernel.org/linux-block/20220803121504.212071-1-yukuai1@huaweicloud.com/
>>
>> on top of something more recent than 5.12? Ideally 5.19 where it applies
>> cleanly.
> 
> 
> This patch applies cleanly on 5.12.0. I can try newer kernels later, but as the problem so easily reproduces with 5.12 and the problem first appeared there, is why I'm sticking with it. (For sure we prefer to be on 5.19 series.)
> 
> Let me know if I should try it still.

I just started running it in 5.19.2 to see if it breaks something;
no issues so far but then again I didn't have any problems to begin with
and only do peasant I/O load, and no MegaRAID.
However if it applies *and builds* on 5.12 I'd just go ahead and see what
catches fire. But you need to set the megaraid setting to fail, otherwise we
won't be able to see whether this is really a contributing factor,
or indeed the other commit that Jan identified.
Unfortunately 5.12 is a bit old already and most of the other important
fixes to sbitmap.c probably won't apply due to some other blk-mq changes.

In any case the plot thickens, so keep going. :)

-h

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 18:33                                       ` Jan Kara
@ 2022-08-17 18:54                                         ` Chris Murphy
  2022-08-17 19:23                                           ` Chris Murphy
  2022-08-18  2:31                                           ` Chris Murphy
  0 siblings, 2 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 18:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik



On Wed, Aug 17, 2022, at 2:33 PM, Jan Kara wrote:
> On Wed 17-08-22 14:18:01, Chris Murphy wrote:
>> 
>> 
>> On Wed, Aug 17, 2022, at 2:15 PM, Jan Kara wrote:
>> 
>> > OK, if this indeed passes then b6e68ee82585 ("blk-mq: Improve performance
>> > of non-mq IO schedulers with multiple HW queues") might be what's causing
>> > issues (although I don't know how yet...).
>> 
>> I can revert it from 5.12.0 and try. Let me know which next test is preferred :)
>
> Let's try to revert this first so that we have it narrowed down what
> started causing the issues. 

OK I've reverted b6e68ee82585, and removing megaraid_sas.host_tagset_enable=0, and will restart the workload...

Usually it's within 10 minutes but the newer the kernel it seems the longer it takes, or the more things I have to throw at it. The problem doesn't reproduce at all with 5.19 series unless I also run a separate dnf install, and that only triggers maybe 1 in 3 times.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 18:54                                         ` Chris Murphy
@ 2022-08-17 19:23                                           ` Chris Murphy
  2022-08-18  2:31                                           ` Chris Murphy
  1 sibling, 0 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-17 19:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik



On Wed, Aug 17, 2022, at 2:54 PM, Chris Murphy wrote:
> On Wed, Aug 17, 2022, at 2:33 PM, Jan Kara wrote:
>> On Wed 17-08-22 14:18:01, Chris Murphy wrote:
>>> 
>>> 
>>> On Wed, Aug 17, 2022, at 2:15 PM, Jan Kara wrote:
>>> 
>>> > OK, if this indeed passes then b6e68ee82585 ("blk-mq: Improve performance
>>> > of non-mq IO schedulers with multiple HW queues") might be what's causing
>>> > issues (although I don't know how yet...).
>>> 
>>> I can revert it from 5.12.0 and try. Let me know which next test is preferred :)
>>
>> Let's try to revert this first so that we have it narrowed down what
>> started causing the issues. 
>
> OK I've reverted b6e68ee82585, and removing 
> megaraid_sas.host_tagset_enable=0, and will restart the workload...
>
> Usually it's within 10 minutes but the newer the kernel it seems the 
> longer it takes, or the more things I have to throw at it. The problem 
> doesn't reproduce at all with 5.19 series unless I also run a separate 
> dnf install, and that only triggers maybe 1 in 3 times.

What I'm seeing is similar to 5.18 and occasionally 5.19...

top reports high %wa, above 30% sometimes above 60%, and increasing load (48 cpus so load 48 is OK, but this is triple digits which never happens on 5.11 series kernels).

IO pressure is 10x higher than with mq-deadline (or bfq on 5.11 series kernel) 40-50% right now

iotop usually craters to 0 by now, but it's near normal.

So I think b6e68ee82585 is s contributing factor. But isn't the only factor. I'm going to let this keep running and see if it matures into the more typical failure pattern.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 16:34                                 ` Chris Murphy
@ 2022-08-18  1:03                                   ` Ming Lei
  2022-08-18  2:30                                     ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Ming Lei @ 2022-08-18  1:03 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik

On Wed, Aug 17, 2022 at 12:34:42PM -0400, Chris Murphy wrote:
> 
> 
> On Wed, Aug 17, 2022, at 11:34 AM, Ming Lei wrote:
> 
> > From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
> > request based block devices, but sda is _not_ included in this log, and
> > only sdi, sdg and sdf are collected, is that expected?
> 
> While the problem was happening I did
> 
> cd /sys/kernel/debug/block
> find . -type f -exec grep -aH . {} \;
> 
> The file has the nodes out of order, but I don't know enough about the interface to see if there are things that are missing, or what it means.
> 
> 
> > BTW, all request based block devices should be observed in blk-mq debugfs.
> 
> /sys/kernel/debug/block contains
> 
> drwxr-xr-x.  2 root root 0 Aug 17 15:20 md0
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sda
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdb
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdc
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdd
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sde
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdf
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdg
> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdh
> drwxr-xr-x.  4 root root 0 Aug 17 15:20 sdi
> drwxr-xr-x.  2 root root 0 Aug 17 15:20 zram0

OK, so lots of devices are missed in your log, and the following command
is supposed to work for collecting log from all block device's debugfs:

(cd /sys/kernel/debug/block/ && find . -type f -exec grep -aH . {} \;)


Thanks,
Ming


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  1:03                                   ` Ming Lei
@ 2022-08-18  2:30                                     ` Chris Murphy
  2022-08-18  3:24                                       ` Ming Lei
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-18  2:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 9:03 PM, Ming Lei wrote:
> On Wed, Aug 17, 2022 at 12:34:42PM -0400, Chris Murphy wrote:
>> 
>> 
>> On Wed, Aug 17, 2022, at 11:34 AM, Ming Lei wrote:
>> 
>> > From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
>> > request based block devices, but sda is _not_ included in this log, and
>> > only sdi, sdg and sdf are collected, is that expected?
>> 
>> While the problem was happening I did
>> 
>> cd /sys/kernel/debug/block
>> find . -type f -exec grep -aH . {} \;
>> 
>> The file has the nodes out of order, but I don't know enough about the interface to see if there are things that are missing, or what it means.
>> 
>> 
>> > BTW, all request based block devices should be observed in blk-mq debugfs.
>> 
>> /sys/kernel/debug/block contains
>> 
>> drwxr-xr-x.  2 root root 0 Aug 17 15:20 md0
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sda
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdb
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdc
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdd
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sde
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdf
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdg
>> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdh
>> drwxr-xr-x.  4 root root 0 Aug 17 15:20 sdi
>> drwxr-xr-x.  2 root root 0 Aug 17 15:20 zram0
>
> OK, so lots of devices are missed in your log, and the following command
> is supposed to work for collecting log from all block device's debugfs:
>
> (cd /sys/kernel/debug/block/ && find . -type f -exec grep -aH . {} \;)

OK here it is:

https://drive.google.com/file/d/18nEOx2Ghsqx8uII6nzWpCFuYENHuQd-f/view?usp=sharing


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-17 18:54                                         ` Chris Murphy
  2022-08-17 19:23                                           ` Chris Murphy
@ 2022-08-18  2:31                                           ` Chris Murphy
  1 sibling, 0 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-18  2:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Holger Hoffstätte, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Linux-RAID, linux-block, linux-kernel,
	Josef Bacik



On Wed, Aug 17, 2022, at 2:54 PM, Chris Murphy wrote:
> On Wed, Aug 17, 2022, at 2:33 PM, Jan Kara wrote:
>> On Wed 17-08-22 14:18:01, Chris Murphy wrote:
>>> 
>>> 
>>> On Wed, Aug 17, 2022, at 2:15 PM, Jan Kara wrote:
>>> 
>>> > OK, if this indeed passes then b6e68ee82585 ("blk-mq: Improve performance
>>> > of non-mq IO schedulers with multiple HW queues") might be what's causing
>>> > issues (although I don't know how yet...).
>>> 
>>> I can revert it from 5.12.0 and try. Let me know which next test is preferred :)
>>
>> Let's try to revert this first so that we have it narrowed down what
>> started causing the issues. 
>
> OK I've reverted b6e68ee82585, and removing 
> megaraid_sas.host_tagset_enable=0, and will restart the workload...

I ran this for 7 hours and the problem didn't happen.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  2:30                                     ` Chris Murphy
@ 2022-08-18  3:24                                       ` Ming Lei
  2022-08-18  4:12                                         ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Ming Lei @ 2022-08-18  3:24 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik

On Wed, Aug 17, 2022 at 10:30:39PM -0400, Chris Murphy wrote:
> 
> 
> On Wed, Aug 17, 2022, at 9:03 PM, Ming Lei wrote:
> > On Wed, Aug 17, 2022 at 12:34:42PM -0400, Chris Murphy wrote:
> >> 
> >> 
> >> On Wed, Aug 17, 2022, at 11:34 AM, Ming Lei wrote:
> >> 
> >> > From the 2nd log of blockdebugfs-all.txt, still not see any in-flight IO on
> >> > request based block devices, but sda is _not_ included in this log, and
> >> > only sdi, sdg and sdf are collected, is that expected?
> >> 
> >> While the problem was happening I did
> >> 
> >> cd /sys/kernel/debug/block
> >> find . -type f -exec grep -aH . {} \;
> >> 
> >> The file has the nodes out of order, but I don't know enough about the interface to see if there are things that are missing, or what it means.
> >> 
> >> 
> >> > BTW, all request based block devices should be observed in blk-mq debugfs.
> >> 
> >> /sys/kernel/debug/block contains
> >> 
> >> drwxr-xr-x.  2 root root 0 Aug 17 15:20 md0
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sda
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdb
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdc
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdd
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sde
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdf
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdg
> >> drwxr-xr-x. 51 root root 0 Aug 17 15:20 sdh
> >> drwxr-xr-x.  4 root root 0 Aug 17 15:20 sdi
> >> drwxr-xr-x.  2 root root 0 Aug 17 15:20 zram0
> >
> > OK, so lots of devices are missed in your log, and the following command
> > is supposed to work for collecting log from all block device's debugfs:
> >
> > (cd /sys/kernel/debug/block/ && find . -type f -exec grep -aH . {} \;)
> 
> OK here it is:
> 
> https://drive.google.com/file/d/18nEOx2Ghsqx8uII6nzWpCFuYENHuQd-f/view?usp=sharing

The above log shows that the io stall happens on sdd, where:

1) 616 requests pending from scheduler queue

grep "busy=" blockdebugfs-all2.txt | grep sdd | grep sched | awk -F "=" '{s+=$2} END {print s}'
616

2) 11 requests pending from ./sdd/hctx2/dispatch for more than 300 seconds

Recently we seldom observe io hang from dispatch list, except for the
following two:

https://lore.kernel.org/linux-block/20220803023355.3687360-1-yuyufen@huaweicloud.com/
https://lore.kernel.org/linux-block/20220726122224.1790882-1-yukuai1@huaweicloud.com/

BTW, what is the output of the following log?

	(cd /sys/block/sdd/device && find . -type f -exec grep -aH . {} \;)

Also the above log shows that host_tagset_enable support is still
crippled on v5.12, I guess the issue may not be triggered(or pretty hard)
after you update to d97e594c5166 ("blk-mq: Use request queue-wide tags for
tagset-wide sbitmap"), or v5.14.



thanks,
Ming


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  3:24                                       ` Ming Lei
@ 2022-08-18  4:12                                         ` Chris Murphy
  2022-08-18  4:18                                           ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-18  4:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:

> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?

https://drive.google.com/file/d/1n8f66pVLCwQTJ0PMd71EiUZoeTWQk3dB/view?usp=sharing

This time it happened pretty quickly. This log is soon after triple digit load and no IO, but not as fully developed as before. The system has become entirely unresponsive to new commands, so I have to issue sysrq+b - if I let it go too long even that won't work.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  4:12                                         ` Chris Murphy
@ 2022-08-18  4:18                                           ` Chris Murphy
  2022-08-18  4:27                                             ` Chris Murphy
  0 siblings, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-18  4:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>
>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>
> https://drive.google.com/file/d/1n8f66pVLCwQTJ0PMd71EiUZoeTWQk3dB/view?usp=sharing
>
> This time it happened pretty quickly. This log is soon after triple 
> digit load and no IO, but not as fully developed as before. The system 
> has become entirely unresponsive to new commands, so I have to issue 
> sysrq+b - if I let it go too long even that won't work.

OK by the time I clicked send, the system had recovered. That also sometimes happens but then later IO stalls again and won't recover.  So I haven't issued sysrq+b on this run yet. Here is a second blk-mq debugfs log...

https://drive.google.com/file/d/1irHcns0qe7e7DJaDfanX8vSiqE1Nj5xl/view?usp=sharing


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  4:18                                           ` Chris Murphy
@ 2022-08-18  4:27                                             ` Chris Murphy
  2022-08-18  4:32                                               ` Chris Murphy
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-18  4:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>
>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?

Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.

https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  4:27                                             ` Chris Murphy
@ 2022-08-18  4:32                                               ` Chris Murphy
  2022-08-18  5:15                                               ` Ming Lei
  2022-08-18  5:24                                               ` Ming Lei
  2 siblings, 0 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-18  4:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Thu, Aug 18, 2022, at 12:27 AM, Chris Murphy wrote:
> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>>
>>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>
> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>
> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing

sysfs sdc

https://drive.google.com/file/d/1DLZHX8Mg_d5w-XSsAYYK1NDzn1pA_QPm/view?usp=sharing

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  4:27                                             ` Chris Murphy
  2022-08-18  4:32                                               ` Chris Murphy
@ 2022-08-18  5:15                                               ` Ming Lei
  2022-08-18 18:52                                                 ` Chris Murphy
  2022-08-18  5:24                                               ` Ming Lei
  2 siblings, 1 reply; 58+ messages in thread
From: Ming Lei @ 2022-08-18  5:15 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik

On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
> 
> 
> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
> >>
> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
> 
> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
> 
> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
> 

Please test the following patch and see if it makes a difference:

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a4f7c101b53b..8e8d77e79dd6 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -44,7 +44,10 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
 	 */
 	smp_mb();
 
-	blk_mq_run_hw_queue(hctx, true);
+	if (blk_mq_is_shared_tags(hctx->flags))
+		blk_mq_run_hw_queues(hctx->queue, true);
+	else
+		blk_mq_run_hw_queue(hctx, true);
 }
 
 static int sched_rq_cmp(void *priv, const struct list_head *a,


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  4:27                                             ` Chris Murphy
  2022-08-18  4:32                                               ` Chris Murphy
  2022-08-18  5:15                                               ` Ming Lei
@ 2022-08-18  5:24                                               ` Ming Lei
  2022-08-18 13:50                                                 ` Chris Murphy
  2022-08-19 19:20                                                 ` Chris Murphy
  2 siblings, 2 replies; 58+ messages in thread
From: Ming Lei @ 2022-08-18  5:24 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik

On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
> 
> 
> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
> >>
> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
> 
> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
> 
> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
> 

Also please test the following one too:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5ee62b95f3e5..d01c64be08e2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
-		else if (needs_restart && needs_resource)
+		else if (needs_restart && (needs_resource ||
+					blk_mq_is_shared_tags(hctx->flags)))
 			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
 
 		blk_mq_update_dispatch_busy(hctx, true);


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  5:24                                               ` Ming Lei
@ 2022-08-18 13:50                                                 ` Chris Murphy
  2022-08-18 15:10                                                   ` Ming Lei
  2022-08-19 19:20                                                 ` Chris Murphy
  1 sibling, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-18 13:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:

>
> Also please test the following one too:
>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5ee62b95f3e5..d01c64be08e2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx 
> *hctx, struct list_head *list,
>  		if (!needs_restart ||
>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>  			blk_mq_run_hw_queue(hctx, true);
> -		else if (needs_restart && needs_resource)
> +		else if (needs_restart && (needs_resource ||
> +					blk_mq_is_shared_tags(hctx->flags)))
>  			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> 
>  		blk_mq_update_dispatch_busy(hctx, true);
>

Should I test both patches at the same time, or separately? On top of v5.17 clean, or with b6e68ee82585 still reverted?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18 13:50                                                 ` Chris Murphy
@ 2022-08-18 15:10                                                   ` Ming Lei
  0 siblings, 0 replies; 58+ messages in thread
From: Ming Lei @ 2022-08-18 15:10 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik

On Thu, Aug 18, 2022 at 9:50 PM Chris Murphy <lists@colorremedies.com> wrote:
>
>
>
> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
>
> >
> > Also please test the following one too:
> >
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 5ee62b95f3e5..d01c64be08e2 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
> > *hctx, struct list_head *list,
> >               if (!needs_restart ||
> >                   (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >                       blk_mq_run_hw_queue(hctx, true);
> > -             else if (needs_restart && needs_resource)
> > +             else if (needs_restart && (needs_resource ||
> > +                                     blk_mq_is_shared_tags(hctx->flags)))
> >                       blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> >
> >               blk_mq_update_dispatch_busy(hctx, true);
> >
>
> Should I test both patches at the same time, or separately? On top of v5.17 clean, or with b6e68ee82585 still reverted?

Please test it separately against v5.17.

thanks,


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  5:15                                               ` Ming Lei
@ 2022-08-18 18:52                                                 ` Chris Murphy
  0 siblings, 0 replies; 58+ messages in thread
From: Chris Murphy @ 2022-08-18 18:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Thu, Aug 18, 2022, at 1:15 AM, Ming Lei wrote:
> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>> 
>> 
>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>> >>
>> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>> 
>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>> 
>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>> 
>
> Please test the following patch and see if it makes a difference:
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index a4f7c101b53b..8e8d77e79dd6 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -44,7 +44,10 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	smp_mb();
> 
> -	blk_mq_run_hw_queue(hctx, true);
> +	if (blk_mq_is_shared_tags(hctx->flags))
> +		blk_mq_run_hw_queues(hctx->queue, true);
> +	else
> +		blk_mq_run_hw_queue(hctx, true);
>  }
> 
>  static int sched_rq_cmp(void *priv, const struct list_head *a,


I still get a stall. By the time I noticed it, I can't run any new commands (they just hang) so I had to sysrq+b. Let me know if I should rerun the test in order to capture block debug log.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-18  5:24                                               ` Ming Lei
  2022-08-18 13:50                                                 ` Chris Murphy
@ 2022-08-19 19:20                                                 ` Chris Murphy
  2022-08-20  7:00                                                   ` Ming Lei
  1 sibling, 1 reply; 58+ messages in thread
From: Chris Murphy @ 2022-08-19 19:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik



On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>> 
>> 
>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>> >>
>> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>> 
>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>> 
>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>> 
>
> Also please test the following one too:
>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5ee62b95f3e5..d01c64be08e2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx 
> *hctx, struct list_head *list,
>  		if (!needs_restart ||
>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>  			blk_mq_run_hw_queue(hctx, true);
> -		else if (needs_restart && needs_resource)
> +		else if (needs_restart && (needs_resource ||
> +					blk_mq_is_shared_tags(hctx->flags)))
>  			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> 
>  		blk_mq_update_dispatch_busy(hctx, true);
>


With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-19 19:20                                                 ` Chris Murphy
@ 2022-08-20  7:00                                                   ` Ming Lei
  2022-09-01  7:02                                                     ` Yu Kuai
  0 siblings, 1 reply; 58+ messages in thread
From: Ming Lei @ 2022-08-20  7:00 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Nikolay Borisov, Jens Axboe, Jan Kara, Paolo Valente,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik

On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
> 
> 
> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
> > On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
> >> 
> >> 
> >> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> >> > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> >> >> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
> >> >>
> >> >>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
> >> 
> >> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
> >> 
> >> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
> >> 
> >
> > Also please test the following one too:
> >
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 5ee62b95f3e5..d01c64be08e2 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx 
> > *hctx, struct list_head *list,
> >  		if (!needs_restart ||
> >  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >  			blk_mq_run_hw_queue(hctx, true);
> > -		else if (needs_restart && needs_resource)
> > +		else if (needs_restart && (needs_resource ||
> > +					blk_mq_is_shared_tags(hctx->flags)))
> >  			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> > 
> >  		blk_mq_update_dispatch_busy(hctx, true);
> >
> 
> 
> With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
> https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing

The log is similar with before, and the only difference is RESTART not
set.

Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:

8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-08-20  7:00                                                   ` Ming Lei
@ 2022-09-01  7:02                                                     ` Yu Kuai
  2022-09-01  8:03                                                       ` Jan Kara
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Yu Kuai @ 2022-09-01  7:02 UTC (permalink / raw)
  To: Ming Lei, Chris Murphy, Jan Kara
  Cc: Nikolay Borisov, Jens Axboe, Paolo Valente, Btrfs BTRFS,
	Linux-RAID, linux-block, linux-kernel, Josef Bacik, yukuai (C)

Hi, Chris

在 2022/08/20 15:00, Ming Lei 写道:
> On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
>>
>>
>> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
>>> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>>>>
>>>>
>>>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>>>>> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>>>>>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>>>>>
>>>>>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>>>>
>>>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>>>>
>>>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>>>>
>>>
>>> Also please test the following one too:
>>>
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 5ee62b95f3e5..d01c64be08e2 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
>>> *hctx, struct list_head *list,
>>>   		if (!needs_restart ||
>>>   		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>>   			blk_mq_run_hw_queue(hctx, true);
>>> -		else if (needs_restart && needs_resource)
>>> +		else if (needs_restart && (needs_resource ||
>>> +					blk_mq_is_shared_tags(hctx->flags)))
>>>   			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>>>
>>>   		blk_mq_update_dispatch_busy(hctx, true);
>>>
>>
>>
>> With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
>> https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing
> 
> The log is similar with before, and the only difference is RESTART not
> set.
> 
> Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
> 
> 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues

Have you tried this patch?

We meet a similar problem in our test, and I'm pretty sure about the
situation at the scene,

Our test environment:nvme with bfq ioscheduler,

How io is stalled:

1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
work is queued.

2. other hctx tries to dispatch rq, however, in service bfqq is
empty, bfq_dispatch_request return NULL, thus
blk_mq_delay_run_hw_queues is called.

3. for the problem described in above patch,run work from "hctx1"
can be stalled.

Above patch should fix this io stall, however, it seems to me bfq do
have some problems that in service bfqq doesn't expire under following
situation:

1. dispatched rqs don't complete
2. no new rq is issued to bfq

Thanks,
Kuai
> 
> 
> 
> Thanks,
> Ming
> 
> .
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-09-01  7:02                                                     ` Yu Kuai
@ 2022-09-01  8:03                                                       ` Jan Kara
  2022-09-01  8:19                                                         ` Yu Kuai
  2022-09-02 16:53                                                       ` Chris Murphy
  2022-09-06  9:45                                                       ` Paolo Valente
  2 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2022-09-01  8:03 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Ming Lei, Chris Murphy, Jan Kara, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Btrfs BTRFS, Linux-RAID, linux-block,
	linux-kernel, Josef Bacik, yukuai (C)

On Thu 01-09-22 15:02:03, Yu Kuai wrote:
> Hi, Chris
> 
> 在 2022/08/20 15:00, Ming Lei 写道:
> > On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
> > > 
> > > 
> > > On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
> > > > On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
> > > > > 
> > > > > 
> > > > > On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
> > > > > > On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
> > > > > > > On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
> > > > > > > 
> > > > > > > > OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
> > > > > 
> > > > > Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
> > > > > 
> > > > > https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
> > > > > 
> > > > 
> > > > Also please test the following one too:
> > > > 
> > > > 
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 5ee62b95f3e5..d01c64be08e2 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
> > > > *hctx, struct list_head *list,
> > > >   		if (!needs_restart ||
> > > >   		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> > > >   			blk_mq_run_hw_queue(hctx, true);
> > > > -		else if (needs_restart && needs_resource)
> > > > +		else if (needs_restart && (needs_resource ||
> > > > +					blk_mq_is_shared_tags(hctx->flags)))
> > > >   			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> > > > 
> > > >   		blk_mq_update_dispatch_busy(hctx, true);
> > > > 
> > > 
> > > 
> > > With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
> > > https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing
> > 
> > The log is similar with before, and the only difference is RESTART not
> > set.
> > 
> > Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
> > 
> > 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
> 
> Have you tried this patch?
> 
> We meet a similar problem in our test, and I'm pretty sure about the
> situation at the scene,
> 
> Our test environment:nvme with bfq ioscheduler,
> 
> How io is stalled:
> 
> 1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
> dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
> work is queued.
> 
> 2. other hctx tries to dispatch rq, however, in service bfqq is
> empty, bfq_dispatch_request return NULL, thus
> blk_mq_delay_run_hw_queues is called.
> 
> 3. for the problem described in above patch,run work from "hctx1"
> can be stalled.
> 
> Above patch should fix this io stall, however, it seems to me bfq do
> have some problems that in service bfqq doesn't expire under following
> situation:
> 
> 1. dispatched rqs don't complete
> 2. no new rq is issued to bfq

And I guess:
3. there are requests queued in other bfqqs
?

Otherwise I don't see a point in expiring current bfqq because there's
nothing bfq could do anyway. But under normal circumstances the request
completion should not take so long so I don't think it would be really
worth it to implement some special mechanism for this in bfq.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-09-01  8:03                                                       ` Jan Kara
@ 2022-09-01  8:19                                                         ` Yu Kuai
  2022-09-06  9:49                                                           ` Paolo Valente
  0 siblings, 1 reply; 58+ messages in thread
From: Yu Kuai @ 2022-09-01  8:19 UTC (permalink / raw)
  To: Jan Kara, Yu Kuai
  Cc: Ming Lei, Chris Murphy, Nikolay Borisov, Jens Axboe,
	Paolo Valente, Btrfs BTRFS, Linux-RAID, linux-block,
	linux-kernel, Josef Bacik, yukuai (C)

在 2022/09/01 16:03, Jan Kara 写道:
> On Thu 01-09-22 15:02:03, Yu Kuai wrote:
>> Hi, Chris
>>
>> 在 2022/08/20 15:00, Ming Lei 写道:
>>> On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
>>>>
>>>>
>>>> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
>>>>> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>>>>>>> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>>>>>>>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>>>>>>>
>>>>>>>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>>>>>>
>>>>>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>>>>>>
>>>>>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>>>>>>
>>>>>
>>>>> Also please test the following one too:
>>>>>
>>>>>
>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>> index 5ee62b95f3e5..d01c64be08e2 100644
>>>>> --- a/block/blk-mq.c
>>>>> +++ b/block/blk-mq.c
>>>>> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
>>>>> *hctx, struct list_head *list,
>>>>>    		if (!needs_restart ||
>>>>>    		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>>>>    			blk_mq_run_hw_queue(hctx, true);
>>>>> -		else if (needs_restart && needs_resource)
>>>>> +		else if (needs_restart && (needs_resource ||
>>>>> +					blk_mq_is_shared_tags(hctx->flags)))
>>>>>    			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>>>>>
>>>>>    		blk_mq_update_dispatch_busy(hctx, true);
>>>>>
>>>>
>>>>
>>>> With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
>>>> https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing
>>>
>>> The log is similar with before, and the only difference is RESTART not
>>> set.
>>>
>>> Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
>>>
>>> 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
>>
>> Have you tried this patch?
>>
>> We meet a similar problem in our test, and I'm pretty sure about the
>> situation at the scene,
>>
>> Our test environment:nvme with bfq ioscheduler,
>>
>> How io is stalled:
>>
>> 1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
>> dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
>> work is queued.
>>
>> 2. other hctx tries to dispatch rq, however, in service bfqq is
>> empty, bfq_dispatch_request return NULL, thus
>> blk_mq_delay_run_hw_queues is called.
>>
>> 3. for the problem described in above patch,run work from "hctx1"
>> can be stalled.
>>
>> Above patch should fix this io stall, however, it seems to me bfq do
>> have some problems that in service bfqq doesn't expire under following
>> situation:
>>
>> 1. dispatched rqs don't complete
>> 2. no new rq is issued to bfq
> 
> And I guess:
> 3. there are requests queued in other bfqqs
> ?

Yes, of course, other bfqqs still have requests, but current
implementation have flaws that even if other bfqqs doesn't have
requests, bfq_asymmetric_scenario() can still return true because
num_groups_with_pending_reqs > 0. We tried to fix this, however, there
seems to be some misunderstanding with Paolo, and it's not applied to
mainline yet...

Thanks,
Kuai
> 
> Otherwise I don't see a point in expiring current bfqq because there's
> nothing bfq could do anyway. But under normal circumstances the request
> completion should not take so long so I don't think it would be really
> worth it to implement some special mechanism for this in bfq.
> 
> 								Honza
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-09-01  7:02                                                     ` Yu Kuai
  2022-09-01  8:03                                                       ` Jan Kara
@ 2022-09-02 16:53                                                       ` Chris Murphy
  2022-09-06  9:45                                                       ` Paolo Valente
  2 siblings, 0 replies; 58+ messages in thread
From: Chris Murphy @ 2022-09-02 16:53 UTC (permalink / raw)
  To: Yu Kuai, Ming Lei, Jan Kara
  Cc: Nikolay Borisov, Jens Axboe, Paolo Valente, Btrfs BTRFS,
	Linux-RAID, linux-block, linux-kernel, Josef Bacik, yukuai (C)



On Thu, Sep 1, 2022, at 3:02 AM, Yu Kuai wrote:
> Hi, Chris


>> Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
>> 
>> 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
>
> Have you tried this patch?

The problem happens on 5.18 series kernels. But takes longer. Once I regain access to this setup, I can try to reproduce on 5.18 and 5.19, and provide block debugfs logs. 


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-09-01  7:02                                                     ` Yu Kuai
  2022-09-01  8:03                                                       ` Jan Kara
  2022-09-02 16:53                                                       ` Chris Murphy
@ 2022-09-06  9:45                                                       ` Paolo Valente
  2 siblings, 0 replies; 58+ messages in thread
From: Paolo Valente @ 2022-09-06  9:45 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Ming Lei, Chris Murphy, Jan Kara, Nikolay Borisov, Jens Axboe,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik,
	yukuai (C)



> Il giorno 1 set 2022, alle ore 09:02, Yu Kuai <yukuai1@huaweicloud.com> ha scritto:
> 
> Hi, Chris
> 
> 在 2022/08/20 15:00, Ming Lei 写道:
>> On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
>>> 
>>> 
>>> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
>>>> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>>>>> 
>>>>> 
>>>>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>>>>>> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>>>>>>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>>>>>> 
>>>>>>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>>>>> 
>>>>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>>>>> 
>>>>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>>>>> 
>>>> 
>>>> Also please test the following one too:
>>>> 
>>>> 
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 5ee62b95f3e5..d01c64be08e2 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
>>>> *hctx, struct list_head *list,
>>>>  		if (!needs_restart ||
>>>>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>>>  			blk_mq_run_hw_queue(hctx, true);
>>>> -		else if (needs_restart && needs_resource)
>>>> +		else if (needs_restart && (needs_resource ||
>>>> +					blk_mq_is_shared_tags(hctx->flags)))
>>>>  			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>>>> 
>>>>  		blk_mq_update_dispatch_busy(hctx, true);
>>>> 
>>> 
>>> 
>>> With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
>>> https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing
>> The log is similar with before, and the only difference is RESTART not
>> set.
>> Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
>> 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
> 
> Have you tried this patch?
> 
> We meet a similar problem in our test, and I'm pretty sure about the
> situation at the scene,
> 
> Our test environment:nvme with bfq ioscheduler,
> 
> How io is stalled:
> 
> 1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
> dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
> work is queued.
> 
> 2. other hctx tries to dispatch rq, however, in service bfqq is
> empty, bfq_dispatch_request return NULL, thus
> blk_mq_delay_run_hw_queues is called.
> 
> 3. for the problem described in above patch,run work from "hctx1"
> can be stalled.
> 
> Above patch should fix this io stall, however, it seems to me bfq do
> have some problems that in service bfqq doesn't expire under following
> situation:
> 
> 1. dispatched rqs don't complete
> 2. no new rq is issued to bfq
> 

There may be one more important problem: is bfq_finish_requeue_request
eventually invoked for the failed rq?  If it is not, then a memory
leak follows, because recounting gets unavoidably unbalanced.

In contrast, if bfq_finish_requeue_request is correctly invoked, then
no stall should occur.

Thanks,
Paolo

> Thanks,
> Kuai
>> Thanks,
>> Ming
>> .


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: stalling IO regression since linux 5.12, through 5.18
  2022-09-01  8:19                                                         ` Yu Kuai
@ 2022-09-06  9:49                                                           ` Paolo Valente
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Valente @ 2022-09-06  9:49 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Jan Kara, Ming Lei, Chris Murphy, Nikolay Borisov, Jens Axboe,
	Btrfs BTRFS, Linux-RAID, linux-block, linux-kernel, Josef Bacik,
	yukuai (C)



> Il giorno 1 set 2022, alle ore 10:19, Yu Kuai <yukuai1@huaweicloud.com> ha scritto:
> 
> 在 2022/09/01 16:03, Jan Kara 写道:
>> On Thu 01-09-22 15:02:03, Yu Kuai wrote:
>>> Hi, Chris
>>> 
>>> 在 2022/08/20 15:00, Ming Lei 写道:
>>>> On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:
>>>>> 
>>>>> 
>>>>> On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
>>>>>> On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
>>>>>>>> On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
>>>>>>>>> On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:
>>>>>>>>> 
>>>>>>>>>> OK, can you post the blk-mq debugfs log after you trigger it on v5.17?
>>>>>>> 
>>>>>>> Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.
>>>>>>> 
>>>>>>> https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing
>>>>>>> 
>>>>>> 
>>>>>> Also please test the following one too:
>>>>>> 
>>>>>> 
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index 5ee62b95f3e5..d01c64be08e2 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
>>>>>> *hctx, struct list_head *list,
>>>>>>   		if (!needs_restart ||
>>>>>>   		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>>>>>   			blk_mq_run_hw_queue(hctx, true);
>>>>>> -		else if (needs_restart && needs_resource)
>>>>>> +		else if (needs_restart && (needs_resource ||
>>>>>> +					blk_mq_is_shared_tags(hctx->flags)))
>>>>>>   			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>>>>>> 
>>>>>>   		blk_mq_update_dispatch_busy(hctx, true);
>>>>>> 
>>>>> 
>>>>> 
>>>>> With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
>>>>> https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing
>>>> 
>>>> The log is similar with before, and the only difference is RESTART not
>>>> set.
>>>> 
>>>> Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:
>>>> 
>>>> 8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
>>> 
>>> Have you tried this patch?
>>> 
>>> We meet a similar problem in our test, and I'm pretty sure about the
>>> situation at the scene,
>>> 
>>> Our test environment:nvme with bfq ioscheduler,
>>> 
>>> How io is stalled:
>>> 
>>> 1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
>>> dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
>>> work is queued.
>>> 
>>> 2. other hctx tries to dispatch rq, however, in service bfqq is
>>> empty, bfq_dispatch_request return NULL, thus
>>> blk_mq_delay_run_hw_queues is called.
>>> 
>>> 3. for the problem described in above patch,run work from "hctx1"
>>> can be stalled.
>>> 
>>> Above patch should fix this io stall, however, it seems to me bfq do
>>> have some problems that in service bfqq doesn't expire under following
>>> situation:
>>> 
>>> 1. dispatched rqs don't complete
>>> 2. no new rq is issued to bfq
>> And I guess:
>> 3. there are requests queued in other bfqqs
>> ?
> 
> Yes, of course, other bfqqs still have requests, but current
> implementation have flaws that even if other bfqqs doesn't have
> requests, bfq_asymmetric_scenario() can still return true because
> num_groups_with_pending_reqs > 0. We tried to fix this, however, there
> seems to be some misunderstanding with Paolo, and it's not applied to
> mainline yet...
> 

I think this is an unsolved performance issue (being solved patiently
by Yu Kuai), but not a functional flaw.  The solution of this issue
would probably solve this stall, but not the essential problem:
refcounting gets broken if reqs disappear for bfq without any
notification.

Thanks,
Paolo

> Thanks,
> Kuai
>> Otherwise I don't see a point in expiring current bfqq because there's
>> nothing bfq could do anyway. But under normal circumstances the request
>> completion should not take so long so I don't think it would be really
>> worth it to implement some special mechanism for this in bfq.
>> 								Honza


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2022-09-06  9:50 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-10 16:35 stalling IO regression in linux 5.12 Chris Murphy
2022-08-10 17:48 ` Josef Bacik
2022-08-10 18:33   ` Chris Murphy
2022-08-10 18:42     ` Chris Murphy
2022-08-10 19:31       ` Josef Bacik
2022-08-10 19:34       ` Chris Murphy
2022-08-12 16:05         ` stalling IO regression since linux 5.12, through 5.18 Chris Murphy
2022-08-12 17:59           ` Josef Bacik
2022-08-12 18:02             ` Jens Axboe
2022-08-14 20:28               ` Chris Murphy
2022-08-16 14:22                 ` Chris Murphy
2022-08-16 15:25                   ` Nikolay Borisov
2022-08-16 15:34                     ` Chris Murphy
2022-08-17  9:52                       ` Holger Hoffstätte
2022-08-17 11:49                         ` Jan Kara
2022-08-17 14:37                           ` Chris Murphy
2022-08-17 15:09                           ` Chris Murphy
2022-08-17 16:30                             ` Jan Kara
2022-08-17 16:47                               ` Chris Murphy
2022-08-17 17:57                                 ` Chris Murphy
2022-08-17 18:15                                   ` Jan Kara
2022-08-17 18:18                                     ` Chris Murphy
2022-08-17 18:33                                       ` Jan Kara
2022-08-17 18:54                                         ` Chris Murphy
2022-08-17 19:23                                           ` Chris Murphy
2022-08-18  2:31                                           ` Chris Murphy
2022-08-17 18:21                                     ` Holger Hoffstätte
2022-08-17 11:57                         ` Chris Murphy
2022-08-17 12:31                           ` Holger Hoffstätte
2022-08-17 18:16                         ` Chris Murphy
2022-08-17 18:38                           ` Holger Hoffstätte
2022-08-17 12:06                       ` Ming Lei
2022-08-17 14:34                         ` Chris Murphy
2022-08-17 14:53                           ` Ming Lei
2022-08-17 15:02                             ` Chris Murphy
2022-08-17 15:34                               ` Ming Lei
2022-08-17 16:34                                 ` Chris Murphy
2022-08-18  1:03                                   ` Ming Lei
2022-08-18  2:30                                     ` Chris Murphy
2022-08-18  3:24                                       ` Ming Lei
2022-08-18  4:12                                         ` Chris Murphy
2022-08-18  4:18                                           ` Chris Murphy
2022-08-18  4:27                                             ` Chris Murphy
2022-08-18  4:32                                               ` Chris Murphy
2022-08-18  5:15                                               ` Ming Lei
2022-08-18 18:52                                                 ` Chris Murphy
2022-08-18  5:24                                               ` Ming Lei
2022-08-18 13:50                                                 ` Chris Murphy
2022-08-18 15:10                                                   ` Ming Lei
2022-08-19 19:20                                                 ` Chris Murphy
2022-08-20  7:00                                                   ` Ming Lei
2022-09-01  7:02                                                     ` Yu Kuai
2022-09-01  8:03                                                       ` Jan Kara
2022-09-01  8:19                                                         ` Yu Kuai
2022-09-06  9:49                                                           ` Paolo Valente
2022-09-02 16:53                                                       ` Chris Murphy
2022-09-06  9:45                                                       ` Paolo Valente
2022-08-15 11:25 ` stalling IO regression in linux 5.12 Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).