Re: [bisected] bfq regression on latest linux-block/for-next

From: Yi Zhang <yi.zhang@redhat.com>
To: Paolo Valente <paolo.valente@linaro.org>
Cc: linux-block <linux-block@vger.kernel.org>
Subject: Re: [bisected] bfq regression on latest linux-block/for-next
Date: Wed, 7 Apr 2021 23:15:05 +0800	[thread overview]
Message-ID: <c7d23258-0890-f79f-cc6a-9cb24bbaa437@redhat.com> (raw)
In-Reply-To: <4F41414B-05F8-4E7D-A312-8A47B8468C78@linaro.org>


On 4/2/21 9:39 PM, Paolo Valente wrote:
>
>> Il giorno 1 apr 2021, alle ore 03:27, Yi Zhang <yi.zhang@redhat.com> ha scritto:
>>
>> Hi
> Hi
>
>> We reproduced this bfq regression[3] on ppc64le with blktests[2] on the latest linux-block/for-next branch, seems it was introduced with [1] from my bisecting, pls help check it. Let me know if you need any testing for it, thanks.
>>
> Thanks for reporting this bug and finding the candidate offending commit. Could you try this test with my dev kernel, which might provide more information? The kernel is here:
> https://github.com/Algodev-github/bfq-mq
>
> Alternatively, I could try to provide you with patches to instrument your kernel.
HI Paolo
I tried your dev kernel, but with no luck to reproduce it, could you 
provide the debug patch based on latest linux-block/for-next?

> The first option may be quicker.
>
> In both cases, having KASAN active could be rather helpful too.
>
> Looking forward to your feedback,
> Paolo
>
>> [1]
>> commit 430a67f9d6169a7b3e328bceb2ef9542e4153c7c (HEAD, refs/bisect/bad)
>> Author: Paolo Valente <paolo.valente@linaro.org>
>> Date:   Thu Mar 4 18:46:27 2021 +0100
>>
>>     block, bfq: merge bursts of newly-created queues
>>
>> [2] blktests: nvme_trtype=tcp ./check nvme/011
>>
>> [3]
>> [  109.342525] run blktests nvme/011 at 2021-03-31 20:58:58
>> [  109.497429] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [  109.512868] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>> [  109.584932] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:9b73e8531afa4320963cc96571e0acb1.
>> [  109.585809] nvme nvme0: creating 128 I/O queues.
>> [  109.596570] nvme nvme0: mapped 128/0/0 default/read/poll queues.
>> [  109.654170] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
>> [  155.366535] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>> [  155.366568] nvmet: ctrl 1 fatal error occurred!
>> [  155.366608] nvme nvme0: starting error recovery
>> [  155.374861] block nvme0n1: no usable path - requeuing I/O
>> [  155.374891] block nvme0n1: no usable path - requeuing I/O
>> [  155.374911] block nvme0n1: no usable path - requeuing I/O
>> [  155.374923] block nvme0n1: no usable path - requeuing I/O
>> [  155.374934] block nvme0n1: no usable path - requeuing I/O
>> [  155.374954] block nvme0n1: no usable path - requeuing I/O
>> [  155.374973] block nvme0n1: no usable path - requeuing I/O
>> [  155.374984] block nvme0n1: no usable path - requeuing I/O
>> [  155.375004] block nvme0n1: no usable path - requeuing I/O
>> [  155.375024] block nvme0n1: no usable path - requeuing I/O
>> [  155.375674] nvme nvme0: Reconnecting in 10 seconds...
>> [  180.967462] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
>> [  180.967486] nvmet: ctrl 2 fatal error occurred!
>> [  193.427906] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> [  193.427934] rcu: 	40-...0: (1 GPs behind) idle=f3e/1/0x4000000000000000 softirq=535/535 fqs=2839
>> [  193.427966] rcu: 	42-...0: (1 GPs behind) idle=792/1/0x4000000000000000 softirq=160/160 fqs=2839
>> [  193.427995] 	(detected by 5, t=6002 jiffies, g=7961, q=7219)
>> [  193.428030] Sending NMI from CPU 5 to CPUs 40:
>> [  199.235530] CPU 40 didn't respond to backtrace IPI, inspecting paca.
>> [  199.235551] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 217 (kworker/40:0H)
>> [  199.235579] Back trace of paca->saved_r1 (0xc000000012973380) (possibly stale):
>> [  199.235594] Call Trace:
>> [  199.235601] [c000000012973380] [c0000000129733e0] 0xc0000000129733e0 (unreliable)
>> [  199.235633] [c000000012973420] [c000000000933008] bfq_allow_bio_merge+0x78/0x110
>> [  199.235673] [c000000012973460] [c0000000008d7470] elv_bio_merge_ok+0x90/0xb0
>> [  199.235711] [c000000012973490] [c0000000008d811c] elv_merge+0x6c/0x1b0
>> [  199.235747] [c0000000129734e0] [c0000000008ea1d8] blk_mq_sched_try_merge+0x48/0x250
>> [  199.235785] [c000000012973540] [c00000000092d29c] bfq_bio_merge+0xfc/0x230
>> [  199.235820] [c0000000129735c0] [c0000000008f9e24] __blk_mq_sched_bio_merge+0xa4/0x210
>> [  199.235848] [c000000012973620] [c0000000008f288c] blk_mq_submit_bio+0xec/0x710
>> [  199.235877] [c0000000129736c0] [c0000000008dfad4] submit_bio_noacct+0x534/0x680
>> [  199.235915] [c000000012973760] [c0000000005e0308] iomap_dio_submit_bio+0xd8/0x100
>> [  199.235953] [c000000012973790] [c0000000005e0ed4] iomap_dio_bio_actor+0x3c4/0x570
>> [  199.235991] [c000000012973830] [c0000000005db284] iomap_apply+0x1f4/0x3e0
>> [  199.236017] [c000000012973920] [c0000000005e06e4] __iomap_dio_rw+0x204/0x5b0
>> [  199.236054] [c0000000129739f0] [c0000000005e0ab0] iomap_dio_rw+0x20/0x80
>> [  199.236090] [c000000012973a10] [c00800000e4ea660] xfs_file_dio_write_aligned+0xb8/0x1b0 [xfs]
>> [  199.236230] [c000000012973a60] [c00800000e4eabfc] xfs_file_write_iter+0x104/0x190 [xfs]
>> [  199.236368] [c000000012973a90] [c008000010440724] nvmet_file_submit_bvec+0xfc/0x160 [nvmet]
>> [  199.236412] [c000000012973b10] [c008000010440b54] nvmet_file_execute_io+0x2cc/0x3b0 [nvmet]
>> [  199.236455] [c000000012973b90] [c00800000fa142d8] nvmet_tcp_io_work+0xce0/0xd10 [nvmet_tcp]
>> [  199.236493] [c000000012973c70] [c000000000179534] process_one_work+0x294/0x580
>> [  199.236533] [c000000012973d10] [c0000000001798c8] worker_thread+0xa8/0x650
>> [  199.236568] [c000000012973da0] [c000000000184180] kthread+0x190/0x1a0
>> [  199.236595] [c000000012973e10] [c00000000000d4ec] ret_from_kernel_thread+0x5c/0x70
>> [  199.236710] Sending NMI from CPU 5 to CPUs 42:
>> [  205.044364] CPU 42 didn't respond to backtrace IPI, inspecting paca.
>> [  205.044381] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 3004 (kworker/42:2H)
>> [  205.044408] Back trace of paca->saved_r1 (0xc00000085194b520) (possibly stale):
>> [  205.044423] Call Trace:
>> [  205.044440] [c00000085194b520] [c00000079ac7f700] 0xc00000079ac7f700 (unreliable)
>> [  205.044468] [c00000085194b540] [c00000000092d248] bfq_bio_merge+0xa8/0x230
>> [  205.044494] [c00000085194b5c0] [c0000000008f9e24] __blk_mq_sched_bio_merge+0xa4/0x210
>> [  205.044531] [c00000085194b620] [c0000000008f288c] blk_mq_submit_bio+0xec/0x710
>> [  205.044569] [c00000085194b6c0] [c0000000008dfad4] submit_bio_noacct+0x534/0x680
>> [  205.044607] [c00000085194b760] [c0000000005e0308] iomap_dio_submit_bio+0xd8/0x100
>> [  205.044645] [c00000085194b790] [c0000000005e0ed4] iomap_dio_bio_actor+0x3c4/0x570
>> [  205.044672] [c00000085194b830] [c0000000005db284] iomap_apply+0x1f4/0x3e0
>> [  205.044698] [c00000085194b920] [c0000000005e06e4] __iomap_dio_rw+0x204/0x5b0
>> [  205.044735] [c00000085194b9f0] [c0000000005e0ab0] iomap_dio_rw+0x20/0x80
>> [  205.044771] [c00000085194ba10] [c00800000e4ea660] xfs_file_dio_write_aligned+0xb8/0x1b0 [xfs]
>> [  205.044897] [c00000085194ba60] [c00800000e4eabfc] xfs_file_write_iter+0x104/0x190 [xfs]
>> [  205.045021] [c00000085194ba90] [c008000010440724] nvmet_file_submit_bvec+0xfc/0x160 [nvmet]
>> [  205.045064] [c00000085194bb10] [c008000010440b54] nvmet_file_execute_io+0x2cc/0x3b0 [nvmet]
>> [  205.045106] [c00000085194bb90] [c00800000fa142d8] nvmet_tcp_io_work+0xce0/0xd10 [nvmet_tcp]
>> [  205.045144] [c00000085194bc70] [c000000000179534] process_one_work+0x294/0x580
>> [  205.045181] [c00000085194bd10] [c0000000001798c8] worker_thread+0xa8/0x650
>> [  205.045216] [c00000085194bda0] [c000000000184180] kthread+0x190/0x1a0
>> [  205.045242] [c00000085194be10] [c00000000000d4ec] ret_from_kernel_thread+0x5c/0x70
>>
>>
>>
>>
>> Best Regards,
>>   Yi Zhang
>>
>>