All of lore.kernel.org
 help / color / mirror / Atom feed
* [bug report] block/005 hangs with NVMe device and linux-block/for-next
@ 2021-11-01  8:34 Shinichiro Kawasaki
  2021-11-01 12:41 ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Shinichiro Kawasaki @ 2021-11-01  8:34 UTC (permalink / raw)
  To: linux-block; +Cc: Damien Le Moal, axboe

I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
observed a process hang during blktests block/005 run on a NVMe device.
Kernel message reported "INFO: task check:1224 blocked for more than 122
seconds." with call trace [1]. So far, the hang is 100% reproducible with my
system. This hang is not observed with HDDs or null_blk devices.

I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
for the IRQ path") triggers the hang. When I revert this commit from the
for-next branch tip, the hang disappears. The block/005 test case does IO
scheduler switch during IO, and the completion path change by the commit looks
affecting the scheduler switch. Comments for solution will be appreciated.

[1]

[  121.677521] run blktests block/005 at 2021-11-01 16:24:21
[  371.819391] INFO: task check:1224 blocked for more than 122 seconds.
[  371.826660]       Not tainted 5.15.0-rc6+ #41
[  371.831786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  371.840349] task:check           state:D stack:    0 pid: 1224 ppid:  1223 flags:0x00000000
[  371.849471] Call Trace:
[  371.852680]  __schedule+0x9e2/0x2240
[  371.857052]  ? io_schedule_timeout+0x190/0x190
[  371.862330]  ? _raw_spin_unlock_irqrestore+0x37/0x40
[  371.868070]  schedule+0xdd/0x280
[  371.872062]  blk_mq_freeze_queue_wait+0xc0/0xf0
[  371.877351]  ? blk_mq_queue_inflight+0x70/0x70
[  371.882554]  ? blk_mq_run_hw_queues+0x13b/0x410
[  371.887856]  ? finish_wait+0x270/0x270
[  371.892363]  elevator_switch+0x4a/0xa0
[  371.896857]  elv_iosched_store+0x16e/0x3c0
[  371.901698]  ? elevator_init_mq+0x3f0/0x3f0
[  371.906643]  ? lock_is_held_type+0xe0/0x110
[  371.911556]  ? lock_is_held_type+0xe0/0x110
[  371.916473]  queue_attr_store+0x8b/0xd0
[  371.921050]  ? sysfs_file_ops+0x170/0x170
[  371.925789]  kernfs_fop_write_iter+0x2c7/0x460
[  371.930965]  new_sync_write+0x359/0x5e0
[  371.935533]  ? new_sync_read+0x5d0/0x5d0
[  371.940177]  ? ksys_write+0xe9/0x1b0
[  371.944483]  ? lock_release+0x690/0x690
[  371.949046]  ? __cond_resched+0x15/0x30
[  371.953604]  ? inode_security+0x56/0xf0
[  371.958171]  ? lock_is_held_type+0xe0/0x110
[  371.963075]  vfs_write+0x5e4/0x8e0
[  371.967196]  ksys_write+0xe9/0x1b0
[  371.971321]  ? __ia32_sys_read+0xa0/0xa0
[  371.975962]  ? syscall_enter_from_user_mode+0x21/0x70
[  371.981733]  do_syscall_64+0x3b/0x90
[  371.986028]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  371.991799] RIP: 0033:0x7f24adda7387
[  371.996085] RSP: 002b:00007ffd9c80f2e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  372.004366] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24adda7387
[  372.012213] RDX: 0000000000000005 RSI: 0000562156826120 RDI: 0000000000000001
[  372.020060] RBP: 0000562156826120 R08: 000000000000000a R09: 00007f24ade3d4e0
[  372.027901] R10: 00007f24ade3d3e0 R11: 0000000000000246 R12: 0000000000000005
[  372.035746] R13: 00007f24ade7a520 R14: 0000000000000005 R15: 00007f24ade7a700
[  372.043629] 
               Showing all locks held in the system:
[  372.051235] 1 lock held by khungtaskd/62:
[  372.055960]  #0: ffffffffa64483c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x269
[  372.065649] 1 lock held by systemd-journal/584:
[  372.070902] 4 locks held by check/1224:
[  372.075452]  #0: ffff888118fc4460 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xe9/0x1b0
[  372.084177]  #1: ffff888120717088 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x216/0x460
[  372.093771]  #2: ffff8881480a0a00 (kn->active#131){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x239/0x460
[  372.103707]  #3: ffff888148d66858 (&q->sysfs_lock){+.+.}-{3:3}, at: queue_attr_store+0x5d/0xd0

[  372.115258] =============================================

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-01  8:34 [bug report] block/005 hangs with NVMe device and linux-block/for-next Shinichiro Kawasaki
@ 2021-11-01 12:41 ` Jens Axboe
  2021-11-01 23:01   ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2021-11-01 12:41 UTC (permalink / raw)
  To: Shinichiro Kawasaki, linux-block; +Cc: Damien Le Moal

On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> observed a process hang during blktests block/005 run on a NVMe device.
> Kernel message reported "INFO: task check:1224 blocked for more than 122
> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> system. This hang is not observed with HDDs or null_blk devices.
> 
> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> for the IRQ path") triggers the hang. When I revert this commit from the
> for-next branch tip, the hang disappears. The block/005 test case does IO
> scheduler switch during IO, and the completion path change by the commit looks
> affecting the scheduler switch. Comments for solution will be appreciated.

I'll take a look at this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-01 12:41 ` Jens Axboe
@ 2021-11-01 23:01   ` Jens Axboe
  2021-11-02  2:22     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2021-11-01 23:01 UTC (permalink / raw)
  To: Shinichiro Kawasaki, linux-block; +Cc: Damien Le Moal

On 11/1/21 6:41 AM, Jens Axboe wrote:
> On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
>> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
>> observed a process hang during blktests block/005 run on a NVMe device.
>> Kernel message reported "INFO: task check:1224 blocked for more than 122
>> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
>> system. This hang is not observed with HDDs or null_blk devices.
>>
>> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
>> for the IRQ path") triggers the hang. When I revert this commit from the
>> for-next branch tip, the hang disappears. The block/005 test case does IO
>> scheduler switch during IO, and the completion path change by the commit looks
>> affecting the scheduler switch. Comments for solution will be appreciated.
> 
> I'll take a look at this.

I've tried running various things most of the day, and I cannot
reproduce this issue nor do I see what it could be. Even if requests are
split between batched completion and one-by-one completion, it works
just fine for me. No special care needs to be taken for put_many() on
the queue reference, as the wake_up() happens for the ref going to zero.

Tell me more about your setup. What does the runtimes of the test look
like? Do you have all schedulers enabled? What kind of NVMe device is
this?

FWIW, this is upstream now, so testing with Linus -git would be
preferable.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-01 23:01   ` Jens Axboe
@ 2021-11-02  2:22     ` Shinichiro Kawasaki
  2021-11-02  3:07       ` Chaitanya Kulkarni
  2021-11-02  3:44       ` Ming Lei
  0 siblings, 2 replies; 13+ messages in thread
From: Shinichiro Kawasaki @ 2021-11-02  2:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Damien Le Moal

On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> On 11/1/21 6:41 AM, Jens Axboe wrote:
> > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> >> observed a process hang during blktests block/005 run on a NVMe device.
> >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> >> system. This hang is not observed with HDDs or null_blk devices.
> >>
> >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> >> for the IRQ path") triggers the hang. When I revert this commit from the
> >> for-next branch tip, the hang disappears. The block/005 test case does IO
> >> scheduler switch during IO, and the completion path change by the commit looks
> >> affecting the scheduler switch. Comments for solution will be appreciated.
> > 
> > I'll take a look at this.
> 
> I've tried running various things most of the day, and I cannot
> reproduce this issue nor do I see what it could be. Even if requests are
> split between batched completion and one-by-one completion, it works
> just fine for me. No special care needs to be taken for put_many() on
> the queue reference, as the wake_up() happens for the ref going to zero.
> 
> Tell me more about your setup. What does the runtimes of the test look
> like? Do you have all schedulers enabled? What kind of NVMe device is
> this?

Thank you for spending your precious time. With the kernel without the hang,
the test case completes around 20 seconds. When the hang happens, the check
script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
workload processes stop at __blkdev_direct_IO_simple(). The test case does not
end, so I need to reboot the system for the next trial. While waiting the test
case completion, the kernel repeats the same INFO message every 2 minutes.

Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.

The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
a regular name space, and the hang is observed with both name spaces. I have
not yet tried other NVME devices, so I will try them.

> 
> FWIW, this is upstream now, so testing with Linus -git would be
> preferable.

I see. I have switched from linux-block for-next branch to the upstream branch
of Linus. At git hash 879dbe9ffebc, and still the hang is observed.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02  2:22     ` Shinichiro Kawasaki
@ 2021-11-02  3:07       ` Chaitanya Kulkarni
  2021-11-02  8:19         ` Shinichiro Kawasaki
  2021-11-02  3:44       ` Ming Lei
  1 sibling, 1 reply; 13+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-02  3:07 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: linux-block, Jens Axboe, Damien Le Moal


> The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> a regular name space, and the hang is observed with both name spaces. I have
> not yet tried other NVME devices, so I will try them.
> 

See if you can produce this with QEMU NVMe emulation (ZNS and NON-ZNS
mode), if you can then it will be easier to reproduce for everyone.

>>
>> FWIW, this is upstream now, so testing with Linus -git would be
>> preferable.
> 
> I see. I have switched from linux-block for-next branch to the upstream branch
> of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> 
> --
> Best Regards,
> Shin'ichiro Kawasaki
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02  2:22     ` Shinichiro Kawasaki
  2021-11-02  3:07       ` Chaitanya Kulkarni
@ 2021-11-02  3:44       ` Ming Lei
  2021-11-02  8:28         ` Shinichiro Kawasaki
  1 sibling, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-11-02  3:44 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Jens Axboe, linux-block, Damien Le Moal, ming.lei

On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > >> observed a process hang during blktests block/005 run on a NVMe device.
> > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > >> system. This hang is not observed with HDDs or null_blk devices.
> > >>
> > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > >> scheduler switch during IO, and the completion path change by the commit looks
> > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > 
> > > I'll take a look at this.
> > 
> > I've tried running various things most of the day, and I cannot
> > reproduce this issue nor do I see what it could be. Even if requests are
> > split between batched completion and one-by-one completion, it works
> > just fine for me. No special care needs to be taken for put_many() on
> > the queue reference, as the wake_up() happens for the ref going to zero.
> > 
> > Tell me more about your setup. What does the runtimes of the test look
> > like? Do you have all schedulers enabled? What kind of NVMe device is
> > this?
> 
> Thank you for spending your precious time. With the kernel without the hang,
> the test case completes around 20 seconds. When the hang happens, the check
> script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> end, so I need to reboot the system for the next trial. While waiting the test
> case completion, the kernel repeats the same INFO message every 2 minutes.
> 
> Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> 
> The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> a regular name space, and the hang is observed with both name spaces. I have
> not yet tried other NVME devices, so I will try them.
> 
> > 
> > FWIW, this is upstream now, so testing with Linus -git would be
> > preferable.
> 
> I see. I have switched from linux-block for-next branch to the upstream branch
> of Linus. At git hash 879dbe9ffebc, and still the hang is observed.

Can you post the blk-mq debugfs log after the hang is triggered?

(cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)


Thanks 
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02  3:07       ` Chaitanya Kulkarni
@ 2021-11-02  8:19         ` Shinichiro Kawasaki
  2021-11-02  8:28           ` Damien Le Moal
  0 siblings, 1 reply; 13+ messages in thread
From: Shinichiro Kawasaki @ 2021-11-02  8:19 UTC (permalink / raw)
  To: Chaitanya Kulkarni; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Nov 02, 2021 / 03:07, Chaitanya Kulkarni wrote:
> 
> > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > a regular name space, and the hang is observed with both name spaces. I have
> > not yet tried other NVME devices, so I will try them.

I tried another NVMe device, WD Black SN750, but it did not recreate the hang.

> > 
> 
> See if you can produce this with QEMU NVMe emulation (ZNS and NON-ZNS
> mode), if you can then it will be easier to reproduce for everyone.

Chaitanya, thank you for the advice. I have managed to reproduce the hang with
QEMU NVMe emulation. Actually, ZNS mode is not required. I tried some device
set up configuration with QEMU, and the hang was recreated when a single NVMe
device has two namespaces. With single namespace in a single NVMe device, the
hang is not observed.

So it looks like that the number of namespaces may be related to the cause. The
WD Black SN750 without hang has single namespace. I reduced the number of
namespaces of the U.2 NVMe ZNS SSD from 2 to 1, then the hang was not observed.

FYI, the QEMU command line options that I used was as follows. It prepares
/dev/nvme0n1 and /dev/nvme0n2, and the block/005 run on /dev/nvme0n1 recreated
the hang.

-device nvme,id=nvme0,serial=qemunvme,logical_block_size=4096,physical_block_size=4096 \
-drive file=(path)/nvme0n1.img,id=nvme0n1,format=raw,if=none \
-device nvme-ns,drive=nvme0n1,bus=nvme0,nsid=1 \
-drive file=(path)/nvme0n2.img,id=nvme0n2,format=raw,if=none \
-device nvme-ns,drive=nvme0n2,bus=nvme0,nsid=2

Regarding the two image files, I created them beforehand with the command below:

$ qemu-img create -f raw "${image_file_path}" 1024M

Hope this helps.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02  8:19         ` Shinichiro Kawasaki
@ 2021-11-02  8:28           ` Damien Le Moal
  0 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2021-11-02  8:28 UTC (permalink / raw)
  To: Shinichiro Kawasaki, Chaitanya Kulkarni
  Cc: linux-block, Jens Axboe, Damien Le Moal, linux-nvme, Keith Busch,
	Christoph Hellwig


+linux-nvme (Keith and Christoph)


On 2021/11/02 17:19, Shinichiro Kawasaki wrote:
> On Nov 02, 2021 / 03:07, Chaitanya Kulkarni wrote:
>>
>>> The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
>>> a regular name space, and the hang is observed with both name spaces. I have
>>> not yet tried other NVME devices, so I will try them.
> 
> I tried another NVMe device, WD Black SN750, but it did not recreate the hang.
> 
>>>
>>
>> See if you can produce this with QEMU NVMe emulation (ZNS and NON-ZNS
>> mode), if you can then it will be easier to reproduce for everyone.
> 
> Chaitanya, thank you for the advice. I have managed to reproduce the hang with
> QEMU NVMe emulation. Actually, ZNS mode is not required. I tried some device
> set up configuration with QEMU, and the hang was recreated when a single NVMe
> device has two namespaces. With single namespace in a single NVMe device, the
> hang is not observed.
> 
> So it looks like that the number of namespaces may be related to the cause. The
> WD Black SN750 without hang has single namespace. I reduced the number of
> namespaces of the U.2 NVMe ZNS SSD from 2 to 1, then the hang was not observed.
> 
> FYI, the QEMU command line options that I used was as follows. It prepares
> /dev/nvme0n1 and /dev/nvme0n2, and the block/005 run on /dev/nvme0n1 recreated
> the hang.
> 
> -device nvme,id=nvme0,serial=qemunvme,logical_block_size=4096,physical_block_size=4096 \
> -drive file=(path)/nvme0n1.img,id=nvme0n1,format=raw,if=none \
> -device nvme-ns,drive=nvme0n1,bus=nvme0,nsid=1 \
> -drive file=(path)/nvme0n2.img,id=nvme0n2,format=raw,if=none \
> -device nvme-ns,drive=nvme0n2,bus=nvme0,nsid=2
> 
> Regarding the two image files, I created them beforehand with the command below:
> 
> $ qemu-img create -f raw "${image_file_path}" 1024M
> 
> Hope this helps.
> 



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02  3:44       ` Ming Lei
@ 2021-11-02  8:28         ` Shinichiro Kawasaki
  2021-11-02  9:02           ` Shinichiro Kawasaki
  0 siblings, 1 reply; 13+ messages in thread
From: Shinichiro Kawasaki @ 2021-11-02  8:28 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Damien Le Moal

On Nov 02, 2021 / 11:44, Ming Lei wrote:
> On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > >>
> > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > 
> > > > I'll take a look at this.
> > > 
> > > I've tried running various things most of the day, and I cannot
> > > reproduce this issue nor do I see what it could be. Even if requests are
> > > split between batched completion and one-by-one completion, it works
> > > just fine for me. No special care needs to be taken for put_many() on
> > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > 
> > > Tell me more about your setup. What does the runtimes of the test look
> > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > this?
> > 
> > Thank you for spending your precious time. With the kernel without the hang,
> > the test case completes around 20 seconds. When the hang happens, the check
> > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > end, so I need to reboot the system for the next trial. While waiting the test
> > case completion, the kernel repeats the same INFO message every 2 minutes.
> > 
> > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > 
> > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > a regular name space, and the hang is observed with both name spaces. I have
> > not yet tried other NVME devices, so I will try them.
> > 
> > > 
> > > FWIW, this is upstream now, so testing with Linus -git would be
> > > preferable.
> > 
> > I see. I have switched from linux-block for-next branch to the upstream branch
> > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> 
> Can you post the blk-mq debugfs log after the hang is triggered?
> 
> (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)

Thanks Ming. When I ran the command above, the grep command stopped when it
opened tag related files in the debugfs tree. That grep command looked hanking
also. So I used the find command below instead to exclude the tag related files.

# find . -type f -not -name *tag* -exec grep -aH . {} \;

Here I share the captured log.

./sched/queued:0 8 0
./sched/owned_by_driver:0 8 0
./sched/async_depth:192
./sched/starved:0
./sched/batching:1
./rqos/wbt/wb_background:4
./rqos/wbt/wb_normal:8
./rqos/wbt/unknown_cnt:0
./rqos/wbt/min_lat_nsec:2000000
./rqos/wbt/inflight:0: inflight 0
./rqos/wbt/inflight:1: inflight 0
./rqos/wbt/inflight:2: inflight 0
./rqos/wbt/id:0
./rqos/wbt/enabled:1
./rqos/wbt/curr_win_nsec:0
./hctx7/type:default
./hctx7/dispatch_busy:7
./hctx7/active:1024
./hctx7/run:5
./hctx7/ctx_map:00000000: 00
./hctx7/dispatch:000000003dfed3fd {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
./hctx7/dispatch:0000000077876d9e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=193}
./hctx7/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx7/state:TAG_ACTIVE|SCHED_RESTART
./hctx6/type:default
./hctx6/dispatch_busy:5
./hctx6/active:1025
./hctx6/run:4
./hctx6/ctx_map:00000000: 00
./hctx6/dispatch:00000000c0b8e1c9 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
./hctx6/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx6/state:TAG_ACTIVE|SCHED_RESTART
./hctx5/type:default
./hctx5/dispatch_busy:5
./hctx5/active:1024
./hctx5/run:4
./hctx5/ctx_map:00000000: 00
./hctx5/dispatch:00000000aaf1e364 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
./hctx5/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx5/state:TAG_ACTIVE|SCHED_RESTART
./hctx4/type:default
./hctx4/dispatch_busy:0
./hctx4/active:1023
./hctx4/run:1
./hctx4/ctx_map:00000000: 00
./hctx4/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx3/type:default
./hctx3/dispatch_busy:5
./hctx3/active:1024
./hctx3/run:4
./hctx3/ctx_map:00000000: 00
./hctx3/dispatch:000000008b07d5e1 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
./hctx3/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx3/state:TAG_ACTIVE|SCHED_RESTART
./hctx2/type:default
./hctx2/dispatch_busy:5
./hctx2/active:1024
./hctx2/run:4
./hctx2/ctx_map:00000000: 00
./hctx2/dispatch:00000000c4887013 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
./hctx2/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx2/state:TAG_ACTIVE|SCHED_RESTART
./hctx1/type:default
./hctx1/dispatch_busy:6
./hctx1/active:1024
./hctx1/run:5
./hctx1/ctx_map:00000000: 00
./hctx1/dispatch:00000000efe38e4e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=0}
./hctx1/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx1/state:TAG_ACTIVE|SCHED_RESTART
./hctx0/type:default
./hctx0/dispatch_busy:5
./hctx0/active:1024
./hctx0/run:4
./hctx0/ctx_map:00000000: 00
./hctx0/dispatch:0000000015147095 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
./hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx0/state:TAG_ACTIVE|SCHED_RESTART
./write_hints:hint0: 0
./write_hints:hint1: 0
./write_hints:hint2: 0
./write_hints:hint3: 0
./write_hints:hint4: 0
./state:SAME_COMP|NONROT|IO_STAT|DISCARD|INIT_DONE|STATS|REGISTERED|PCI_P2PDMA|NOWAIT
./pm_only:0
./poll_stat:read  (512 Bytes): samples=0
./poll_stat:write (512 Bytes): samples=0
./poll_stat:read  (1024 Bytes): samples=0
./poll_stat:write (1024 Bytes): samples=0
./poll_stat:read  (2048 Bytes): samples=0
./poll_stat:write (2048 Bytes): samples=0
./poll_stat:read  (4096 Bytes): samples=0
./poll_stat:write (4096 Bytes): samples=0
./poll_stat:read  (8192 Bytes): samples=0
./poll_stat:write (8192 Bytes): samples=0
./poll_stat:read  (16384 Bytes): samples=0
./poll_stat:write (16384 Bytes): samples=0
./poll_stat:read  (32768 Bytes): samples=0
./poll_stat:write (32768 Bytes): samples=0
./poll_stat:read  (65536 Bytes): samples=0
./poll_stat:write (65536 Bytes): samples=0


-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02  8:28         ` Shinichiro Kawasaki
@ 2021-11-02  9:02           ` Shinichiro Kawasaki
  2021-11-02 10:48             ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Shinichiro Kawasaki @ 2021-11-02  9:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Damien Le Moal, linux-nvme, Keith Busch,
	Christoph Hellwig

Let me add linux-nvme, Keith and Christoph to the CC list.

-- 
Best Regards,
Shin'ichiro Kawasaki


On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > >>
> > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > 
> > > > > I'll take a look at this.
> > > > 
> > > > I've tried running various things most of the day, and I cannot
> > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > split between batched completion and one-by-one completion, it works
> > > > just fine for me. No special care needs to be taken for put_many() on
> > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > 
> > > > Tell me more about your setup. What does the runtimes of the test look
> > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > this?
> > > 
> > > Thank you for spending your precious time. With the kernel without the hang,
> > > the test case completes around 20 seconds. When the hang happens, the check
> > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > end, so I need to reboot the system for the next trial. While waiting the test
> > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > 
> > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > 
> > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > a regular name space, and the hang is observed with both name spaces. I have
> > > not yet tried other NVME devices, so I will try them.
> > > 
> > > > 
> > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > preferable.
> > > 
> > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > 
> > Can you post the blk-mq debugfs log after the hang is triggered?
> > 
> > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> 
> Thanks Ming. When I ran the command above, the grep command stopped when it
> opened tag related files in the debugfs tree. That grep command looked hanking
> also. So I used the find command below instead to exclude the tag related files.
> 
> # find . -type f -not -name *tag* -exec grep -aH . {} \;
> 
> Here I share the captured log.
> 
> ./sched/queued:0 8 0
> ./sched/owned_by_driver:0 8 0
> ./sched/async_depth:192
> ./sched/starved:0
> ./sched/batching:1
> ./rqos/wbt/wb_background:4
> ./rqos/wbt/wb_normal:8
> ./rqos/wbt/unknown_cnt:0
> ./rqos/wbt/min_lat_nsec:2000000
> ./rqos/wbt/inflight:0: inflight 0
> ./rqos/wbt/inflight:1: inflight 0
> ./rqos/wbt/inflight:2: inflight 0
> ./rqos/wbt/id:0
> ./rqos/wbt/enabled:1
> ./rqos/wbt/curr_win_nsec:0
> ./hctx7/type:default
> ./hctx7/dispatch_busy:7
> ./hctx7/active:1024
> ./hctx7/run:5
> ./hctx7/ctx_map:00000000: 00
> ./hctx7/dispatch:000000003dfed3fd {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
> ./hctx7/dispatch:0000000077876d9e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=193}
> ./hctx7/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx7/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx6/type:default
> ./hctx6/dispatch_busy:5
> ./hctx6/active:1025
> ./hctx6/run:4
> ./hctx6/ctx_map:00000000: 00
> ./hctx6/dispatch:00000000c0b8e1c9 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
> ./hctx6/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx6/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx5/type:default
> ./hctx5/dispatch_busy:5
> ./hctx5/active:1024
> ./hctx5/run:4
> ./hctx5/ctx_map:00000000: 00
> ./hctx5/dispatch:00000000aaf1e364 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
> ./hctx5/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx5/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx4/type:default
> ./hctx4/dispatch_busy:0
> ./hctx4/active:1023
> ./hctx4/run:1
> ./hctx4/ctx_map:00000000: 00
> ./hctx4/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx3/type:default
> ./hctx3/dispatch_busy:5
> ./hctx3/active:1024
> ./hctx3/run:4
> ./hctx3/ctx_map:00000000: 00
> ./hctx3/dispatch:000000008b07d5e1 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
> ./hctx3/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx3/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx2/type:default
> ./hctx2/dispatch_busy:5
> ./hctx2/active:1024
> ./hctx2/run:4
> ./hctx2/ctx_map:00000000: 00
> ./hctx2/dispatch:00000000c4887013 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
> ./hctx2/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx2/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx1/type:default
> ./hctx1/dispatch_busy:6
> ./hctx1/active:1024
> ./hctx1/run:5
> ./hctx1/ctx_map:00000000: 00
> ./hctx1/dispatch:00000000efe38e4e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=0}
> ./hctx1/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx1/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx0/type:default
> ./hctx0/dispatch_busy:5
> ./hctx0/active:1024
> ./hctx0/run:4
> ./hctx0/ctx_map:00000000: 00
> ./hctx0/dispatch:0000000015147095 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
> ./hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx0/state:TAG_ACTIVE|SCHED_RESTART
> ./write_hints:hint0: 0
> ./write_hints:hint1: 0
> ./write_hints:hint2: 0
> ./write_hints:hint3: 0
> ./write_hints:hint4: 0
> ./state:SAME_COMP|NONROT|IO_STAT|DISCARD|INIT_DONE|STATS|REGISTERED|PCI_P2PDMA|NOWAIT
> ./pm_only:0
> ./poll_stat:read  (512 Bytes): samples=0
> ./poll_stat:write (512 Bytes): samples=0
> ./poll_stat:read  (1024 Bytes): samples=0
> ./poll_stat:write (1024 Bytes): samples=0
> ./poll_stat:read  (2048 Bytes): samples=0
> ./poll_stat:write (2048 Bytes): samples=0
> ./poll_stat:read  (4096 Bytes): samples=0
> ./poll_stat:write (4096 Bytes): samples=0
> ./poll_stat:read  (8192 Bytes): samples=0
> ./poll_stat:write (8192 Bytes): samples=0
> ./poll_stat:read  (16384 Bytes): samples=0
> ./poll_stat:write (16384 Bytes): samples=0
> ./poll_stat:read  (32768 Bytes): samples=0
> ./poll_stat:write (32768 Bytes): samples=0
> ./poll_stat:read  (65536 Bytes): samples=0
> ./poll_stat:write (65536 Bytes): samples=0
> 
> 
> -- 
> Best Regards,
> Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02  9:02           ` Shinichiro Kawasaki
@ 2021-11-02 10:48             ` Ming Lei
  2021-11-02 11:24               ` Shinichiro Kawasaki
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-11-02 10:48 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Jens Axboe, linux-block, Damien Le Moal, linux-nvme, Keith Busch,
	Christoph Hellwig, ming.lei

On Tue, Nov 02, 2021 at 09:02:47AM +0000, Shinichiro Kawasaki wrote:
> Let me add linux-nvme, Keith and Christoph to the CC list.
> 
> -- 
> Best Regards,
> Shin'ichiro Kawasaki
> 
> 
> On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> > On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > > >>
> > > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > > 
> > > > > > I'll take a look at this.
> > > > > 
> > > > > I've tried running various things most of the day, and I cannot
> > > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > > split between batched completion and one-by-one completion, it works
> > > > > just fine for me. No special care needs to be taken for put_many() on
> > > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > > 
> > > > > Tell me more about your setup. What does the runtimes of the test look
> > > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > > this?
> > > > 
> > > > Thank you for spending your precious time. With the kernel without the hang,
> > > > the test case completes around 20 seconds. When the hang happens, the check
> > > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > > end, so I need to reboot the system for the next trial. While waiting the test
> > > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > > 
> > > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > > 
> > > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > > a regular name space, and the hang is observed with both name spaces. I have
> > > > not yet tried other NVME devices, so I will try them.
> > > > 
> > > > > 
> > > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > > preferable.
> > > > 
> > > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > > 
> > > Can you post the blk-mq debugfs log after the hang is triggered?
> > > 
> > > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> > 
> > Thanks Ming. When I ran the command above, the grep command stopped when it
> > opened tag related files in the debugfs tree. That grep command looked hanking
> > also. So I used the find command below instead to exclude the tag related files.
> > 
> > # find . -type f -not -name *tag* -exec grep -aH . {} \;
> > 
> > Here I share the captured log.
> > 

It is a bit odd since batching completion shouldn't be triggered in case
of io sched, but blk_mq_end_request_batch() does not restart hctx, so
maybe you can try the following patch:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 07eb1412760b..4c0c9af9235e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -846,16 +846,20 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
 		rq_qos_done(rq->q, rq);
 
 		if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
-			if (cur_hctx)
+			if (cur_hctx) {
 				blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
+				blk_mq_sched_restart(cur_hctx);
+			}
 			nr_tags = 0;
 			cur_hctx = rq->mq_hctx;
 		}
 		tags[nr_tags++] = rq->tag;
 	}
 
-	if (nr_tags)
+	if (nr_tags) {
 		blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
+		blk_mq_sched_restart(cur_hctx);
+	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);
 


-- 
Ming


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02 10:48             ` Ming Lei
@ 2021-11-02 11:24               ` Shinichiro Kawasaki
  2021-11-02 12:26                 ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Shinichiro Kawasaki @ 2021-11-02 11:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Damien Le Moal, linux-nvme, Keith Busch,
	Christoph Hellwig

On Nov 02, 2021 / 18:48, Ming Lei wrote:
> On Tue, Nov 02, 2021 at 09:02:47AM +0000, Shinichiro Kawasaki wrote:
> > Let me add linux-nvme, Keith and Christoph to the CC list.
> > 
> > -- 
> > Best Regards,
> > Shin'ichiro Kawasaki
> > 
> > 
> > On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> > > On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > > > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > > > >>
> > > > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > > > 
> > > > > > > I'll take a look at this.
> > > > > > 
> > > > > > I've tried running various things most of the day, and I cannot
> > > > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > > > split between batched completion and one-by-one completion, it works
> > > > > > just fine for me. No special care needs to be taken for put_many() on
> > > > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > > > 
> > > > > > Tell me more about your setup. What does the runtimes of the test look
> > > > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > > > this?
> > > > > 
> > > > > Thank you for spending your precious time. With the kernel without the hang,
> > > > > the test case completes around 20 seconds. When the hang happens, the check
> > > > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > > > end, so I need to reboot the system for the next trial. While waiting the test
> > > > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > > > 
> > > > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > > > 
> > > > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > > > a regular name space, and the hang is observed with both name spaces. I have
> > > > > not yet tried other NVME devices, so I will try them.
> > > > > 
> > > > > > 
> > > > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > > > preferable.
> > > > > 
> > > > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > > > 
> > > > Can you post the blk-mq debugfs log after the hang is triggered?
> > > > 
> > > > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> > > 
> > > Thanks Ming. When I ran the command above, the grep command stopped when it
> > > opened tag related files in the debugfs tree. That grep command looked hanking
> > > also. So I used the find command below instead to exclude the tag related files.
> > > 
> > > # find . -type f -not -name *tag* -exec grep -aH . {} \;
> > > 
> > > Here I share the captured log.
> > > 
> 
> It is a bit odd since batching completion shouldn't be triggered in case
> of io sched, but blk_mq_end_request_batch() does not restart hctx, so
> maybe you can try the following patch:
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 07eb1412760b..4c0c9af9235e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -846,16 +846,20 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
>  		rq_qos_done(rq->q, rq);
>  
>  		if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
> -			if (cur_hctx)
> +			if (cur_hctx) {
>  				blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
> +				blk_mq_sched_restart(cur_hctx);
> +			}
>  			nr_tags = 0;
>  			cur_hctx = rq->mq_hctx;
>  		}
>  		tags[nr_tags++] = rq->tag;
>  	}
>  
> -	if (nr_tags)
> +	if (nr_tags) {
>  		blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
> +		blk_mq_sched_restart(cur_hctx);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);

Ming, thank you, but unfortunately, still I observe the hang with the patch
above.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02 11:24               ` Shinichiro Kawasaki
@ 2021-11-02 12:26                 ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2021-11-02 12:26 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Jens Axboe, linux-block, Damien Le Moal, linux-nvme, Keith Busch,
	Christoph Hellwig, ming.lei

On Tue, Nov 02, 2021 at 11:24:06AM +0000, Shinichiro Kawasaki wrote:
> On Nov 02, 2021 / 18:48, Ming Lei wrote:
> > On Tue, Nov 02, 2021 at 09:02:47AM +0000, Shinichiro Kawasaki wrote:
> > > Let me add linux-nvme, Keith and Christoph to the CC list.
> > > 
> > > -- 
> > > Best Regards,
> > > Shin'ichiro Kawasaki
> > > 
> > > 
> > > On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> > > > On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > > > > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > > > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > > > > >>
> > > > > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > > > > 
> > > > > > > > I'll take a look at this.
> > > > > > > 
> > > > > > > I've tried running various things most of the day, and I cannot
> > > > > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > > > > split between batched completion and one-by-one completion, it works
> > > > > > > just fine for me. No special care needs to be taken for put_many() on
> > > > > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > > > > 
> > > > > > > Tell me more about your setup. What does the runtimes of the test look
> > > > > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > > > > this?
> > > > > > 
> > > > > > Thank you for spending your precious time. With the kernel without the hang,
> > > > > > the test case completes around 20 seconds. When the hang happens, the check
> > > > > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > > > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > > > > end, so I need to reboot the system for the next trial. While waiting the test
> > > > > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > > > > 
> > > > > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > > > > 
> > > > > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > > > > a regular name space, and the hang is observed with both name spaces. I have
> > > > > > not yet tried other NVME devices, so I will try them.
> > > > > > 
> > > > > > > 
> > > > > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > > > > preferable.
> > > > > > 
> > > > > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > > > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > > > > 
> > > > > Can you post the blk-mq debugfs log after the hang is triggered?
> > > > > 
> > > > > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> > > > 
> > > > Thanks Ming. When I ran the command above, the grep command stopped when it
> > > > opened tag related files in the debugfs tree. That grep command looked hanking
> > > > also. So I used the find command below instead to exclude the tag related files.
> > > > 
> > > > # find . -type f -not -name *tag* -exec grep -aH . {} \;
> > > > 
> > > > Here I share the captured log.
> > > > 
> > 
> > It is a bit odd since batching completion shouldn't be triggered in case
> > of io sched, but blk_mq_end_request_batch() does not restart hctx, so
> > maybe you can try the following patch:
> > 
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 07eb1412760b..4c0c9af9235e 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -846,16 +846,20 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
> >  		rq_qos_done(rq->q, rq);
> >  
> >  		if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
> > -			if (cur_hctx)
> > +			if (cur_hctx) {
> >  				blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
> > +				blk_mq_sched_restart(cur_hctx);
> > +			}
> >  			nr_tags = 0;
> >  			cur_hctx = rq->mq_hctx;
> >  		}
> >  		tags[nr_tags++] = rq->tag;
> >  	}
> >  
> > -	if (nr_tags)
> > +	if (nr_tags) {
> >  		blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
> > +		blk_mq_sched_restart(cur_hctx);
> > +	}
> >  }
> >  EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);
> 
> Ming, thank you, but unfortunately, still I observe the hang with the patch
> above.

OK, thanks for your test, care to test the following patch?


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 888c43aff1df..ef767d4e65da 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -837,6 +837,9 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
 		if (iob->need_ts)
 			__blk_mq_end_request_acct(rq, now);
 
+		if (rq->rq_flags & RQF_MQ_INFLIGHT)
+			__blk_mq_dec_active_requests(rq->mq_hctx);
+
 		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 		if (!refcount_dec_and_test(&rq->ref))
 			continue;

Thanks, 
Ming


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-11-02 12:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-01  8:34 [bug report] block/005 hangs with NVMe device and linux-block/for-next Shinichiro Kawasaki
2021-11-01 12:41 ` Jens Axboe
2021-11-01 23:01   ` Jens Axboe
2021-11-02  2:22     ` Shinichiro Kawasaki
2021-11-02  3:07       ` Chaitanya Kulkarni
2021-11-02  8:19         ` Shinichiro Kawasaki
2021-11-02  8:28           ` Damien Le Moal
2021-11-02  3:44       ` Ming Lei
2021-11-02  8:28         ` Shinichiro Kawasaki
2021-11-02  9:02           ` Shinichiro Kawasaki
2021-11-02 10:48             ` Ming Lei
2021-11-02 11:24               ` Shinichiro Kawasaki
2021-11-02 12:26                 ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.