Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* Hang at NVME Host caused by Controller reset
@ 2020-07-27 18:19 Krishnamraju Eraparaju
  2020-07-27 18:47 ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Krishnamraju Eraparaju @ 2020-07-27 18:19 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-rdma


kernel hang observed on NVME Host(TCP) while running iozone with link
toggle:

    
[ +42.773018] INFO: task kworker/u24:5:1243 blocked for more than 122
seconds.
[  +0.000124]       Not tainted 5.8.0-rc4ekr+ #19
[  +0.000105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000164] kworker/u24:5   D12600  1243      2 0x00004000
[  +0.000114] Workqueue: nvme-reset-wq nvme_reset_ctrl_work [nvme_tcp]
[  +0.000109] Call Trace:
[  +0.000105]  __schedule+0x270/0x5d0
[  +0.000105]  schedule+0x45/0xb0
[  +0.000125]  blk_mq_freeze_queue_wait+0x41/0xa0
[  +0.000122]  ? wait_woken+0x80/0x80
[  +0.000116]  blk_mq_update_nr_hw_queues+0x8a/0x380
[  +0.000109]  nvme_tcp_setup_ctrl+0x345/0x510 [nvme_tcp]
[  +0.000108]  nvme_reset_ctrl_work+0x45/0x60 [nvme_tcp]
[  +0.000135]  process_one_work+0x149/0x380
[  +0.000107]  worker_thread+0x1ae/0x3a0
[  +0.000107]  ? process_one_work+0x380/0x380
[  +0.000108]  kthread+0xf7/0x130
[  +0.000135]  ? kthread_bind+0x10/0x10
[  +0.000121]  ret_from_fork+0x22/0x30
[  +0.000134] INFO: task bash:6000 blocked for more than 122 seconds.
[  +0.000122]       Not tainted 5.8.0-rc4ekr+ #19
[  +0.000109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000195] bash            D14232  6000   5967 0x00000080
[  +0.000115] Call Trace:
[  +0.000106]  __schedule+0x270/0x5d0
[  +0.000138]  ? terminate_walk+0x8a/0x90
[  +0.000123]  schedule+0x45/0xb0
[  +0.000108]  schedule_timeout+0x1d6/0x290
[  +0.000121]  wait_for_completion+0x82/0xe0
[  +0.000120]  __flush_work.isra.37+0x10c/0x180
[  +0.000115]  ? flush_workqueue_prep_pwqs+0x110/0x110
[  +0.000119]  nvme_reset_ctrl_sync+0x1c/0x30 [nvme_core]
[  +0.000110]  nvme_sysfs_reset+0xd/0x20 [nvme_core]
[  +0.000137]  kernfs_fop_write+0x10a/0x1a0
[  +0.000124]  vfs_write+0xa8/0x1a0
[  +0.000122]  ksys_write+0x50/0xc0
[  +0.000117]  do_syscall_64+0x3e/0x70
[  +0.000108]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000111] RIP: 0033:0x7f4ed689dc60
[  +0.000107] Code: Bad RIP value.
[  +0.000105] RSP: 002b:00007ffe636b6fe8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  +0.000188] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007f4ed689dc60
----

This issue got uncovered after commit fe35ec58f0d3, which does
freeze-queue operation if set->nr_maps is greater than '1'(all nvmef
trasnports sets nr_maps to '2' by default). 
  
Issue will not occur with multipath enabled.
Issue observed with RDMA transports also.

Steps to reproduce: 
nvme connect -t tcp -a 102.1.1.6 -s 4420 -n nvme-ram0 -i 1

Run below each while loop in different terminals parallelly, to reprodue
instantaneously.
while [ 1 ]; do echo 1 > /sys/block/nvme0n1/device/reset_controller;
done
while [ 1 ]; do  nvme write-zeroes /dev/nvme0n1 -s 1 -c 1; done



My understanding is: while performing reset-controller, nvme-write task
tries to submit IO/blk_queue_enter, but fails at blk_mq_run_hw_queue()
after seeing blk_queue_quiesced.
And never succeeded to blk_queue_exit, may be due to out-of-sync percpu
counter operations(q_usage_counter), causing this hang at freeze_queue.

Thanks,
Krishna.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hang at NVME Host caused by Controller reset
  2020-07-27 18:19 Hang at NVME Host caused by Controller reset Krishnamraju Eraparaju
@ 2020-07-27 18:47 ` Sagi Grimberg
  2020-07-28 11:59   ` Krishnamraju Eraparaju
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2020-07-27 18:47 UTC (permalink / raw)
  To: Krishnamraju Eraparaju, linux-nvme; +Cc: linux-rdma


> kernel hang observed on NVME Host(TCP) while running iozone with link
> toggle:

Hi Krishnamraju,

Good timing, please see my patch:
[PATCH 1/2] nvme-tcp: fix controller reset hang during traffic

It will be great to get your tested-by.

> 
>      
> [ +42.773018] INFO: task kworker/u24:5:1243 blocked for more than 122
> seconds.
> [  +0.000124]       Not tainted 5.8.0-rc4ekr+ #19
> [  +0.000105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  +0.000164] kworker/u24:5   D12600  1243      2 0x00004000
> [  +0.000114] Workqueue: nvme-reset-wq nvme_reset_ctrl_work [nvme_tcp]
> [  +0.000109] Call Trace:
> [  +0.000105]  __schedule+0x270/0x5d0
> [  +0.000105]  schedule+0x45/0xb0
> [  +0.000125]  blk_mq_freeze_queue_wait+0x41/0xa0
> [  +0.000122]  ? wait_woken+0x80/0x80
> [  +0.000116]  blk_mq_update_nr_hw_queues+0x8a/0x380
> [  +0.000109]  nvme_tcp_setup_ctrl+0x345/0x510 [nvme_tcp]
> [  +0.000108]  nvme_reset_ctrl_work+0x45/0x60 [nvme_tcp]
> [  +0.000135]  process_one_work+0x149/0x380
> [  +0.000107]  worker_thread+0x1ae/0x3a0
> [  +0.000107]  ? process_one_work+0x380/0x380
> [  +0.000108]  kthread+0xf7/0x130
> [  +0.000135]  ? kthread_bind+0x10/0x10
> [  +0.000121]  ret_from_fork+0x22/0x30
> [  +0.000134] INFO: task bash:6000 blocked for more than 122 seconds.
> [  +0.000122]       Not tainted 5.8.0-rc4ekr+ #19
> [  +0.000109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  +0.000195] bash            D14232  6000   5967 0x00000080
> [  +0.000115] Call Trace:
> [  +0.000106]  __schedule+0x270/0x5d0
> [  +0.000138]  ? terminate_walk+0x8a/0x90
> [  +0.000123]  schedule+0x45/0xb0
> [  +0.000108]  schedule_timeout+0x1d6/0x290
> [  +0.000121]  wait_for_completion+0x82/0xe0
> [  +0.000120]  __flush_work.isra.37+0x10c/0x180
> [  +0.000115]  ? flush_workqueue_prep_pwqs+0x110/0x110
> [  +0.000119]  nvme_reset_ctrl_sync+0x1c/0x30 [nvme_core]
> [  +0.000110]  nvme_sysfs_reset+0xd/0x20 [nvme_core]
> [  +0.000137]  kernfs_fop_write+0x10a/0x1a0
> [  +0.000124]  vfs_write+0xa8/0x1a0
> [  +0.000122]  ksys_write+0x50/0xc0
> [  +0.000117]  do_syscall_64+0x3e/0x70
> [  +0.000108]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  +0.000111] RIP: 0033:0x7f4ed689dc60
> [  +0.000107] Code: Bad RIP value.
> [  +0.000105] RSP: 002b:00007ffe636b6fe8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000001
> [  +0.000188] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
> 00007f4ed689dc60
> ----
> 
> This issue got uncovered after commit fe35ec58f0d3, which does
> freeze-queue operation if set->nr_maps is greater than '1'(all nvmef
> trasnports sets nr_maps to '2' by default).
>    
> Issue will not occur with multipath enabled.
> Issue observed with RDMA transports also.
> 
> Steps to reproduce:
> nvme connect -t tcp -a 102.1.1.6 -s 4420 -n nvme-ram0 -i 1
> 
> Run below each while loop in different terminals parallelly, to reprodue
> instantaneously.
> while [ 1 ]; do echo 1 > /sys/block/nvme0n1/device/reset_controller;
> done
> while [ 1 ]; do  nvme write-zeroes /dev/nvme0n1 -s 1 -c 1; done
> 
> 
> 
> My understanding is: while performing reset-controller, nvme-write task
> tries to submit IO/blk_queue_enter, but fails at blk_mq_run_hw_queue()
> after seeing blk_queue_quiesced.
> And never succeeded to blk_queue_exit, may be due to out-of-sync percpu
> counter operations(q_usage_counter), causing this hang at freeze_queue.
> 
> Thanks,
> Krishna.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hang at NVME Host caused by Controller reset
  2020-07-27 18:47 ` Sagi Grimberg
@ 2020-07-28 11:59   ` Krishnamraju Eraparaju
  2020-07-28 15:54     ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Krishnamraju Eraparaju @ 2020-07-28 11:59 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, linux-rdma, bharat

Sagi,
With the given patch, I am no more seeing the freeze_queue_wait hang
issue, but I am seeing another hang issue:
dmesg:
[Jul28 11:01] igb 0000:03:00.0 enp3s0f0: igb: enp3s0f0 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX
[  +0.000137] IPv6: ADDRCONF(NETDEV_CHANGE): enp3s0f0: link becomes
ready
[Jul28 11:17] cxgb4 0000:02:00.4 enp2s0f4: passive DA module inserted
[  +0.579450] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.000683] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[Jul28 11:19] nvme nvme0: Please enable CONFIG_NVME_MULTIPATH for full
support of multi-port devices.
[  +0.000159] nvme nvme0: creating 1 I/O queues.
[  +0.000350] nvme nvme0: mapped 1/0/0 default/read/poll queues.
[  +0.001316] nvme nvme0: new ctrl: NQN "nvme-ram0", addr 102.1.1.6:4420
[Jul28 11:20] DEBUG: cpu: 3: blk_queue_enter:448 process is "nvme" (pid
4011)
	q->mq_freeze_depth: 1
	(pm || (blk_pm_request_resume(q),!blk_queue_pm_only(q)))): 1
	blk_queue_dying(q): 0            
[ +21.511514] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
[  +0.560355] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.000941] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[Jul28 11:21] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
[  +0.552934] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.001076] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[Jul28 11:22] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
[  +0.615365] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.000886] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[Jul28 11:23] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
[  +0.556661] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.000837] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[  +3.765550] INFO: task bash:3014 blocked for more than 122 seconds.
[  +0.000067]       Not tainted 5.8.0-rc7ekr+ #2
[  +0.000057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000064] bash            D14272  3014   2417 0x00000000
[  +0.000066] Call Trace:
[  +0.000064]  __schedule+0x32b/0x670
[  +0.000060]  schedule+0x45/0xb0
[  +0.000059]  schedule_timeout+0x216/0x330
[  +0.000060]  ? enqueue_task_fair+0x196/0x7e0
[  +0.000059]  wait_for_completion+0x81/0xe0
[  +0.000061]  __flush_work+0x114/0x1c0
[  +0.000058]  ? flush_workqueue_prep_pwqs+0x130/0x130
[  +0.000066]  nvme_reset_ctrl_sync+0x25/0x40 [nvme_core]
[  +0.000125]  nvme_sysfs_reset+0xd/0x20 [nvme_core]
[  +0.000137]  kernfs_fop_write+0xbc/0x1a0
[  +0.000114]  vfs_write+0xc2/0x1f0
[  +0.000120]  ksys_write+0x5a/0xd0
[  +0.000106]  do_syscall_64+0x3e/0x70
[  +0.000122]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000115] RIP: 0033:0x7f8124b93317
[  +0.000110] Code: Bad RIP value.
[  +0.000109] RSP: 002b:00007ffdbbbff1c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  +0.000182] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007f8124b93317
[  +0.000138] RDX: 0000000000000002 RSI: 0000559345c156d0 RDI:
0000000000000001
[  +0.000125] RBP: 0000559345c156d0 R08: 000000000000000a R09:
0000000000000001
[  +0.000117] R10: 00005593453d1471 R11: 0000000000000246 R12:
0000000000000002
[  +0.000116] R13: 00007f8124c6d6a0 R14: 00007f8124c6e4a0 R15:
00007f8124c6d8a0
[  +0.000121] INFO: task nvme:4011 blocked for more than 122 seconds.
[  +0.000118]       Not tainted 5.8.0-rc7ekr+ #2
[  +0.000114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000190] nvme            D14392  4011   2326 0x00004000
[  +0.000132] Call Trace:
[  +0.000117]  __schedule+0x32b/0x670
[  +0.000109]  schedule+0x45/0xb0
[  +0.000108]  blk_queue_enter+0x1e9/0x250
[  +0.000109]  ? wait_woken+0x70/0x70
[  +0.000108]  blk_mq_alloc_request+0x53/0xc0
[  +0.000112]  nvme_alloc_request+0x61/0x70 [nvme_core]
[  +0.000118]  nvme_submit_user_cmd+0x50/0x310 [nvme_core]
[  +0.000126]  nvme_user_cmd+0x12e/0x1c0 [nvme_core]
[  +0.000124]  ? _copy_to_user+0x22/0x30
[  +0.000108]  blkdev_ioctl+0x100/0x250
[  +0.000109]  block_ioctl+0x34/0x40
[  +0.000110]  ksys_ioctl+0x82/0xc0
[  +0.000106]  __x64_sys_ioctl+0x11/0x20
[  +0.000126]  do_syscall_64+0x3e/0x70
[  +0.000113]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000132] RIP: 0033:0x7fed0bd2967b
[  +0.000134] Code: Bad RIP value.
[  +0.000107] RSP: 002b:00007fff55b568a8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  +0.000172] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
00007fed0bd2967b
[  +0.000112] RDX: 00007fff55b568b0 RSI: 00000000c0484e43 RDI:
0000000000000003
[  +0.000113] RBP: 0000000000000000 R08: 0000000000000001 R09:
0000000000000000
[  +0.000130] R10: 0000000000000000 R11: 0000000000000246 R12:
00007fff55b5878a
[  +0.000119] R13: 0000000000000006 R14: 00007fff55b56f60 R15:
00005595f54554a0
[  +0.000135] Kernel panic - not syncing: hung_task: blocked tasks
[  +0.000141] CPU: 8 PID: 520 Comm: khungtaskd Not tainted 5.8.0-rc7ekr+
#2



Testcase:
 while [ 1 ]; do  nvme write-zeroes /dev/nvme0n1 -s 1 -c 1; done
 while [ 1 ]; do echo 1 > /sys/block/nvme0n1/device/reset_controller;
done
 while [ 1 ]; do ifconfig enp2s0f4 down; sleep 24; ifconfig enp2s0f4 up;
sleep 28;  done



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hang at NVME Host caused by Controller reset
  2020-07-28 11:59   ` Krishnamraju Eraparaju
@ 2020-07-28 15:54     ` Sagi Grimberg
  2020-07-28 17:42       ` Krishnamraju Eraparaju
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2020-07-28 15:54 UTC (permalink / raw)
  To: Krishnamraju Eraparaju; +Cc: linux-nvme, linux-rdma, bharat



On 7/28/20 4:59 AM, Krishnamraju Eraparaju wrote:
> Sagi,
> With the given patch, I am no more seeing the freeze_queue_wait hang
> issue, but I am seeing another hang issue:

The trace suggest that you are not running with multipath right?

I think you need the patch:
[PATCH] nvme-fabrics: allow to queue requests for live queues

You can find it in linux-nvme

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hang at NVME Host caused by Controller reset
  2020-07-28 15:54     ` Sagi Grimberg
@ 2020-07-28 17:42       ` Krishnamraju Eraparaju
  2020-07-28 18:35         ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Krishnamraju Eraparaju @ 2020-07-28 17:42 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, linux-rdma, bharat



Sagi,

Yes, Multipath is disabled.
This time, with "nvme-fabrics: allow to queue requests for live queues"
patch applied, I see hang only at blk_queue_enter():
[Jul28 17:25] INFO: task nvme:21119 blocked for more than 122 seconds.
[  +0.000061]       Not tainted 5.8.0-rc7ekr+ #2
[  +0.000052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000059] nvme            D14392 21119   2456 0x00004000
[  +0.000059] Call Trace:
[  +0.000110]  __schedule+0x32b/0x670
[  +0.000108]  schedule+0x45/0xb0
[  +0.000107]  blk_queue_enter+0x1e9/0x250
[  +0.000109]  ? wait_woken+0x70/0x70
[  +0.000110]  blk_mq_alloc_request+0x53/0xc0
[  +0.000111]  nvme_alloc_request+0x61/0x70 [nvme_core]
[  +0.000121]  nvme_submit_user_cmd+0x50/0x310 [nvme_core]
[  +0.000118]  nvme_user_cmd+0x12e/0x1c0 [nvme_core]
[  +0.000163]  ? _copy_to_user+0x22/0x30
[  +0.000113]  blkdev_ioctl+0x100/0x250
[  +0.000115]  block_ioctl+0x34/0x40
[  +0.000110]  ksys_ioctl+0x82/0xc0
[  +0.000109]  __x64_sys_ioctl+0x11/0x20
[  +0.000109]  do_syscall_64+0x3e/0x70
[  +0.000120]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000112] RIP: 0033:0x7fbe9cdbb67b
[  +0.000110] Code: Bad RIP value.
[  +0.000124] RSP: 002b:00007ffd61ff5778 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  +0.000170] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
00007fbe9cdbb67b
[  +0.000114] RDX: 00007ffd61ff5780 RSI: 00000000c0484e43 RDI:
0000000000000003
[  +0.000113] RBP: 0000000000000000 R08: 0000000000000001 R09:
0000000000000000
[  +0.000115] R10: 0000000000000000 R11: 0000000000000246 R12:
00007ffd61ff7219
[  +0.000123] R13: 0000000000000006 R14: 00007ffd61ff5e30 R15:
000055e09c1854a0
[  +0.000115] Kernel panic - not syncing: hung_task: blocked tasks

You could easily reproduce this by running below, parallelly, for 10min:
 while [ 1 ]; do  nvme write-zeroes /dev/nvme0n1 -s 1 -c 1; done
 while [ 1 ]; do echo 1 > /sys/block/nvme0n1/device/reset_controller;
done
 while [ 1 ]; do ifconfig enp2s0f4 down; sleep 24; ifconfig enp2s0f4 up;
sleep 28; done
 
 Not sure using nvme-write this way is valid or not..
 
 Thanks,
 Krishna.
On Tuesday, July 07/28/20, 2020 at 08:54:18 -0700, Sagi Grimberg wrote:
> 
> 
> On 7/28/20 4:59 AM, Krishnamraju Eraparaju wrote:
> >Sagi,
> >With the given patch, I am no more seeing the freeze_queue_wait hang
> >issue, but I am seeing another hang issue:
> 
> The trace suggest that you are not running with multipath right?
> 
> I think you need the patch:
> [PATCH] nvme-fabrics: allow to queue requests for live queues
> 
> You can find it in linux-nvme

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hang at NVME Host caused by Controller reset
  2020-07-28 17:42       ` Krishnamraju Eraparaju
@ 2020-07-28 18:35         ` Sagi Grimberg
  2020-07-28 20:20           ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2020-07-28 18:35 UTC (permalink / raw)
  To: Krishnamraju Eraparaju; +Cc: linux-nvme, linux-rdma, bharat

> Sagi,
> 
> Yes, Multipath is disabled.

Thanks.

> This time, with "nvme-fabrics: allow to queue requests for live queues"
> patch applied, I see hang only at blk_queue_enter():

Interesting, does the reset loop hang? or is it able to make forward
progress?

> [Jul28 17:25] INFO: task nvme:21119 blocked for more than 122 seconds.
> [  +0.000061]       Not tainted 5.8.0-rc7ekr+ #2
> [  +0.000052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  +0.000059] nvme            D14392 21119   2456 0x00004000
> [  +0.000059] Call Trace:
> [  +0.000110]  __schedule+0x32b/0x670
> [  +0.000108]  schedule+0x45/0xb0
> [  +0.000107]  blk_queue_enter+0x1e9/0x250
> [  +0.000109]  ? wait_woken+0x70/0x70
> [  +0.000110]  blk_mq_alloc_request+0x53/0xc0
> [  +0.000111]  nvme_alloc_request+0x61/0x70 [nvme_core]
> [  +0.000121]  nvme_submit_user_cmd+0x50/0x310 [nvme_core]
> [  +0.000118]  nvme_user_cmd+0x12e/0x1c0 [nvme_core]
> [  +0.000163]  ? _copy_to_user+0x22/0x30
> [  +0.000113]  blkdev_ioctl+0x100/0x250
> [  +0.000115]  block_ioctl+0x34/0x40
> [  +0.000110]  ksys_ioctl+0x82/0xc0
> [  +0.000109]  __x64_sys_ioctl+0x11/0x20
> [  +0.000109]  do_syscall_64+0x3e/0x70
> [  +0.000120]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  +0.000112] RIP: 0033:0x7fbe9cdbb67b
> [  +0.000110] Code: Bad RIP value.
> [  +0.000124] RSP: 002b:00007ffd61ff5778 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  +0.000170] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> 00007fbe9cdbb67b
> [  +0.000114] RDX: 00007ffd61ff5780 RSI: 00000000c0484e43 RDI:
> 0000000000000003
> [  +0.000113] RBP: 0000000000000000 R08: 0000000000000001 R09:
> 0000000000000000
> [  +0.000115] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00007ffd61ff7219
> [  +0.000123] R13: 0000000000000006 R14: 00007ffd61ff5e30 R15:
> 000055e09c1854a0
> [  +0.000115] Kernel panic - not syncing: hung_task: blocked tasks

For some reason the ioctl is not woken up when unfreezing the queue...

> You could easily reproduce this by running below, parallelly, for 10min:
>   while [ 1 ]; do  nvme write-zeroes /dev/nvme0n1 -s 1 -c 1; done
>   while [ 1 ]; do echo 1 > /sys/block/nvme0n1/device/reset_controller;
> done
>   while [ 1 ]; do ifconfig enp2s0f4 down; sleep 24; ifconfig enp2s0f4 up;
> sleep 28; done
>   
>   Not sure using nvme-write this way is valid or not..

sure it is, its I/O just like fs I/O.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hang at NVME Host caused by Controller reset
  2020-07-28 18:35         ` Sagi Grimberg
@ 2020-07-28 20:20           ` Sagi Grimberg
  2020-07-29  8:57             ` Krishnamraju Eraparaju
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2020-07-28 20:20 UTC (permalink / raw)
  To: Krishnamraju Eraparaju; +Cc: linux-nvme, linux-rdma, bharat


>> This time, with "nvme-fabrics: allow to queue requests for live queues"
>> patch applied, I see hang only at blk_queue_enter():
> 
> Interesting, does the reset loop hang? or is it able to make forward
> progress?

Looks like the freeze depth is messed up with the timeout handler.
We shouldn't call nvme_tcp_teardown_io_queues in the timeout handler
because it messes with the freeze depth, causing the unfreeze to not
wake the waiter (blk_queue_enter). We should simply stop the queue
and complete the I/O, and the condition was wrong too, because we
need to do it only for the connect command (which cannot reset the
timer). So we should check for reserved in the timeout handler.

Can you please try this patch?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 62fbaecdc960..c3288dd2c92f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -464,6 +464,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl 
*ctrl)
         if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
                 return;

+       dev_warn(ctrl->device, "starting error recovery\n");
         queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
  }

@@ -2156,33 +2157,37 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
         struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
         struct nvme_tcp_cmd_pdu *pdu = req->pdu;

-       /*
-        * Restart the timer if a controller reset is already scheduled. Any
-        * timed out commands would be handled before entering the 
connecting
-        * state.
-        */
-       if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
-               return BLK_EH_RESET_TIMER;
-
         dev_warn(ctrl->ctrl.device,
                 "queue %d: timeout request %#x type %d\n",
                 nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);

-       if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+       switch (ctrl->ctrl.state) {
+       case NVME_CTRL_RESETTING:
                 /*
-                * Teardown immediately if controller times out while 
starting
-                * or we are already started error recovery. all outstanding
-                * requests are completed on shutdown, so we return 
BLK_EH_DONE.
+                * Restart the timer if a controller reset is already 
scheduled.
+                * Any timed out commands would be handled before 
entering the
+                * connecting state.
                  */
-               flush_work(&ctrl->err_work);
-               nvme_tcp_teardown_io_queues(&ctrl->ctrl, false);
-               nvme_tcp_teardown_admin_queue(&ctrl->ctrl, false);
-               return BLK_EH_DONE;
+               return BLK_EH_RESET_TIMER;
+       case NVME_CTRL_CONNECTING:
+               if (reserved) {
+                       /*
+                        * stop queue immediately if controller times 
out while connecting
+                        * or we are already started error recovery. all 
outstanding
+                        * requests are completed on shutdown, so we 
return BLK_EH_DONE.
+                        */
+                       nvme_tcp_stop_queue(&ctrl->ctrl, 
nvme_tcp_queue_id(req->queue));
+                       nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+                       nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
+                       blk_mq_complete_request(rq);
+                       return BLK_EH_DONE;
+               }
+               /* fallthru */
+       default:
+       case NVME_CTRL_LIVE:
+               nvme_tcp_error_recovery(&ctrl->ctrl);
         }

-       dev_warn(ctrl->ctrl.device, "starting error recovery\n");
-       nvme_tcp_error_recovery(&ctrl->ctrl);
-
         return BLK_EH_RESET_TIMER;
  }
--

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Hang at NVME Host caused by Controller reset
  2020-07-28 20:20           ` Sagi Grimberg
@ 2020-07-29  8:57             ` Krishnamraju Eraparaju
  0 siblings, 0 replies; 8+ messages in thread
From: Krishnamraju Eraparaju @ 2020-07-29  8:57 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, linux-rdma, bharat

On Tuesday, July 07/28/20, 2020 at 13:20:38 -0700, Sagi Grimberg wrote:
> 
> >>This time, with "nvme-fabrics: allow to queue requests for live queues"
> >>patch applied, I see hang only at blk_queue_enter():
> >
> >Interesting, does the reset loop hang? or is it able to make forward
> >progress?
> 
> Looks like the freeze depth is messed up with the timeout handler.
> We shouldn't call nvme_tcp_teardown_io_queues in the timeout handler
> because it messes with the freeze depth, causing the unfreeze to not
> wake the waiter (blk_queue_enter). We should simply stop the queue
> and complete the I/O, and the condition was wrong too, because we
> need to do it only for the connect command (which cannot reset the
> timer). So we should check for reserved in the timeout handler.
> 
> Can you please try this patch?
Even with this patch I see hangs, as shown below:
[  +0.000174] nvme nvme0: creating 1 I/O queues.
[  +0.001145] nvme nvme0: mapped 1/0/0 default/read/poll queues.
[ +24.026614] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
[  +0.553164] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.000933] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[Jul29 07:17] nvme nvme0: queue 0: timeout request 0xe type 4
[ +13.816015] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
[  +1.782975] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.000882] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[Jul29 07:18] nvme nvme0: queue 0: timeout request 0xe type 4
[  +4.413969] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
[  +0.555863] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.001006] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[Jul29 07:19] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
[  +0.617088] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex,
Tx/Rx PAUSE
[  +0.000864] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes
ready
[  +4.369233] nvme nvme0: queue 0: timeout request 0xe type 4
[ +12.288452] INFO: task bash:2749 blocked for more than 122 seconds.
[  +0.000125]       Not tainted 5.8.0-rc7ekr+ #2
[  +0.000115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000181] bash            D13600  2749   2748 0x00000000
[  +0.000123] Call Trace:
[  +0.000122]  __schedule+0x32b/0x670
[  +0.000117]  schedule+0x45/0xb0
[  +0.000137]  schedule_timeout+0x216/0x330
[  +0.000157]  ? enqueue_task_fair+0x196/0x7e0
[  +0.000150]  wait_for_completion+0x81/0xe0
[  +0.000164]  __flush_work+0x114/0x1c0
[  +0.000121]  ? flush_workqueue_prep_pwqs+0x130/0x130
[  +0.000129]  nvme_reset_ctrl_sync+0x25/0x40 [nvme_core]
[  +0.000146]  nvme_sysfs_reset+0xd/0x20 [nvme_core]
[  +0.000133]  kernfs_fop_write+0xbc/0x1a0
[  +0.000122]  vfs_write+0xc2/0x1f0
[  +0.000115]  ksys_write+0x5a/0xd0
[  +0.000117]  do_syscall_64+0x3e/0x70
[  +0.000117]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000119] RIP: 0033:0x7f37701ca317
[  +0.000113] Code: Bad RIP value.
[  +0.000112] RSP: 002b:00007fff64bb25e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  +0.000212] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007f37701ca317
[  +0.000132] RDX: 0000000000000002 RSI: 000055b0821a8140 RDI:
0000000000000001
[  +0.000121] RBP: 000055b0821a8140 R08: 000000000000000a R09:
0000000000000001
[  +0.000120] R10: 000055b081b4d471 R11: 0000000000000246 R12:
0000000000000002
[  +0.000148] R13: 00007f37702a46a0 R14: 00007f37702a54a0 R15:
00007f37702a48a0
[  +0.000124] INFO: task nvme:21284 blocked for more than 122 seconds.
[  +0.000119]       Not tainted 5.8.0-rc7ekr+ #2
[  +0.000115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  +0.000219] nvme            D14392 21284   2740 0x00004000
[  +0.000121] Call Trace:
[  +0.000123]  __schedule+0x32b/0x670
[  +0.000136]  schedule+0x45/0xb0
[  +0.000130]  blk_queue_enter+0x1e9/0x250
[  +0.000118]  ? wait_woken+0x70/0x70
[  +0.000117]  blk_mq_alloc_request+0x53/0xc0
[  +0.000142]  nvme_alloc_request+0x61/0x70 [nvme_core]
[  +0.000125]  nvme_submit_user_cmd+0x50/0x310 [nvme_core]
[  +0.000128]  nvme_user_cmd+0x12e/0x1c0 [nvme_core]
[  +0.000126]  ? _copy_to_user+0x22/0x30
[  +0.000118]  blkdev_ioctl+0x100/0x250
[  +0.000119]  block_ioctl+0x34/0x40
[  +0.000118]  ksys_ioctl+0x82/0xc0
[  +0.000117]  __x64_sys_ioctl+0x11/0x20
[  +0.000125]  do_syscall_64+0x3e/0x70
[  +0.000143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000125] RIP: 0033:0x7f954ee8567b
[  +0.000134] Code: Bad RIP value.
[  +0.000138] RSP: 002b:00007ffcd9be20a8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  +0.000184] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
00007f954ee8567b
[  +0.000121] RDX: 00007ffcd9be20b0 RSI: 00000000c0484e43 RDI:
0000000000000003
[  +0.000120] RBP: 0000000000000000 R08: 0000000000000001 R09:
0000000000000000
[  +0.000151] R10: 0000000000000000 R11: 0000000000000246 R12:
00007ffcd9be3219
[  +0.000128] R13: 0000000000000006 R14: 00007ffcd9be2760 R15:
000055aad89024a0


> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 62fbaecdc960..c3288dd2c92f 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -464,6 +464,7 @@ static void nvme_tcp_error_recovery(struct
> nvme_ctrl *ctrl)
>         if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>                 return;
> 
> +       dev_warn(ctrl->device, "starting error recovery\n");
>         queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
>  }
> 
> @@ -2156,33 +2157,37 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
>         struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
>         struct nvme_tcp_cmd_pdu *pdu = req->pdu;
> 
> -       /*
> -        * Restart the timer if a controller reset is already scheduled. Any
> -        * timed out commands would be handled before entering the
> connecting
> -        * state.
> -        */
> -       if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
> -               return BLK_EH_RESET_TIMER;
> -
>         dev_warn(ctrl->ctrl.device,
>                 "queue %d: timeout request %#x type %d\n",
>                 nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
> 
> -       if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
> +       switch (ctrl->ctrl.state) {
> +       case NVME_CTRL_RESETTING:
>                 /*
> -                * Teardown immediately if controller times out
> while starting
> -                * or we are already started error recovery. all outstanding
> -                * requests are completed on shutdown, so we return
> BLK_EH_DONE.
> +                * Restart the timer if a controller reset is
> already scheduled.
> +                * Any timed out commands would be handled before
> entering the
> +                * connecting state.
>                  */
> -               flush_work(&ctrl->err_work);
> -               nvme_tcp_teardown_io_queues(&ctrl->ctrl, false);
> -               nvme_tcp_teardown_admin_queue(&ctrl->ctrl, false);
> -               return BLK_EH_DONE;
> +               return BLK_EH_RESET_TIMER;
> +       case NVME_CTRL_CONNECTING:
> +               if (reserved) {
> +                       /*
> +                        * stop queue immediately if controller
> times out while connecting
> +                        * or we are already started error recovery.
> all outstanding
> +                        * requests are completed on shutdown, so we
> return BLK_EH_DONE.
> +                        */
> +                       nvme_tcp_stop_queue(&ctrl->ctrl,
> nvme_tcp_queue_id(req->queue));
> +                       nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
> +                       nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> +                       blk_mq_complete_request(rq);
> +                       return BLK_EH_DONE;
> +               }
> +               /* fallthru */
> +       default:
> +       case NVME_CTRL_LIVE:
> +               nvme_tcp_error_recovery(&ctrl->ctrl);
>         }
> 
> -       dev_warn(ctrl->ctrl.device, "starting error recovery\n");
> -       nvme_tcp_error_recovery(&ctrl->ctrl);
> -
>         return BLK_EH_RESET_TIMER;
>  }
> --

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27 18:19 Hang at NVME Host caused by Controller reset Krishnamraju Eraparaju
2020-07-27 18:47 ` Sagi Grimberg
2020-07-28 11:59   ` Krishnamraju Eraparaju
2020-07-28 15:54     ` Sagi Grimberg
2020-07-28 17:42       ` Krishnamraju Eraparaju
2020-07-28 18:35         ` Sagi Grimberg
2020-07-28 20:20           ` Sagi Grimberg
2020-07-29  8:57             ` Krishnamraju Eraparaju

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git