From mboxrd@z Thu Jan 1 00:00:00 1970 From: swise@opengridcomputing.com (Steve Wise) Date: Wed, 17 Aug 2016 09:33:48 -0500 Subject: nvme/rdma initiator stuck on reboot In-Reply-To: <2202d08c-2b4c-3bd9-6340-d630b8e2f8b5@grimberg.me> References: <043901d1f7f5$fb5f73c0$f21e5b40$@opengridcomputing.com> <2202d08c-2b4c-3bd9-6340-d630b8e2f8b5@grimberg.me> Message-ID: <073301d1f894$5ddb81d0$19928570$@opengridcomputing.com> > > Hey Sagi, > > > > Here is another issue I'm seeing doing reboot testing. The test does this: > > > > 1) connect 10 ram devices over iw_cxgb4 > > 2) reboot the target node > > 3) the initiator goes into recovery/reconnect mode > > 4) reboot the inititator at this point. > > > > The initiator gets stuck doing this continually and the system never reboots: > > > > [ 596.411842] nvme nvme1: Failed reconnect attempt, requeueing... > > [ 596.907865] nvme nvme9: rdma_resolve_addr wait failed (-104). > > [ 596.914461] nvme nvme9: Failed reconnect attempt, requeueing... > > [ 597.939935] nvme nvme10: rdma_resolve_addr wait failed (-104). > > [ 597.946625] nvme nvme10: Failed reconnect attempt, requeueing... > > [ 598.963995] nvme nvme2: rdma_resolve_addr wait failed (-110). > > [ 598.971968] nvme nvme2: Failed reconnect attempt, requeueing... > > [ 602.036135] nvme nvme3: rdma_resolve_addr wait failed (-104). > > [ 602.043797] nvme nvme3: Failed reconnect attempt, requeueing... > > [ 603.060171] nvme nvme4: rdma_resolve_addr wait failed (-104). > > [ 603.068153] nvme nvme4: Failed reconnect attempt, requeueing... > > [ 604.084223] nvme nvme5: rdma_resolve_addr wait failed (-104). > > [ 604.092191] nvme nvme5: Failed reconnect attempt, requeueing... > > [ 605.108294] nvme nvme6: rdma_resolve_addr wait failed (-104). > > [ 605.116251] nvme nvme6: Failed reconnect attempt, requeueing... > > > > Debugging now... > > Hmm... > > Does this reproduce also when you simply delete all the > controllers (via sysfs)? > Yes. > Do you see the hung task watchdog? can you share the > threads state? (echo t > /proc/sysrq-trigger) The deleting thread is stuck: Aug 17 07:21:32 stevo1 kernel: [ 2213.279798] INFO: task bash:4066 blocked for more than 120 seconds. Aug 17 07:21:32 stevo1 kernel: [ 2213.279802] Tainted: G E 4.8.0-rc2-nvmf-4.8-rc-rebased-rc2+ #15 Aug 17 07:21:32 stevo1 kernel: [ 2213.279803] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 17 07:21:32 stevo1 kernel: [ 2213.279811] bash D ffff881034547ac8 0 4066 4056 0x10000080 Aug 17 07:21:32 stevo1 kernel: [ 2213.279819] ffff881034547ac8 ffff881033f04140 ffff88102bc80000 ffff881032b66250 Aug 17 07:21:32 stevo1 kernel: [ 2213.279824] ffff880fefe64230 ffff88103e88aa50 ffff881034544008 ffff881034547c98 Aug 17 07:21:32 stevo1 kernel: [ 2213.279828] ffff881034547c08 ffff88102bc80000 7fffffffffffffff ffff881034547b18 Aug 17 07:21:32 stevo1 kernel: [ 2213.279829] Call Trace: Aug 17 07:21:32 stevo1 kernel: [ 2213.279842] [] schedule+0x40/0xb0 Aug 17 07:21:32 stevo1 kernel: [ 2213.279850] [] ? try_to_wake_up+0x54/0x240 Aug 17 07:21:32 stevo1 kernel: [ 2213.279856] [] schedule_timeout+0x2ad/0x410 Aug 17 07:21:32 stevo1 kernel: [ 2213.279861] [] ? wake_up_process+0x15/0x20 Aug 17 07:21:32 stevo1 kernel: [ 2213.279868] [] ? wake_up_worker+0x30/0x40 Aug 17 07:21:32 stevo1 kernel: [ 2213.279873] [] ? insert_work+0x6c/0xb0 Aug 17 07:21:32 stevo1 kernel: [ 2213.279878] [] wait_for_completion+0xde/0x110 Aug 17 07:21:32 stevo1 kernel: [ 2213.279882] [] ? try_to_wake_up+0x240/0x240 Aug 17 07:21:32 stevo1 kernel: [ 2213.279887] [] flush_work+0x2b/0x40 Aug 17 07:21:32 stevo1 kernel: [ 2213.279892] [] ? worker_detach_from_pool+0xa0/0xa0 Aug 17 07:21:32 stevo1 kernel: [ 2213.279902] [] ? __nvme_rdma_del_ctrl+0x4a/0x60 [nvme_rdma] Aug 17 07:21:32 stevo1 kernel: [ 2213.279908] [] nvme_rdma_del_ctrl+0x38/0x50 [nvme_rdma] Aug 17 07:21:32 stevo1 kernel: [ 2213.279918] [] nvme_sysfs_delete+0x34/0x50 [nvme_core] Aug 17 07:21:32 stevo1 kernel: [ 2213.279923] [] dev_attr_store+0x20/0x30 Aug 17 07:21:32 stevo1 kernel: [ 2213.279931] [] sysfs_kf_write+0x3e/0x40 Aug 17 07:21:32 stevo1 kernel: [ 2213.279935] [] kernfs_fop_write+0x113/0x1b0 Aug 17 07:21:32 stevo1 kernel: [ 2213.279942] [] ? trace_buffer_unlock_commit_regs+0x68/0x80 Aug 17 07:21:32 stevo1 kernel: [ 2213.279949] [] __vfs_write+0x34/0x120 Aug 17 07:21:32 stevo1 kernel: [ 2213.279957] [] ? trace_event_raw_event_sys_enter+0xb5/0x130 Aug 17 07:21:32 stevo1 kernel: [ 2213.279963] [] ? percpu_down_read+0x16/0x60 Aug 17 07:21:32 stevo1 kernel: [ 2213.279967] [] vfs_write+0xc1/0x130 Aug 17 07:21:32 stevo1 kernel: [ 2213.279974] [] ? __fdget+0x13/0x20 Aug 17 07:21:32 stevo1 kernel: [ 2213.279978] [] SyS_write+0x56/0xc0 Aug 17 07:21:32 stevo1 kernel: [ 2213.279983] [] do_syscall_64+0x7d/0x230 Aug 17 07:21:32 stevo1 kernel: [ 2213.279991] [] ? do_page_fault+0x37/0x90 Aug 17 07:21:32 stevo1 kernel: [ 2213.279997] [] entry_SYSCALL64_slow_path+0x25/0x25 Nvme_rdma_wq is doing nothing: Aug 17 07:20:08 stevo1 kernel: [ 2130.461614] nvme_rdma_wq S ffff880fedd7bd58 0 4607 2 0x10000080 Aug 17 07:20:08 stevo1 kernel: [ 2130.469572] ffff880fedd7bd58 ffff8810381a15c0 ffff88102bdb5700 ffff881032b66250 Aug 17 07:20:08 stevo1 kernel: [ 2130.477972] ffffffff81038959 ffff880fedd7bd48 ffff880fedd78008 ffff8810206dde00 Aug 17 07:20:08 stevo1 kernel: [ 2130.486357] ffff88102bdb5700 ffffffff810a1730 ffff8810381abed8 ffff880fedd7bda8 Aug 17 07:20:08 stevo1 kernel: [ 2130.494755] Call Trace: Aug 17 07:20:08 stevo1 kernel: [ 2130.498088] [] ? sched_clock+0x9/0x10 Aug 17 07:20:08 stevo1 kernel: [ 2130.504287] [] ? process_one_work+0x4d0/0x4d0 Aug 17 07:20:08 stevo1 kernel: [ 2130.511168] [] schedule+0x40/0xb0 Aug 17 07:20:08 stevo1 kernel: [ 2130.517002] [] ? set_user_nice+0x114/0x1b0 Aug 17 07:20:08 stevo1 kernel: [ 2130.523623] [] rescuer_thread+0x2f2/0x380 Aug 17 07:20:08 stevo1 kernel: [ 2130.530146] [] ? __schedule+0x1f0/0x5b0 Aug 17 07:20:08 stevo1 kernel: [ 2130.536501] [] ? __wake_up_common+0x56/0x90 Aug 17 07:20:08 stevo1 kernel: [ 2130.543173] [] ? process_one_work+0x4d0/0x4d0 Aug 17 07:20:08 stevo1 kernel: [ 2130.550013] [] ? schedule+0x40/0xb0 Aug 17 07:20:08 stevo1 kernel: [ 2130.555971] [] ? process_one_work+0x4d0/0x4d0 Aug 17 07:20:08 stevo1 kernel: [ 2130.562800] [] kthread+0xcc/0xf0 Aug 17 07:20:08 stevo1 kernel: [ 2130.568485] [] ? schedule_tail+0x1e/0xc0 Aug 17 07:20:08 stevo1 kernel: [ 2130.574871] [] ret_from_fork+0x1f/0x40 Aug 17 07:20:08 stevo1 kernel: [ 2130.581073] [] ? kthread_freezable_should_stop+0x70/0x70 Here is another thread in the delete controller sysfs path: Aug 17 07:20:05 stevo1 kernel: [ 2127.184581] bash D ffff881034547ac8 0 4066 4056 0x10000080 Aug 17 07:20:05 stevo1 kernel: [ 2127.192445] ffff881034547ac8 ffff881033f04140 ffff88102bc80000 ffff881032b66250 Aug 17 07:20:05 stevo1 kernel: [ 2127.200730] ffff880fefe64230 ffff88103e88aa50 ffff881034544008 ffff881034547c98 Aug 17 07:20:05 stevo1 kernel: [ 2127.209022] ffff881034547c08 ffff88102bc80000 7fffffffffffffff ffff881034547b18 Aug 17 07:20:05 stevo1 kernel: [ 2127.217278] Call Trace: Aug 17 07:20:05 stevo1 kernel: [ 2127.220523] [] schedule+0x40/0xb0 Aug 17 07:20:05 stevo1 kernel: [ 2127.226244] [] ? try_to_wake_up+0x54/0x240 Aug 17 07:20:05 stevo1 kernel: [ 2127.232753] [] schedule_timeout+0x2ad/0x410 Aug 17 07:20:05 stevo1 kernel: [ 2127.239353] [] ? wake_up_process+0x15/0x20 Aug 17 07:20:05 stevo1 kernel: [ 2127.245871] [] ? wake_up_worker+0x30/0x40 Aug 17 07:20:05 stevo1 kernel: [ 2127.252285] [] ? insert_work+0x6c/0xb0 Aug 17 07:20:05 stevo1 kernel: [ 2127.258432] [] wait_for_completion+0xde/0x110 Aug 17 07:20:05 stevo1 kernel: [ 2127.263351] nvme nvme4: rdma_resolve_addr wait failed (-110). Aug 17 07:20:05 stevo1 kernel: [ 2127.263355] nvme nvme4: Failed reconnect attempt, requeueing... Aug 17 07:20:05 stevo1 kernel: [ 2127.278353] [] ? try_to_wake_up+0x240/0x240 Aug 17 07:20:05 stevo1 kernel: [ 2127.284912] [] flush_work+0x2b/0x40 Aug 17 07:20:05 stevo1 kernel: [ 2127.290771] [] ? worker_detach_from_pool+0xa0/0xa0 Aug 17 07:20:05 stevo1 kernel: [ 2127.297970] [] ? __nvme_rdma_del_ctrl+0x4a/0x60 [nvme_rdma] Aug 17 07:20:05 stevo1 kernel: [ 2127.305945] [] nvme_rdma_del_ctrl+0x38/0x50 [nvme_rdma] Aug 17 07:20:05 stevo1 kernel: [ 2127.313577] [] nvme_sysfs_delete+0x34/0x50 [nvme_core] Aug 17 07:20:05 stevo1 kernel: [ 2127.321111] [] dev_attr_store+0x20/0x30 Aug 17 07:20:05 stevo1 kernel: [ 2127.327343] [] sysfs_kf_write+0x3e/0x40 Aug 17 07:20:05 stevo1 kernel: [ 2127.333551] [] kernfs_fop_write+0x113/0x1b0 Aug 17 07:20:05 stevo1 kernel: [ 2127.340127] [] ? trace_buffer_unlock_commit_regs+0x68/0x80 Aug 17 07:20:05 stevo1 kernel: [ 2127.348017] [] __vfs_write+0x34/0x120 Aug 17 07:20:05 stevo1 kernel: [ 2127.354083] [] ? trace_event_raw_event_sys_enter+0xb5/0x130 Aug 17 07:20:05 stevo1 kernel: [ 2127.362060] [] ? percpu_down_read+0x16/0x60 Aug 17 07:20:05 stevo1 kernel: [ 2127.368667] [] vfs_write+0xc1/0x130 Aug 17 07:20:05 stevo1 kernel: [ 2127.374559] [] ? __fdget+0x13/0x20 Aug 17 07:20:05 stevo1 kernel: [ 2127.380367] [] SyS_write+0x56/0xc0 Aug 17 07:20:05 stevo1 kernel: [ 2127.386201] [] do_syscall_64+0x7d/0x230 Aug 17 07:20:05 stevo1 kernel: [ 2127.392485] [] ? do_page_fault+0x37/0x90 Aug 17 07:20:05 stevo1 kernel: [ 2127.398866] [] entry_SYSCALL64_slow_path+0x25/0x25 Here is the reconnect workq thread stuck trying to connect: Aug 17 07:19:45 stevo1 kernel: [ 2107.446780] Workqueue: nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Aug 17 07:19:45 stevo1 kernel: [ 2107.454764] ffff881033a57aa8 ffff88103815c140 ffff8810371dab80 ffff881032b66250 Aug 17 07:19:45 stevo1 kernel: [ 2107.463091] 00000001001b93e9 0000000000000001 ffff881033a54008 ffff88103ee500c0 Aug 17 07:19:45 stevo1 kernel: [ 2107.471431] 00000001001b93e9 0000000000000000 ffff88103ee500c0 ffff881033a57af8 Aug 17 07:19:45 stevo1 kernel: [ 2107.479805] Call Trace: Aug 17 07:19:45 stevo1 kernel: [ 2107.483076] [] schedule+0x40/0xb0 Aug 17 07:19:45 stevo1 kernel: [ 2107.488892] [] ? internal_add_timer+0x70/0x80 Aug 17 07:19:45 stevo1 kernel: [ 2107.495765] [] schedule_timeout+0x21b/0x410 Aug 17 07:19:45 stevo1 kernel: [ 2107.502464] [] ? del_timer+0x70/0x70 Aug 17 07:19:45 stevo1 kernel: [ 2107.508547] [] ? rdma_resolve_ip+0x1cf/0x1e0 [ib_core] Aug 17 07:19:45 stevo1 kernel: [ 2107.516183] [] ? cma_work_handler+0xa0/0xa0 [rdma_cm] Aug 17 07:19:45 stevo1 kernel: [ 2107.523721] [] wait_for_completion_interruptible_timeout+0xef/0x130 Aug 17 07:19:45 stevo1 kernel: [ 2107.532472] [] ? try_to_wake_up+0x240/0x240 Aug 17 07:19:45 stevo1 kernel: [ 2107.539152] [] nvme_rdma_init_queue+0xe2/0x180 [nvme_rdma] Aug 17 07:19:45 stevo1 kernel: [ 2107.547132] [] nvme_rdma_reconnect_ctrl_work+0x9c/0x1f0 [nvme_rdma] Aug 17 07:19:45 stevo1 kernel: [ 2107.555908] [] process_one_work+0x183/0x4d0 Aug 17 07:19:45 stevo1 kernel: [ 2107.562592] [] ? __schedule+0x1f0/0x5b0 Aug 17 07:19:45 stevo1 kernel: [ 2107.568938] [] ? schedule+0x40/0xb0 Aug 17 07:19:45 stevo1 kernel: [ 2107.574927] [] worker_thread+0x16d/0x530 Aug 17 07:19:45 stevo1 kernel: [ 2107.581346] [] ? maybe_create_worker+0x120/0x120 Aug 17 07:19:45 stevo1 kernel: [ 2107.588474] [] ? __schedule+0x1f0/0x5b0 Aug 17 07:19:45 stevo1 kernel: [ 2107.594813] [] ? __wake_up_common+0x56/0x90 Aug 17 07:19:45 stevo1 kernel: [ 2107.601488] [] ? maybe_create_worker+0x120/0x120 Aug 17 07:19:45 stevo1 kernel: [ 2107.608625] [] ? schedule+0x40/0xb0 Aug 17 07:19:45 stevo1 kernel: [ 2107.614626] [] ? maybe_create_worker+0x120/0x120 Aug 17 07:19:45 stevo1 kernel: [ 2107.621745] [] kthread+0xcc/0xf0 Aug 17 07:19:45 stevo1 kernel: [ 2107.627464] [] ? schedule_tail+0x1e/0xc0 Aug 17 07:19:45 stevo1 kernel: [ 2107.633886] [] ret_from_fork+0x1f/0x40 Aug 17 07:19:45 stevo1 kernel: [ 2107.640121] [] ? kthread_freezable_should_stop+0x70/0x70 Here is controller delete workq thread: Aug 17 07:19:42 stevo1 kernel: [ 2104.515910] Workqueue: nvme_rdma_wq nvme_rdma_del_ctrl_work [nvme_rdma] Aug 17 07:19:42 stevo1 kernel: [ 2104.523373] ffff881033f13888 ffff88103817ab80 ffff881033f04140 ffff881032b66250 Aug 17 07:19:42 stevo1 kernel: [ 2104.531744] 00000001001c1a60 0000000000000005 ffff881033f10008 ffff88103ef500c0 Aug 17 07:19:42 stevo1 kernel: [ 2104.540116] 00000001001c1a60 0000000000000000 ffff88103ef500c0 ffff881033f138d8 Aug 17 07:19:42 stevo1 kernel: [ 2104.548501] Call Trace: Aug 17 07:19:42 stevo1 kernel: [ 2104.551777] [] schedule+0x40/0xb0 Aug 17 07:19:42 stevo1 kernel: [ 2104.557593] [] ? internal_add_timer+0x70/0x80 Aug 17 07:19:42 stevo1 kernel: [ 2104.564445] [] schedule_timeout+0x21b/0x410 Aug 17 07:19:42 stevo1 kernel: [ 2104.571117] [] ? del_timer+0x70/0x70 Aug 17 07:19:42 stevo1 kernel: [ 2104.577191] [] ? ktime_get+0x4c/0xc0 Aug 17 07:19:42 stevo1 kernel: [ 2104.583248] [] io_schedule_timeout+0xa4/0x110 Aug 17 07:19:42 stevo1 kernel: [ 2104.590083] [] wait_for_completion_io_timeout+0xe3/0x120 Aug 17 07:19:42 stevo1 kernel: [ 2104.597943] [] ? try_to_wake_up+0x240/0x240 Aug 17 07:19:42 stevo1 kernel: [ 2104.604613] [] ? blk_execute_rq_nowait+0x7d/0x160 Aug 17 07:19:42 stevo1 kernel: [ 2104.611810] [] blk_execute_rq+0x8c/0x110 Aug 17 07:19:42 stevo1 kernel: [ 2104.618221] [] __nvme_submit_sync_cmd+0x85/0xe0 [nvme_core] Aug 17 07:19:42 stevo1 kernel: [ 2104.626293] [] nvmf_reg_write32+0x66/0xb0 [nvme_fabrics] Aug 17 07:19:42 stevo1 kernel: [ 2104.634086] [] nvme_shutdown_ctrl+0x4e/0xe0 [nvme_core] Aug 17 07:19:42 stevo1 kernel: [ 2104.641796] [] nvme_rdma_shutdown_ctrl+0xa1/0xe0 [nvme_rdma] Aug 17 07:19:42 stevo1 kernel: [ 2104.649940] [] __nvme_rdma_remove_ctrl+0x7c/0x80 [nvme_rdma] Aug 17 07:19:42 stevo1 kernel: [ 2104.658093] [] nvme_rdma_del_ctrl_work+0x1a/0x20 [nvme_rdma] Aug 17 07:19:42 stevo1 kernel: [ 2104.666248] [] process_one_work+0x183/0x4d0 Aug 17 07:19:42 stevo1 kernel: [ 2104.672936] [] ? __schedule+0x1f0/0x5b0 Aug 17 07:19:42 stevo1 kernel: [ 2104.679269] [] ? move_linked_works+0x62/0x90 Aug 17 07:19:42 stevo1 kernel: [ 2104.686032] [] worker_thread+0x2f4/0x530 Aug 17 07:19:42 stevo1 kernel: [ 2104.692441] [] ? maybe_create_worker+0x120/0x120 Aug 17 07:19:42 stevo1 kernel: [ 2104.699540] [] ? __schedule+0x1f0/0x5b0 Aug 17 07:19:42 stevo1 kernel: [ 2104.705867] [] ? __wake_up_common+0x56/0x90 Aug 17 07:19:42 stevo1 kernel: [ 2104.712534] [] ? maybe_create_worker+0x120/0x120 Aug 17 07:19:42 stevo1 kernel: [ 2104.719257] nvme nvme4: rdma_resolve_addr wait failed (-104). Aug 17 07:19:42 stevo1 kernel: [ 2104.719261] nvme nvme4: Failed reconnect attempt, requeueing... Aug 17 07:19:42 stevo1 kernel: [ 2104.733085] [] ? schedule+0x40/0xb0 Aug 17 07:19:42 stevo1 kernel: [ 2104.739063] [] ? maybe_create_worker+0x120/0x120 Aug 17 07:19:42 stevo1 kernel: [ 2104.746163] [] kthread+0xcc/0xf0 Aug 17 07:19:42 stevo1 kernel: [ 2104.751878] [] ? schedule_tail+0x1e/0xc0