nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect

All of lore.kernel.org
 help / color / mirror / Atom feed

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
@ 2016-08-23 14:46 Steve Wise
  2016-08-24 10:40 ` Sagi Grimberg
  0 siblings, 1 reply; 9+ messages in thread
From: Steve Wise @ 2016-08-23 14:46 UTC (permalink / raw)


Hey guys, when I force an nvmf host into kato recovery/reconnect mode by killing
the target, and then reboot the host, it hangs forever because the nvmf host
controllers never get a delete command, so they stay stuck in reconnect state.  

Here is the dmesg log:

<... one nvmf device connected...>

[  255.079939] nvme nvme1: creating 32 I/O queues.
[  255.377218] nvme nvme1: new ctrl: NQN "test-ram0", addr 10.0.1.14:4420


<... target rebooted here via 'reboot -f'...>

[  264.768555] cxgb4 0000:83:00.4: Port 0 link down, reason: Link Down
[  264.777520] cxgb4 0000:83:00.4 eth10: link down
[  265.177225] nvme nvme1: RECV for CQE 0xffff88101d6f3568 failed with status WR
flushed (5)
[  265.177306] nvme nvme1: reconnecting in 10 seconds
[  265.748213] cxgb4 0000:82:00.4: Port 0 link down, reason: Link Down
[  265.755478] cxgb4 0000:82:00.4 eth2: link down
[  266.183927] mlx4_en: eth14: Link Down
[  276.387127] nvme nvme1: rdma_resolve_addr wait failed (-110).
[  283.116153] nvme nvme1: Failed reconnect attempt, requeueing...

<... host 'reboot' issued here...>

Stopping certmonger: [  OK  ]

Running guests on default URI: no running guests.

Stopping libvirtd daemon: [  OK  ]
Stopping atd: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping ksmtuned: [  OK  ]
Stopping abrt daemon: [  OK  ]
Stopping sshd: [  OK  ]
Stopping mcelog
Stopping xinetd: [  OK  ]
Stopping crond: [  OK  ]
Stopping automount: [  OK  ]
Stopping HAL daemon: [  OK  ]
Stopping block device availability: Deactivating block devices:
[  OK  ]
Stopping cgdcbxd: [  OK  ]
Stopping lldpad: [  OK  ]
Stopping system message bus: [  OK  ]
Shutting down ca[  290.560113] CacheFiles: File cache on sda2 unregistering
chefilesd: [  290.566076] FS-Cache: Withdrawing cache "mycache"
[  OK  ]
Stopping rpcbind: [  OK  ]
Stopping auditd: [  290.809894] audit: type=1305 audit(1471963093.850:82):
audit_pid=0 old=3011 auid=4294967295 ses=4294967295 res=1
[  OK  ]
[  290.908238] audit: type=1305 audit(1471963093.948:83): audit_enabled=0 old=1
auid=4294967295 ses=4294967295 res=1
Shutting down system logger: [  OK  ]
Shutting down interface eth8:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Stopping cgconfig service: [  OK  ]
Stopping virt-who: [  OK  ]
[  294.307812] nvme nvme1: rdma_resolve_addr wait failed (-110).
[  301.035260] nvme nvme1: Failed reconnect attempt, requeueing...
[  312.228468] nvme nvme1: rdma_resolve_addr wait failed (-110).
[  312.234310] nvme nvme1: Failed reconnect attempt, requeueing...
[  323.492871] nvme nvme1: rdma_resolve_addr wait failed (-110).
[  323.498713] nvme nvme1: Failed reconnect attempt, requeueing...
[  334.757296] nvme nvme1: rdma_resolve_addr wait failed (-110).
[  334.763162] nvme nvme1: Failed reconnect attempt, requeueing...

<..stuck forever...>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
  2016-08-23 14:46 nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect Steve Wise
@ 2016-08-24 10:40 ` Sagi Grimberg
  2016-08-24 11:20   ` Sagi Grimberg
  0 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2016-08-24 10:40 UTC (permalink / raw)



> Hey guys, when I force an nvmf host into kato recovery/reconnect mode by killing
> the target, and then reboot the host, it hangs forever because the nvmf host
> controllers never get a delete command, so they stay stuck in reconnect state.

Hey Steve,

For some reason I can't reproduce this on my setup...

So I'm wandering where is nvme_rdma_del_ctrl() thread stuck?
Probably a dump of all the kworkers would be helpful here:

$ pids=`ps -ef | grep kworker | grep -v grep | awk {'print $2'}`
$ for p in $pids; do echo "$p:" ;cat /proc/$p/stack; done

The fact that nvme1 keeps reconnecting forever, means that
del_ctrl() never changes the controller state. Is there an
nvme0 on the system that is also being removed and you don't
see the reconnecting thread keeps on going?

My expectation would be that del_ctrl() would move the ctrl state
to DELETING and reconnect thread would bail-out, then the delete_work
should fire and delete the controller. Obviously something is not
happening like it should.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
  2016-08-24 10:40 ` Sagi Grimberg
@ 2016-08-24 11:20   ` Sagi Grimberg
  2016-08-24 20:25     ` Steve Wise
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Sagi Grimberg @ 2016-08-24 11:20 UTC (permalink / raw)



> Hey Steve,
>
> For some reason I can't reproduce this on my setup...
>
> So I'm wandering where is nvme_rdma_del_ctrl() thread stuck?
> Probably a dump of all the kworkers would be helpful here:
>
> $ pids=`ps -ef | grep kworker | grep -v grep | awk {'print $2'}`
> $ for p in $pids; do echo "$p:" ;cat /proc/$p/stack; done
>
> The fact that nvme1 keeps reconnecting forever, means that
> del_ctrl() never changes the controller state. Is there an
> nvme0 on the system that is also being removed and you don't
> see the reconnecting thread keeps on going?
>
> My expectation would be that del_ctrl() would move the ctrl state
> to DELETING and reconnect thread would bail-out, then the delete_work
> should fire and delete the controller. Obviously something is not
> happening like it should.

I think I suspect what is going on...

When we get a surprise disconnect from the target we queue
a periodic reconnect (which is the sane thing to do...).

We only move the queues out of CONNECTED when we retry
to reconnect (after 10 seconds in the default case) but we stop
the blk queues immediately so we are not bothered with traffic from
now on. If delete() is kicking off in this period the queues are still
in CONNECTED state.

Part of the delete sequence is trying to issue ctrl shutdown if the
admin queue is CONNECTED (which it is!). This request is issued but
stuck in blk-mq waiting for the queues to start again. This might
be the one preventing us from forward progress...

Steve, care to check if the below patch makes things better?

The patch tries to separate the queue flags to CONNECTED and
DELETING. Now we will move out of CONNECTED as soon as error recovery
kicks in (before stopping the queues) and DELETING is on when
we start the queue deletion.

--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 23297c5f85ed..75b49c29b890 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -86,6 +86,7 @@ struct nvme_rdma_request {

  enum nvme_rdma_queue_flags {
         NVME_RDMA_Q_CONNECTED = (1 << 0),
+       NVME_RDMA_Q_DELETING  = (1 << 1),
  };

  struct nvme_rdma_queue {
@@ -612,7 +613,7 @@ static void nvme_rdma_free_queue(struct 
nvme_rdma_queue *queue)

  static void nvme_rdma_stop_and_free_queue(struct nvme_rdma_queue *queue)
  {
-       if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags))
+       if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
                 return;
         nvme_rdma_stop_queue(queue);
         nvme_rdma_free_queue(queue);
@@ -764,8 +765,13 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)
  {
         struct nvme_rdma_ctrl *ctrl = container_of(work,
                         struct nvme_rdma_ctrl, err_work);
+       int i;

         nvme_stop_keep_alive(&ctrl->ctrl);
+
+       for (i = 0; i < ctrl->queue_count; i++)
+               clear_bit(NVME_RDMA_Q_CONNECTED, &ctrl->queues[i].flags);
+
         if (ctrl->queue_count > 1)
                 nvme_stop_queues(&ctrl->ctrl);
         blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
@@ -1331,7 +1337,7 @@ static int nvme_rdma_device_unplug(struct 
nvme_rdma_queue *queue)
         cancel_delayed_work_sync(&ctrl->reconnect_work);

         /* Disable the queue so ctrl delete won't free it */
-       if (test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) {
+       if (!test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags)) {
                 /* Free this queue ourselves */
                 nvme_rdma_stop_queue(queue);
                 nvme_rdma_destroy_queue_ib(queue);
--

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
  2016-08-24 11:20   ` Sagi Grimberg
@ 2016-08-24 20:25     ` Steve Wise
       [not found]     ` <021d01d1fe45$af92ff60$0eb8fe20$@opengridcomputing.com>
  2016-08-25 21:58     ` Sagi Grimberg
  2 siblings, 0 replies; 9+ messages in thread
From: Steve Wise @ 2016-08-24 20:25 UTC (permalink / raw)


> > Hey Steve,
> >
> > For some reason I can't reproduce this on my setup...
> >
> > So I'm wandering where is nvme_rdma_del_ctrl() thread stuck?
> > Probably a dump of all the kworkers would be helpful here:
> >
> > $ pids=`ps -ef | grep kworker | grep -v grep | awk {'print $2'}`
> > $ for p in $pids; do echo "$p:" ;cat /proc/$p/stack; done
> >

I can't do this because the system is crippled due to shutting down.  I get the
feeling though that the del_ctrl thread isn't getting scheduled. Note that the
difference between 'reboot' and 'reboot -f' is that without the -f, iw_cxgb4
isn't unloaded before we get stuck.  So there has to be some part of 'reboot'
that deletes the controllers for it to work.  But I still don't know what is
stalling the reboot anyway.  Some I/O pending I guess?

> > The fact that nvme1 keeps reconnecting forever, means that
> > del_ctrl() never changes the controller state. Is there an
> > nvme0 on the system that is also being removed and you don't
> > see the reconnecting thread keeps on going?
> >

nvme0 is a local nvme device on my setup.

> > My expectation would be that del_ctrl() would move the ctrl state
> > to DELETING and reconnect thread would bail-out, then the delete_work
> > should fire and delete the controller. Obviously something is not
> > happening like it should.
> 
> I think I suspect what is going on...
> 
> When we get a surprise disconnect from the target we queue
> a periodic reconnect (which is the sane thing to do...).
> 

Or a kato timeout.

> We only move the queues out of CONNECTED when we retry
> to reconnect (after 10 seconds in the default case) but we stop
> the blk queues immediately so we are not bothered with traffic from
> now on. If delete() is kicking off in this period the queues are still
> in CONNECTED state.
> 
> Part of the delete sequence is trying to issue ctrl shutdown if the
> admin queue is CONNECTED (which it is!). This request is issued but
> stuck in blk-mq waiting for the queues to start again. This might
> be the one preventing us from forward progress...
> 
> Steve, care to check if the below patch makes things better?
>

This doesn't help.  I'm debugging to get more details.  But can you answer this:
What code initiates the ctrl deletes for the active devices as part of a
'reboot'?  
 
> The patch tries to separate the queue flags to CONNECTED and
> DELETING. Now we will move out of CONNECTED as soon as error recovery
> kicks in (before stopping the queues) and DELETING is on when
> we start the queue deletion.
> 
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 23297c5f85ed..75b49c29b890 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -86,6 +86,7 @@ struct nvme_rdma_request {
> 
>   enum nvme_rdma_queue_flags {
>          NVME_RDMA_Q_CONNECTED = (1 << 0),
> +       NVME_RDMA_Q_DELETING  = (1 << 1),
>   };
> 
>   struct nvme_rdma_queue {
> @@ -612,7 +613,7 @@ static void nvme_rdma_free_queue(struct
> nvme_rdma_queue *queue)
> 
>   static void nvme_rdma_stop_and_free_queue(struct nvme_rdma_queue *queue)
>   {
> -       if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags))
> +       if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
>                  return;
>          nvme_rdma_stop_queue(queue);
>          nvme_rdma_free_queue(queue);
> @@ -764,8 +765,13 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
>   {
>          struct nvme_rdma_ctrl *ctrl = container_of(work,
>                          struct nvme_rdma_ctrl, err_work);
> +       int i;
> 
>          nvme_stop_keep_alive(&ctrl->ctrl);
> +
> +       for (i = 0; i < ctrl->queue_count; i++)
> +               clear_bit(NVME_RDMA_Q_CONNECTED, &ctrl->queues[i].flags);
> +
>          if (ctrl->queue_count > 1)
>                  nvme_stop_queues(&ctrl->ctrl);
>          blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> @@ -1331,7 +1337,7 @@ static int nvme_rdma_device_unplug(struct
> nvme_rdma_queue *queue)
>          cancel_delayed_work_sync(&ctrl->reconnect_work);
> 
>          /* Disable the queue so ctrl delete won't free it */
> -       if (test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) {
> +       if (!test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags)) {
>                  /* Free this queue ourselves */
>                  nvme_rdma_stop_queue(queue);
>                  nvme_rdma_destroy_queue_ib(queue);
> --
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
       [not found]     ` <021d01d1fe45$af92ff60$0eb8fe20$@opengridcomputing.com>
@ 2016-08-24 20:34       ` Steve Wise
       [not found]       ` <022201d1fe46$e85649f0$b902ddd0$@opengridcomputing.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Steve Wise @ 2016-08-24 20:34 UTC (permalink / raw)


> > > Hey Steve,
> > >
> > > For some reason I can't reproduce this on my setup...
> > >
> > > So I'm wandering where is nvme_rdma_del_ctrl() thread stuck?
> > > Probably a dump of all the kworkers would be helpful here:
> > >
> > > $ pids=`ps -ef | grep kworker | grep -v grep | awk {'print $2'}`
> > > $ for p in $pids; do echo "$p:" ;cat /proc/$p/stack; done
> > >
> 
> I can't do this because the system is crippled due to shutting down.  I
> get the feeling though that the del_ctrl thread isn't getting scheduled.
> Note that the difference between 'reboot' and 'reboot -f' is that without
> the -f, iw_cxgb4 isn't unloaded before we get stuck.  So there has to be
> some part of 'reboot' that deletes the controllers for it to work.  But I
> still don't know what is stalling the reboot anyway.  Some I/O pending I
> guess?

According to the hung task detector, this is the only thread stuck:

[  861.638248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[  861.647826] vgs             D ffff880ff6e5b8e8     0  4849   4848 0x10000080
[  861.656702]  ffff880ff6e5b8e8 ffff8810381a15c0 ffff88103343ab80
ffff8810283a6f10
[  861.665829]  00000001e0941240 ffff880ff6e5b8b8 ffff880ff6e58008
ffff88103f059300
[  861.674882]  7fffffffffffffff 0000000000000000 0000000000000000
ffff880ff6e5b938
[  861.683819] Call Trace:
[  861.687677]  [<ffffffff816ddde0>] schedule+0x40/0xb0
[  861.694078]  [<ffffffff816e0a8d>] schedule_timeout+0x2ad/0x410
[  861.701279]  [<ffffffff8132d6d2>] ? blk_flush_plug_list+0x132/0x2e0
[  861.708924]  [<ffffffff810fe67c>] ? ktime_get+0x4c/0xc0
[  861.715452]  [<ffffffff8132c92c>] ? generic_make_request+0xfc/0x1d0
[  861.723060]  [<ffffffff816dd6c4>] io_schedule_timeout+0xa4/0x110
[  861.730319]  [<ffffffff81269cb9>] dio_await_one+0x99/0xe0
[  861.736951]  [<ffffffff8126d359>] do_blockdev_direct_IO+0x919/0xc00
[  861.744402]  [<ffffffff81267350>] ? I_BDEV+0x20/0x20
[  861.750569]  [<ffffffff81267350>] ? I_BDEV+0x20/0x20
[  861.756677]  [<ffffffff8115527b>] ? rb_reserve_next_event+0xdb/0x230
[  861.764155]  [<ffffffff811547ba>] ? rb_commit+0x10a/0x1a0
[  861.770642]  [<ffffffff8126d67a>] __blockdev_direct_IO+0x3a/0x40
[  861.777729]  [<ffffffff81267b83>] blkdev_direct_IO+0x43/0x50
[  861.784439]  [<ffffffff81199ef7>] generic_file_read_iter+0xf7/0x110
[  861.791727]  [<ffffffff81267657>] blkdev_read_iter+0x37/0x40
[  861.798404]  [<ffffffff8122b15c>] __vfs_read+0xfc/0x120
[  861.804624]  [<ffffffff8122b22e>] vfs_read+0xae/0xf0
[  861.810544]  [<ffffffff81249633>] ? __fdget+0x13/0x20
[  861.816539]  [<ffffffff8122bd36>] SyS_read+0x56/0xc0
[  861.822437]  [<ffffffff81003e7d>] do_syscall_64+0x7d/0x230
[  861.828863]  [<ffffffff8106f057>] ? do_page_fault+0x37/0x90
[  861.835313]  [<ffffffff816e1921>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 9+ messages in thread

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
       [not found]       ` <022201d1fe46$e85649f0$b902ddd0$@opengridcomputing.com>
@ 2016-08-24 20:47         ` Steve Wise
  0 siblings, 0 replies; 9+ messages in thread
From: Steve Wise @ 2016-08-24 20:47 UTC (permalink / raw)


> > > > Hey Steve,
> > > >
> > > > For some reason I can't reproduce this on my setup...
> > > >
> > > > So I'm wandering where is nvme_rdma_del_ctrl() thread stuck?
> > > > Probably a dump of all the kworkers would be helpful here:
> > > >
> > > > $ pids=`ps -ef | grep kworker | grep -v grep | awk {'print $2'}`
> > > > $ for p in $pids; do echo "$p:" ;cat /proc/$p/stack; done
> > > >
> >
> > I can't do this because the system is crippled due to shutting down.  I
> > get the feeling though that the del_ctrl thread isn't getting scheduled.
> > Note that the difference between 'reboot' and 'reboot -f' is that
> without
> > the -f, iw_cxgb4 isn't unloaded before we get stuck.  So there has to be
> > some part of 'reboot' that deletes the controllers for it to work.  But
> I
> > still don't know what is stalling the reboot anyway.  Some I/O pending I
> > guess?
> 
> According to the hung task detector, this is the only thread stuck:
> 
> [  861.638248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  861.647826] vgs             D ffff880ff6e5b8e8     0  4849   4848
> 0x10000080
> [  861.656702]  ffff880ff6e5b8e8 ffff8810381a15c0 ffff88103343ab80
> ffff8810283a6f10
> [  861.665829]  00000001e0941240 ffff880ff6e5b8b8 ffff880ff6e58008
> ffff88103f059300
> [  861.674882]  7fffffffffffffff 0000000000000000 0000000000000000
> ffff880ff6e5b938
> [  861.683819] Call Trace:
> [  861.687677]  [<ffffffff816ddde0>] schedule+0x40/0xb0
> [  861.694078]  [<ffffffff816e0a8d>] schedule_timeout+0x2ad/0x410
> [  861.701279]  [<ffffffff8132d6d2>] ? blk_flush_plug_list+0x132/0x2e0
> [  861.708924]  [<ffffffff810fe67c>] ? ktime_get+0x4c/0xc0
> [  861.715452]  [<ffffffff8132c92c>] ? generic_make_request+0xfc/0x1d0
> [  861.723060]  [<ffffffff816dd6c4>] io_schedule_timeout+0xa4/0x110
> [  861.730319]  [<ffffffff81269cb9>] dio_await_one+0x99/0xe0
> [  861.736951]  [<ffffffff8126d359>] do_blockdev_direct_IO+0x919/0xc00
> [  861.744402]  [<ffffffff81267350>] ? I_BDEV+0x20/0x20
> [  861.750569]  [<ffffffff81267350>] ? I_BDEV+0x20/0x20
> [  861.756677]  [<ffffffff8115527b>] ? rb_reserve_next_event+0xdb/0x230
> [  861.764155]  [<ffffffff811547ba>] ? rb_commit+0x10a/0x1a0
> [  861.770642]  [<ffffffff8126d67a>] __blockdev_direct_IO+0x3a/0x40
> [  861.777729]  [<ffffffff81267b83>] blkdev_direct_IO+0x43/0x50
> [  861.784439]  [<ffffffff81199ef7>] generic_file_read_iter+0xf7/0x110
> [  861.791727]  [<ffffffff81267657>] blkdev_read_iter+0x37/0x40
> [  861.798404]  [<ffffffff8122b15c>] __vfs_read+0xfc/0x120
> [  861.804624]  [<ffffffff8122b22e>] vfs_read+0xae/0xf0
> [  861.810544]  [<ffffffff81249633>] ? __fdget+0x13/0x20
> [  861.816539]  [<ffffffff8122bd36>] SyS_read+0x56/0xc0
> [  861.822437]  [<ffffffff81003e7d>] do_syscall_64+0x7d/0x230
> [  861.828863]  [<ffffffff8106f057>] ? do_page_fault+0x37/0x90
> [  861.835313]  [<ffffffff816e1921>] entry_SYSCALL64_slow_path+0x25/0x25
> 
> 

vgs is part of a rhel6 service that is started to monitor volume groups.  I
force stopped it and then did my reboot experiment and it worked fine.  So for
some reason the vgs program is reading the nvmf devices, and gets stuck when the
target goes away.  And I guess stopping this service is part of reboot so it got
stuck too.  The service is named lvm2-monitor.  So I did this:

service lvm2-monitor force-stop

Perhaps that is why you couldn't reproduce it? :)  So with lvm2-monitor stopped,
then the reboot eventually removes iw_cxgb4 which triggers the device removal
code in nvme_rdma and that is how the controllers get deleted.  So the question
in my mind is: Should nvme-rdma have some sort of shutdown handler to delete the
controllers?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
  2016-08-24 11:20   ` Sagi Grimberg
  2016-08-24 20:25     ` Steve Wise
       [not found]     ` <021d01d1fe45$af92ff60$0eb8fe20$@opengridcomputing.com>
@ 2016-08-25 21:58     ` Sagi Grimberg
  2016-08-25 22:05       ` Steve Wise
  2 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2016-08-25 21:58 UTC (permalink / raw)



> I think I suspect what is going on...
>
> When we get a surprise disconnect from the target we queue
> a periodic reconnect (which is the sane thing to do...).
>
> We only move the queues out of CONNECTED when we retry
> to reconnect (after 10 seconds in the default case) but we stop
> the blk queues immediately so we are not bothered with traffic from
> now on. If delete() is kicking off in this period the queues are still
> in CONNECTED state.
>
> Part of the delete sequence is trying to issue ctrl shutdown if the
> admin queue is CONNECTED (which it is!). This request is issued but
> stuck in blk-mq waiting for the queues to start again. This might
> be the one preventing us from forward progress...
>
> Steve, care to check if the below patch makes things better?
>
> The patch tries to separate the queue flags to CONNECTED and
> DELETING. Now we will move out of CONNECTED as soon as error recovery
> kicks in (before stopping the queues) and DELETING is on when
> we start the queue deletion.

Steve, did you get around to have a look at this?

I managed to reproduce this on my setup and the patch
makes it go away...

>
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 23297c5f85ed..75b49c29b890 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -86,6 +86,7 @@ struct nvme_rdma_request {
>
>  enum nvme_rdma_queue_flags {
>         NVME_RDMA_Q_CONNECTED = (1 << 0),
> +       NVME_RDMA_Q_DELETING  = (1 << 1),
>  };
>
>  struct nvme_rdma_queue {
> @@ -612,7 +613,7 @@ static void nvme_rdma_free_queue(struct
> nvme_rdma_queue *queue)
>
>  static void nvme_rdma_stop_and_free_queue(struct nvme_rdma_queue *queue)
>  {
> -       if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags))
> +       if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
>                 return;
>         nvme_rdma_stop_queue(queue);
>         nvme_rdma_free_queue(queue);
> @@ -764,8 +765,13 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
>  {
>         struct nvme_rdma_ctrl *ctrl = container_of(work,
>                         struct nvme_rdma_ctrl, err_work);
> +       int i;
>
>         nvme_stop_keep_alive(&ctrl->ctrl);
> +
> +       for (i = 0; i < ctrl->queue_count; i++)
> +               clear_bit(NVME_RDMA_Q_CONNECTED, &ctrl->queues[i].flags);
> +
>         if (ctrl->queue_count > 1)
>                 nvme_stop_queues(&ctrl->ctrl);
>         blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> @@ -1331,7 +1337,7 @@ static int nvme_rdma_device_unplug(struct
> nvme_rdma_queue *queue)
>         cancel_delayed_work_sync(&ctrl->reconnect_work);
>
>         /* Disable the queue so ctrl delete won't free it */
> -       if (test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) {
> +       if (!test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags)) {
>                 /* Free this queue ourselves */
>                 nvme_rdma_stop_queue(queue);
>                 nvme_rdma_destroy_queue_ib(queue);
> --

^ permalink raw reply	[flat|nested] 9+ messages in thread

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
  2016-08-25 21:58     ` Sagi Grimberg
@ 2016-08-25 22:05       ` Steve Wise
  0 siblings, 0 replies; 9+ messages in thread
From: Steve Wise @ 2016-08-25 22:05 UTC (permalink / raw)


> > I think I suspect what is going on...
> >
> > When we get a surprise disconnect from the target we queue
> > a periodic reconnect (which is the sane thing to do...).
> >
> > We only move the queues out of CONNECTED when we retry
> > to reconnect (after 10 seconds in the default case) but we stop
> > the blk queues immediately so we are not bothered with traffic from
> > now on. If delete() is kicking off in this period the queues are still
> > in CONNECTED state.
> >
> > Part of the delete sequence is trying to issue ctrl shutdown if the
> > admin queue is CONNECTED (which it is!). This request is issued but
> > stuck in blk-mq waiting for the queues to start again. This might
> > be the one preventing us from forward progress...
> >
> > Steve, care to check if the below patch makes things better?
> >
> > The patch tries to separate the queue flags to CONNECTED and
> > DELETING. Now we will move out of CONNECTED as soon as error recovery
> > kicks in (before stopping the queues) and DELETING is on when
> > we start the queue deletion.
> 
> Steve, did you get around to have a look at this?
> 
> I managed to reproduce this on my setup and the patch
> makes it go away...

Yes, I think it is needed.

Reviewed-by: Steve Wise <swise at opengridcomputing.com>
Tested-by: Steve Wise <swise at opengridcomputing.com>

Thanks!!

Steve.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
       [not found] <00df01d1fd4d$10ea8890$32bf99b0$@opengridcomputing.com>
@ 2016-08-23 14:58 ` Steve Wise
  0 siblings, 0 replies; 9+ messages in thread
From: Steve Wise @ 2016-08-23 14:58 UTC (permalink / raw)


> Hey guys, when I force an nvmf host into kato recovery/reconnect mode by
> killing the target, and then reboot the host, it hangs forever because the
> nvmf host controllers never get a delete command, so they stay stuck in
> reconnect state.
> 
> Here is the dmesg log:
> 
> <... one nvmf device connected...>
> 
> [  255.079939] nvme nvme1: creating 32 I/O queues.
> [  255.377218] nvme nvme1: new ctrl: NQN "test-ram0", addr 10.0.1.14:4420
> 
> 
> <... target rebooted here via 'reboot -f'...>
> 
> [  264.768555] cxgb4 0000:83:00.4: Port 0 link down, reason: Link Down
> [  264.777520] cxgb4 0000:83:00.4 eth10: link down
> [  265.177225] nvme nvme1: RECV for CQE 0xffff88101d6f3568 failed with
> status WR flushed (5)
> [  265.177306] nvme nvme1: reconnecting in 10 seconds
> [  265.748213] cxgb4 0000:82:00.4: Port 0 link down, reason: Link Down
> [  265.755478] cxgb4 0000:82:00.4 eth2: link down
> [  266.183927] mlx4_en: eth14: Link Down
> [  276.387127] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  283.116153] nvme nvme1: Failed reconnect attempt, requeueing...
> 
> <... host 'reboot' issued here...>
> 
> Stopping certmonger: [  OK  ]
> 
> Running guests on default URI: no running guests.
> 
> Stopping libvirtd daemon: [  OK  ]
> Stopping atd: [  OK  ]
> Shutting down console mouse services: [  OK  ]
> Stopping ksmtuned: [  OK  ]
> Stopping abrt daemon: [  OK  ]
> Stopping sshd: [  OK  ]
> Stopping mcelog
> Stopping xinetd: [  OK  ]
> Stopping crond: [  OK  ]
> Stopping automount: [  OK  ]
> Stopping HAL daemon: [  OK  ]
> Stopping block device availability: Deactivating block devices:
> [  OK  ]
> Stopping cgdcbxd: [  OK  ]
> Stopping lldpad: [  OK  ]
> Stopping system message bus: [  OK  ]
> Shutting down ca[  290.560113] CacheFiles: File cache on sda2
> unregistering
> chefilesd: [  290.566076] FS-Cache: Withdrawing cache "mycache"
> [  OK  ]
> Stopping rpcbind: [  OK  ]
> Stopping auditd: [  290.809894] audit: type=1305 audit(1471963093.850:82):
> audit_pid=0 old=3011 auid=4294967295 ses=4294967295 res=1
> [  OK  ]
> [  290.908238] audit: type=1305 audit(1471963093.948:83): audit_enabled=0
> old=1 auid=4294967295 ses=4294967295 res=1
> Shutting down system logger: [  OK  ]
> Shutting down interface eth8:  [  OK  ]
> Shutting down loopback interface:  [  OK  ]
> Stopping cgconfig service: [  OK  ]
> Stopping virt-who: [  OK  ]
> [  294.307812] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  301.035260] nvme nvme1: Failed reconnect attempt, requeueing...
> [  312.228468] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  312.234310] nvme nvme1: Failed reconnect attempt, requeueing...
> [  323.492871] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  323.498713] nvme nvme1: Failed reconnect attempt, requeueing...
> [  334.757296] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  334.763162] nvme nvme1: Failed reconnect attempt, requeueing...
> 
> <..stuck forever...>
> 

Eventually I see this stuck thread:

[  492.971125] INFO: task vgs:4755 blocked for more than 120 seconds.
[  492.977409]       Tainted: G            E
4.8.0-rc2nvmf-4.8-rc-rebased-rc2-harsha+ #16
[  492.985606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[  492.993538] vgs             D ffff880fefadf8e8     0  4755   4754 0x10000080
[  493.000749]  ffff880fefadf8e8 ffffffff81c0d4c0 ffff880fefa1ab80
ffff88103415fd30
[  493.010277]  00000001e7fb1240 ffff880fefadf8b8 ffff880fefadc008
ffff88103ee19300
[  493.019728]  7fffffffffffffff 0000000000000000 0000000000000000
ffff880fefadf938
[  493.029121] Call Trace:
[  493.033295]  [<ffffffff816ddde0>] schedule+0x40/0xb0
[  493.039937]  [<ffffffff816e0a8d>] schedule_timeout+0x2ad/0x410
[  493.047364]  [<ffffffff8132d6d2>] ? blk_flush_plug_list+0x132/0x2e0
[  493.055234]  [<ffffffff810fe67c>] ? ktime_get+0x4c/0xc0
[  493.061957]  [<ffffffff8132c92c>] ? generic_make_request+0xfc/0x1d0
[  493.069721]  [<ffffffff816dd6c4>] io_schedule_timeout+0xa4/0x110
[  493.077160]  [<ffffffff81269cb9>] dio_await_one+0x99/0xe0
[  493.083973]  [<ffffffff8126d359>] do_blockdev_direct_IO+0x919/0xc00
[  493.091636]  [<ffffffff81267350>] ? I_BDEV+0x20/0x20
[  493.097946]  [<ffffffff81267350>] ? I_BDEV+0x20/0x20
[  493.104195]  [<ffffffff8115527b>] ? rb_reserve_next_event+0xdb/0x230
[  493.111831]  [<ffffffff811547ba>] ? rb_commit+0x10a/0x1a0
[  493.118450]  [<ffffffff8126d67a>] __blockdev_direct_IO+0x3a/0x40
[  493.125657]  [<ffffffff81267b83>] blkdev_direct_IO+0x43/0x50
[  493.132468]  [<ffffffff81199ef7>] generic_file_read_iter+0xf7/0x110
[  493.139890]  [<ffffffff81267657>] blkdev_read_iter+0x37/0x40
[  493.146664]  [<ffffffff8122b15c>] __vfs_read+0xfc/0x120
[  493.153009]  [<ffffffff8122b22e>] vfs_read+0xae/0xf0
[  493.158877]  [<ffffffff81249633>] ? __fdget+0x13/0x20
[  493.164810]  [<ffffffff8122bd36>] SyS_read+0x56/0xc0
[  493.170651]  [<ffffffff81003e7d>] do_syscall_64+0x7d/0x230
[  493.177027]  [<ffffffff8106f057>] ? do_page_fault+0x37/0x90
[  493.183474]  [<ffffffff816e1921>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-08-25 22:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-23 14:46 nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect Steve Wise
2016-08-24 10:40 ` Sagi Grimberg
2016-08-24 11:20   ` Sagi Grimberg
2016-08-24 20:25     ` Steve Wise
     [not found]     ` <021d01d1fe45$af92ff60$0eb8fe20$@opengridcomputing.com>
2016-08-24 20:34       ` Steve Wise
     [not found]       ` <022201d1fe46$e85649f0$b902ddd0$@opengridcomputing.com>
2016-08-24 20:47         ` Steve Wise
2016-08-25 21:58     ` Sagi Grimberg
2016-08-25 22:05       ` Steve Wise
     [not found] <00df01d1fd4d$10ea8890$32bf99b0$@opengridcomputing.com>
2016-08-23 14:58 ` Steve Wise

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.