Host reconnecting need more than 60s to start after nvmetcli clear on target

All of lore.kernel.org
 help / color / mirror / Atom feed

* Host reconnecting need more than 60s to start after nvmetcli clear on target
       [not found] <506214967.5315130.1504869110572.JavaMail.zimbra@redhat.com>
@ 2017-09-08 11:44 ` Yi Zhang
  2017-09-18 15:58   ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Yi Zhang @ 2017-09-08 11:44 UTC (permalink / raw)


Hi

I found this issue on latest 4.13, is it for designed?  I cannot reproduce it on 4.12.

Here is the log from host:
4.13
[  637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  637.436315] nvme nvme0: creating 40 I/O queues.
[  637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  645.319803] nvme nvme0: rescanning

-->need more than 60 seconds to start reconnect

[  706.073551] nvme nvme0: Reconnecting in 10 seconds...
[  717.740550] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  717.748061] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  717.754495] nvme nvme0: Failed reconnect attempt 1
[  717.759853] nvme nvme0: Reconnecting in 10 seconds...
[  728.530246] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  728.537757] nvme nvme0: rdma_resolve_addr wait failed (-104).

4.12
[  106.264737] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  106.390110] nvme nvme0: creating 40 I/O queues.
[  106.903876] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  116.865211] nvme nvme0: rescanning
[  116.912470] nvme nvme0: Reconnecting in 10 seconds...
[  127.129986] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  127.137492] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  127.144116] nvme nvme0: Failed reconnect attempt 1
[  127.149474] nvme nvme0: Reconnecting in 10 seconds...
[  137.343403] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  137.350904] nvme nvme0: rdma_resolve_addr wait failed (-104).


Best Regards,
  Yi Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Host reconnecting need more than 60s to start after nvmetcli clear on target
  2017-09-08 11:44 ` Host reconnecting need more than 60s to start after nvmetcli clear on target Yi Zhang
@ 2017-09-18 15:58   ` Christoph Hellwig
  2017-09-19  9:09     ` Yi Zhang
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2017-09-18 15:58 UTC (permalink / raw)


On Fri, Sep 08, 2017@07:44:37AM -0400, Yi Zhang wrote:
> Hi
> 
> I found this issue on latest 4.13, is it for designed?  I cannot reproduce it on 4.12.
> 
> Here is the log from host:
> 4.13
> [  637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
> [  637.436315] nvme nvme0: creating 40 I/O queues.
> [  637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
> [  645.319803] nvme nvme0: rescanning
> 
> -->need more than 60 seconds to start reconnect
> 
> [  706.073551] nvme nvme0: Reconnecting in 10 seconds...

How did you initiate the reconnect?  Cable drop?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Host reconnecting need more than 60s to start after nvmetcli clear on target
  2017-09-18 15:58   ` Christoph Hellwig
@ 2017-09-19  9:09     ` Yi Zhang
  2017-09-19 14:18       ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Yi Zhang @ 2017-09-19  9:09 UTC (permalink / raw)




On 09/18/2017 11:58 PM, Christoph Hellwig wrote:
> On Fri, Sep 08, 2017@07:44:37AM -0400, Yi Zhang wrote:
>> Hi
>>
>> I found this issue on latest 4.13, is it for designed?  I cannot reproduce it on 4.12.
>>
>> Here is the log from host:
>> 4.13
>> [  637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
>> [  637.436315] nvme nvme0: creating 40 I/O queues.
>> [  637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
>> [  645.319803] nvme nvme0: rescanning
>>
>> -->need more than 60 seconds to start reconnect
>>
>> [  706.073551] nvme nvme0: Reconnecting in 10 seconds...
> How did you initiate the reconnect?  Cable drop?

Just execute "nvmetcli clear" on target side, and check the log on host 
side.

> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Host reconnecting need more than 60s to start after nvmetcli clear on target
  2017-09-19  9:09     ` Yi Zhang
@ 2017-09-19 14:18       ` Christoph Hellwig
  2017-09-20  0:50         ` Yi Zhang
  2017-09-20 11:26         ` Sagi Grimberg
  0 siblings, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2017-09-19 14:18 UTC (permalink / raw)


On Tue, Sep 19, 2017@05:09:05PM +0800, Yi Zhang wrote:
> 
> 
> On 09/18/2017 11:58 PM, Christoph Hellwig wrote:
> > On Fri, Sep 08, 2017@07:44:37AM -0400, Yi Zhang wrote:
> > > Hi
> > > 
> > > I found this issue on latest 4.13, is it for designed?  I cannot reproduce it on 4.12.
> > > 
> > > Here is the log from host:
> > > 4.13
> > > [  637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
> > > [  637.436315] nvme nvme0: creating 40 I/O queues.
> > > [  637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
> > > [  645.319803] nvme nvme0: rescanning
> > > 
> > > -->need more than 60 seconds to start reconnect
> > > 
> > > [  706.073551] nvme nvme0: Reconnecting in 10 seconds...
> > How did you initiate the reconnect?  Cable drop?
> 
> Just execute "nvmetcli clear" on target side, and check the log on host
> side.

Ok.  60 seconds is when the first commands will time out, so that's
expected.   The NVMeoF protocol has no way to notify the host that
a connection went away, so if you ren't on a protocol that supports
link up/down notifications we'll have to wait for timeouts.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Host reconnecting need more than 60s to start after nvmetcli clear on target
  2017-09-19 14:18       ` Christoph Hellwig
@ 2017-09-20  0:50         ` Yi Zhang
  2017-09-20 11:26         ` Sagi Grimberg
  1 sibling, 0 replies; 7+ messages in thread
From: Yi Zhang @ 2017-09-20  0:50 UTC (permalink / raw)




On 09/19/2017 10:18 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017@05:09:05PM +0800, Yi Zhang wrote:
>>
>> On 09/18/2017 11:58 PM, Christoph Hellwig wrote:
>>> On Fri, Sep 08, 2017@07:44:37AM -0400, Yi Zhang wrote:
>>>> Hi
>>>>
>>>> I found this issue on latest 4.13, is it for designed?  I cannot reproduce it on 4.12.
>>>>
>>>> Here is the log from host:
>>>> 4.13
>>>> [  637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
>>>> [  637.436315] nvme nvme0: creating 40 I/O queues.
>>>> [  637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
>>>> [  645.319803] nvme nvme0: rescanning
>>>>
>>>> -->need more than 60 seconds to start reconnect
>>>>
>>>> [  706.073551] nvme nvme0: Reconnecting in 10 seconds...
>>> How did you initiate the reconnect?  Cable drop?
>> Just execute "nvmetcli clear" on target side, and check the log on host
>> side.
> Ok.  60 seconds is when the first commands will time out, so that's
> expected.   The NVMeoF protocol has no way to notify the host that
> a connection went away, so if you ren't on a protocol that supports
> link up/down notifications we'll have to wait for timeouts.
Got, thanks Christoph.
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Host reconnecting need more than 60s to start after nvmetcli clear on target
  2017-09-19 14:18       ` Christoph Hellwig
  2017-09-20  0:50         ` Yi Zhang
@ 2017-09-20 11:26         ` Sagi Grimberg
  2017-09-21  3:25           ` Yi Zhang
  1 sibling, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2017-09-20 11:26 UTC (permalink / raw)



>>>> Hi
>>>>
>>>> I found this issue on latest 4.13, is it for designed?  I cannot reproduce it on 4.12.
>>>>
>>>> Here is the log from host:
>>>> 4.13
>>>> [  637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
>>>> [  637.436315] nvme nvme0: creating 40 I/O queues.
>>>> [  637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
>>>> [  645.319803] nvme nvme0: rescanning
>>>>
>>>> -->need more than 60 seconds to start reconnect
>>>>
>>>> [  706.073551] nvme nvme0: Reconnecting in 10 seconds...
>>> How did you initiate the reconnect?  Cable drop?
>>
>> Just execute "nvmetcli clear" on target side, and check the log on host
>> side.
> 
> Ok.  60 seconds is when the first commands will time out, so that's
> expected.   The NVMeoF protocol has no way to notify the host that
> a connection went away, so if you ren't on a protocol that supports
> link up/down notifications we'll have to wait for timeouts.

That's not entirely true.

Yes there is no clear indication, but the keep-alive should expire
faster than 60 seconds (it actually 5 seconds by default). The point
here is that its not really a cable pull, its removal of the subsystem
and the namespaces just before that trigger a rescan.

In rdma error recovery we first of all call nvme_stop_ctrl() which flush
the scan_work and that waits for the identify to exhaust (60 seconds
admin timeout). But in error recovery, we should really call full
stop_ctrl, we just need to stop the keep-alive so it will get out
of the way...

Does this fix your issue?
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4460ec3a2c0f..2d2afb5e8102 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -966,7 +966,7 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)
         struct nvme_rdma_ctrl *ctrl = container_of(work,
                         struct nvme_rdma_ctrl, err_work);

-       nvme_stop_ctrl(&ctrl->ctrl);
+       nvme_stop_keep_alive(ctrl);

         if (ctrl->ctrl.queue_count > 1) {
                 nvme_stop_queues(&ctrl->ctrl);
--

This was the original code, I replaced it (incorrectly I think) when
introducing nvme_stop_ctrl.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Host reconnecting need more than 60s to start after nvmetcli clear on target
  2017-09-20 11:26         ` Sagi Grimberg
@ 2017-09-21  3:25           ` Yi Zhang
  0 siblings, 0 replies; 7+ messages in thread
From: Yi Zhang @ 2017-09-21  3:25 UTC (permalink / raw)




On 09/20/2017 07:26 PM, Sagi Grimberg wrote:
>
>>>>> Hi
>>>>>
>>>>> I found this issue on latest 4.13, is it for designed?  I cannot 
>>>>> reproduce it on 4.12.
>>>>>
>>>>> Here is the log from host:
>>>>> 4.13
>>>>> [  637.246798] nvme nvme0: new ctrl: NQN 
>>>>> "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
>>>>> [  637.436315] nvme nvme0: creating 40 I/O queues.
>>>>> [  637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 
>>>>> 172.31.0.90:4420
>>>>> [  645.319803] nvme nvme0: rescanning
>>>>>
>>>>> -->need more than 60 seconds to start reconnect
>>>>>
>>>>> [  706.073551] nvme nvme0: Reconnecting in 10 seconds...
>>>> How did you initiate the reconnect?  Cable drop?
>>>
>>> Just execute "nvmetcli clear" on target side, and check the log on host
>>> side.
>>
>> Ok.  60 seconds is when the first commands will time out, so that's
>> expected.   The NVMeoF protocol has no way to notify the host that
>> a connection went away, so if you ren't on a protocol that supports
>> link up/down notifications we'll have to wait for timeouts.
>
> That's not entirely true.
>
> Yes there is no clear indication, but the keep-alive should expire
> faster than 60 seconds (it actually 5 seconds by default). The point
> here is that its not really a cable pull, its removal of the subsystem
> and the namespaces just before that trigger a rescan.
>
> In rdma error recovery we first of all call nvme_stop_ctrl() which flush
> the scan_work and that waits for the identify to exhaust (60 seconds
> admin timeout). But in error recovery, we should really call full
> stop_ctrl, we just need to stop the keep-alive so it will get out
> of the way...
>
> Does this fix your issue?
Hi Sagi

Your patch works, actually we should use 
nvme_stop_keep_alive(&ctrl->ctrl) instead nvme_stop_keep_alive(ctrl).   :)

Here is the log:
[  599.979081] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  600.116311] nvme nvme0: creating 40 I/O queues.
[  600.630916] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  606.107455] nvme nvme0: rescanning
[  606.265619] nvme nvme0: Reconnecting in 10 seconds...
[  616.367326] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  616.374831] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  616.381262] nvme nvme0: Failed reconnect attempt 1
[  616.386626] nvme nvme0: Reconnecting in 10 seconds...
[  626.595572] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  626.603073] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  626.609507] nvme nvme0: Failed reconnect attempt 2
[  626.614899] nvme nvme0: Reconnecting in 10 seconds...
[  636.835354] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  636.842856] nvme nvme0: rdma_resolve_addr wait failed (-104).

Thanks
Yi

> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 4460ec3a2c0f..2d2afb5e8102 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -966,7 +966,7 @@ static void nvme_rdma_error_recovery_work(struct 
> work_struct *work)
>         struct nvme_rdma_ctrl *ctrl = container_of(work,
>                         struct nvme_rdma_ctrl, err_work);
>
> -       nvme_stop_ctrl(&ctrl->ctrl);
> +       nvme_stop_keep_alive(ctrl);
>
>         if (ctrl->ctrl.queue_count > 1) {
>                 nvme_stop_queues(&ctrl->ctrl);
> -- 
>
> This was the original code, I replaced it (incorrectly I think) when
> introducing nvme_stop_ctrl.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-09-21  3:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <506214967.5315130.1504869110572.JavaMail.zimbra@redhat.com>
2017-09-08 11:44 ` Host reconnecting need more than 60s to start after nvmetcli clear on target Yi Zhang
2017-09-18 15:58   ` Christoph Hellwig
2017-09-19  9:09     ` Yi Zhang
2017-09-19 14:18       ` Christoph Hellwig
2017-09-20  0:50         ` Yi Zhang
2017-09-20 11:26         ` Sagi Grimberg
2017-09-21  3:25           ` Yi Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.