* Host reconnecting need more than 60s to start after nvmetcli clear on target
[not found] <506214967.5315130.1504869110572.JavaMail.zimbra@redhat.com>
@ 2017-09-08 11:44 ` Yi Zhang
2017-09-18 15:58 ` Christoph Hellwig
0 siblings, 1 reply; 7+ messages in thread
From: Yi Zhang @ 2017-09-08 11:44 UTC (permalink / raw)
Hi
I found this issue on latest 4.13, is it for designed? I cannot reproduce it on 4.12.
Here is the log from host:
4.13
[ 637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[ 637.436315] nvme nvme0: creating 40 I/O queues.
[ 637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[ 645.319803] nvme nvme0: rescanning
-->need more than 60 seconds to start reconnect
[ 706.073551] nvme nvme0: Reconnecting in 10 seconds...
[ 717.740550] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[ 717.748061] nvme nvme0: rdma_resolve_addr wait failed (-104).
[ 717.754495] nvme nvme0: Failed reconnect attempt 1
[ 717.759853] nvme nvme0: Reconnecting in 10 seconds...
[ 728.530246] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[ 728.537757] nvme nvme0: rdma_resolve_addr wait failed (-104).
4.12
[ 106.264737] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[ 106.390110] nvme nvme0: creating 40 I/O queues.
[ 106.903876] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[ 116.865211] nvme nvme0: rescanning
[ 116.912470] nvme nvme0: Reconnecting in 10 seconds...
[ 127.129986] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[ 127.137492] nvme nvme0: rdma_resolve_addr wait failed (-104).
[ 127.144116] nvme nvme0: Failed reconnect attempt 1
[ 127.149474] nvme nvme0: Reconnecting in 10 seconds...
[ 137.343403] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[ 137.350904] nvme nvme0: rdma_resolve_addr wait failed (-104).
Best Regards,
Yi Zhang
^ permalink raw reply [flat|nested] 7+ messages in thread
* Host reconnecting need more than 60s to start after nvmetcli clear on target
2017-09-08 11:44 ` Host reconnecting need more than 60s to start after nvmetcli clear on target Yi Zhang
@ 2017-09-18 15:58 ` Christoph Hellwig
2017-09-19 9:09 ` Yi Zhang
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2017-09-18 15:58 UTC (permalink / raw)
On Fri, Sep 08, 2017@07:44:37AM -0400, Yi Zhang wrote:
> Hi
>
> I found this issue on latest 4.13, is it for designed? I cannot reproduce it on 4.12.
>
> Here is the log from host:
> 4.13
> [ 637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
> [ 637.436315] nvme nvme0: creating 40 I/O queues.
> [ 637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
> [ 645.319803] nvme nvme0: rescanning
>
> -->need more than 60 seconds to start reconnect
>
> [ 706.073551] nvme nvme0: Reconnecting in 10 seconds...
How did you initiate the reconnect? Cable drop?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Host reconnecting need more than 60s to start after nvmetcli clear on target
2017-09-18 15:58 ` Christoph Hellwig
@ 2017-09-19 9:09 ` Yi Zhang
2017-09-19 14:18 ` Christoph Hellwig
0 siblings, 1 reply; 7+ messages in thread
From: Yi Zhang @ 2017-09-19 9:09 UTC (permalink / raw)
On 09/18/2017 11:58 PM, Christoph Hellwig wrote:
> On Fri, Sep 08, 2017@07:44:37AM -0400, Yi Zhang wrote:
>> Hi
>>
>> I found this issue on latest 4.13, is it for designed? I cannot reproduce it on 4.12.
>>
>> Here is the log from host:
>> 4.13
>> [ 637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
>> [ 637.436315] nvme nvme0: creating 40 I/O queues.
>> [ 637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
>> [ 645.319803] nvme nvme0: rescanning
>>
>> -->need more than 60 seconds to start reconnect
>>
>> [ 706.073551] nvme nvme0: Reconnecting in 10 seconds...
> How did you initiate the reconnect? Cable drop?
Just execute "nvmetcli clear" on target side, and check the log on host
side.
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Host reconnecting need more than 60s to start after nvmetcli clear on target
2017-09-19 9:09 ` Yi Zhang
@ 2017-09-19 14:18 ` Christoph Hellwig
2017-09-20 0:50 ` Yi Zhang
2017-09-20 11:26 ` Sagi Grimberg
0 siblings, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2017-09-19 14:18 UTC (permalink / raw)
On Tue, Sep 19, 2017@05:09:05PM +0800, Yi Zhang wrote:
>
>
> On 09/18/2017 11:58 PM, Christoph Hellwig wrote:
> > On Fri, Sep 08, 2017@07:44:37AM -0400, Yi Zhang wrote:
> > > Hi
> > >
> > > I found this issue on latest 4.13, is it for designed? I cannot reproduce it on 4.12.
> > >
> > > Here is the log from host:
> > > 4.13
> > > [ 637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
> > > [ 637.436315] nvme nvme0: creating 40 I/O queues.
> > > [ 637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
> > > [ 645.319803] nvme nvme0: rescanning
> > >
> > > -->need more than 60 seconds to start reconnect
> > >
> > > [ 706.073551] nvme nvme0: Reconnecting in 10 seconds...
> > How did you initiate the reconnect? Cable drop?
>
> Just execute "nvmetcli clear" on target side, and check the log on host
> side.
Ok. 60 seconds is when the first commands will time out, so that's
expected. The NVMeoF protocol has no way to notify the host that
a connection went away, so if you ren't on a protocol that supports
link up/down notifications we'll have to wait for timeouts.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Host reconnecting need more than 60s to start after nvmetcli clear on target
2017-09-19 14:18 ` Christoph Hellwig
@ 2017-09-20 0:50 ` Yi Zhang
2017-09-20 11:26 ` Sagi Grimberg
1 sibling, 0 replies; 7+ messages in thread
From: Yi Zhang @ 2017-09-20 0:50 UTC (permalink / raw)
On 09/19/2017 10:18 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017@05:09:05PM +0800, Yi Zhang wrote:
>>
>> On 09/18/2017 11:58 PM, Christoph Hellwig wrote:
>>> On Fri, Sep 08, 2017@07:44:37AM -0400, Yi Zhang wrote:
>>>> Hi
>>>>
>>>> I found this issue on latest 4.13, is it for designed? I cannot reproduce it on 4.12.
>>>>
>>>> Here is the log from host:
>>>> 4.13
>>>> [ 637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
>>>> [ 637.436315] nvme nvme0: creating 40 I/O queues.
>>>> [ 637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
>>>> [ 645.319803] nvme nvme0: rescanning
>>>>
>>>> -->need more than 60 seconds to start reconnect
>>>>
>>>> [ 706.073551] nvme nvme0: Reconnecting in 10 seconds...
>>> How did you initiate the reconnect? Cable drop?
>> Just execute "nvmetcli clear" on target side, and check the log on host
>> side.
> Ok. 60 seconds is when the first commands will time out, so that's
> expected. The NVMeoF protocol has no way to notify the host that
> a connection went away, so if you ren't on a protocol that supports
> link up/down notifications we'll have to wait for timeouts.
Got, thanks Christoph.
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Host reconnecting need more than 60s to start after nvmetcli clear on target
2017-09-19 14:18 ` Christoph Hellwig
2017-09-20 0:50 ` Yi Zhang
@ 2017-09-20 11:26 ` Sagi Grimberg
2017-09-21 3:25 ` Yi Zhang
1 sibling, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2017-09-20 11:26 UTC (permalink / raw)
>>>> Hi
>>>>
>>>> I found this issue on latest 4.13, is it for designed? I cannot reproduce it on 4.12.
>>>>
>>>> Here is the log from host:
>>>> 4.13
>>>> [ 637.246798] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
>>>> [ 637.436315] nvme nvme0: creating 40 I/O queues.
>>>> [ 637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
>>>> [ 645.319803] nvme nvme0: rescanning
>>>>
>>>> -->need more than 60 seconds to start reconnect
>>>>
>>>> [ 706.073551] nvme nvme0: Reconnecting in 10 seconds...
>>> How did you initiate the reconnect? Cable drop?
>>
>> Just execute "nvmetcli clear" on target side, and check the log on host
>> side.
>
> Ok. 60 seconds is when the first commands will time out, so that's
> expected. The NVMeoF protocol has no way to notify the host that
> a connection went away, so if you ren't on a protocol that supports
> link up/down notifications we'll have to wait for timeouts.
That's not entirely true.
Yes there is no clear indication, but the keep-alive should expire
faster than 60 seconds (it actually 5 seconds by default). The point
here is that its not really a cable pull, its removal of the subsystem
and the namespaces just before that trigger a rescan.
In rdma error recovery we first of all call nvme_stop_ctrl() which flush
the scan_work and that waits for the identify to exhaust (60 seconds
admin timeout). But in error recovery, we should really call full
stop_ctrl, we just need to stop the keep-alive so it will get out
of the way...
Does this fix your issue?
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4460ec3a2c0f..2d2afb5e8102 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -966,7 +966,7 @@ static void nvme_rdma_error_recovery_work(struct
work_struct *work)
struct nvme_rdma_ctrl *ctrl = container_of(work,
struct nvme_rdma_ctrl, err_work);
- nvme_stop_ctrl(&ctrl->ctrl);
+ nvme_stop_keep_alive(ctrl);
if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(&ctrl->ctrl);
--
This was the original code, I replaced it (incorrectly I think) when
introducing nvme_stop_ctrl.
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Host reconnecting need more than 60s to start after nvmetcli clear on target
2017-09-20 11:26 ` Sagi Grimberg
@ 2017-09-21 3:25 ` Yi Zhang
0 siblings, 0 replies; 7+ messages in thread
From: Yi Zhang @ 2017-09-21 3:25 UTC (permalink / raw)
On 09/20/2017 07:26 PM, Sagi Grimberg wrote:
>
>>>>> Hi
>>>>>
>>>>> I found this issue on latest 4.13, is it for designed? I cannot
>>>>> reproduce it on 4.12.
>>>>>
>>>>> Here is the log from host:
>>>>> 4.13
>>>>> [ 637.246798] nvme nvme0: new ctrl: NQN
>>>>> "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
>>>>> [ 637.436315] nvme nvme0: creating 40 I/O queues.
>>>>> [ 637.939988] nvme nvme0: new ctrl: NQN "testnqn", addr
>>>>> 172.31.0.90:4420
>>>>> [ 645.319803] nvme nvme0: rescanning
>>>>>
>>>>> -->need more than 60 seconds to start reconnect
>>>>>
>>>>> [ 706.073551] nvme nvme0: Reconnecting in 10 seconds...
>>>> How did you initiate the reconnect? Cable drop?
>>>
>>> Just execute "nvmetcli clear" on target side, and check the log on host
>>> side.
>>
>> Ok. 60 seconds is when the first commands will time out, so that's
>> expected. The NVMeoF protocol has no way to notify the host that
>> a connection went away, so if you ren't on a protocol that supports
>> link up/down notifications we'll have to wait for timeouts.
>
> That's not entirely true.
>
> Yes there is no clear indication, but the keep-alive should expire
> faster than 60 seconds (it actually 5 seconds by default). The point
> here is that its not really a cable pull, its removal of the subsystem
> and the namespaces just before that trigger a rescan.
>
> In rdma error recovery we first of all call nvme_stop_ctrl() which flush
> the scan_work and that waits for the identify to exhaust (60 seconds
> admin timeout). But in error recovery, we should really call full
> stop_ctrl, we just need to stop the keep-alive so it will get out
> of the way...
>
> Does this fix your issue?
Hi Sagi
Your patch works, actually we should use
nvme_stop_keep_alive(&ctrl->ctrl) instead nvme_stop_keep_alive(ctrl). :)
Here is the log:
[ 599.979081] nvme nvme0: new ctrl: NQN
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[ 600.116311] nvme nvme0: creating 40 I/O queues.
[ 600.630916] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[ 606.107455] nvme nvme0: rescanning
[ 606.265619] nvme nvme0: Reconnecting in 10 seconds...
[ 616.367326] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[ 616.374831] nvme nvme0: rdma_resolve_addr wait failed (-104).
[ 616.381262] nvme nvme0: Failed reconnect attempt 1
[ 616.386626] nvme nvme0: Reconnecting in 10 seconds...
[ 626.595572] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[ 626.603073] nvme nvme0: rdma_resolve_addr wait failed (-104).
[ 626.609507] nvme nvme0: Failed reconnect attempt 2
[ 626.614899] nvme nvme0: Reconnecting in 10 seconds...
[ 636.835354] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[ 636.842856] nvme nvme0: rdma_resolve_addr wait failed (-104).
Thanks
Yi
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 4460ec3a2c0f..2d2afb5e8102 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -966,7 +966,7 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
> struct nvme_rdma_ctrl *ctrl = container_of(work,
> struct nvme_rdma_ctrl, err_work);
>
> - nvme_stop_ctrl(&ctrl->ctrl);
> + nvme_stop_keep_alive(ctrl);
>
> if (ctrl->ctrl.queue_count > 1) {
> nvme_stop_queues(&ctrl->ctrl);
> --
>
> This was the original code, I replaced it (incorrectly I think) when
> introducing nvme_stop_ctrl.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-09-21 3:25 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <506214967.5315130.1504869110572.JavaMail.zimbra@redhat.com>
2017-09-08 11:44 ` Host reconnecting need more than 60s to start after nvmetcli clear on target Yi Zhang
2017-09-18 15:58 ` Christoph Hellwig
2017-09-19 9:09 ` Yi Zhang
2017-09-19 14:18 ` Christoph Hellwig
2017-09-20 0:50 ` Yi Zhang
2017-09-20 11:26 ` Sagi Grimberg
2017-09-21 3:25 ` Yi Zhang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.