* NVMeoF: multipath stuck after bringing one ethernet port down
@ 2017-05-17 16:08 Alex Turin
2017-05-17 17:28 ` Sagi Grimberg
0 siblings, 1 reply; 18+ messages in thread
From: Alex Turin @ 2017-05-17 16:08 UTC (permalink / raw)
I am trying to test failure scenarios of NVMeoF + multipath. I bring
one of ports down and expect to see failed paths using "multipath
-ll". Instead I see that "multipath -ll" get stuck.
reproduce:
1. Connected to NVMeoF device through 2 ports.
2. Bind them with multipath.
3. Bring one port down (ifconfig eth3 down)
4. Execute "multipath -ll" command and it will get stuck.
>From strace I see that multipath is stuck in io_destroy() during
release of resources. As I understand io_destroy is stuck because of
io_cancel() that failed. And io_cancel() failed because of port that
was disabled in step 3.
environment:
Kernel: 4.11.1-1.el7.elrepo.x86_64
Network card: ConnectX-4 Lx
rdma-core: version 14 (commit 240c019)
nvme-cli: master - commit 4b5b4d2 (https://github.com/linux-nvme/nvme-cli.git)
/var/log/messages
May 17 11:15:00 localhost NetworkManager[1352]: <info>
[1495034100.7162] device (eth3): state change: activated ->
unavailable (reason 'carrier-changed') [100 20 40]
May 17 11:15:00 localhost dbus-daemon: dbus[1300]: [system] Activating
via systemd: service name='org.freedesktop.nm_dispatcher'
unit='dbus-org.freedesktop.nm-dispatcher.service'
May 17 11:15:00 localhost dbus[1300]: [system] Activating via systemd:
service name='org.freedesktop.nm_dispatcher'
unit='dbus-org.freedesktop.nm-dispatcher.service'
May 17 11:15:00 localhost systemd: Starting Network Manager Script
Dispatcher Service...
May 17 11:15:00 localhost dbus[1300]: [system] Successfully activated
service 'org.freedesktop.nm_dispatcher'
May 17 11:15:00 localhost dbus-daemon: dbus[1300]: [system]
Successfully activated service 'org.freedesktop.nm_dispatcher'
May 17 11:15:00 localhost systemd: Started Network Manager Script
Dispatcher Service.
May 17 11:15:00 localhost nm-dispatcher: req:1 'down' [eth3]: new
request (3 scripts)
May 17 11:15:00 localhost nm-dispatcher: req:1 'down' [eth3]: start
running ordered scripts...
May 17 11:15:09 localhost kernel: nvme nvme2: failed
nvme_keep_alive_end_io error=16391
May 17 11:15:09 localhost kernel: nvme nvme1: failed
nvme_keep_alive_end_io error=16391
May 17 11:15:09 localhost kernel: nvme nvme2: reconnecting in 10 seconds
May 17 11:15:09 localhost kernel: nvme nvme1: reconnecting in 10 seconds
# strace multipath -ll 2>&1 | grep "io_\|open(\"/dev/nvme"
io_setup(1, {140204923695104}) = 0
io_submit(140204923695104, 1, {{pread, filedes:4, buf:0x1d14000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923695104, 1, 1, {{(nil), 0x1d11ba8, 4096, 0}}, {2, 0}) = 1
open("/dev/nvme0n1", O_RDONLY) = 5
io_setup(1, {140204923682816}) = 0
io_submit(140204923682816, 1, {{pread, filedes:5, buf:0x1d22000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923682816, 1, 1, {{(nil), 0x1d21548, 4096, 0}}, {2, 0}) = 1
open("/dev/nvme1n1", O_RDONLY) = 6
io_setup(1, {140204923670528}) = 0
io_submit(140204923670528, 1, {{pread, filedes:6, buf:0x1d25000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923670528, 1, 1, {}{2, 0}) = 0
io_cancel(140204923670528, {(nil), 0, 0, 0, 6}, {...}) = -1 EINVAL
(Invalid argument)
open("/dev/nvme2n1", O_RDONLY) = 7
io_setup(1, {140204923355136}) = 0
io_submit(140204923355136, 1, {{pread, filedes:7, buf:0x1d29000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923355136, 1, 1, {}{2, 0}) = 0
io_cancel(140204923355136, {(nil), 0, 0, 0, 7}, {...}) = -1 EINVAL
(Invalid argument)
open("/dev/nvme3n1", O_RDONLY) = 8
io_setup(1, {140204923342848}) = 0
io_submit(140204923342848, 1, {{pread, filedes:8, buf:0x1d2d000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923342848, 1, 1, {{(nil), 0x1d2af28, 4096, 0}}, {2, 0}) = 1
io_destroy(140204923695104) = 0
io_destroy(140204923682816) = 0
io_destroy(140204923670528
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-17 16:08 NVMeoF: multipath stuck after bringing one ethernet port down Alex Turin
@ 2017-05-17 17:28 ` Sagi Grimberg
2017-05-18 5:52 ` shahar.salzman
0 siblings, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-17 17:28 UTC (permalink / raw)
Hi Alex,
> I am trying to test failure scenarios of NVMeoF + multipath. I bring
> one of ports down and expect to see failed paths using "multipath
> -ll". Instead I see that "multipath -ll" get stuck.
>
> reproduce:
> 1. Connected to NVMeoF device through 2 ports.
> 2. Bind them with multipath.
> 3. Bring one port down (ifconfig eth3 down)
> 4. Execute "multipath -ll" command and it will get stuck.
> From strace I see that multipath is stuck in io_destroy() during
> release of resources. As I understand io_destroy is stuck because of
> io_cancel() that failed. And io_cancel() failed because of port that
> was disabled in step 3.
Hmm, it looks like we do take care of failing fast pending IO, but once
we schedule periodic reconnects the request queues are already stopped
and new incoming requests may block until we successfully reconnect.
I don't have too much time for it at the moment, but here is an untested
patch for you to try out:
--
[PATCH] nvme-rdma: restart queues after we at error recovery to fast
fail incoming io
When we encounter an transport/controller errors, error recovery
kicks in which performs:
1. stops io/admin queues
2. moves transport queues out of LIVE state
3. fast fail pending io
4. schedule periodic reconnects.
But we also need to fast fail incoming IO taht enters after we
already scheduled. Given that our queue is not LIVE anymore, simply
restart the request queues to fail in .queue_rq
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
drivers/nvme/host/rdma.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index dd1c6deef82f..a0aa2bfb91ee 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -753,28 +753,26 @@ static void nvme_rdma_reconnect_ctrl_work(struct
work_struct *work)
if (ret)
goto requeue;
- blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
-
ret = nvmf_connect_admin_queue(&ctrl->ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
if (ret)
- goto stop_admin_q;
+ goto requeue;
nvme_start_keep_alive(&ctrl->ctrl);
if (ctrl->queue_count > 1) {
ret = nvme_rdma_init_io_queues(ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
ret = nvme_rdma_connect_io_queues(ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
}
changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
@@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
work_struct *work)
ctrl->ctrl.opts->nr_reconnects = 0;
if (ctrl->queue_count > 1) {
- nvme_start_queues(&ctrl->ctrl);
nvme_queue_scan(&ctrl->ctrl);
nvme_queue_async_events(&ctrl->ctrl);
}
@@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
work_struct *work)
return;
-stop_admin_q:
- blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
requeue:
dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
ctrl->ctrl.opts->nr_reconnects);
@@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct
work_struct *work)
blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
nvme_cancel_request, &ctrl->ctrl);
+ /*
+ * queues are not a live anymore, so restart the queues to fail fast
+ * new IO
+ */
+ blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
+ nvme_start_queues(&ctrl->ctrl);
+
nvme_rdma_reconnect_or_remove(ctrl);
}
--
^ permalink raw reply related [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-17 17:28 ` Sagi Grimberg
@ 2017-05-18 5:52 ` shahar.salzman
2017-05-23 17:31 ` shahar.salzman
0 siblings, 1 reply; 18+ messages in thread
From: shahar.salzman @ 2017-05-18 5:52 UTC (permalink / raw)
I have seen this too, but was internally investigating it as I am
running 4.9.6 and not upstream. I will be able to check this patch on
Sunday/Monday.
On my setup the IOs never complete even after the path is reconnected,
and the process remains stuck in D state, and the path unusable until I
power cycle the server.
Some more information, hope it helps:
Dumping the stuck process stack (multipath -ll):
[root at kblock01-knode02 ~]# cat /proc/9715/stack
[<ffffffff94354727>] blk_execute_rq+0x97/0x110
[<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
[<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
[<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
[<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
[<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
[<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
[<ffffffff9428993c>] block_ioctl+0x3c/0x40
[<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
[<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
[<ffffffff94261732>] SyS_ioctl+0x92/0xa0
[<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff
Looking at the dissasembly:
0000000000000170 <blk_execute_rq>:
...
1fa: e8 00 00 00 00 callq 1ff <blk_execute_rq+0x8f>
/* Prevent hang_check timer from firing at us during very long
I/O */
hang_check = sysctl_hung_task_timeout_secs;
if (hang_check)
while (!wait_for_completion_io_timeout(&wait,
hang_check * (HZ/2)));
else
wait_for_completion_io(&wait);
1ff: 4c 89 e7 mov %r12,%rdi
202: e8 00 00 00 00 callq 207 <blk_execute_rq+0x97>
if (rq->errors)
207: 83 bb 04 01 00 00 01 cmpl $0x1,0x104(%rbx)
I ran btrace, and it seems that the IO running before the path
disconnect is properly cleaned, and only the IO running after gets stuck
(in the btrace example bellow I run both multipath -ll, and a dd):
259,2 15 27548 4.790727567 8527 I RS 764837376 + 8 [fio]
259,2 15 27549 4.790728103 8527 D RS 764837376 + 8 [fio]
259,2 15 27550 4.791244566 8527 A R 738007008 + 8 <-
(253,0) 738007008
259,2 15 27551 4.791245136 8527 I RS 738007008 + 8 [fio]
259,2 15 27552 4.791245803 8527 D RS 738007008 + 8 [fio]
<<<<< From this stage there are no more fio Queue requests, only completions
259,2 8 9118 4.758116423 8062 C RS 2294539440 + 8 [0]
259,2 8 9119 4.758559461 8062 C RS 160307464 + 8 [0]
259,2 8 9120 4.759009990 0 C RS 2330623512 + 8 [0]
259,2 8 9121 4.759457400 0 C RS 2657710096 + 8 [0]
259,2 8 9122 4.759928113 0 C RS 2988256176 + 8 [0]
259,2 8 9123 4.760388009 0 C RS 789190936 + 8 [0]
259,2 8 9124 4.760835889 0 C RS 2915035352 + 8 [0]
259,2 8 9125 4.761304282 0 C RS 2458313624 + 8 [0]
259,2 8 9126 4.761779694 0 C RS 2280411456 + 8 [0]
259,2 8 9127 4.762286741 0 C RS 2082207376 + 8 [0]
259,2 8 9128 4.762783722 0 C RS 2874601688 + 8 [0]
....
259,2 8 9173 4.785798485 0 C RS 828737568 + 8 [0]
259,2 8 9174 4.786301391 0 C RS 229168888 + 8 [0]
259,2 8 9175 4.786769105 0 C RS 893604096 + 8 [0]
259,2 8 9176 4.787270594 0 C RS 2308778728 + 8 [0]
259,2 8 9177 4.787797448 0 C RS 2190029912 + 8 [0]
259,2 8 9178 4.788298956 0 C RS 2652012944 + 8 [0]
259,2 8 9179 4.788803759 0 C RS 2205242008 + 8 [0]
259,2 8 9180 4.789309423 0 C RS 1948528416 + 8 [0]
259,2 8 9181 4.789831240 0 C RS 2003421288 + 8 [0]
259,2 8 9182 4.790343528 0 C RS 2464964728 + 8 [0]
259,2 8 9183 4.790854684 0 C RS 764837376 + 8 [0]
<<<<<< This is probably where the port had been disconnected
259,2 21 7 9.413088410 8574 Q R 0 + 8 [multipathd]
259,2 21 8 9.413089387 8574 G R 0 + 8 [multipathd]
259,2 21 9 9.413095030 8574 U N [multipathd] 1
259,2 21 10 9.413095367 8574 I RS 0 + 8 [multipathd]
259,2 21 11 9.413096534 8574 D RS 0 + 8 [multipathd]
259,2 10 1 14.381330190 2016 C RS 738007008 + 8 [7]
259,2 21 12 14.381394890 0 R RS 0 + 8 [7]
259,2 8 9184 80.500537429 8657 Q R 0 + 8 [dd]
259,2 8 9185 80.500540122 8657 G R 0 + 8 [dd]
259,2 8 9186 80.500545289 8657 U N [dd] 1
259,2 8 9187 80.500545792 8657 I RS 0 + 8 [dd]
<<<<<< At this stage, I reconnected the path:
259,2 4 1 8381.791090134 11127 Q R 0 + 8 [dd]
259,2 4 2 8381.791093611 11127 G R 0 + 8 [dd]
259,2 4 3 8381.791098288 11127 U N [dd] 1
259,2 4 4 8381.791098791 11127 I RS 0 + 8 [dd]
As I wrote above, I will check the patch on Sunday/Monday.
On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
> Hi Alex,
>
>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>> one of ports down and expect to see failed paths using "multipath
>> -ll". Instead I see that "multipath -ll" get stuck.
>>
>> reproduce:
>> 1. Connected to NVMeoF device through 2 ports.
>> 2. Bind them with multipath.
>> 3. Bring one port down (ifconfig eth3 down)
>> 4. Execute "multipath -ll" command and it will get stuck.
>> From strace I see that multipath is stuck in io_destroy() during
>> release of resources. As I understand io_destroy is stuck because of
>> io_cancel() that failed. And io_cancel() failed because of port that
>> was disabled in step 3.
>
> Hmm, it looks like we do take care of failing fast pending IO, but once
> we schedule periodic reconnects the request queues are already stopped
> and new incoming requests may block until we successfully reconnect.
>
> I don't have too much time for it at the moment, but here is an untested
> patch for you to try out:
>
> --
> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
> fail incoming io
>
> When we encounter an transport/controller errors, error recovery
> kicks in which performs:
> 1. stops io/admin queues
> 2. moves transport queues out of LIVE state
> 3. fast fail pending io
> 4. schedule periodic reconnects.
>
> But we also need to fast fail incoming IO taht enters after we
> already scheduled. Given that our queue is not LIVE anymore, simply
> restart the request queues to fail in .queue_rq
>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
> drivers/nvme/host/rdma.c | 20 +++++++++++---------
> 1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index dd1c6deef82f..a0aa2bfb91ee 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -753,28 +753,26 @@ static void nvme_rdma_reconnect_ctrl_work(struct
> work_struct *work)
> if (ret)
> goto requeue;
>
> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> -
> ret = nvmf_connect_admin_queue(&ctrl->ctrl);
> if (ret)
> - goto stop_admin_q;
> + goto requeue;
>
> set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>
> ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
> if (ret)
> - goto stop_admin_q;
> + goto requeue;
>
> nvme_start_keep_alive(&ctrl->ctrl);
>
> if (ctrl->queue_count > 1) {
> ret = nvme_rdma_init_io_queues(ctrl);
> if (ret)
> - goto stop_admin_q;
> + goto requeue;
>
> ret = nvme_rdma_connect_io_queues(ctrl);
> if (ret)
> - goto stop_admin_q;
> + goto requeue;
> }
>
> changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> @@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
> work_struct *work)
> ctrl->ctrl.opts->nr_reconnects = 0;
>
> if (ctrl->queue_count > 1) {
> - nvme_start_queues(&ctrl->ctrl);
> nvme_queue_scan(&ctrl->ctrl);
> nvme_queue_async_events(&ctrl->ctrl);
> }
> @@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
> work_struct *work)
>
> return;
>
> -stop_admin_q:
> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> requeue:
> dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
> ctrl->ctrl.opts->nr_reconnects);
> @@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
> nvme_cancel_request, &ctrl->ctrl);
>
> + /*
> + * queues are not a live anymore, so restart the queues to
> fail fast
> + * new IO
> + */
> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> + nvme_start_queues(&ctrl->ctrl);
> +
> nvme_rdma_reconnect_or_remove(ctrl);
> }
>
> --
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-18 5:52 ` shahar.salzman
@ 2017-05-23 17:31 ` shahar.salzman
2017-05-25 11:06 ` shahar.salzman
2017-05-30 12:05 ` Sagi Grimberg
0 siblings, 2 replies; 18+ messages in thread
From: shahar.salzman @ 2017-05-23 17:31 UTC (permalink / raw)
I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially
helps but the overall behaviour is still not good...
If I do a dd (direct_IO) or multipath -ll during the path disconnect, I
get indefinite retries with nearly zero time between retries, until the
path is reconnected, and then dd is completed successfully, paths come
back to life, returning OK status to the process which completes
successfully.I would expect IO to fail after a certain timeout, and that
the retries should be paced somewhat
If I do an nvme list (blk_execute_rq), then the process stays in D state
(I still see the retries), the block layer (at leasy for this device)
seems to be stuck, and I have to power cycle the server to get it to work...
Just a reminder, I am using Mellanox OFED 4.0, and kernel 4.9.6, the
plan here is to upgrade to the latest stable kernel but not sure when
we'll get to it.
On 05/18/2017 08:52 AM, shahar.salzman wrote:
> I have seen this too, but was internally investigating it as I am
> running 4.9.6 and not upstream. I will be able to check this patch on
> Sunday/Monday.
>
> On my setup the IOs never complete even after the path is reconnected,
> and the process remains stuck in D state, and the path unusable until
> I power cycle the server.
>
> Some more information, hope it helps:
>
> Dumping the stuck process stack (multipath -ll):
>
> [root at kblock01-knode02 ~]# cat /proc/9715/stack
> [<ffffffff94354727>] blk_execute_rq+0x97/0x110
> [<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
> [<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
> [<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
> [<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
> [<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
> [<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
> [<ffffffff9428993c>] block_ioctl+0x3c/0x40
> [<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
> [<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
> [<ffffffff94261732>] SyS_ioctl+0x92/0xa0
> [<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Looking at the dissasembly:
>
> 0000000000000170 <blk_execute_rq>:
>
> ...
>
> 1fa: e8 00 00 00 00 callq 1ff <blk_execute_rq+0x8f>
> /* Prevent hang_check timer from firing at us during very long
> I/O */
> hang_check = sysctl_hung_task_timeout_secs;
> if (hang_check)
> while (!wait_for_completion_io_timeout(&wait,
> hang_check * (HZ/2)));
> else
> wait_for_completion_io(&wait);
> 1ff: 4c 89 e7 mov %r12,%rdi
> 202: e8 00 00 00 00 callq 207 <blk_execute_rq+0x97>
>
> if (rq->errors)
> 207: 83 bb 04 01 00 00 01 cmpl $0x1,0x104(%rbx)
>
> I ran btrace, and it seems that the IO running before the path
> disconnect is properly cleaned, and only the IO running after gets
> stuck (in the btrace example bellow I run both multipath -ll, and a dd):
>
> 259,2 15 27548 4.790727567 8527 I RS 764837376 + 8 [fio]
> 259,2 15 27549 4.790728103 8527 D RS 764837376 + 8 [fio]
> 259,2 15 27550 4.791244566 8527 A R 738007008 + 8 <-
> (253,0) 738007008
> 259,2 15 27551 4.791245136 8527 I RS 738007008 + 8 [fio]
> 259,2 15 27552 4.791245803 8527 D RS 738007008 + 8 [fio]
> <<<<< From this stage there are no more fio Queue requests, only
> completions
> 259,2 8 9118 4.758116423 8062 C RS 2294539440 + 8 [0]
> 259,2 8 9119 4.758559461 8062 C RS 160307464 + 8 [0]
> 259,2 8 9120 4.759009990 0 C RS 2330623512 + 8 [0]
> 259,2 8 9121 4.759457400 0 C RS 2657710096 + 8 [0]
> 259,2 8 9122 4.759928113 0 C RS 2988256176 + 8 [0]
> 259,2 8 9123 4.760388009 0 C RS 789190936 + 8 [0]
> 259,2 8 9124 4.760835889 0 C RS 2915035352 + 8 [0]
> 259,2 8 9125 4.761304282 0 C RS 2458313624 + 8 [0]
> 259,2 8 9126 4.761779694 0 C RS 2280411456 + 8 [0]
> 259,2 8 9127 4.762286741 0 C RS 2082207376 + 8 [0]
> 259,2 8 9128 4.762783722 0 C RS 2874601688 + 8 [0]
> ....
> 259,2 8 9173 4.785798485 0 C RS 828737568 + 8 [0]
> 259,2 8 9174 4.786301391 0 C RS 229168888 + 8 [0]
> 259,2 8 9175 4.786769105 0 C RS 893604096 + 8 [0]
> 259,2 8 9176 4.787270594 0 C RS 2308778728 + 8 [0]
> 259,2 8 9177 4.787797448 0 C RS 2190029912 + 8 [0]
> 259,2 8 9178 4.788298956 0 C RS 2652012944 + 8 [0]
> 259,2 8 9179 4.788803759 0 C RS 2205242008 + 8 [0]
> 259,2 8 9180 4.789309423 0 C RS 1948528416 + 8 [0]
> 259,2 8 9181 4.789831240 0 C RS 2003421288 + 8 [0]
> 259,2 8 9182 4.790343528 0 C RS 2464964728 + 8 [0]
> 259,2 8 9183 4.790854684 0 C RS 764837376 + 8 [0]
> <<<<<< This is probably where the port had been disconnected
> 259,2 21 7 9.413088410 8574 Q R 0 + 8 [multipathd]
> 259,2 21 8 9.413089387 8574 G R 0 + 8 [multipathd]
> 259,2 21 9 9.413095030 8574 U N [multipathd] 1
> 259,2 21 10 9.413095367 8574 I RS 0 + 8 [multipathd]
> 259,2 21 11 9.413096534 8574 D RS 0 + 8 [multipathd]
> 259,2 10 1 14.381330190 2016 C RS 738007008 + 8 [7]
> 259,2 21 12 14.381394890 0 R RS 0 + 8 [7]
> 259,2 8 9184 80.500537429 8657 Q R 0 + 8 [dd]
> 259,2 8 9185 80.500540122 8657 G R 0 + 8 [dd]
> 259,2 8 9186 80.500545289 8657 U N [dd] 1
> 259,2 8 9187 80.500545792 8657 I RS 0 + 8 [dd]
> <<<<<< At this stage, I reconnected the path:
> 259,2 4 1 8381.791090134 11127 Q R 0 + 8 [dd]
> 259,2 4 2 8381.791093611 11127 G R 0 + 8 [dd]
> 259,2 4 3 8381.791098288 11127 U N [dd] 1
> 259,2 4 4 8381.791098791 11127 I RS 0 + 8 [dd]
>
>
> As I wrote above, I will check the patch on Sunday/Monday.
>
> On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
>> Hi Alex,
>>
>>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>>> one of ports down and expect to see failed paths using "multipath
>>> -ll". Instead I see that "multipath -ll" get stuck.
>>>
>>> reproduce:
>>> 1. Connected to NVMeoF device through 2 ports.
>>> 2. Bind them with multipath.
>>> 3. Bring one port down (ifconfig eth3 down)
>>> 4. Execute "multipath -ll" command and it will get stuck.
>>> From strace I see that multipath is stuck in io_destroy() during
>>> release of resources. As I understand io_destroy is stuck because of
>>> io_cancel() that failed. And io_cancel() failed because of port that
>>> was disabled in step 3.
>>
>> Hmm, it looks like we do take care of failing fast pending IO, but once
>> we schedule periodic reconnects the request queues are already stopped
>> and new incoming requests may block until we successfully reconnect.
>>
>> I don't have too much time for it at the moment, but here is an untested
>> patch for you to try out:
>>
>> --
>> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>> fail incoming io
>>
>> When we encounter an transport/controller errors, error recovery
>> kicks in which performs:
>> 1. stops io/admin queues
>> 2. moves transport queues out of LIVE state
>> 3. fast fail pending io
>> 4. schedule periodic reconnects.
>>
>> But we also need to fast fail incoming IO taht enters after we
>> already scheduled. Given that our queue is not LIVE anymore, simply
>> restart the request queues to fail in .queue_rq
>>
>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>> ---
>> drivers/nvme/host/rdma.c | 20 +++++++++++---------
>> 1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index dd1c6deef82f..a0aa2bfb91ee 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -753,28 +753,26 @@ static void
>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>> if (ret)
>> goto requeue;
>>
>> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>> -
>> ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>> if (ret)
>> - goto stop_admin_q;
>> + goto requeue;
>>
>> set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>>
>> ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>> if (ret)
>> - goto stop_admin_q;
>> + goto requeue;
>>
>> nvme_start_keep_alive(&ctrl->ctrl);
>>
>> if (ctrl->queue_count > 1) {
>> ret = nvme_rdma_init_io_queues(ctrl);
>> if (ret)
>> - goto stop_admin_q;
>> + goto requeue;
>>
>> ret = nvme_rdma_connect_io_queues(ctrl);
>> if (ret)
>> - goto stop_admin_q;
>> + goto requeue;
>> }
>>
>> changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
>> @@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
>> work_struct *work)
>> ctrl->ctrl.opts->nr_reconnects = 0;
>>
>> if (ctrl->queue_count > 1) {
>> - nvme_start_queues(&ctrl->ctrl);
>> nvme_queue_scan(&ctrl->ctrl);
>> nvme_queue_async_events(&ctrl->ctrl);
>> }
>> @@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
>> work_struct *work)
>>
>> return;
>>
>> -stop_admin_q:
>> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>> requeue:
>> dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>> ctrl->ctrl.opts->nr_reconnects);
>> @@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct
>> work_struct *work)
>> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>> nvme_cancel_request, &ctrl->ctrl);
>>
>> + /*
>> + * queues are not a live anymore, so restart the queues to
>> fail fast
>> + * new IO
>> + */
>> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>> + nvme_start_queues(&ctrl->ctrl);
>> +
>> nvme_rdma_reconnect_or_remove(ctrl);
>> }
>>
>> --
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-23 17:31 ` shahar.salzman
@ 2017-05-25 11:06 ` shahar.salzman
2017-05-25 12:27 ` shahar.salzman
2017-05-30 12:11 ` Sagi Grimberg
2017-05-30 12:05 ` Sagi Grimberg
1 sibling, 2 replies; 18+ messages in thread
From: shahar.salzman @ 2017-05-25 11:06 UTC (permalink / raw)
OK, so the indefinite retries are due to the
nvme/host/rdma.c:nvme_rdma_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY when
the queue is not ready, wouldn't it be better to return
BLK_MQ_RQ_QUEUE_ERROR so that the application may handle it? Or
alternatively return an error after several retries (per Q or something).
I tested a patch (on top of Sagi's) converting the return value to
QUEUE_ERROR, and both dd, and multipath -ll return instantly.
On 05/23/2017 08:31 PM, shahar.salzman wrote:
> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially
> helps but the overall behaviour is still not good...
>
> If I do a dd (direct_IO) or multipath -ll during the path disconnect,
> I get indefinite retries with nearly zero time between retries, until
> the path is reconnected, and then dd is completed successfully, paths
> come back to life, returning OK status to the process which completes
> successfully.I would expect IO to fail after a certain timeout, and
> that the retries should be paced somewhat
>
> If I do an nvme list (blk_execute_rq), then the process stays in D
> state (I still see the retries), the block layer (at leasy for this
> device) seems to be stuck, and I have to power cycle the server to get
> it to work...
>
> Just a reminder, I am using Mellanox OFED 4.0, and kernel 4.9.6, the
> plan here is to upgrade to the latest stable kernel but not sure when
> we'll get to it.
>
>
> On 05/18/2017 08:52 AM, shahar.salzman wrote:
>> I have seen this too, but was internally investigating it as I am
>> running 4.9.6 and not upstream. I will be able to check this patch on
>> Sunday/Monday.
>>
>> On my setup the IOs never complete even after the path is
>> reconnected, and the process remains stuck in D state, and the path
>> unusable until I power cycle the server.
>>
>> Some more information, hope it helps:
>>
>> Dumping the stuck process stack (multipath -ll):
>>
>> [root at kblock01-knode02 ~]# cat /proc/9715/stack
>> [<ffffffff94354727>] blk_execute_rq+0x97/0x110
>> [<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
>> [<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
>> [<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
>> [<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
>> [<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
>> [<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
>> [<ffffffff9428993c>] block_ioctl+0x3c/0x40
>> [<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
>> [<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
>> [<ffffffff94261732>] SyS_ioctl+0x92/0xa0
>> [<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Looking at the dissasembly:
>>
>> 0000000000000170 <blk_execute_rq>:
>>
>> ...
>>
>> 1fa: e8 00 00 00 00 callq 1ff <blk_execute_rq+0x8f>
>> /* Prevent hang_check timer from firing at us during very
>> long I/O */
>> hang_check = sysctl_hung_task_timeout_secs;
>> if (hang_check)
>> while (!wait_for_completion_io_timeout(&wait,
>> hang_check * (HZ/2)));
>> else
>> wait_for_completion_io(&wait);
>> 1ff: 4c 89 e7 mov %r12,%rdi
>> 202: e8 00 00 00 00 callq 207 <blk_execute_rq+0x97>
>>
>> if (rq->errors)
>> 207: 83 bb 04 01 00 00 01 cmpl $0x1,0x104(%rbx)
>>
>> I ran btrace, and it seems that the IO running before the path
>> disconnect is properly cleaned, and only the IO running after gets
>> stuck (in the btrace example bellow I run both multipath -ll, and a dd):
>>
>> 259,2 15 27548 4.790727567 8527 I RS 764837376 + 8 [fio]
>> 259,2 15 27549 4.790728103 8527 D RS 764837376 + 8 [fio]
>> 259,2 15 27550 4.791244566 8527 A R 738007008 + 8 <-
>> (253,0) 738007008
>> 259,2 15 27551 4.791245136 8527 I RS 738007008 + 8 [fio]
>> 259,2 15 27552 4.791245803 8527 D RS 738007008 + 8 [fio]
>> <<<<< From this stage there are no more fio Queue requests, only
>> completions
>> 259,2 8 9118 4.758116423 8062 C RS 2294539440 + 8 [0]
>> 259,2 8 9119 4.758559461 8062 C RS 160307464 + 8 [0]
>> 259,2 8 9120 4.759009990 0 C RS 2330623512 + 8 [0]
>> 259,2 8 9121 4.759457400 0 C RS 2657710096 + 8 [0]
>> 259,2 8 9122 4.759928113 0 C RS 2988256176 + 8 [0]
>> 259,2 8 9123 4.760388009 0 C RS 789190936 + 8 [0]
>> 259,2 8 9124 4.760835889 0 C RS 2915035352 + 8 [0]
>> 259,2 8 9125 4.761304282 0 C RS 2458313624 + 8 [0]
>> 259,2 8 9126 4.761779694 0 C RS 2280411456 + 8 [0]
>> 259,2 8 9127 4.762286741 0 C RS 2082207376 + 8 [0]
>> 259,2 8 9128 4.762783722 0 C RS 2874601688 + 8 [0]
>> ....
>> 259,2 8 9173 4.785798485 0 C RS 828737568 + 8 [0]
>> 259,2 8 9174 4.786301391 0 C RS 229168888 + 8 [0]
>> 259,2 8 9175 4.786769105 0 C RS 893604096 + 8 [0]
>> 259,2 8 9176 4.787270594 0 C RS 2308778728 + 8 [0]
>> 259,2 8 9177 4.787797448 0 C RS 2190029912 + 8 [0]
>> 259,2 8 9178 4.788298956 0 C RS 2652012944 + 8 [0]
>> 259,2 8 9179 4.788803759 0 C RS 2205242008 + 8 [0]
>> 259,2 8 9180 4.789309423 0 C RS 1948528416 + 8 [0]
>> 259,2 8 9181 4.789831240 0 C RS 2003421288 + 8 [0]
>> 259,2 8 9182 4.790343528 0 C RS 2464964728 + 8 [0]
>> 259,2 8 9183 4.790854684 0 C RS 764837376 + 8 [0]
>> <<<<<< This is probably where the port had been disconnected
>> 259,2 21 7 9.413088410 8574 Q R 0 + 8 [multipathd]
>> 259,2 21 8 9.413089387 8574 G R 0 + 8 [multipathd]
>> 259,2 21 9 9.413095030 8574 U N [multipathd] 1
>> 259,2 21 10 9.413095367 8574 I RS 0 + 8 [multipathd]
>> 259,2 21 11 9.413096534 8574 D RS 0 + 8 [multipathd]
>> 259,2 10 1 14.381330190 2016 C RS 738007008 + 8 [7]
>> 259,2 21 12 14.381394890 0 R RS 0 + 8 [7]
>> 259,2 8 9184 80.500537429 8657 Q R 0 + 8 [dd]
>> 259,2 8 9185 80.500540122 8657 G R 0 + 8 [dd]
>> 259,2 8 9186 80.500545289 8657 U N [dd] 1
>> 259,2 8 9187 80.500545792 8657 I RS 0 + 8 [dd]
>> <<<<<< At this stage, I reconnected the path:
>> 259,2 4 1 8381.791090134 11127 Q R 0 + 8 [dd]
>> 259,2 4 2 8381.791093611 11127 G R 0 + 8 [dd]
>> 259,2 4 3 8381.791098288 11127 U N [dd] 1
>> 259,2 4 4 8381.791098791 11127 I RS 0 + 8 [dd]
>>
>>
>> As I wrote above, I will check the patch on Sunday/Monday.
>>
>> On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
>>> Hi Alex,
>>>
>>>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>>>> one of ports down and expect to see failed paths using "multipath
>>>> -ll". Instead I see that "multipath -ll" get stuck.
>>>>
>>>> reproduce:
>>>> 1. Connected to NVMeoF device through 2 ports.
>>>> 2. Bind them with multipath.
>>>> 3. Bring one port down (ifconfig eth3 down)
>>>> 4. Execute "multipath -ll" command and it will get stuck.
>>>> From strace I see that multipath is stuck in io_destroy() during
>>>> release of resources. As I understand io_destroy is stuck because of
>>>> io_cancel() that failed. And io_cancel() failed because of port that
>>>> was disabled in step 3.
>>>
>>> Hmm, it looks like we do take care of failing fast pending IO, but once
>>> we schedule periodic reconnects the request queues are already stopped
>>> and new incoming requests may block until we successfully reconnect.
>>>
>>> I don't have too much time for it at the moment, but here is an
>>> untested
>>> patch for you to try out:
>>>
>>> --
>>> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>>> fail incoming io
>>>
>>> When we encounter an transport/controller errors, error recovery
>>> kicks in which performs:
>>> 1. stops io/admin queues
>>> 2. moves transport queues out of LIVE state
>>> 3. fast fail pending io
>>> 4. schedule periodic reconnects.
>>>
>>> But we also need to fast fail incoming IO taht enters after we
>>> already scheduled. Given that our queue is not LIVE anymore, simply
>>> restart the request queues to fail in .queue_rq
>>>
>>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>>> ---
>>> drivers/nvme/host/rdma.c | 20 +++++++++++---------
>>> 1 file changed, 11 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index dd1c6deef82f..a0aa2bfb91ee 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -753,28 +753,26 @@ static void
>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>> if (ret)
>>> goto requeue;
>>>
>>> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>> -
>>> ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>>> if (ret)
>>> - goto stop_admin_q;
>>> + goto requeue;
>>>
>>> set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>>>
>>> ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>>> if (ret)
>>> - goto stop_admin_q;
>>> + goto requeue;
>>>
>>> nvme_start_keep_alive(&ctrl->ctrl);
>>>
>>> if (ctrl->queue_count > 1) {
>>> ret = nvme_rdma_init_io_queues(ctrl);
>>> if (ret)
>>> - goto stop_admin_q;
>>> + goto requeue;
>>>
>>> ret = nvme_rdma_connect_io_queues(ctrl);
>>> if (ret)
>>> - goto stop_admin_q;
>>> + goto requeue;
>>> }
>>>
>>> changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
>>> @@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
>>> work_struct *work)
>>> ctrl->ctrl.opts->nr_reconnects = 0;
>>>
>>> if (ctrl->queue_count > 1) {
>>> - nvme_start_queues(&ctrl->ctrl);
>>> nvme_queue_scan(&ctrl->ctrl);
>>> nvme_queue_async_events(&ctrl->ctrl);
>>> }
>>> @@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
>>> work_struct *work)
>>>
>>> return;
>>>
>>> -stop_admin_q:
>>> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>>> requeue:
>>> dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>>> ctrl->ctrl.opts->nr_reconnects);
>>> @@ -823,6 +818,13 @@ static void
>>> nvme_rdma_error_recovery_work(struct work_struct *work)
>>> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>> nvme_cancel_request, &ctrl->ctrl);
>>>
>>> + /*
>>> + * queues are not a live anymore, so restart the queues to
>>> fail fast
>>> + * new IO
>>> + */
>>> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>> + nvme_start_queues(&ctrl->ctrl);
>>> +
>>> nvme_rdma_reconnect_or_remove(ctrl);
>>> }
>>>
>>> --
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme at lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-25 11:06 ` shahar.salzman
@ 2017-05-25 12:27 ` shahar.salzman
2017-05-25 13:08 ` shahar.salzman
2017-05-30 12:14 ` Sagi Grimberg
2017-05-30 12:11 ` Sagi Grimberg
1 sibling, 2 replies; 18+ messages in thread
From: shahar.salzman @ 2017-05-25 12:27 UTC (permalink / raw)
Returning the QUEUE_ERROR also solved the "nvme list" issue reported,
but the paths where not coming back after I reconnected the ports.
I played with the keep alive thread so that it didn't exit if it got an
error, and that was resolved as well.
On 05/25/2017 02:06 PM, shahar.salzman wrote:
> OK, so the indefinite retries are due to the
> nvme/host/rdma.c:nvme_rdma_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY
> when the queue is not ready, wouldn't it be better to return
> BLK_MQ_RQ_QUEUE_ERROR so that the application may handle it? Or
> alternatively return an error after several retries (per Q or something).
>
> I tested a patch (on top of Sagi's) converting the return value to
> QUEUE_ERROR, and both dd, and multipath -ll return instantly.
>
>
> On 05/23/2017 08:31 PM, shahar.salzman wrote:
>> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially
>> helps but the overall behaviour is still not good...
>>
>> If I do a dd (direct_IO) or multipath -ll during the path disconnect,
>> I get indefinite retries with nearly zero time between retries, until
>> the path is reconnected, and then dd is completed successfully, paths
>> come back to life, returning OK status to the process which completes
>> successfully.I would expect IO to fail after a certain timeout, and
>> that the retries should be paced somewhat
>>
>> If I do an nvme list (blk_execute_rq), then the process stays in D
>> state (I still see the retries), the block layer (at leasy for this
>> device) seems to be stuck, and I have to power cycle the server to
>> get it to work...
>>
>> Just a reminder, I am using Mellanox OFED 4.0, and kernel 4.9.6, the
>> plan here is to upgrade to the latest stable kernel but not sure when
>> we'll get to it.
>>
>>
>> On 05/18/2017 08:52 AM, shahar.salzman wrote:
>>> I have seen this too, but was internally investigating it as I am
>>> running 4.9.6 and not upstream. I will be able to check this patch
>>> on Sunday/Monday.
>>>
>>> On my setup the IOs never complete even after the path is
>>> reconnected, and the process remains stuck in D state, and the path
>>> unusable until I power cycle the server.
>>>
>>> Some more information, hope it helps:
>>>
>>> Dumping the stuck process stack (multipath -ll):
>>>
>>> [root at kblock01-knode02 ~]# cat /proc/9715/stack
>>> [<ffffffff94354727>] blk_execute_rq+0x97/0x110
>>> [<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
>>> [<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
>>> [<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
>>> [<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
>>> [<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
>>> [<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
>>> [<ffffffff9428993c>] block_ioctl+0x3c/0x40
>>> [<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
>>> [<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
>>> [<ffffffff94261732>] SyS_ioctl+0x92/0xa0
>>> [<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Looking at the dissasembly:
>>>
>>> 0000000000000170 <blk_execute_rq>:
>>>
>>> ...
>>>
>>> 1fa: e8 00 00 00 00 callq 1ff <blk_execute_rq+0x8f>
>>> /* Prevent hang_check timer from firing at us during very
>>> long I/O */
>>> hang_check = sysctl_hung_task_timeout_secs;
>>> if (hang_check)
>>> while (!wait_for_completion_io_timeout(&wait,
>>> hang_check * (HZ/2)));
>>> else
>>> wait_for_completion_io(&wait);
>>> 1ff: 4c 89 e7 mov %r12,%rdi
>>> 202: e8 00 00 00 00 callq 207 <blk_execute_rq+0x97>
>>>
>>> if (rq->errors)
>>> 207: 83 bb 04 01 00 00 01 cmpl $0x1,0x104(%rbx)
>>>
>>> I ran btrace, and it seems that the IO running before the path
>>> disconnect is properly cleaned, and only the IO running after gets
>>> stuck (in the btrace example bellow I run both multipath -ll, and a
>>> dd):
>>>
>>> 259,2 15 27548 4.790727567 8527 I RS 764837376 + 8 [fio]
>>> 259,2 15 27549 4.790728103 8527 D RS 764837376 + 8 [fio]
>>> 259,2 15 27550 4.791244566 8527 A R 738007008 + 8 <-
>>> (253,0) 738007008
>>> 259,2 15 27551 4.791245136 8527 I RS 738007008 + 8 [fio]
>>> 259,2 15 27552 4.791245803 8527 D RS 738007008 + 8 [fio]
>>> <<<<< From this stage there are no more fio Queue requests, only
>>> completions
>>> 259,2 8 9118 4.758116423 8062 C RS 2294539440 + 8 [0]
>>> 259,2 8 9119 4.758559461 8062 C RS 160307464 + 8 [0]
>>> 259,2 8 9120 4.759009990 0 C RS 2330623512 + 8 [0]
>>> 259,2 8 9121 4.759457400 0 C RS 2657710096 + 8 [0]
>>> 259,2 8 9122 4.759928113 0 C RS 2988256176 + 8 [0]
>>> 259,2 8 9123 4.760388009 0 C RS 789190936 + 8 [0]
>>> 259,2 8 9124 4.760835889 0 C RS 2915035352 + 8 [0]
>>> 259,2 8 9125 4.761304282 0 C RS 2458313624 + 8 [0]
>>> 259,2 8 9126 4.761779694 0 C RS 2280411456 + 8 [0]
>>> 259,2 8 9127 4.762286741 0 C RS 2082207376 + 8 [0]
>>> 259,2 8 9128 4.762783722 0 C RS 2874601688 + 8 [0]
>>> ....
>>> 259,2 8 9173 4.785798485 0 C RS 828737568 + 8 [0]
>>> 259,2 8 9174 4.786301391 0 C RS 229168888 + 8 [0]
>>> 259,2 8 9175 4.786769105 0 C RS 893604096 + 8 [0]
>>> 259,2 8 9176 4.787270594 0 C RS 2308778728 + 8 [0]
>>> 259,2 8 9177 4.787797448 0 C RS 2190029912 + 8 [0]
>>> 259,2 8 9178 4.788298956 0 C RS 2652012944 + 8 [0]
>>> 259,2 8 9179 4.788803759 0 C RS 2205242008 + 8 [0]
>>> 259,2 8 9180 4.789309423 0 C RS 1948528416 + 8 [0]
>>> 259,2 8 9181 4.789831240 0 C RS 2003421288 + 8 [0]
>>> 259,2 8 9182 4.790343528 0 C RS 2464964728 + 8 [0]
>>> 259,2 8 9183 4.790854684 0 C RS 764837376 + 8 [0]
>>> <<<<<< This is probably where the port had been disconnected
>>> 259,2 21 7 9.413088410 8574 Q R 0 + 8 [multipathd]
>>> 259,2 21 8 9.413089387 8574 G R 0 + 8 [multipathd]
>>> 259,2 21 9 9.413095030 8574 U N [multipathd] 1
>>> 259,2 21 10 9.413095367 8574 I RS 0 + 8 [multipathd]
>>> 259,2 21 11 9.413096534 8574 D RS 0 + 8 [multipathd]
>>> 259,2 10 1 14.381330190 2016 C RS 738007008 + 8 [7]
>>> 259,2 21 12 14.381394890 0 R RS 0 + 8 [7]
>>> 259,2 8 9184 80.500537429 8657 Q R 0 + 8 [dd]
>>> 259,2 8 9185 80.500540122 8657 G R 0 + 8 [dd]
>>> 259,2 8 9186 80.500545289 8657 U N [dd] 1
>>> 259,2 8 9187 80.500545792 8657 I RS 0 + 8 [dd]
>>> <<<<<< At this stage, I reconnected the path:
>>> 259,2 4 1 8381.791090134 11127 Q R 0 + 8 [dd]
>>> 259,2 4 2 8381.791093611 11127 G R 0 + 8 [dd]
>>> 259,2 4 3 8381.791098288 11127 U N [dd] 1
>>> 259,2 4 4 8381.791098791 11127 I RS 0 + 8 [dd]
>>>
>>>
>>> As I wrote above, I will check the patch on Sunday/Monday.
>>>
>>> On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
>>>> Hi Alex,
>>>>
>>>>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>>>>> one of ports down and expect to see failed paths using "multipath
>>>>> -ll". Instead I see that "multipath -ll" get stuck.
>>>>>
>>>>> reproduce:
>>>>> 1. Connected to NVMeoF device through 2 ports.
>>>>> 2. Bind them with multipath.
>>>>> 3. Bring one port down (ifconfig eth3 down)
>>>>> 4. Execute "multipath -ll" command and it will get stuck.
>>>>> From strace I see that multipath is stuck in io_destroy() during
>>>>> release of resources. As I understand io_destroy is stuck because of
>>>>> io_cancel() that failed. And io_cancel() failed because of port that
>>>>> was disabled in step 3.
>>>>
>>>> Hmm, it looks like we do take care of failing fast pending IO, but
>>>> once
>>>> we schedule periodic reconnects the request queues are already stopped
>>>> and new incoming requests may block until we successfully reconnect.
>>>>
>>>> I don't have too much time for it at the moment, but here is an
>>>> untested
>>>> patch for you to try out:
>>>>
>>>> --
>>>> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>>>> fail incoming io
>>>>
>>>> When we encounter an transport/controller errors, error recovery
>>>> kicks in which performs:
>>>> 1. stops io/admin queues
>>>> 2. moves transport queues out of LIVE state
>>>> 3. fast fail pending io
>>>> 4. schedule periodic reconnects.
>>>>
>>>> But we also need to fast fail incoming IO taht enters after we
>>>> already scheduled. Given that our queue is not LIVE anymore, simply
>>>> restart the request queues to fail in .queue_rq
>>>>
>>>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>>>> ---
>>>> drivers/nvme/host/rdma.c | 20 +++++++++++---------
>>>> 1 file changed, 11 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index dd1c6deef82f..a0aa2bfb91ee 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -753,28 +753,26 @@ static void
>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>> if (ret)
>>>> goto requeue;
>>>>
>>>> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>>> -
>>>> ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>>>> if (ret)
>>>> - goto stop_admin_q;
>>>> + goto requeue;
>>>>
>>>> set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>>>>
>>>> ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>>>> if (ret)
>>>> - goto stop_admin_q;
>>>> + goto requeue;
>>>>
>>>> nvme_start_keep_alive(&ctrl->ctrl);
>>>>
>>>> if (ctrl->queue_count > 1) {
>>>> ret = nvme_rdma_init_io_queues(ctrl);
>>>> if (ret)
>>>> - goto stop_admin_q;
>>>> + goto requeue;
>>>>
>>>> ret = nvme_rdma_connect_io_queues(ctrl);
>>>> if (ret)
>>>> - goto stop_admin_q;
>>>> + goto requeue;
>>>> }
>>>>
>>>> changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
>>>> @@ -782,7 +780,6 @@ static void
>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>> ctrl->ctrl.opts->nr_reconnects = 0;
>>>>
>>>> if (ctrl->queue_count > 1) {
>>>> - nvme_start_queues(&ctrl->ctrl);
>>>> nvme_queue_scan(&ctrl->ctrl);
>>>> nvme_queue_async_events(&ctrl->ctrl);
>>>> }
>>>> @@ -791,8 +788,6 @@ static void
>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>
>>>> return;
>>>>
>>>> -stop_admin_q:
>>>> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>>>> requeue:
>>>> dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>>>> ctrl->ctrl.opts->nr_reconnects);
>>>> @@ -823,6 +818,13 @@ static void
>>>> nvme_rdma_error_recovery_work(struct work_struct *work)
>>>> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>>> nvme_cancel_request, &ctrl->ctrl);
>>>>
>>>> + /*
>>>> + * queues are not a live anymore, so restart the queues to
>>>> fail fast
>>>> + * new IO
>>>> + */
>>>> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>>> + nvme_start_queues(&ctrl->ctrl);
>>>> +
>>>> nvme_rdma_reconnect_or_remove(ctrl);
>>>> }
>>>>
>>>> --
>>>>
>>>> _______________________________________________
>>>> Linux-nvme mailing list
>>>> Linux-nvme at lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>
>>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-25 12:27 ` shahar.salzman
@ 2017-05-25 13:08 ` shahar.salzman
2017-05-30 12:14 ` Sagi Grimberg
1 sibling, 0 replies; 18+ messages in thread
From: shahar.salzman @ 2017-05-25 13:08 UTC (permalink / raw)
This is the patch I applied (Sagi's + my additions), for the
mlnx-nvme-rdma source RPM:
--- mlnx-nvme-rdma-4.0/source/rdma.c.orig 2017-01-29
22:26:53.000000000 +0200
+++ mlnx-nvme-rdma-4.0/source/rdma.c 2017-05-25 11:12:00.703099000 +0300
@@ -715,35 +715,32 @@
if (ret)
goto requeue;
- blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
-
ret = nvmf_connect_admin_queue(&ctrl->ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
if (ret)
- goto stop_admin_q;
+ goto requeue;
nvme_start_keep_alive(&ctrl->ctrl);
if (ctrl->queue_count > 1) {
ret = nvme_rdma_init_io_queues(ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
ret = nvme_rdma_connect_io_queues(ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
}
changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
WARN_ON_ONCE(!changed);
if (ctrl->queue_count > 1) {
- nvme_start_queues(&ctrl->ctrl);
nvme_queue_scan(&ctrl->ctrl);
nvme_queue_async_events(&ctrl->ctrl);
}
@@ -752,8 +749,6 @@
return;
-stop_admin_q:
- blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
requeue:
/* Make sure we are not resetting/deleting */
if (ctrl->ctrl.state == NVME_CTRL_RECONNECTING) {
@@ -788,6 +783,13 @@
blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
nvme_cancel_request, &ctrl->ctrl);
+ /*
+ * queues are not a live anymore, so restart the queues to fail fast
+ * new IO
+ */
+ blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
+ nvme_start_queues(&ctrl->ctrl);
+
dev_info(ctrl->ctrl.device, "reconnecting in %d seconds\n",
ctrl->reconnect_delay);
@@ -1426,7 +1428,7 @@
WARN_ON_ONCE(rq->tag < 0);
if (!nvme_rdma_queue_is_ready(queue, rq))
- return BLK_MQ_RQ_QUEUE_BUSY;
+ return BLK_MQ_RQ_QUEUE_ERROR;
dev = queue->device->dev;
ib_dma_sync_single_for_cpu(dev, sqe->dma,
@@ -2031,6 +2033,8 @@
{
int ret;
+ pr_info("Loading nvme-rdma with Sagi G. patch to support "
+ "multipath\n");
nvme_rdma_wq = create_workqueue("nvme_rdma_wq");
if (!nvme_rdma_wq)
return -ENOMEM;
--- mlnx-nvme-rdma-4.0/source/core.c.orig 2017-05-25
14:50:22.941022000 +0300
+++ mlnx-nvme-rdma-4.0/source/core.c 2017-05-25 14:50:30.563588000 +0300
@@ -488,7 +488,6 @@
if (error) {
dev_err(ctrl->device,
"failed nvme_keep_alive_end_io error=%d\n", error);
- return;
}
schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);
On 05/25/2017 03:27 PM, shahar.salzman wrote:
> Returning the QUEUE_ERROR also solved the "nvme list" issue reported,
> but the paths where not coming back after I reconnected the ports.
>
> I played with the keep alive thread so that it didn't exit if it got
> an error, and that was resolved as well.
>
>
> On 05/25/2017 02:06 PM, shahar.salzman wrote:
>> OK, so the indefinite retries are due to the
>> nvme/host/rdma.c:nvme_rdma_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY
>> when the queue is not ready, wouldn't it be better to return
>> BLK_MQ_RQ_QUEUE_ERROR so that the application may handle it? Or
>> alternatively return an error after several retries (per Q or
>> something).
>>
>> I tested a patch (on top of Sagi's) converting the return value to
>> QUEUE_ERROR, and both dd, and multipath -ll return instantly.
>>
>>
>> On 05/23/2017 08:31 PM, shahar.salzman wrote:
>>> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it
>>> patrially helps but the overall behaviour is still not good...
>>>
>>> If I do a dd (direct_IO) or multipath -ll during the path
>>> disconnect, I get indefinite retries with nearly zero time between
>>> retries, until the path is reconnected, and then dd is completed
>>> successfully, paths come back to life, returning OK status to the
>>> process which completes successfully.I would expect IO to fail after
>>> a certain timeout, and that the retries should be paced somewhat
>>>
>>> If I do an nvme list (blk_execute_rq), then the process stays in D
>>> state (I still see the retries), the block layer (at leasy for this
>>> device) seems to be stuck, and I have to power cycle the server to
>>> get it to work...
>>>
>>> Just a reminder, I am using Mellanox OFED 4.0, and kernel 4.9.6, the
>>> plan here is to upgrade to the latest stable kernel but not sure
>>> when we'll get to it.
>>>
>>>
>>> On 05/18/2017 08:52 AM, shahar.salzman wrote:
>>>> I have seen this too, but was internally investigating it as I am
>>>> running 4.9.6 and not upstream. I will be able to check this patch
>>>> on Sunday/Monday.
>>>>
>>>> On my setup the IOs never complete even after the path is
>>>> reconnected, and the process remains stuck in D state, and the path
>>>> unusable until I power cycle the server.
>>>>
>>>> Some more information, hope it helps:
>>>>
>>>> Dumping the stuck process stack (multipath -ll):
>>>>
>>>> [root at kblock01-knode02 ~]# cat /proc/9715/stack
>>>> [<ffffffff94354727>] blk_execute_rq+0x97/0x110
>>>> [<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
>>>> [<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
>>>> [<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
>>>> [<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
>>>> [<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
>>>> [<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
>>>> [<ffffffff9428993c>] block_ioctl+0x3c/0x40
>>>> [<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
>>>> [<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
>>>> [<ffffffff94261732>] SyS_ioctl+0x92/0xa0
>>>> [<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>
>>>> Looking at the dissasembly:
>>>>
>>>> 0000000000000170 <blk_execute_rq>:
>>>>
>>>> ...
>>>>
>>>> 1fa: e8 00 00 00 00 callq 1ff <blk_execute_rq+0x8f>
>>>> /* Prevent hang_check timer from firing at us during very
>>>> long I/O */
>>>> hang_check = sysctl_hung_task_timeout_secs;
>>>> if (hang_check)
>>>> while (!wait_for_completion_io_timeout(&wait,
>>>> hang_check * (HZ/2)));
>>>> else
>>>> wait_for_completion_io(&wait);
>>>> 1ff: 4c 89 e7 mov %r12,%rdi
>>>> 202: e8 00 00 00 00 callq 207 <blk_execute_rq+0x97>
>>>>
>>>> if (rq->errors)
>>>> 207: 83 bb 04 01 00 00 01 cmpl $0x1,0x104(%rbx)
>>>>
>>>> I ran btrace, and it seems that the IO running before the path
>>>> disconnect is properly cleaned, and only the IO running after gets
>>>> stuck (in the btrace example bellow I run both multipath -ll, and a
>>>> dd):
>>>>
>>>> 259,2 15 27548 4.790727567 8527 I RS 764837376 + 8 [fio]
>>>> 259,2 15 27549 4.790728103 8527 D RS 764837376 + 8 [fio]
>>>> 259,2 15 27550 4.791244566 8527 A R 738007008 + 8 <-
>>>> (253,0) 738007008
>>>> 259,2 15 27551 4.791245136 8527 I RS 738007008 + 8 [fio]
>>>> 259,2 15 27552 4.791245803 8527 D RS 738007008 + 8 [fio]
>>>> <<<<< From this stage there are no more fio Queue requests, only
>>>> completions
>>>> 259,2 8 9118 4.758116423 8062 C RS 2294539440 + 8 [0]
>>>> 259,2 8 9119 4.758559461 8062 C RS 160307464 + 8 [0]
>>>> 259,2 8 9120 4.759009990 0 C RS 2330623512 + 8 [0]
>>>> 259,2 8 9121 4.759457400 0 C RS 2657710096 + 8 [0]
>>>> 259,2 8 9122 4.759928113 0 C RS 2988256176 + 8 [0]
>>>> 259,2 8 9123 4.760388009 0 C RS 789190936 + 8 [0]
>>>> 259,2 8 9124 4.760835889 0 C RS 2915035352 + 8 [0]
>>>> 259,2 8 9125 4.761304282 0 C RS 2458313624 + 8 [0]
>>>> 259,2 8 9126 4.761779694 0 C RS 2280411456 + 8 [0]
>>>> 259,2 8 9127 4.762286741 0 C RS 2082207376 + 8 [0]
>>>> 259,2 8 9128 4.762783722 0 C RS 2874601688 + 8 [0]
>>>> ....
>>>> 259,2 8 9173 4.785798485 0 C RS 828737568 + 8 [0]
>>>> 259,2 8 9174 4.786301391 0 C RS 229168888 + 8 [0]
>>>> 259,2 8 9175 4.786769105 0 C RS 893604096 + 8 [0]
>>>> 259,2 8 9176 4.787270594 0 C RS 2308778728 + 8 [0]
>>>> 259,2 8 9177 4.787797448 0 C RS 2190029912 + 8 [0]
>>>> 259,2 8 9178 4.788298956 0 C RS 2652012944 + 8 [0]
>>>> 259,2 8 9179 4.788803759 0 C RS 2205242008 + 8 [0]
>>>> 259,2 8 9180 4.789309423 0 C RS 1948528416 + 8 [0]
>>>> 259,2 8 9181 4.789831240 0 C RS 2003421288 + 8 [0]
>>>> 259,2 8 9182 4.790343528 0 C RS 2464964728 + 8 [0]
>>>> 259,2 8 9183 4.790854684 0 C RS 764837376 + 8 [0]
>>>> <<<<<< This is probably where the port had been disconnected
>>>> 259,2 21 7 9.413088410 8574 Q R 0 + 8 [multipathd]
>>>> 259,2 21 8 9.413089387 8574 G R 0 + 8 [multipathd]
>>>> 259,2 21 9 9.413095030 8574 U N [multipathd] 1
>>>> 259,2 21 10 9.413095367 8574 I RS 0 + 8 [multipathd]
>>>> 259,2 21 11 9.413096534 8574 D RS 0 + 8 [multipathd]
>>>> 259,2 10 1 14.381330190 2016 C RS 738007008 + 8 [7]
>>>> 259,2 21 12 14.381394890 0 R RS 0 + 8 [7]
>>>> 259,2 8 9184 80.500537429 8657 Q R 0 + 8 [dd]
>>>> 259,2 8 9185 80.500540122 8657 G R 0 + 8 [dd]
>>>> 259,2 8 9186 80.500545289 8657 U N [dd] 1
>>>> 259,2 8 9187 80.500545792 8657 I RS 0 + 8 [dd]
>>>> <<<<<< At this stage, I reconnected the path:
>>>> 259,2 4 1 8381.791090134 11127 Q R 0 + 8 [dd]
>>>> 259,2 4 2 8381.791093611 11127 G R 0 + 8 [dd]
>>>> 259,2 4 3 8381.791098288 11127 U N [dd] 1
>>>> 259,2 4 4 8381.791098791 11127 I RS 0 + 8 [dd]
>>>>
>>>>
>>>> As I wrote above, I will check the patch on Sunday/Monday.
>>>>
>>>> On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
>>>>> Hi Alex,
>>>>>
>>>>>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>>>>>> one of ports down and expect to see failed paths using "multipath
>>>>>> -ll". Instead I see that "multipath -ll" get stuck.
>>>>>>
>>>>>> reproduce:
>>>>>> 1. Connected to NVMeoF device through 2 ports.
>>>>>> 2. Bind them with multipath.
>>>>>> 3. Bring one port down (ifconfig eth3 down)
>>>>>> 4. Execute "multipath -ll" command and it will get stuck.
>>>>>> From strace I see that multipath is stuck in io_destroy() during
>>>>>> release of resources. As I understand io_destroy is stuck because of
>>>>>> io_cancel() that failed. And io_cancel() failed because of port that
>>>>>> was disabled in step 3.
>>>>>
>>>>> Hmm, it looks like we do take care of failing fast pending IO, but
>>>>> once
>>>>> we schedule periodic reconnects the request queues are already
>>>>> stopped
>>>>> and new incoming requests may block until we successfully reconnect.
>>>>>
>>>>> I don't have too much time for it at the moment, but here is an
>>>>> untested
>>>>> patch for you to try out:
>>>>>
>>>>> --
>>>>> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>>>>> fail incoming io
>>>>>
>>>>> When we encounter an transport/controller errors, error recovery
>>>>> kicks in which performs:
>>>>> 1. stops io/admin queues
>>>>> 2. moves transport queues out of LIVE state
>>>>> 3. fast fail pending io
>>>>> 4. schedule periodic reconnects.
>>>>>
>>>>> But we also need to fast fail incoming IO taht enters after we
>>>>> already scheduled. Given that our queue is not LIVE anymore, simply
>>>>> restart the request queues to fail in .queue_rq
>>>>>
>>>>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>>>>> ---
>>>>> drivers/nvme/host/rdma.c | 20 +++++++++++---------
>>>>> 1 file changed, 11 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>>> index dd1c6deef82f..a0aa2bfb91ee 100644
>>>>> --- a/drivers/nvme/host/rdma.c
>>>>> +++ b/drivers/nvme/host/rdma.c
>>>>> @@ -753,28 +753,26 @@ static void
>>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>> if (ret)
>>>>> goto requeue;
>>>>>
>>>>> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>>>> -
>>>>> ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>>>>> if (ret)
>>>>> - goto stop_admin_q;
>>>>> + goto requeue;
>>>>>
>>>>> set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>>>>>
>>>>> ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>>>>> if (ret)
>>>>> - goto stop_admin_q;
>>>>> + goto requeue;
>>>>>
>>>>> nvme_start_keep_alive(&ctrl->ctrl);
>>>>>
>>>>> if (ctrl->queue_count > 1) {
>>>>> ret = nvme_rdma_init_io_queues(ctrl);
>>>>> if (ret)
>>>>> - goto stop_admin_q;
>>>>> + goto requeue;
>>>>>
>>>>> ret = nvme_rdma_connect_io_queues(ctrl);
>>>>> if (ret)
>>>>> - goto stop_admin_q;
>>>>> + goto requeue;
>>>>> }
>>>>>
>>>>> changed = nvme_change_ctrl_state(&ctrl->ctrl,
>>>>> NVME_CTRL_LIVE);
>>>>> @@ -782,7 +780,6 @@ static void
>>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>> ctrl->ctrl.opts->nr_reconnects = 0;
>>>>>
>>>>> if (ctrl->queue_count > 1) {
>>>>> - nvme_start_queues(&ctrl->ctrl);
>>>>> nvme_queue_scan(&ctrl->ctrl);
>>>>> nvme_queue_async_events(&ctrl->ctrl);
>>>>> }
>>>>> @@ -791,8 +788,6 @@ static void
>>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>>
>>>>> return;
>>>>>
>>>>> -stop_admin_q:
>>>>> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>>>>> requeue:
>>>>> dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>>>>> ctrl->ctrl.opts->nr_reconnects);
>>>>> @@ -823,6 +818,13 @@ static void
>>>>> nvme_rdma_error_recovery_work(struct work_struct *work)
>>>>> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>>>> nvme_cancel_request, &ctrl->ctrl);
>>>>>
>>>>> + /*
>>>>> + * queues are not a live anymore, so restart the queues to
>>>>> fail fast
>>>>> + * new IO
>>>>> + */
>>>>> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>>>> + nvme_start_queues(&ctrl->ctrl);
>>>>> +
>>>>> nvme_rdma_reconnect_or_remove(ctrl);
>>>>> }
>>>>>
>>>>> --
>>>>>
>>>>> _______________________________________________
>>>>> Linux-nvme mailing list
>>>>> Linux-nvme at lists.infradead.org
>>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-23 17:31 ` shahar.salzman
2017-05-25 11:06 ` shahar.salzman
@ 2017-05-30 12:05 ` Sagi Grimberg
2017-05-30 13:37 ` Max Gurtovoy
1 sibling, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-30 12:05 UTC (permalink / raw)
> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially
> helps but the overall behaviour is still not good...
Can you test with upstream kernel code? Otherwise I suggest approaching
Mellanox support.
We can't support non-upstream code really...
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-25 11:06 ` shahar.salzman
2017-05-25 12:27 ` shahar.salzman
@ 2017-05-30 12:11 ` Sagi Grimberg
1 sibling, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-30 12:11 UTC (permalink / raw)
> OK, so the indefinite retries are due to the
> nvme/host/rdma.c:nvme_rdma_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY when
> the queue is not ready, wouldn't it be better to return
> BLK_MQ_RQ_QUEUE_ERROR so that the application may handle it? Or
> alternatively return an error after several retries (per Q or something).
>
> I tested a patch (on top of Sagi's) converting the return value to
> QUEUE_ERROR, and both dd, and multipath -ll return instantly.
We shouldn't return QUEUE_ERROR unconditionally, probably only if
the controller is in state RECONNECTING (RESETTING is a quick phase
and DELETING is going to kill the request anyway).
Does this keeps the existing behavior:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 28bd255c144d..f0d700220ec1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1433,7 +1433,7 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
/*
* We cannot accept any other command until the Connect command has
completed.
*/
-static inline bool nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
+static inline int nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
struct request *rq)
{
if (unlikely(!test_bit(NVME_RDMA_Q_LIVE, &queue->flags))) {
@@ -1441,11 +1441,15 @@ static inline bool
nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
if (!blk_rq_is_passthrough(rq) ||
cmd->common.opcode != nvme_fabrics_command ||
- cmd->fabrics.fctype != nvme_fabrics_type_connect)
- return false;
+ cmd->fabrics.fctype != nvme_fabrics_type_connect) {
+ if (queue->ctrl->ctrl->state ==
NVME_CTRL_RECONNECTING)
+ return -EIO;
+ else
+ return -EAGAIN;
+ }
}
- return true;
+ return 0;
}
static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -1463,8 +1467,9 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx
*hctx,
WARN_ON_ONCE(rq->tag < 0);
- if (!nvme_rdma_queue_is_ready(queue, rq))
- return BLK_MQ_RQ_QUEUE_BUSY;
+ ret = nvme_rdma_queue_is_ready(queue, rq);
+ if (unlikely(ret))
+ goto err;
dev = queue->device->dev;
ib_dma_sync_single_for_cpu(dev, sqe->dma,
--
^ permalink raw reply related [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-25 12:27 ` shahar.salzman
2017-05-25 13:08 ` shahar.salzman
@ 2017-05-30 12:14 ` Sagi Grimberg
1 sibling, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-30 12:14 UTC (permalink / raw)
> I played with the keep alive thread so that it didn't exit if it got an
> error, and that was resolved as well.
Why is that needed? there is no reason scheduling keep-alives if they
fail.
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-30 12:05 ` Sagi Grimberg
@ 2017-05-30 13:37 ` Max Gurtovoy
2017-05-30 14:17 ` Sagi Grimberg
0 siblings, 1 reply; 18+ messages in thread
From: Max Gurtovoy @ 2017-05-30 13:37 UTC (permalink / raw)
On 5/30/2017 3:05 PM, Sagi Grimberg wrote:
>
>> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially
>> helps but the overall behaviour is still not good...
>
> Can you test with upstream kernel code? Otherwise I suggest approaching
> Mellanox support.
>
> We can't support non-upstream code really...
Hi guys,
this is a known issue in upstream code and Mellanox OFED code as well.
I agree with Sagi's approach for future issues using our package.
For this one, we will test the proposed fixes and update you regarding
the results.
Max.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-30 13:37 ` Max Gurtovoy
@ 2017-05-30 14:17 ` Sagi Grimberg
2017-06-05 7:11 ` shahar.salzman
2017-06-05 8:40 ` Christoph Hellwig
0 siblings, 2 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-30 14:17 UTC (permalink / raw)
> Hi guys,
> this is a known issue in upstream code and Mellanox OFED code as well.
> I agree with Sagi's approach for future issues using our package.
> For this one, we will test the proposed fixes and update you regarding
> the results.
You can try:
--
[PATCH] nvme-rdma: fast fail incoming requests while we reconnect
When we encounter an transport/controller errors, error recovery
kicks in which performs:
1. stops io/admin queues
2. moves transport queues out of LIVE state
3. fast fail pending io
4. schedule periodic reconnects.
But we also need to fast fail incoming IO taht enters after we
already scheduled. Given that our queue is not LIVE anymore, simply
restart the request queues to fail in .queue_rq
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
drivers/nvme/host/rdma.c | 37 ++++++++++++++++++++++---------------
1 file changed, 22 insertions(+), 15 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 28bd255c144d..ce8f1e992e64 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -753,28 +753,26 @@ static void nvme_rdma_reconnect_ctrl_work(struct
work_struct *work)
if (ret)
goto requeue;
- blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
-
ret = nvmf_connect_admin_queue(&ctrl->ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
if (ret)
- goto stop_admin_q;
+ goto requeue;
nvme_start_keep_alive(&ctrl->ctrl);
if (ctrl->queue_count > 1) {
ret = nvme_rdma_init_io_queues(ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
ret = nvme_rdma_connect_io_queues(ctrl);
if (ret)
- goto stop_admin_q;
+ goto requeue;
}
changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
@@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
work_struct *work)
ctrl->ctrl.opts->nr_reconnects = 0;
if (ctrl->queue_count > 1) {
- nvme_start_queues(&ctrl->ctrl);
nvme_queue_scan(&ctrl->ctrl);
nvme_queue_async_events(&ctrl->ctrl);
}
@@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
work_struct *work)
return;
-stop_admin_q:
- blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
requeue:
dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
ctrl->ctrl.opts->nr_reconnects);
@@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct
work_struct *work)
blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
nvme_cancel_request, &ctrl->ctrl);
+ /*
+ * queues are not a live anymore, so restart the queues to fail fast
+ * new IO
+ */
+ blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
+ nvme_start_queues(&ctrl->ctrl);
+
nvme_rdma_reconnect_or_remove(ctrl);
}
@@ -1433,7 +1435,7 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
/*
* We cannot accept any other command until the Connect command has
completed.
*/
-static inline bool nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
+static inline int nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
struct request *rq)
{
if (unlikely(!test_bit(NVME_RDMA_Q_LIVE, &queue->flags))) {
@@ -1441,11 +1443,15 @@ static inline bool
nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
if (!blk_rq_is_passthrough(rq) ||
cmd->common.opcode != nvme_fabrics_command ||
- cmd->fabrics.fctype != nvme_fabrics_type_connect)
- return false;
+ cmd->fabrics.fctype != nvme_fabrics_type_connect) {
+ if (queue->ctrl->ctrl->state ==
NVME_CTRL_RECONNECTING)
+ return -EIO;
+ else
+ return -EAGAIN;
+ }
}
- return true;
+ return 0;
}
static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -1463,8 +1469,9 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx
*hctx,
WARN_ON_ONCE(rq->tag < 0);
- if (!nvme_rdma_queue_is_ready(queue, rq))
- return BLK_MQ_RQ_QUEUE_BUSY;
+ ret = nvme_rdma_queue_is_ready(queue, rq);
+ if (unlikely(ret))
+ goto err;
dev = queue->device->dev;
ib_dma_sync_single_for_cpu(dev, sqe->dma,
--
^ permalink raw reply related [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-30 14:17 ` Sagi Grimberg
@ 2017-06-05 7:11 ` shahar.salzman
2017-06-05 8:14 ` Sagi Grimberg
2017-06-05 8:40 ` Christoph Hellwig
1 sibling, 1 reply; 18+ messages in thread
From: shahar.salzman @ 2017-06-05 7:11 UTC (permalink / raw)
I tested the patch, works great.
Both IO (dd), "multipath -ll", and "nvme list" return instantaneously
with IO error, multipath is reinstated as soon as the path is reconnected.
Sagi, thanks for the fix!
On 05/30/2017 05:17 PM, Sagi Grimberg wrote:
>
>> Hi guys,
>> this is a known issue in upstream code and Mellanox OFED code as well.
>> I agree with Sagi's approach for future issues using our package.
>> For this one, we will test the proposed fixes and update you regarding
>> the results.
>
> You can try:
>
> --
> [PATCH] nvme-rdma: fast fail incoming requests while we reconnect
>
> When we encounter an transport/controller errors, error recovery
> kicks in which performs:
> 1. stops io/admin queues
> 2. moves transport queues out of LIVE state
> 3. fast fail pending io
> 4. schedule periodic reconnects.
>
> But we also need to fast fail incoming IO taht enters after we
> already scheduled. Given that our queue is not LIVE anymore, simply
> restart the request queues to fail in .queue_rq
>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
> drivers/nvme/host/rdma.c | 37 ++++++++++++++++++++++---------------
> 1 file changed, 22 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 28bd255c144d..ce8f1e992e64 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -753,28 +753,26 @@ static void nvme_rdma_reconnect_ctrl_work(struct
> work_struct *work)
> if (ret)
> goto requeue;
>
> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> -
> ret = nvmf_connect_admin_queue(&ctrl->ctrl);
> if (ret)
> - goto stop_admin_q;
> + goto requeue;
>
> set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>
> ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
> if (ret)
> - goto stop_admin_q;
> + goto requeue;
>
> nvme_start_keep_alive(&ctrl->ctrl);
>
> if (ctrl->queue_count > 1) {
> ret = nvme_rdma_init_io_queues(ctrl);
> if (ret)
> - goto stop_admin_q;
> + goto requeue;
>
> ret = nvme_rdma_connect_io_queues(ctrl);
> if (ret)
> - goto stop_admin_q;
> + goto requeue;
> }
>
> changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> @@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
> work_struct *work)
> ctrl->ctrl.opts->nr_reconnects = 0;
>
> if (ctrl->queue_count > 1) {
> - nvme_start_queues(&ctrl->ctrl);
> nvme_queue_scan(&ctrl->ctrl);
> nvme_queue_async_events(&ctrl->ctrl);
> }
> @@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct
> work_struct *work)
>
> return;
>
> -stop_admin_q:
> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> requeue:
> dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
> ctrl->ctrl.opts->nr_reconnects);
> @@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
> nvme_cancel_request, &ctrl->ctrl);
>
> + /*
> + * queues are not a live anymore, so restart the queues to
> fail fast
> + * new IO
> + */
> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> + nvme_start_queues(&ctrl->ctrl);
> +
> nvme_rdma_reconnect_or_remove(ctrl);
> }
>
> @@ -1433,7 +1435,7 @@ nvme_rdma_timeout(struct request *rq, bool
> reserved)
> /*
> * We cannot accept any other command until the Connect command has
> completed.
> */
> -static inline bool nvme_rdma_queue_is_ready(struct nvme_rdma_queue
> *queue,
> +static inline int nvme_rdma_queue_is_ready(struct nvme_rdma_queue
> *queue,
> struct request *rq)
> {
> if (unlikely(!test_bit(NVME_RDMA_Q_LIVE, &queue->flags))) {
> @@ -1441,11 +1443,15 @@ static inline bool
> nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
>
> if (!blk_rq_is_passthrough(rq) ||
> cmd->common.opcode != nvme_fabrics_command ||
> - cmd->fabrics.fctype != nvme_fabrics_type_connect)
> - return false;
> + cmd->fabrics.fctype != nvme_fabrics_type_connect) {
> + if (queue->ctrl->ctrl->state ==
> NVME_CTRL_RECONNECTING)
> + return -EIO;
> + else
> + return -EAGAIN;
> + }
> }
>
> - return true;
> + return 0;
> }
>
> static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
> @@ -1463,8 +1469,9 @@ static int nvme_rdma_queue_rq(struct
> blk_mq_hw_ctx *hctx,
>
> WARN_ON_ONCE(rq->tag < 0);
>
> - if (!nvme_rdma_queue_is_ready(queue, rq))
> - return BLK_MQ_RQ_QUEUE_BUSY;
> + ret = nvme_rdma_queue_is_ready(queue, rq);
> + if (unlikely(ret))
> + goto err;
>
> dev = queue->device->dev;
> ib_dma_sync_single_for_cpu(dev, sqe->dma,
> --
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-06-05 7:11 ` shahar.salzman
@ 2017-06-05 8:14 ` Sagi Grimberg
0 siblings, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-06-05 8:14 UTC (permalink / raw)
> I tested the patch, works great.
>
> Both IO (dd), "multipath -ll", and "nvme list" return instantaneously
> with IO error, multipath is reinstated as soon as the path is reconnected.
Thanks Shahar,
Alex, does the patch work for you as well?
Christoph, any feedback on this patch? if not, I'll submit a formal
patch soon for 4.12-rc.
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-05-30 14:17 ` Sagi Grimberg
2017-06-05 7:11 ` shahar.salzman
@ 2017-06-05 8:40 ` Christoph Hellwig
2017-06-05 8:53 ` Sagi Grimberg
1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-06-05 8:40 UTC (permalink / raw)
On Tue, May 30, 2017@05:17:40PM +0300, Sagi Grimberg wrote:
> [PATCH] nvme-rdma: fast fail incoming requests while we reconnect
>
> When we encounter an transport/controller errors, error recovery
> kicks in which performs:
> 1. stops io/admin queues
> 2. moves transport queues out of LIVE state
> 3. fast fail pending io
> 4. schedule periodic reconnects.
>
> But we also need to fast fail incoming IO taht enters after we
> already scheduled. Given that our queue is not LIVE anymore, simply
> restart the request queues to fail in .queue_rq
But we shouldn't _fail_ I/O just because we're reconnecting, we
need to be able to retry it once reconnected.
> + cmd->fabrics.fctype != nvme_fabrics_type_connect) {
> + if (queue->ctrl->ctrl->state ==
> NVME_CTRL_RECONNECTING)
> + return -EIO;
> + else
> + return -EAGAIN;
> + }
So this looks somewhat bogus to me, while the rest looks ok.
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-06-05 8:40 ` Christoph Hellwig
@ 2017-06-05 8:53 ` Sagi Grimberg
2017-06-05 15:07 ` Christoph Hellwig
0 siblings, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2017-06-05 8:53 UTC (permalink / raw)
>> [PATCH] nvme-rdma: fast fail incoming requests while we reconnect
>>
>> When we encounter an transport/controller errors, error recovery
>> kicks in which performs:
>> 1. stops io/admin queues
>> 2. moves transport queues out of LIVE state
>> 3. fast fail pending io
>> 4. schedule periodic reconnects.
>>
>> But we also need to fast fail incoming IO taht enters after we
>> already scheduled. Given that our queue is not LIVE anymore, simply
>> restart the request queues to fail in .queue_rq
>
> But we shouldn't _fail_ I/O just because we're reconnecting, we
> need to be able to retry it once reconnected.
I'm not sure, the point is to fail fast so that dm or user can
failover traffic. Besides, we iterate and cancel all inflight IO, this
attempts to give the same treatment to IO that arrives later...
In several scsi transports, we have the concept of fast_io_fail_tmo
which we could add to nvme, but from my experience, people usually set
it to a minimum to achieve fast failover (usually smaller than the very
first reconnect attempt).
We have nvme_max_retries modparam, so we could simply fail it fast until
we hit this modparam, but I suspect it'll expire very fast.
>
>> + cmd->fabrics.fctype != nvme_fabrics_type_connect) {
>> + if (queue->ctrl->ctrl->state ==
>> NVME_CTRL_RECONNECTING)
>> + return -EIO;
>> + else
>> + return -EAGAIN;
>> + }
>
> So this looks somewhat bogus to me, while the rest looks ok.
The point here is that RECONNECTING is a ctrl state that has a
potential to linger for a long time (unlike RESETTING or DELETING),
so we don't want to trigger requeue right away.
I'm open to other ideas. I just want to prevent triggering a redundant
loop of queue_rq -> fail with BUSY -> queue_rq -> fail with BUSY ...
Thoughts?
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-06-05 8:53 ` Sagi Grimberg
@ 2017-06-05 15:07 ` Christoph Hellwig
2017-06-05 17:23 ` Sagi Grimberg
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-06-05 15:07 UTC (permalink / raw)
On Mon, Jun 05, 2017@11:53:58AM +0300, Sagi Grimberg wrote:
>> So this looks somewhat bogus to me, while the rest looks ok.
>
> The point here is that RECONNECTING is a ctrl state that has a
> potential to linger for a long time (unlike RESETTING or DELETING),
> so we don't want to trigger requeue right away.
>
> I'm open to other ideas. I just want to prevent triggering a redundant
> loop of queue_rq -> fail with BUSY -> queue_rq -> fail with BUSY ...
>
> Thoughts?
Let's get this patch in, then sort out a common stratefy for the
dev_loss_tmo for all drivers, as FC is already doing some work in
that area.
^ permalink raw reply [flat|nested] 18+ messages in thread
* NVMeoF: multipath stuck after bringing one ethernet port down
2017-06-05 15:07 ` Christoph Hellwig
@ 2017-06-05 17:23 ` Sagi Grimberg
0 siblings, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-06-05 17:23 UTC (permalink / raw)
>>> So this looks somewhat bogus to me, while the rest looks ok.
>>
>> The point here is that RECONNECTING is a ctrl state that has a
>> potential to linger for a long time (unlike RESETTING or DELETING),
>> so we don't want to trigger requeue right away.
>>
>> I'm open to other ideas. I just want to prevent triggering a redundant
>> loop of queue_rq -> fail with BUSY -> queue_rq -> fail with BUSY ...
>>
>> Thoughts?
>
> Let's get this patch in, then sort out a common stratefy for the
> dev_loss_tmo for all drivers, as FC is already doing some work in
> that area.
It's not dev_loss_tmo (timeout to give up on reconnect attempts),
but yea, I agree. I'll send a patch.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2017-06-05 17:23 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-17 16:08 NVMeoF: multipath stuck after bringing one ethernet port down Alex Turin
2017-05-17 17:28 ` Sagi Grimberg
2017-05-18 5:52 ` shahar.salzman
2017-05-23 17:31 ` shahar.salzman
2017-05-25 11:06 ` shahar.salzman
2017-05-25 12:27 ` shahar.salzman
2017-05-25 13:08 ` shahar.salzman
2017-05-30 12:14 ` Sagi Grimberg
2017-05-30 12:11 ` Sagi Grimberg
2017-05-30 12:05 ` Sagi Grimberg
2017-05-30 13:37 ` Max Gurtovoy
2017-05-30 14:17 ` Sagi Grimberg
2017-06-05 7:11 ` shahar.salzman
2017-06-05 8:14 ` Sagi Grimberg
2017-06-05 8:40 ` Christoph Hellwig
2017-06-05 8:53 ` Sagi Grimberg
2017-06-05 15:07 ` Christoph Hellwig
2017-06-05 17:23 ` Sagi Grimberg
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.