All of lore.kernel.org
 help / color / mirror / Atom feed
* NVMeoF: multipath stuck after bringing one ethernet port down
@ 2017-05-17 16:08 Alex Turin
  2017-05-17 17:28 ` Sagi Grimberg
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Turin @ 2017-05-17 16:08 UTC (permalink / raw)


I am trying to test failure scenarios of NVMeoF + multipath. I bring
one of ports down and expect to see failed paths using "multipath
-ll". Instead I see that "multipath -ll" get stuck.

reproduce:
1. Connected to NVMeoF device through 2 ports.
2. Bind them with multipath.
3. Bring one port down (ifconfig eth3 down)
4. Execute "multipath -ll" command and it will get stuck.
>From strace I see that multipath is stuck in io_destroy() during
release of resources. As I understand io_destroy is stuck because of
io_cancel() that failed. And io_cancel() failed because of port that
was disabled in step 3.


environment:
Kernel: 4.11.1-1.el7.elrepo.x86_64
Network card: ConnectX-4 Lx
rdma-core: version 14 (commit 240c019)
nvme-cli: master - commit 4b5b4d2 (https://github.com/linux-nvme/nvme-cli.git)


/var/log/messages
May 17 11:15:00 localhost NetworkManager[1352]: <info>
[1495034100.7162] device (eth3): state change: activated ->
unavailable (reason 'carrier-changed') [100 20 40]
May 17 11:15:00 localhost dbus-daemon: dbus[1300]: [system] Activating
via systemd: service name='org.freedesktop.nm_dispatcher'
unit='dbus-org.freedesktop.nm-dispatcher.service'
May 17 11:15:00 localhost dbus[1300]: [system] Activating via systemd:
service name='org.freedesktop.nm_dispatcher'
unit='dbus-org.freedesktop.nm-dispatcher.service'
May 17 11:15:00 localhost systemd: Starting Network Manager Script
Dispatcher Service...
May 17 11:15:00 localhost dbus[1300]: [system] Successfully activated
service 'org.freedesktop.nm_dispatcher'
May 17 11:15:00 localhost dbus-daemon: dbus[1300]: [system]
Successfully activated service 'org.freedesktop.nm_dispatcher'
May 17 11:15:00 localhost systemd: Started Network Manager Script
Dispatcher Service.
May 17 11:15:00 localhost nm-dispatcher: req:1 'down' [eth3]: new
request (3 scripts)
May 17 11:15:00 localhost nm-dispatcher: req:1 'down' [eth3]: start
running ordered scripts...
May 17 11:15:09 localhost kernel: nvme nvme2: failed
nvme_keep_alive_end_io error=16391
May 17 11:15:09 localhost kernel: nvme nvme1: failed
nvme_keep_alive_end_io error=16391
May 17 11:15:09 localhost kernel: nvme nvme2: reconnecting in 10 seconds
May 17 11:15:09 localhost kernel: nvme nvme1: reconnecting in 10 seconds


# strace multipath -ll 2>&1 | grep  "io_\|open(\"/dev/nvme"
io_setup(1, {140204923695104})          = 0
io_submit(140204923695104, 1, {{pread, filedes:4, buf:0x1d14000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923695104, 1, 1, {{(nil), 0x1d11ba8, 4096, 0}}, {2, 0}) = 1
open("/dev/nvme0n1", O_RDONLY)          = 5
io_setup(1, {140204923682816})          = 0
io_submit(140204923682816, 1, {{pread, filedes:5, buf:0x1d22000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923682816, 1, 1, {{(nil), 0x1d21548, 4096, 0}}, {2, 0}) = 1
open("/dev/nvme1n1", O_RDONLY)          = 6
io_setup(1, {140204923670528})          = 0
io_submit(140204923670528, 1, {{pread, filedes:6, buf:0x1d25000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923670528, 1, 1, {}{2, 0}) = 0
io_cancel(140204923670528, {(nil), 0, 0, 0, 6}, {...}) = -1 EINVAL
(Invalid argument)
open("/dev/nvme2n1", O_RDONLY)          = 7
io_setup(1, {140204923355136})          = 0
io_submit(140204923355136, 1, {{pread, filedes:7, buf:0x1d29000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923355136, 1, 1, {}{2, 0}) = 0
io_cancel(140204923355136, {(nil), 0, 0, 0, 7}, {...}) = -1 EINVAL
(Invalid argument)
open("/dev/nvme3n1", O_RDONLY)          = 8
io_setup(1, {140204923342848})          = 0
io_submit(140204923342848, 1, {{pread, filedes:8, buf:0x1d2d000,
nbytes:4096, offset:0}}) = 1
io_getevents(140204923342848, 1, 1, {{(nil), 0x1d2af28, 4096, 0}}, {2, 0}) = 1
io_destroy(140204923695104)             = 0
io_destroy(140204923682816)             = 0
io_destroy(140204923670528

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-17 16:08 NVMeoF: multipath stuck after bringing one ethernet port down Alex Turin
@ 2017-05-17 17:28 ` Sagi Grimberg
  2017-05-18  5:52   ` shahar.salzman
  0 siblings, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-17 17:28 UTC (permalink / raw)


Hi Alex,

> I am trying to test failure scenarios of NVMeoF + multipath. I bring
> one of ports down and expect to see failed paths using "multipath
> -ll". Instead I see that "multipath -ll" get stuck.
>
> reproduce:
> 1. Connected to NVMeoF device through 2 ports.
> 2. Bind them with multipath.
> 3. Bring one port down (ifconfig eth3 down)
> 4. Execute "multipath -ll" command and it will get stuck.
> From strace I see that multipath is stuck in io_destroy() during
> release of resources. As I understand io_destroy is stuck because of
> io_cancel() that failed. And io_cancel() failed because of port that
> was disabled in step 3.

Hmm, it looks like we do take care of failing fast pending IO, but once
we schedule periodic reconnects the request queues are already stopped
and new incoming requests may block until we successfully reconnect.

I don't have too much time for it at the moment, but here is an untested
patch for you to try out:

--
[PATCH] nvme-rdma: restart queues after we at error recovery to fast
  fail incoming io

When we encounter an transport/controller errors, error recovery
kicks in which performs:
1. stops io/admin queues
2. moves transport queues out of LIVE state
3. fast fail pending io
4. schedule periodic reconnects.

But we also need to fast fail incoming IO taht enters after we
already scheduled. Given that our queue is not LIVE anymore, simply
restart the request queues to fail in .queue_rq

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
  drivers/nvme/host/rdma.c | 20 +++++++++++---------
  1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index dd1c6deef82f..a0aa2bfb91ee 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -753,28 +753,26 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)
         if (ret)
                 goto requeue;

-       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
-
         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
         if (ret)
-               goto stop_admin_q;
+               goto requeue;

         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);

         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
         if (ret)
-               goto stop_admin_q;
+               goto requeue;

         nvme_start_keep_alive(&ctrl->ctrl);

         if (ctrl->queue_count > 1) {
                 ret = nvme_rdma_init_io_queues(ctrl);
                 if (ret)
-                       goto stop_admin_q;
+                       goto requeue;

                 ret = nvme_rdma_connect_io_queues(ctrl);
                 if (ret)
-                       goto stop_admin_q;
+                       goto requeue;
         }

         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
@@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)
         ctrl->ctrl.opts->nr_reconnects = 0;

         if (ctrl->queue_count > 1) {
-               nvme_start_queues(&ctrl->ctrl);
                 nvme_queue_scan(&ctrl->ctrl);
                 nvme_queue_async_events(&ctrl->ctrl);
         }
@@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)

         return;

-stop_admin_q:
-       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
  requeue:
         dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
                         ctrl->ctrl.opts->nr_reconnects);
@@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)
         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
                                 nvme_cancel_request, &ctrl->ctrl);

+       /*
+        * queues are not a live anymore, so restart the queues to fail fast
+        * new IO
+        */
+       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
+       nvme_start_queues(&ctrl->ctrl);
+
         nvme_rdma_reconnect_or_remove(ctrl);
  }

--

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-17 17:28 ` Sagi Grimberg
@ 2017-05-18  5:52   ` shahar.salzman
  2017-05-23 17:31     ` shahar.salzman
  0 siblings, 1 reply; 18+ messages in thread
From: shahar.salzman @ 2017-05-18  5:52 UTC (permalink / raw)


I have seen this too, but was internally investigating it as I am 
running 4.9.6 and not upstream. I will be able to check this patch on 
Sunday/Monday.

On my setup the IOs never complete even after the path is reconnected, 
and the process remains stuck in D state, and the path unusable until I 
power cycle the server.

Some more information, hope it helps:

Dumping the stuck process stack (multipath -ll):

[root at kblock01-knode02 ~]# cat /proc/9715/stack
[<ffffffff94354727>] blk_execute_rq+0x97/0x110
[<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
[<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
[<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
[<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
[<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
[<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
[<ffffffff9428993c>] block_ioctl+0x3c/0x40
[<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
[<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
[<ffffffff94261732>] SyS_ioctl+0x92/0xa0
[<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

Looking at the dissasembly:

0000000000000170 <blk_execute_rq>:

...

  1fa:   e8 00 00 00 00          callq  1ff <blk_execute_rq+0x8f>
         /* Prevent hang_check timer from firing at us during very long 
I/O */
         hang_check = sysctl_hung_task_timeout_secs;
         if (hang_check)
                 while (!wait_for_completion_io_timeout(&wait, 
hang_check * (HZ/2)));
         else
                 wait_for_completion_io(&wait);
  1ff:   4c 89 e7                mov    %r12,%rdi
  202:   e8 00 00 00 00          callq  207 <blk_execute_rq+0x97>

         if (rq->errors)
  207:   83 bb 04 01 00 00 01    cmpl   $0x1,0x104(%rbx)

I ran btrace, and it seems that the IO running before the path 
disconnect is properly cleaned, and only the IO running after gets stuck 
(in the btrace example bellow I run both multipath -ll, and a dd):

259,2   15    27548     4.790727567  8527  I  RS 764837376 + 8 [fio]
259,2   15    27549     4.790728103  8527  D  RS 764837376 + 8 [fio]
259,2   15    27550     4.791244566  8527  A   R 738007008 + 8 <- 
(253,0) 738007008
259,2   15    27551     4.791245136  8527  I  RS 738007008 + 8 [fio]
259,2   15    27552     4.791245803  8527  D  RS 738007008 + 8 [fio]
<<<<< From this stage there are no more fio Queue requests, only completions
259,2    8     9118     4.758116423  8062  C  RS 2294539440 + 8 [0]
259,2    8     9119     4.758559461  8062  C  RS 160307464 + 8 [0]
259,2    8     9120     4.759009990     0  C  RS 2330623512 + 8 [0]
259,2    8     9121     4.759457400     0  C  RS 2657710096 + 8 [0]
259,2    8     9122     4.759928113     0  C  RS 2988256176 + 8 [0]
259,2    8     9123     4.760388009     0  C  RS 789190936 + 8 [0]
259,2    8     9124     4.760835889     0  C  RS 2915035352 + 8 [0]
259,2    8     9125     4.761304282     0  C  RS 2458313624 + 8 [0]
259,2    8     9126     4.761779694     0  C  RS 2280411456 + 8 [0]
259,2    8     9127     4.762286741     0  C  RS 2082207376 + 8 [0]
259,2    8     9128     4.762783722     0  C  RS 2874601688 + 8 [0]
....
259,2    8     9173     4.785798485     0  C  RS 828737568 + 8 [0]
259,2    8     9174     4.786301391     0  C  RS 229168888 + 8 [0]
259,2    8     9175     4.786769105     0  C  RS 893604096 + 8 [0]
259,2    8     9176     4.787270594     0  C  RS 2308778728 + 8 [0]
259,2    8     9177     4.787797448     0  C  RS 2190029912 + 8 [0]
259,2    8     9178     4.788298956     0  C  RS 2652012944 + 8 [0]
259,2    8     9179     4.788803759     0  C  RS 2205242008 + 8 [0]
259,2    8     9180     4.789309423     0  C  RS 1948528416 + 8 [0]
259,2    8     9181     4.789831240     0  C  RS 2003421288 + 8 [0]
259,2    8     9182     4.790343528     0  C  RS 2464964728 + 8 [0]
259,2    8     9183     4.790854684     0  C  RS 764837376 + 8 [0]
<<<<<< This is probably where the port had been disconnected
259,2   21        7     9.413088410  8574  Q   R 0 + 8 [multipathd]
259,2   21        8     9.413089387  8574  G   R 0 + 8 [multipathd]
259,2   21        9     9.413095030  8574  U   N [multipathd] 1
259,2   21       10     9.413095367  8574  I  RS 0 + 8 [multipathd]
259,2   21       11     9.413096534  8574  D  RS 0 + 8 [multipathd]
259,2   10        1    14.381330190  2016  C  RS 738007008 + 8 [7]
259,2   21       12    14.381394890     0  R  RS 0 + 8 [7]
259,2    8     9184    80.500537429  8657  Q   R 0 + 8 [dd]
259,2    8     9185    80.500540122  8657  G   R 0 + 8 [dd]
259,2    8     9186    80.500545289  8657  U   N [dd] 1
259,2    8     9187    80.500545792  8657  I  RS 0 + 8 [dd]
<<<<<< At this stage, I reconnected the path:
259,2    4        1  8381.791090134 11127  Q   R 0 + 8 [dd]
259,2    4        2  8381.791093611 11127  G   R 0 + 8 [dd]
259,2    4        3  8381.791098288 11127  U   N [dd] 1
259,2    4        4  8381.791098791 11127  I  RS 0 + 8 [dd]


As I wrote above, I will check the patch on Sunday/Monday.

On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
> Hi Alex,
>
>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>> one of ports down and expect to see failed paths using "multipath
>> -ll". Instead I see that "multipath -ll" get stuck.
>>
>> reproduce:
>> 1. Connected to NVMeoF device through 2 ports.
>> 2. Bind them with multipath.
>> 3. Bring one port down (ifconfig eth3 down)
>> 4. Execute "multipath -ll" command and it will get stuck.
>> From strace I see that multipath is stuck in io_destroy() during
>> release of resources. As I understand io_destroy is stuck because of
>> io_cancel() that failed. And io_cancel() failed because of port that
>> was disabled in step 3.
>
> Hmm, it looks like we do take care of failing fast pending IO, but once
> we schedule periodic reconnects the request queues are already stopped
> and new incoming requests may block until we successfully reconnect.
>
> I don't have too much time for it at the moment, but here is an untested
> patch for you to try out:
>
> -- 
> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>  fail incoming io
>
> When we encounter an transport/controller errors, error recovery
> kicks in which performs:
> 1. stops io/admin queues
> 2. moves transport queues out of LIVE state
> 3. fast fail pending io
> 4. schedule periodic reconnects.
>
> But we also need to fast fail incoming IO taht enters after we
> already scheduled. Given that our queue is not LIVE anymore, simply
> restart the request queues to fail in .queue_rq
>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/rdma.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index dd1c6deef82f..a0aa2bfb91ee 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -753,28 +753,26 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
> work_struct *work)
>         if (ret)
>                 goto requeue;
>
> -       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> -
>         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>         if (ret)
> -               goto stop_admin_q;
> +               goto requeue;
>
>         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>
>         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>         if (ret)
> -               goto stop_admin_q;
> +               goto requeue;
>
>         nvme_start_keep_alive(&ctrl->ctrl);
>
>         if (ctrl->queue_count > 1) {
>                 ret = nvme_rdma_init_io_queues(ctrl);
>                 if (ret)
> -                       goto stop_admin_q;
> +                       goto requeue;
>
>                 ret = nvme_rdma_connect_io_queues(ctrl);
>                 if (ret)
> -                       goto stop_admin_q;
> +                       goto requeue;
>         }
>
>         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> @@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
> work_struct *work)
>         ctrl->ctrl.opts->nr_reconnects = 0;
>
>         if (ctrl->queue_count > 1) {
> -               nvme_start_queues(&ctrl->ctrl);
>                 nvme_queue_scan(&ctrl->ctrl);
>                 nvme_queue_async_events(&ctrl->ctrl);
>         }
> @@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
> work_struct *work)
>
>         return;
>
> -stop_admin_q:
> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>  requeue:
>         dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>                         ctrl->ctrl.opts->nr_reconnects);
> @@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct 
> work_struct *work)
>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>                                 nvme_cancel_request, &ctrl->ctrl);
>
> +       /*
> +        * queues are not a live anymore, so restart the queues to 
> fail fast
> +        * new IO
> +        */
> +       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> +       nvme_start_queues(&ctrl->ctrl);
> +
>         nvme_rdma_reconnect_or_remove(ctrl);
>  }
>
> -- 
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-18  5:52   ` shahar.salzman
@ 2017-05-23 17:31     ` shahar.salzman
  2017-05-25 11:06       ` shahar.salzman
  2017-05-30 12:05       ` Sagi Grimberg
  0 siblings, 2 replies; 18+ messages in thread
From: shahar.salzman @ 2017-05-23 17:31 UTC (permalink / raw)


I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially 
helps but the overall behaviour is still not good...

If I do a dd (direct_IO) or multipath -ll during the path disconnect, I 
get indefinite retries with nearly zero time between retries, until the 
path is reconnected, and then dd is completed successfully, paths come 
back to life, returning OK status to the process which completes 
successfully.I would expect IO to fail after a certain timeout, and that 
the retries should be paced somewhat

If I do an nvme list (blk_execute_rq), then the process stays in D state 
(I still see the retries), the block layer (at leasy for this device) 
seems to be stuck, and I have to power cycle the server to get it to work...

Just a reminder, I am using Mellanox OFED 4.0, and kernel 4.9.6, the 
plan here is to upgrade to the latest stable kernel but not sure when 
we'll get to it.


On 05/18/2017 08:52 AM, shahar.salzman wrote:
> I have seen this too, but was internally investigating it as I am 
> running 4.9.6 and not upstream. I will be able to check this patch on 
> Sunday/Monday.
>
> On my setup the IOs never complete even after the path is reconnected, 
> and the process remains stuck in D state, and the path unusable until 
> I power cycle the server.
>
> Some more information, hope it helps:
>
> Dumping the stuck process stack (multipath -ll):
>
> [root at kblock01-knode02 ~]# cat /proc/9715/stack
> [<ffffffff94354727>] blk_execute_rq+0x97/0x110
> [<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
> [<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
> [<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
> [<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
> [<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
> [<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
> [<ffffffff9428993c>] block_ioctl+0x3c/0x40
> [<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
> [<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
> [<ffffffff94261732>] SyS_ioctl+0x92/0xa0
> [<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Looking at the dissasembly:
>
> 0000000000000170 <blk_execute_rq>:
>
> ...
>
>  1fa:   e8 00 00 00 00          callq  1ff <blk_execute_rq+0x8f>
>         /* Prevent hang_check timer from firing at us during very long 
> I/O */
>         hang_check = sysctl_hung_task_timeout_secs;
>         if (hang_check)
>                 while (!wait_for_completion_io_timeout(&wait, 
> hang_check * (HZ/2)));
>         else
>                 wait_for_completion_io(&wait);
>  1ff:   4c 89 e7                mov    %r12,%rdi
>  202:   e8 00 00 00 00          callq  207 <blk_execute_rq+0x97>
>
>         if (rq->errors)
>  207:   83 bb 04 01 00 00 01    cmpl   $0x1,0x104(%rbx)
>
> I ran btrace, and it seems that the IO running before the path 
> disconnect is properly cleaned, and only the IO running after gets 
> stuck (in the btrace example bellow I run both multipath -ll, and a dd):
>
> 259,2   15    27548     4.790727567  8527  I  RS 764837376 + 8 [fio]
> 259,2   15    27549     4.790728103  8527  D  RS 764837376 + 8 [fio]
> 259,2   15    27550     4.791244566  8527  A   R 738007008 + 8 <- 
> (253,0) 738007008
> 259,2   15    27551     4.791245136  8527  I  RS 738007008 + 8 [fio]
> 259,2   15    27552     4.791245803  8527  D  RS 738007008 + 8 [fio]
> <<<<< From this stage there are no more fio Queue requests, only 
> completions
> 259,2    8     9118     4.758116423  8062  C  RS 2294539440 + 8 [0]
> 259,2    8     9119     4.758559461  8062  C  RS 160307464 + 8 [0]
> 259,2    8     9120     4.759009990     0  C  RS 2330623512 + 8 [0]
> 259,2    8     9121     4.759457400     0  C  RS 2657710096 + 8 [0]
> 259,2    8     9122     4.759928113     0  C  RS 2988256176 + 8 [0]
> 259,2    8     9123     4.760388009     0  C  RS 789190936 + 8 [0]
> 259,2    8     9124     4.760835889     0  C  RS 2915035352 + 8 [0]
> 259,2    8     9125     4.761304282     0  C  RS 2458313624 + 8 [0]
> 259,2    8     9126     4.761779694     0  C  RS 2280411456 + 8 [0]
> 259,2    8     9127     4.762286741     0  C  RS 2082207376 + 8 [0]
> 259,2    8     9128     4.762783722     0  C  RS 2874601688 + 8 [0]
> ....
> 259,2    8     9173     4.785798485     0  C  RS 828737568 + 8 [0]
> 259,2    8     9174     4.786301391     0  C  RS 229168888 + 8 [0]
> 259,2    8     9175     4.786769105     0  C  RS 893604096 + 8 [0]
> 259,2    8     9176     4.787270594     0  C  RS 2308778728 + 8 [0]
> 259,2    8     9177     4.787797448     0  C  RS 2190029912 + 8 [0]
> 259,2    8     9178     4.788298956     0  C  RS 2652012944 + 8 [0]
> 259,2    8     9179     4.788803759     0  C  RS 2205242008 + 8 [0]
> 259,2    8     9180     4.789309423     0  C  RS 1948528416 + 8 [0]
> 259,2    8     9181     4.789831240     0  C  RS 2003421288 + 8 [0]
> 259,2    8     9182     4.790343528     0  C  RS 2464964728 + 8 [0]
> 259,2    8     9183     4.790854684     0  C  RS 764837376 + 8 [0]
> <<<<<< This is probably where the port had been disconnected
> 259,2   21        7     9.413088410  8574  Q   R 0 + 8 [multipathd]
> 259,2   21        8     9.413089387  8574  G   R 0 + 8 [multipathd]
> 259,2   21        9     9.413095030  8574  U   N [multipathd] 1
> 259,2   21       10     9.413095367  8574  I  RS 0 + 8 [multipathd]
> 259,2   21       11     9.413096534  8574  D  RS 0 + 8 [multipathd]
> 259,2   10        1    14.381330190  2016  C  RS 738007008 + 8 [7]
> 259,2   21       12    14.381394890     0  R  RS 0 + 8 [7]
> 259,2    8     9184    80.500537429  8657  Q   R 0 + 8 [dd]
> 259,2    8     9185    80.500540122  8657  G   R 0 + 8 [dd]
> 259,2    8     9186    80.500545289  8657  U   N [dd] 1
> 259,2    8     9187    80.500545792  8657  I  RS 0 + 8 [dd]
> <<<<<< At this stage, I reconnected the path:
> 259,2    4        1  8381.791090134 11127  Q   R 0 + 8 [dd]
> 259,2    4        2  8381.791093611 11127  G   R 0 + 8 [dd]
> 259,2    4        3  8381.791098288 11127  U   N [dd] 1
> 259,2    4        4  8381.791098791 11127  I  RS 0 + 8 [dd]
>
>
> As I wrote above, I will check the patch on Sunday/Monday.
>
> On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
>> Hi Alex,
>>
>>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>>> one of ports down and expect to see failed paths using "multipath
>>> -ll". Instead I see that "multipath -ll" get stuck.
>>>
>>> reproduce:
>>> 1. Connected to NVMeoF device through 2 ports.
>>> 2. Bind them with multipath.
>>> 3. Bring one port down (ifconfig eth3 down)
>>> 4. Execute "multipath -ll" command and it will get stuck.
>>> From strace I see that multipath is stuck in io_destroy() during
>>> release of resources. As I understand io_destroy is stuck because of
>>> io_cancel() that failed. And io_cancel() failed because of port that
>>> was disabled in step 3.
>>
>> Hmm, it looks like we do take care of failing fast pending IO, but once
>> we schedule periodic reconnects the request queues are already stopped
>> and new incoming requests may block until we successfully reconnect.
>>
>> I don't have too much time for it at the moment, but here is an untested
>> patch for you to try out:
>>
>> -- 
>> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>>  fail incoming io
>>
>> When we encounter an transport/controller errors, error recovery
>> kicks in which performs:
>> 1. stops io/admin queues
>> 2. moves transport queues out of LIVE state
>> 3. fast fail pending io
>> 4. schedule periodic reconnects.
>>
>> But we also need to fast fail incoming IO taht enters after we
>> already scheduled. Given that our queue is not LIVE anymore, simply
>> restart the request queues to fail in .queue_rq
>>
>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>> ---
>>  drivers/nvme/host/rdma.c | 20 +++++++++++---------
>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index dd1c6deef82f..a0aa2bfb91ee 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -753,28 +753,26 @@ static void 
>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>         if (ret)
>>                 goto requeue;
>>
>> -       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>> -
>>         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>>         if (ret)
>> -               goto stop_admin_q;
>> +               goto requeue;
>>
>>         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>>
>>         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>>         if (ret)
>> -               goto stop_admin_q;
>> +               goto requeue;
>>
>>         nvme_start_keep_alive(&ctrl->ctrl);
>>
>>         if (ctrl->queue_count > 1) {
>>                 ret = nvme_rdma_init_io_queues(ctrl);
>>                 if (ret)
>> -                       goto stop_admin_q;
>> +                       goto requeue;
>>
>>                 ret = nvme_rdma_connect_io_queues(ctrl);
>>                 if (ret)
>> -                       goto stop_admin_q;
>> +                       goto requeue;
>>         }
>>
>>         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
>> @@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
>> work_struct *work)
>>         ctrl->ctrl.opts->nr_reconnects = 0;
>>
>>         if (ctrl->queue_count > 1) {
>> -               nvme_start_queues(&ctrl->ctrl);
>>                 nvme_queue_scan(&ctrl->ctrl);
>>                 nvme_queue_async_events(&ctrl->ctrl);
>>         }
>> @@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
>> work_struct *work)
>>
>>         return;
>>
>> -stop_admin_q:
>> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>>  requeue:
>>         dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>>                         ctrl->ctrl.opts->nr_reconnects);
>> @@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct 
>> work_struct *work)
>>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>                                 nvme_cancel_request, &ctrl->ctrl);
>>
>> +       /*
>> +        * queues are not a live anymore, so restart the queues to 
>> fail fast
>> +        * new IO
>> +        */
>> +       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>> +       nvme_start_queues(&ctrl->ctrl);
>> +
>>         nvme_rdma_reconnect_or_remove(ctrl);
>>  }
>>
>> -- 
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-23 17:31     ` shahar.salzman
@ 2017-05-25 11:06       ` shahar.salzman
  2017-05-25 12:27         ` shahar.salzman
  2017-05-30 12:11         ` Sagi Grimberg
  2017-05-30 12:05       ` Sagi Grimberg
  1 sibling, 2 replies; 18+ messages in thread
From: shahar.salzman @ 2017-05-25 11:06 UTC (permalink / raw)


OK, so the indefinite retries are due to the 
nvme/host/rdma.c:nvme_rdma_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY when 
the queue is not ready, wouldn't it be better to return 
BLK_MQ_RQ_QUEUE_ERROR so that the application may handle it? Or 
alternatively return an error after several retries (per Q or something).

I tested a patch (on top of Sagi's) converting the return value to 
QUEUE_ERROR, and both dd, and multipath -ll return instantly.


On 05/23/2017 08:31 PM, shahar.salzman wrote:
> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially 
> helps but the overall behaviour is still not good...
>
> If I do a dd (direct_IO) or multipath -ll during the path disconnect, 
> I get indefinite retries with nearly zero time between retries, until 
> the path is reconnected, and then dd is completed successfully, paths 
> come back to life, returning OK status to the process which completes 
> successfully.I would expect IO to fail after a certain timeout, and 
> that the retries should be paced somewhat
>
> If I do an nvme list (blk_execute_rq), then the process stays in D 
> state (I still see the retries), the block layer (at leasy for this 
> device) seems to be stuck, and I have to power cycle the server to get 
> it to work...
>
> Just a reminder, I am using Mellanox OFED 4.0, and kernel 4.9.6, the 
> plan here is to upgrade to the latest stable kernel but not sure when 
> we'll get to it.
>
>
> On 05/18/2017 08:52 AM, shahar.salzman wrote:
>> I have seen this too, but was internally investigating it as I am 
>> running 4.9.6 and not upstream. I will be able to check this patch on 
>> Sunday/Monday.
>>
>> On my setup the IOs never complete even after the path is 
>> reconnected, and the process remains stuck in D state, and the path 
>> unusable until I power cycle the server.
>>
>> Some more information, hope it helps:
>>
>> Dumping the stuck process stack (multipath -ll):
>>
>> [root at kblock01-knode02 ~]# cat /proc/9715/stack
>> [<ffffffff94354727>] blk_execute_rq+0x97/0x110
>> [<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
>> [<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
>> [<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
>> [<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
>> [<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
>> [<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
>> [<ffffffff9428993c>] block_ioctl+0x3c/0x40
>> [<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
>> [<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
>> [<ffffffff94261732>] SyS_ioctl+0x92/0xa0
>> [<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Looking at the dissasembly:
>>
>> 0000000000000170 <blk_execute_rq>:
>>
>> ...
>>
>>  1fa:   e8 00 00 00 00          callq  1ff <blk_execute_rq+0x8f>
>>         /* Prevent hang_check timer from firing at us during very 
>> long I/O */
>>         hang_check = sysctl_hung_task_timeout_secs;
>>         if (hang_check)
>>                 while (!wait_for_completion_io_timeout(&wait, 
>> hang_check * (HZ/2)));
>>         else
>>                 wait_for_completion_io(&wait);
>>  1ff:   4c 89 e7                mov    %r12,%rdi
>>  202:   e8 00 00 00 00          callq  207 <blk_execute_rq+0x97>
>>
>>         if (rq->errors)
>>  207:   83 bb 04 01 00 00 01    cmpl   $0x1,0x104(%rbx)
>>
>> I ran btrace, and it seems that the IO running before the path 
>> disconnect is properly cleaned, and only the IO running after gets 
>> stuck (in the btrace example bellow I run both multipath -ll, and a dd):
>>
>> 259,2   15    27548     4.790727567  8527  I  RS 764837376 + 8 [fio]
>> 259,2   15    27549     4.790728103  8527  D  RS 764837376 + 8 [fio]
>> 259,2   15    27550     4.791244566  8527  A   R 738007008 + 8 <- 
>> (253,0) 738007008
>> 259,2   15    27551     4.791245136  8527  I  RS 738007008 + 8 [fio]
>> 259,2   15    27552     4.791245803  8527  D  RS 738007008 + 8 [fio]
>> <<<<< From this stage there are no more fio Queue requests, only 
>> completions
>> 259,2    8     9118     4.758116423  8062  C  RS 2294539440 + 8 [0]
>> 259,2    8     9119     4.758559461  8062  C  RS 160307464 + 8 [0]
>> 259,2    8     9120     4.759009990     0  C  RS 2330623512 + 8 [0]
>> 259,2    8     9121     4.759457400     0  C  RS 2657710096 + 8 [0]
>> 259,2    8     9122     4.759928113     0  C  RS 2988256176 + 8 [0]
>> 259,2    8     9123     4.760388009     0  C  RS 789190936 + 8 [0]
>> 259,2    8     9124     4.760835889     0  C  RS 2915035352 + 8 [0]
>> 259,2    8     9125     4.761304282     0  C  RS 2458313624 + 8 [0]
>> 259,2    8     9126     4.761779694     0  C  RS 2280411456 + 8 [0]
>> 259,2    8     9127     4.762286741     0  C  RS 2082207376 + 8 [0]
>> 259,2    8     9128     4.762783722     0  C  RS 2874601688 + 8 [0]
>> ....
>> 259,2    8     9173     4.785798485     0  C  RS 828737568 + 8 [0]
>> 259,2    8     9174     4.786301391     0  C  RS 229168888 + 8 [0]
>> 259,2    8     9175     4.786769105     0  C  RS 893604096 + 8 [0]
>> 259,2    8     9176     4.787270594     0  C  RS 2308778728 + 8 [0]
>> 259,2    8     9177     4.787797448     0  C  RS 2190029912 + 8 [0]
>> 259,2    8     9178     4.788298956     0  C  RS 2652012944 + 8 [0]
>> 259,2    8     9179     4.788803759     0  C  RS 2205242008 + 8 [0]
>> 259,2    8     9180     4.789309423     0  C  RS 1948528416 + 8 [0]
>> 259,2    8     9181     4.789831240     0  C  RS 2003421288 + 8 [0]
>> 259,2    8     9182     4.790343528     0  C  RS 2464964728 + 8 [0]
>> 259,2    8     9183     4.790854684     0  C  RS 764837376 + 8 [0]
>> <<<<<< This is probably where the port had been disconnected
>> 259,2   21        7     9.413088410  8574  Q   R 0 + 8 [multipathd]
>> 259,2   21        8     9.413089387  8574  G   R 0 + 8 [multipathd]
>> 259,2   21        9     9.413095030  8574  U   N [multipathd] 1
>> 259,2   21       10     9.413095367  8574  I  RS 0 + 8 [multipathd]
>> 259,2   21       11     9.413096534  8574  D  RS 0 + 8 [multipathd]
>> 259,2   10        1    14.381330190  2016  C  RS 738007008 + 8 [7]
>> 259,2   21       12    14.381394890     0  R  RS 0 + 8 [7]
>> 259,2    8     9184    80.500537429  8657  Q   R 0 + 8 [dd]
>> 259,2    8     9185    80.500540122  8657  G   R 0 + 8 [dd]
>> 259,2    8     9186    80.500545289  8657  U   N [dd] 1
>> 259,2    8     9187    80.500545792  8657  I  RS 0 + 8 [dd]
>> <<<<<< At this stage, I reconnected the path:
>> 259,2    4        1  8381.791090134 11127  Q   R 0 + 8 [dd]
>> 259,2    4        2  8381.791093611 11127  G   R 0 + 8 [dd]
>> 259,2    4        3  8381.791098288 11127  U   N [dd] 1
>> 259,2    4        4  8381.791098791 11127  I  RS 0 + 8 [dd]
>>
>>
>> As I wrote above, I will check the patch on Sunday/Monday.
>>
>> On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
>>> Hi Alex,
>>>
>>>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>>>> one of ports down and expect to see failed paths using "multipath
>>>> -ll". Instead I see that "multipath -ll" get stuck.
>>>>
>>>> reproduce:
>>>> 1. Connected to NVMeoF device through 2 ports.
>>>> 2. Bind them with multipath.
>>>> 3. Bring one port down (ifconfig eth3 down)
>>>> 4. Execute "multipath -ll" command and it will get stuck.
>>>> From strace I see that multipath is stuck in io_destroy() during
>>>> release of resources. As I understand io_destroy is stuck because of
>>>> io_cancel() that failed. And io_cancel() failed because of port that
>>>> was disabled in step 3.
>>>
>>> Hmm, it looks like we do take care of failing fast pending IO, but once
>>> we schedule periodic reconnects the request queues are already stopped
>>> and new incoming requests may block until we successfully reconnect.
>>>
>>> I don't have too much time for it at the moment, but here is an 
>>> untested
>>> patch for you to try out:
>>>
>>> -- 
>>> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>>>  fail incoming io
>>>
>>> When we encounter an transport/controller errors, error recovery
>>> kicks in which performs:
>>> 1. stops io/admin queues
>>> 2. moves transport queues out of LIVE state
>>> 3. fast fail pending io
>>> 4. schedule periodic reconnects.
>>>
>>> But we also need to fast fail incoming IO taht enters after we
>>> already scheduled. Given that our queue is not LIVE anymore, simply
>>> restart the request queues to fail in .queue_rq
>>>
>>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>>> ---
>>>  drivers/nvme/host/rdma.c | 20 +++++++++++---------
>>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index dd1c6deef82f..a0aa2bfb91ee 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -753,28 +753,26 @@ static void 
>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>         if (ret)
>>>                 goto requeue;
>>>
>>> -       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>> -
>>>         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>>>         if (ret)
>>> -               goto stop_admin_q;
>>> +               goto requeue;
>>>
>>>         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>>>
>>>         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>>>         if (ret)
>>> -               goto stop_admin_q;
>>> +               goto requeue;
>>>
>>>         nvme_start_keep_alive(&ctrl->ctrl);
>>>
>>>         if (ctrl->queue_count > 1) {
>>>                 ret = nvme_rdma_init_io_queues(ctrl);
>>>                 if (ret)
>>> -                       goto stop_admin_q;
>>> +                       goto requeue;
>>>
>>>                 ret = nvme_rdma_connect_io_queues(ctrl);
>>>                 if (ret)
>>> -                       goto stop_admin_q;
>>> +                       goto requeue;
>>>         }
>>>
>>>         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
>>> @@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
>>> work_struct *work)
>>>         ctrl->ctrl.opts->nr_reconnects = 0;
>>>
>>>         if (ctrl->queue_count > 1) {
>>> -               nvme_start_queues(&ctrl->ctrl);
>>>                 nvme_queue_scan(&ctrl->ctrl);
>>>                 nvme_queue_async_events(&ctrl->ctrl);
>>>         }
>>> @@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
>>> work_struct *work)
>>>
>>>         return;
>>>
>>> -stop_admin_q:
>>> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>>>  requeue:
>>>         dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>>>                         ctrl->ctrl.opts->nr_reconnects);
>>> @@ -823,6 +818,13 @@ static void 
>>> nvme_rdma_error_recovery_work(struct work_struct *work)
>>>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>>                                 nvme_cancel_request, &ctrl->ctrl);
>>>
>>> +       /*
>>> +        * queues are not a live anymore, so restart the queues to 
>>> fail fast
>>> +        * new IO
>>> +        */
>>> +       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>> +       nvme_start_queues(&ctrl->ctrl);
>>> +
>>>         nvme_rdma_reconnect_or_remove(ctrl);
>>>  }
>>>
>>> -- 
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme at lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-25 11:06       ` shahar.salzman
@ 2017-05-25 12:27         ` shahar.salzman
  2017-05-25 13:08           ` shahar.salzman
  2017-05-30 12:14           ` Sagi Grimberg
  2017-05-30 12:11         ` Sagi Grimberg
  1 sibling, 2 replies; 18+ messages in thread
From: shahar.salzman @ 2017-05-25 12:27 UTC (permalink / raw)


Returning the QUEUE_ERROR also solved the "nvme list" issue reported, 
but the paths where not coming back after I reconnected the ports.

I played with the keep alive thread so that it didn't exit if it got an 
error, and that was resolved as well.


On 05/25/2017 02:06 PM, shahar.salzman wrote:
> OK, so the indefinite retries are due to the 
> nvme/host/rdma.c:nvme_rdma_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY 
> when the queue is not ready, wouldn't it be better to return 
> BLK_MQ_RQ_QUEUE_ERROR so that the application may handle it? Or 
> alternatively return an error after several retries (per Q or something).
>
> I tested a patch (on top of Sagi's) converting the return value to 
> QUEUE_ERROR, and both dd, and multipath -ll return instantly.
>
>
> On 05/23/2017 08:31 PM, shahar.salzman wrote:
>> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially 
>> helps but the overall behaviour is still not good...
>>
>> If I do a dd (direct_IO) or multipath -ll during the path disconnect, 
>> I get indefinite retries with nearly zero time between retries, until 
>> the path is reconnected, and then dd is completed successfully, paths 
>> come back to life, returning OK status to the process which completes 
>> successfully.I would expect IO to fail after a certain timeout, and 
>> that the retries should be paced somewhat
>>
>> If I do an nvme list (blk_execute_rq), then the process stays in D 
>> state (I still see the retries), the block layer (at leasy for this 
>> device) seems to be stuck, and I have to power cycle the server to 
>> get it to work...
>>
>> Just a reminder, I am using Mellanox OFED 4.0, and kernel 4.9.6, the 
>> plan here is to upgrade to the latest stable kernel but not sure when 
>> we'll get to it.
>>
>>
>> On 05/18/2017 08:52 AM, shahar.salzman wrote:
>>> I have seen this too, but was internally investigating it as I am 
>>> running 4.9.6 and not upstream. I will be able to check this patch 
>>> on Sunday/Monday.
>>>
>>> On my setup the IOs never complete even after the path is 
>>> reconnected, and the process remains stuck in D state, and the path 
>>> unusable until I power cycle the server.
>>>
>>> Some more information, hope it helps:
>>>
>>> Dumping the stuck process stack (multipath -ll):
>>>
>>> [root at kblock01-knode02 ~]# cat /proc/9715/stack
>>> [<ffffffff94354727>] blk_execute_rq+0x97/0x110
>>> [<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
>>> [<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
>>> [<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
>>> [<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
>>> [<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
>>> [<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
>>> [<ffffffff9428993c>] block_ioctl+0x3c/0x40
>>> [<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
>>> [<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
>>> [<ffffffff94261732>] SyS_ioctl+0x92/0xa0
>>> [<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Looking at the dissasembly:
>>>
>>> 0000000000000170 <blk_execute_rq>:
>>>
>>> ...
>>>
>>>  1fa:   e8 00 00 00 00          callq  1ff <blk_execute_rq+0x8f>
>>>         /* Prevent hang_check timer from firing at us during very 
>>> long I/O */
>>>         hang_check = sysctl_hung_task_timeout_secs;
>>>         if (hang_check)
>>>                 while (!wait_for_completion_io_timeout(&wait, 
>>> hang_check * (HZ/2)));
>>>         else
>>>                 wait_for_completion_io(&wait);
>>>  1ff:   4c 89 e7                mov    %r12,%rdi
>>>  202:   e8 00 00 00 00          callq  207 <blk_execute_rq+0x97>
>>>
>>>         if (rq->errors)
>>>  207:   83 bb 04 01 00 00 01    cmpl   $0x1,0x104(%rbx)
>>>
>>> I ran btrace, and it seems that the IO running before the path 
>>> disconnect is properly cleaned, and only the IO running after gets 
>>> stuck (in the btrace example bellow I run both multipath -ll, and a 
>>> dd):
>>>
>>> 259,2   15    27548     4.790727567  8527  I  RS 764837376 + 8 [fio]
>>> 259,2   15    27549     4.790728103  8527  D  RS 764837376 + 8 [fio]
>>> 259,2   15    27550     4.791244566  8527  A   R 738007008 + 8 <- 
>>> (253,0) 738007008
>>> 259,2   15    27551     4.791245136  8527  I  RS 738007008 + 8 [fio]
>>> 259,2   15    27552     4.791245803  8527  D  RS 738007008 + 8 [fio]
>>> <<<<< From this stage there are no more fio Queue requests, only 
>>> completions
>>> 259,2    8     9118     4.758116423  8062  C  RS 2294539440 + 8 [0]
>>> 259,2    8     9119     4.758559461  8062  C  RS 160307464 + 8 [0]
>>> 259,2    8     9120     4.759009990     0  C  RS 2330623512 + 8 [0]
>>> 259,2    8     9121     4.759457400     0  C  RS 2657710096 + 8 [0]
>>> 259,2    8     9122     4.759928113     0  C  RS 2988256176 + 8 [0]
>>> 259,2    8     9123     4.760388009     0  C  RS 789190936 + 8 [0]
>>> 259,2    8     9124     4.760835889     0  C  RS 2915035352 + 8 [0]
>>> 259,2    8     9125     4.761304282     0  C  RS 2458313624 + 8 [0]
>>> 259,2    8     9126     4.761779694     0  C  RS 2280411456 + 8 [0]
>>> 259,2    8     9127     4.762286741     0  C  RS 2082207376 + 8 [0]
>>> 259,2    8     9128     4.762783722     0  C  RS 2874601688 + 8 [0]
>>> ....
>>> 259,2    8     9173     4.785798485     0  C  RS 828737568 + 8 [0]
>>> 259,2    8     9174     4.786301391     0  C  RS 229168888 + 8 [0]
>>> 259,2    8     9175     4.786769105     0  C  RS 893604096 + 8 [0]
>>> 259,2    8     9176     4.787270594     0  C  RS 2308778728 + 8 [0]
>>> 259,2    8     9177     4.787797448     0  C  RS 2190029912 + 8 [0]
>>> 259,2    8     9178     4.788298956     0  C  RS 2652012944 + 8 [0]
>>> 259,2    8     9179     4.788803759     0  C  RS 2205242008 + 8 [0]
>>> 259,2    8     9180     4.789309423     0  C  RS 1948528416 + 8 [0]
>>> 259,2    8     9181     4.789831240     0  C  RS 2003421288 + 8 [0]
>>> 259,2    8     9182     4.790343528     0  C  RS 2464964728 + 8 [0]
>>> 259,2    8     9183     4.790854684     0  C  RS 764837376 + 8 [0]
>>> <<<<<< This is probably where the port had been disconnected
>>> 259,2   21        7     9.413088410  8574  Q   R 0 + 8 [multipathd]
>>> 259,2   21        8     9.413089387  8574  G   R 0 + 8 [multipathd]
>>> 259,2   21        9     9.413095030  8574  U   N [multipathd] 1
>>> 259,2   21       10     9.413095367  8574  I  RS 0 + 8 [multipathd]
>>> 259,2   21       11     9.413096534  8574  D  RS 0 + 8 [multipathd]
>>> 259,2   10        1    14.381330190  2016  C  RS 738007008 + 8 [7]
>>> 259,2   21       12    14.381394890     0  R  RS 0 + 8 [7]
>>> 259,2    8     9184    80.500537429  8657  Q   R 0 + 8 [dd]
>>> 259,2    8     9185    80.500540122  8657  G   R 0 + 8 [dd]
>>> 259,2    8     9186    80.500545289  8657  U   N [dd] 1
>>> 259,2    8     9187    80.500545792  8657  I  RS 0 + 8 [dd]
>>> <<<<<< At this stage, I reconnected the path:
>>> 259,2    4        1  8381.791090134 11127  Q   R 0 + 8 [dd]
>>> 259,2    4        2  8381.791093611 11127  G   R 0 + 8 [dd]
>>> 259,2    4        3  8381.791098288 11127  U   N [dd] 1
>>> 259,2    4        4  8381.791098791 11127  I  RS 0 + 8 [dd]
>>>
>>>
>>> As I wrote above, I will check the patch on Sunday/Monday.
>>>
>>> On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
>>>> Hi Alex,
>>>>
>>>>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>>>>> one of ports down and expect to see failed paths using "multipath
>>>>> -ll". Instead I see that "multipath -ll" get stuck.
>>>>>
>>>>> reproduce:
>>>>> 1. Connected to NVMeoF device through 2 ports.
>>>>> 2. Bind them with multipath.
>>>>> 3. Bring one port down (ifconfig eth3 down)
>>>>> 4. Execute "multipath -ll" command and it will get stuck.
>>>>> From strace I see that multipath is stuck in io_destroy() during
>>>>> release of resources. As I understand io_destroy is stuck because of
>>>>> io_cancel() that failed. And io_cancel() failed because of port that
>>>>> was disabled in step 3.
>>>>
>>>> Hmm, it looks like we do take care of failing fast pending IO, but 
>>>> once
>>>> we schedule periodic reconnects the request queues are already stopped
>>>> and new incoming requests may block until we successfully reconnect.
>>>>
>>>> I don't have too much time for it at the moment, but here is an 
>>>> untested
>>>> patch for you to try out:
>>>>
>>>> -- 
>>>> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>>>>  fail incoming io
>>>>
>>>> When we encounter an transport/controller errors, error recovery
>>>> kicks in which performs:
>>>> 1. stops io/admin queues
>>>> 2. moves transport queues out of LIVE state
>>>> 3. fast fail pending io
>>>> 4. schedule periodic reconnects.
>>>>
>>>> But we also need to fast fail incoming IO taht enters after we
>>>> already scheduled. Given that our queue is not LIVE anymore, simply
>>>> restart the request queues to fail in .queue_rq
>>>>
>>>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>>>> ---
>>>>  drivers/nvme/host/rdma.c | 20 +++++++++++---------
>>>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index dd1c6deef82f..a0aa2bfb91ee 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -753,28 +753,26 @@ static void 
>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>         if (ret)
>>>>                 goto requeue;
>>>>
>>>> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>>> -
>>>>         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>>>>         if (ret)
>>>> -               goto stop_admin_q;
>>>> +               goto requeue;
>>>>
>>>>         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>>>>
>>>>         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>>>>         if (ret)
>>>> -               goto stop_admin_q;
>>>> +               goto requeue;
>>>>
>>>>         nvme_start_keep_alive(&ctrl->ctrl);
>>>>
>>>>         if (ctrl->queue_count > 1) {
>>>>                 ret = nvme_rdma_init_io_queues(ctrl);
>>>>                 if (ret)
>>>> -                       goto stop_admin_q;
>>>> +                       goto requeue;
>>>>
>>>>                 ret = nvme_rdma_connect_io_queues(ctrl);
>>>>                 if (ret)
>>>> -                       goto stop_admin_q;
>>>> +                       goto requeue;
>>>>         }
>>>>
>>>>         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
>>>> @@ -782,7 +780,6 @@ static void 
>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>         ctrl->ctrl.opts->nr_reconnects = 0;
>>>>
>>>>         if (ctrl->queue_count > 1) {
>>>> -               nvme_start_queues(&ctrl->ctrl);
>>>>                 nvme_queue_scan(&ctrl->ctrl);
>>>>                 nvme_queue_async_events(&ctrl->ctrl);
>>>>         }
>>>> @@ -791,8 +788,6 @@ static void 
>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>
>>>>         return;
>>>>
>>>> -stop_admin_q:
>>>> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>>>>  requeue:
>>>>         dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>>>> ctrl->ctrl.opts->nr_reconnects);
>>>> @@ -823,6 +818,13 @@ static void 
>>>> nvme_rdma_error_recovery_work(struct work_struct *work)
>>>>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>>>                                 nvme_cancel_request, &ctrl->ctrl);
>>>>
>>>> +       /*
>>>> +        * queues are not a live anymore, so restart the queues to 
>>>> fail fast
>>>> +        * new IO
>>>> +        */
>>>> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>>> +       nvme_start_queues(&ctrl->ctrl);
>>>> +
>>>>         nvme_rdma_reconnect_or_remove(ctrl);
>>>>  }
>>>>
>>>> -- 
>>>>
>>>> _______________________________________________
>>>> Linux-nvme mailing list
>>>> Linux-nvme at lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-25 12:27         ` shahar.salzman
@ 2017-05-25 13:08           ` shahar.salzman
  2017-05-30 12:14           ` Sagi Grimberg
  1 sibling, 0 replies; 18+ messages in thread
From: shahar.salzman @ 2017-05-25 13:08 UTC (permalink / raw)


This is the patch I applied (Sagi's + my additions), for the 
mlnx-nvme-rdma source RPM:

--- mlnx-nvme-rdma-4.0/source/rdma.c.orig       2017-01-29 
22:26:53.000000000 +0200
+++ mlnx-nvme-rdma-4.0/source/rdma.c    2017-05-25 11:12:00.703099000 +0300
@@ -715,35 +715,32 @@
         if (ret)
                 goto requeue;

-       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
-
         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
         if (ret)
-               goto stop_admin_q;
+               goto requeue;

         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);

         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
         if (ret)
-               goto stop_admin_q;
+               goto requeue;

         nvme_start_keep_alive(&ctrl->ctrl);

         if (ctrl->queue_count > 1) {
                 ret = nvme_rdma_init_io_queues(ctrl);
                 if (ret)
-                       goto stop_admin_q;
+                       goto requeue;

                 ret = nvme_rdma_connect_io_queues(ctrl);
                 if (ret)
-                       goto stop_admin_q;
+                       goto requeue;
         }

         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
         WARN_ON_ONCE(!changed);

         if (ctrl->queue_count > 1) {
-               nvme_start_queues(&ctrl->ctrl);
                 nvme_queue_scan(&ctrl->ctrl);
                 nvme_queue_async_events(&ctrl->ctrl);
         }
@@ -752,8 +749,6 @@

         return;

-stop_admin_q:
-       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
  requeue:
         /* Make sure we are not resetting/deleting */
         if (ctrl->ctrl.state == NVME_CTRL_RECONNECTING) {
@@ -788,6 +783,13 @@
         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
                                 nvme_cancel_request, &ctrl->ctrl);

+       /*
+        * queues are not a live anymore, so restart the queues to fail fast
+        * new IO
+        */
+       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
+       nvme_start_queues(&ctrl->ctrl);
+
         dev_info(ctrl->ctrl.device, "reconnecting in %d seconds\n",
                 ctrl->reconnect_delay);

@@ -1426,7 +1428,7 @@
         WARN_ON_ONCE(rq->tag < 0);

         if (!nvme_rdma_queue_is_ready(queue, rq))
-               return BLK_MQ_RQ_QUEUE_BUSY;
+               return BLK_MQ_RQ_QUEUE_ERROR;

         dev = queue->device->dev;
         ib_dma_sync_single_for_cpu(dev, sqe->dma,
@@ -2031,6 +2033,8 @@
  {
         int ret;

+       pr_info("Loading nvme-rdma with Sagi G. patch to support "
+               "multipath\n");
         nvme_rdma_wq = create_workqueue("nvme_rdma_wq");
         if (!nvme_rdma_wq)
                 return -ENOMEM;
--- mlnx-nvme-rdma-4.0/source/core.c.orig       2017-05-25 
14:50:22.941022000 +0300
+++ mlnx-nvme-rdma-4.0/source/core.c    2017-05-25 14:50:30.563588000 +0300
@@ -488,7 +488,6 @@
         if (error) {
                 dev_err(ctrl->device,
                         "failed nvme_keep_alive_end_io error=%d\n", error);
-               return;
         }

         schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);


On 05/25/2017 03:27 PM, shahar.salzman wrote:
> Returning the QUEUE_ERROR also solved the "nvme list" issue reported, 
> but the paths where not coming back after I reconnected the ports.
>
> I played with the keep alive thread so that it didn't exit if it got 
> an error, and that was resolved as well.
>
>
> On 05/25/2017 02:06 PM, shahar.salzman wrote:
>> OK, so the indefinite retries are due to the 
>> nvme/host/rdma.c:nvme_rdma_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY 
>> when the queue is not ready, wouldn't it be better to return 
>> BLK_MQ_RQ_QUEUE_ERROR so that the application may handle it? Or 
>> alternatively return an error after several retries (per Q or 
>> something).
>>
>> I tested a patch (on top of Sagi's) converting the return value to 
>> QUEUE_ERROR, and both dd, and multipath -ll return instantly.
>>
>>
>> On 05/23/2017 08:31 PM, shahar.salzman wrote:
>>> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it 
>>> patrially helps but the overall behaviour is still not good...
>>>
>>> If I do a dd (direct_IO) or multipath -ll during the path 
>>> disconnect, I get indefinite retries with nearly zero time between 
>>> retries, until the path is reconnected, and then dd is completed 
>>> successfully, paths come back to life, returning OK status to the 
>>> process which completes successfully.I would expect IO to fail after 
>>> a certain timeout, and that the retries should be paced somewhat
>>>
>>> If I do an nvme list (blk_execute_rq), then the process stays in D 
>>> state (I still see the retries), the block layer (at leasy for this 
>>> device) seems to be stuck, and I have to power cycle the server to 
>>> get it to work...
>>>
>>> Just a reminder, I am using Mellanox OFED 4.0, and kernel 4.9.6, the 
>>> plan here is to upgrade to the latest stable kernel but not sure 
>>> when we'll get to it.
>>>
>>>
>>> On 05/18/2017 08:52 AM, shahar.salzman wrote:
>>>> I have seen this too, but was internally investigating it as I am 
>>>> running 4.9.6 and not upstream. I will be able to check this patch 
>>>> on Sunday/Monday.
>>>>
>>>> On my setup the IOs never complete even after the path is 
>>>> reconnected, and the process remains stuck in D state, and the path 
>>>> unusable until I power cycle the server.
>>>>
>>>> Some more information, hope it helps:
>>>>
>>>> Dumping the stuck process stack (multipath -ll):
>>>>
>>>> [root at kblock01-knode02 ~]# cat /proc/9715/stack
>>>> [<ffffffff94354727>] blk_execute_rq+0x97/0x110
>>>> [<ffffffffc0a72f8a>] __nvme_submit_user_cmd+0xca/0x300 [nvme_core]
>>>> [<ffffffffc0a73369>] nvme_submit_user_cmd+0x29/0x30 [nvme_core]
>>>> [<ffffffffc0a734bd>] nvme_user_cmd+0x14d/0x180 [nvme_core]
>>>> [<ffffffffc0a7376c>] nvme_ioctl+0x7c/0xa0 [nvme_core]
>>>> [<ffffffff9435d548>] __blkdev_driver_ioctl+0x28/0x30
>>>> [<ffffffff9435dce1>] blkdev_ioctl+0x131/0x8f0
>>>> [<ffffffff9428993c>] block_ioctl+0x3c/0x40
>>>> [<ffffffff94260ae8>] vfs_ioctl+0x18/0x30
>>>> [<ffffffff94261201>] do_vfs_ioctl+0x161/0x600
>>>> [<ffffffff94261732>] SyS_ioctl+0x92/0xa0
>>>> [<ffffffff9479cc77>] entry_SYSCALL_64_fastpath+0x1a/0xa9
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>
>>>> Looking at the dissasembly:
>>>>
>>>> 0000000000000170 <blk_execute_rq>:
>>>>
>>>> ...
>>>>
>>>>  1fa:   e8 00 00 00 00          callq  1ff <blk_execute_rq+0x8f>
>>>>         /* Prevent hang_check timer from firing at us during very 
>>>> long I/O */
>>>>         hang_check = sysctl_hung_task_timeout_secs;
>>>>         if (hang_check)
>>>>                 while (!wait_for_completion_io_timeout(&wait, 
>>>> hang_check * (HZ/2)));
>>>>         else
>>>>                 wait_for_completion_io(&wait);
>>>>  1ff:   4c 89 e7                mov    %r12,%rdi
>>>>  202:   e8 00 00 00 00          callq  207 <blk_execute_rq+0x97>
>>>>
>>>>         if (rq->errors)
>>>>  207:   83 bb 04 01 00 00 01    cmpl   $0x1,0x104(%rbx)
>>>>
>>>> I ran btrace, and it seems that the IO running before the path 
>>>> disconnect is properly cleaned, and only the IO running after gets 
>>>> stuck (in the btrace example bellow I run both multipath -ll, and a 
>>>> dd):
>>>>
>>>> 259,2   15    27548     4.790727567  8527  I  RS 764837376 + 8 [fio]
>>>> 259,2   15    27549     4.790728103  8527  D  RS 764837376 + 8 [fio]
>>>> 259,2   15    27550     4.791244566  8527  A   R 738007008 + 8 <- 
>>>> (253,0) 738007008
>>>> 259,2   15    27551     4.791245136  8527  I  RS 738007008 + 8 [fio]
>>>> 259,2   15    27552     4.791245803  8527  D  RS 738007008 + 8 [fio]
>>>> <<<<< From this stage there are no more fio Queue requests, only 
>>>> completions
>>>> 259,2    8     9118     4.758116423  8062  C  RS 2294539440 + 8 [0]
>>>> 259,2    8     9119     4.758559461  8062  C  RS 160307464 + 8 [0]
>>>> 259,2    8     9120     4.759009990     0  C  RS 2330623512 + 8 [0]
>>>> 259,2    8     9121     4.759457400     0  C  RS 2657710096 + 8 [0]
>>>> 259,2    8     9122     4.759928113     0  C  RS 2988256176 + 8 [0]
>>>> 259,2    8     9123     4.760388009     0  C  RS 789190936 + 8 [0]
>>>> 259,2    8     9124     4.760835889     0  C  RS 2915035352 + 8 [0]
>>>> 259,2    8     9125     4.761304282     0  C  RS 2458313624 + 8 [0]
>>>> 259,2    8     9126     4.761779694     0  C  RS 2280411456 + 8 [0]
>>>> 259,2    8     9127     4.762286741     0  C  RS 2082207376 + 8 [0]
>>>> 259,2    8     9128     4.762783722     0  C  RS 2874601688 + 8 [0]
>>>> ....
>>>> 259,2    8     9173     4.785798485     0  C  RS 828737568 + 8 [0]
>>>> 259,2    8     9174     4.786301391     0  C  RS 229168888 + 8 [0]
>>>> 259,2    8     9175     4.786769105     0  C  RS 893604096 + 8 [0]
>>>> 259,2    8     9176     4.787270594     0  C  RS 2308778728 + 8 [0]
>>>> 259,2    8     9177     4.787797448     0  C  RS 2190029912 + 8 [0]
>>>> 259,2    8     9178     4.788298956     0  C  RS 2652012944 + 8 [0]
>>>> 259,2    8     9179     4.788803759     0  C  RS 2205242008 + 8 [0]
>>>> 259,2    8     9180     4.789309423     0  C  RS 1948528416 + 8 [0]
>>>> 259,2    8     9181     4.789831240     0  C  RS 2003421288 + 8 [0]
>>>> 259,2    8     9182     4.790343528     0  C  RS 2464964728 + 8 [0]
>>>> 259,2    8     9183     4.790854684     0  C  RS 764837376 + 8 [0]
>>>> <<<<<< This is probably where the port had been disconnected
>>>> 259,2   21        7     9.413088410  8574  Q   R 0 + 8 [multipathd]
>>>> 259,2   21        8     9.413089387  8574  G   R 0 + 8 [multipathd]
>>>> 259,2   21        9     9.413095030  8574  U   N [multipathd] 1
>>>> 259,2   21       10     9.413095367  8574  I  RS 0 + 8 [multipathd]
>>>> 259,2   21       11     9.413096534  8574  D  RS 0 + 8 [multipathd]
>>>> 259,2   10        1    14.381330190  2016  C  RS 738007008 + 8 [7]
>>>> 259,2   21       12    14.381394890     0  R  RS 0 + 8 [7]
>>>> 259,2    8     9184    80.500537429  8657  Q   R 0 + 8 [dd]
>>>> 259,2    8     9185    80.500540122  8657  G   R 0 + 8 [dd]
>>>> 259,2    8     9186    80.500545289  8657  U   N [dd] 1
>>>> 259,2    8     9187    80.500545792  8657  I  RS 0 + 8 [dd]
>>>> <<<<<< At this stage, I reconnected the path:
>>>> 259,2    4        1  8381.791090134 11127  Q   R 0 + 8 [dd]
>>>> 259,2    4        2  8381.791093611 11127  G   R 0 + 8 [dd]
>>>> 259,2    4        3  8381.791098288 11127  U   N [dd] 1
>>>> 259,2    4        4  8381.791098791 11127  I  RS 0 + 8 [dd]
>>>>
>>>>
>>>> As I wrote above, I will check the patch on Sunday/Monday.
>>>>
>>>> On 05/17/2017 08:28 PM, Sagi Grimberg wrote:
>>>>> Hi Alex,
>>>>>
>>>>>> I am trying to test failure scenarios of NVMeoF + multipath. I bring
>>>>>> one of ports down and expect to see failed paths using "multipath
>>>>>> -ll". Instead I see that "multipath -ll" get stuck.
>>>>>>
>>>>>> reproduce:
>>>>>> 1. Connected to NVMeoF device through 2 ports.
>>>>>> 2. Bind them with multipath.
>>>>>> 3. Bring one port down (ifconfig eth3 down)
>>>>>> 4. Execute "multipath -ll" command and it will get stuck.
>>>>>> From strace I see that multipath is stuck in io_destroy() during
>>>>>> release of resources. As I understand io_destroy is stuck because of
>>>>>> io_cancel() that failed. And io_cancel() failed because of port that
>>>>>> was disabled in step 3.
>>>>>
>>>>> Hmm, it looks like we do take care of failing fast pending IO, but 
>>>>> once
>>>>> we schedule periodic reconnects the request queues are already 
>>>>> stopped
>>>>> and new incoming requests may block until we successfully reconnect.
>>>>>
>>>>> I don't have too much time for it at the moment, but here is an 
>>>>> untested
>>>>> patch for you to try out:
>>>>>
>>>>> -- 
>>>>> [PATCH] nvme-rdma: restart queues after we at error recovery to fast
>>>>>  fail incoming io
>>>>>
>>>>> When we encounter an transport/controller errors, error recovery
>>>>> kicks in which performs:
>>>>> 1. stops io/admin queues
>>>>> 2. moves transport queues out of LIVE state
>>>>> 3. fast fail pending io
>>>>> 4. schedule periodic reconnects.
>>>>>
>>>>> But we also need to fast fail incoming IO taht enters after we
>>>>> already scheduled. Given that our queue is not LIVE anymore, simply
>>>>> restart the request queues to fail in .queue_rq
>>>>>
>>>>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>>>>> ---
>>>>>  drivers/nvme/host/rdma.c | 20 +++++++++++---------
>>>>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>>> index dd1c6deef82f..a0aa2bfb91ee 100644
>>>>> --- a/drivers/nvme/host/rdma.c
>>>>> +++ b/drivers/nvme/host/rdma.c
>>>>> @@ -753,28 +753,26 @@ static void 
>>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>>         if (ret)
>>>>>                 goto requeue;
>>>>>
>>>>> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>>>> -
>>>>>         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>>>>>         if (ret)
>>>>> -               goto stop_admin_q;
>>>>> +               goto requeue;
>>>>>
>>>>>         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>>>>>
>>>>>         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>>>>>         if (ret)
>>>>> -               goto stop_admin_q;
>>>>> +               goto requeue;
>>>>>
>>>>>         nvme_start_keep_alive(&ctrl->ctrl);
>>>>>
>>>>>         if (ctrl->queue_count > 1) {
>>>>>                 ret = nvme_rdma_init_io_queues(ctrl);
>>>>>                 if (ret)
>>>>> -                       goto stop_admin_q;
>>>>> +                       goto requeue;
>>>>>
>>>>>                 ret = nvme_rdma_connect_io_queues(ctrl);
>>>>>                 if (ret)
>>>>> -                       goto stop_admin_q;
>>>>> +                       goto requeue;
>>>>>         }
>>>>>
>>>>>         changed = nvme_change_ctrl_state(&ctrl->ctrl, 
>>>>> NVME_CTRL_LIVE);
>>>>> @@ -782,7 +780,6 @@ static void 
>>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>>         ctrl->ctrl.opts->nr_reconnects = 0;
>>>>>
>>>>>         if (ctrl->queue_count > 1) {
>>>>> -               nvme_start_queues(&ctrl->ctrl);
>>>>>                 nvme_queue_scan(&ctrl->ctrl);
>>>>> nvme_queue_async_events(&ctrl->ctrl);
>>>>>         }
>>>>> @@ -791,8 +788,6 @@ static void 
>>>>> nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
>>>>>
>>>>>         return;
>>>>>
>>>>> -stop_admin_q:
>>>>> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>>>>>  requeue:
>>>>>         dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>>>>> ctrl->ctrl.opts->nr_reconnects);
>>>>> @@ -823,6 +818,13 @@ static void 
>>>>> nvme_rdma_error_recovery_work(struct work_struct *work)
>>>>> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>>>>>                                 nvme_cancel_request, &ctrl->ctrl);
>>>>>
>>>>> +       /*
>>>>> +        * queues are not a live anymore, so restart the queues to 
>>>>> fail fast
>>>>> +        * new IO
>>>>> +        */
>>>>> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
>>>>> +       nvme_start_queues(&ctrl->ctrl);
>>>>> +
>>>>>         nvme_rdma_reconnect_or_remove(ctrl);
>>>>>  }
>>>>>
>>>>> -- 
>>>>>
>>>>> _______________________________________________
>>>>> Linux-nvme mailing list
>>>>> Linux-nvme at lists.infradead.org
>>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-23 17:31     ` shahar.salzman
  2017-05-25 11:06       ` shahar.salzman
@ 2017-05-30 12:05       ` Sagi Grimberg
  2017-05-30 13:37         ` Max Gurtovoy
  1 sibling, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-30 12:05 UTC (permalink / raw)



> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially
> helps but the overall behaviour is still not good...

Can you test with upstream kernel code? Otherwise I suggest approaching
Mellanox support.

We can't support non-upstream code really...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-25 11:06       ` shahar.salzman
  2017-05-25 12:27         ` shahar.salzman
@ 2017-05-30 12:11         ` Sagi Grimberg
  1 sibling, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-30 12:11 UTC (permalink / raw)



> OK, so the indefinite retries are due to the
> nvme/host/rdma.c:nvme_rdma_queue_rq returning BLK_MQ_RQ_QUEUE_BUSY when
> the queue is not ready, wouldn't it be better to return
> BLK_MQ_RQ_QUEUE_ERROR so that the application may handle it? Or
> alternatively return an error after several retries (per Q or something).
>
> I tested a patch (on top of Sagi's) converting the return value to
> QUEUE_ERROR, and both dd, and multipath -ll return instantly.

We shouldn't return QUEUE_ERROR unconditionally, probably only if
the controller is in state RECONNECTING (RESETTING is a quick phase
and DELETING is going to kill the request anyway).

Does this keeps the existing behavior:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 28bd255c144d..f0d700220ec1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1433,7 +1433,7 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
  /*
   * We cannot accept any other command until the Connect command has 
completed.
   */
-static inline bool nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
+static inline int nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
                 struct request *rq)
  {
         if (unlikely(!test_bit(NVME_RDMA_Q_LIVE, &queue->flags))) {
@@ -1441,11 +1441,15 @@ static inline bool 
nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,

                 if (!blk_rq_is_passthrough(rq) ||
                     cmd->common.opcode != nvme_fabrics_command ||
-                   cmd->fabrics.fctype != nvme_fabrics_type_connect)
-                       return false;
+                   cmd->fabrics.fctype != nvme_fabrics_type_connect) {
+                       if (queue->ctrl->ctrl->state == 
NVME_CTRL_RECONNECTING)
+                               return -EIO;
+                       else
+                               return -EAGAIN;
+               }
         }

-       return true;
+       return 0;
  }

  static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -1463,8 +1467,9 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx 
*hctx,

         WARN_ON_ONCE(rq->tag < 0);

-       if (!nvme_rdma_queue_is_ready(queue, rq))
-               return BLK_MQ_RQ_QUEUE_BUSY;
+       ret = nvme_rdma_queue_is_ready(queue, rq);
+       if (unlikely(ret))
+               goto err;

         dev = queue->device->dev;
         ib_dma_sync_single_for_cpu(dev, sqe->dma,
--

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-25 12:27         ` shahar.salzman
  2017-05-25 13:08           ` shahar.salzman
@ 2017-05-30 12:14           ` Sagi Grimberg
  1 sibling, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-30 12:14 UTC (permalink / raw)



> I played with the keep alive thread so that it didn't exit if it got an
> error, and that was resolved as well.

Why is that needed? there is no reason scheduling keep-alives if they
fail.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-30 12:05       ` Sagi Grimberg
@ 2017-05-30 13:37         ` Max Gurtovoy
  2017-05-30 14:17           ` Sagi Grimberg
  0 siblings, 1 reply; 18+ messages in thread
From: Max Gurtovoy @ 2017-05-30 13:37 UTC (permalink / raw)




On 5/30/2017 3:05 PM, Sagi Grimberg wrote:
>
>> I used the patch with my Mellanox OFED (mlnx-nvme-rdma), it patrially
>> helps but the overall behaviour is still not good...
>
> Can you test with upstream kernel code? Otherwise I suggest approaching
> Mellanox support.
>
> We can't support non-upstream code really...

Hi guys,
this is a known issue in upstream code and Mellanox OFED code as well.
I agree with Sagi's approach for future issues using our package.
For this one, we will test the proposed fixes and update you regarding 
the results.

Max.


>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-30 13:37         ` Max Gurtovoy
@ 2017-05-30 14:17           ` Sagi Grimberg
  2017-06-05  7:11             ` shahar.salzman
  2017-06-05  8:40             ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-05-30 14:17 UTC (permalink / raw)



> Hi guys,
> this is a known issue in upstream code and Mellanox OFED code as well.
> I agree with Sagi's approach for future issues using our package.
> For this one, we will test the proposed fixes and update you regarding
> the results.

You can try:

--
[PATCH] nvme-rdma: fast fail incoming requests while we reconnect

When we encounter an transport/controller errors, error recovery
kicks in which performs:
1. stops io/admin queues
2. moves transport queues out of LIVE state
3. fast fail pending io
4. schedule periodic reconnects.

But we also need to fast fail incoming IO taht enters after we
already scheduled. Given that our queue is not LIVE anymore, simply
restart the request queues to fail in .queue_rq

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
  drivers/nvme/host/rdma.c | 37 ++++++++++++++++++++++---------------
  1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 28bd255c144d..ce8f1e992e64 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -753,28 +753,26 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)
         if (ret)
                 goto requeue;

-       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
-
         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
         if (ret)
-               goto stop_admin_q;
+               goto requeue;

         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);

         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
         if (ret)
-               goto stop_admin_q;
+               goto requeue;

         nvme_start_keep_alive(&ctrl->ctrl);

         if (ctrl->queue_count > 1) {
                 ret = nvme_rdma_init_io_queues(ctrl);
                 if (ret)
-                       goto stop_admin_q;
+                       goto requeue;

                 ret = nvme_rdma_connect_io_queues(ctrl);
                 if (ret)
-                       goto stop_admin_q;
+                       goto requeue;
         }

         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
@@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)
         ctrl->ctrl.opts->nr_reconnects = 0;

         if (ctrl->queue_count > 1) {
-               nvme_start_queues(&ctrl->ctrl);
                 nvme_queue_scan(&ctrl->ctrl);
                 nvme_queue_async_events(&ctrl->ctrl);
         }
@@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)

         return;

-stop_admin_q:
-       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
  requeue:
         dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
                         ctrl->ctrl.opts->nr_reconnects);
@@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)
         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
                                 nvme_cancel_request, &ctrl->ctrl);

+       /*
+        * queues are not a live anymore, so restart the queues to fail fast
+        * new IO
+        */
+       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
+       nvme_start_queues(&ctrl->ctrl);
+
         nvme_rdma_reconnect_or_remove(ctrl);
  }

@@ -1433,7 +1435,7 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
  /*
   * We cannot accept any other command until the Connect command has 
completed.
   */
-static inline bool nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
+static inline int nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
                 struct request *rq)
  {
         if (unlikely(!test_bit(NVME_RDMA_Q_LIVE, &queue->flags))) {
@@ -1441,11 +1443,15 @@ static inline bool 
nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,

                 if (!blk_rq_is_passthrough(rq) ||
                     cmd->common.opcode != nvme_fabrics_command ||
-                   cmd->fabrics.fctype != nvme_fabrics_type_connect)
-                       return false;
+                   cmd->fabrics.fctype != nvme_fabrics_type_connect) {
+                       if (queue->ctrl->ctrl->state == 
NVME_CTRL_RECONNECTING)
+                               return -EIO;
+                       else
+                               return -EAGAIN;
+               }
         }

-       return true;
+       return 0;
  }

  static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -1463,8 +1469,9 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx 
*hctx,

         WARN_ON_ONCE(rq->tag < 0);

-       if (!nvme_rdma_queue_is_ready(queue, rq))
-               return BLK_MQ_RQ_QUEUE_BUSY;
+       ret = nvme_rdma_queue_is_ready(queue, rq);
+       if (unlikely(ret))
+               goto err;

         dev = queue->device->dev;
         ib_dma_sync_single_for_cpu(dev, sqe->dma,
--

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-30 14:17           ` Sagi Grimberg
@ 2017-06-05  7:11             ` shahar.salzman
  2017-06-05  8:14               ` Sagi Grimberg
  2017-06-05  8:40             ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: shahar.salzman @ 2017-06-05  7:11 UTC (permalink / raw)


I tested the patch, works great.

Both IO (dd), "multipath -ll", and "nvme list" return instantaneously 
with IO error, multipath is reinstated as soon as the path is reconnected.

Sagi, thanks for the fix!


On 05/30/2017 05:17 PM, Sagi Grimberg wrote:
>
>> Hi guys,
>> this is a known issue in upstream code and Mellanox OFED code as well.
>> I agree with Sagi's approach for future issues using our package.
>> For this one, we will test the proposed fixes and update you regarding
>> the results.
>
> You can try:
>
> -- 
> [PATCH] nvme-rdma: fast fail incoming requests while we reconnect
>
> When we encounter an transport/controller errors, error recovery
> kicks in which performs:
> 1. stops io/admin queues
> 2. moves transport queues out of LIVE state
> 3. fast fail pending io
> 4. schedule periodic reconnects.
>
> But we also need to fast fail incoming IO taht enters after we
> already scheduled. Given that our queue is not LIVE anymore, simply
> restart the request queues to fail in .queue_rq
>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/rdma.c | 37 ++++++++++++++++++++++---------------
>  1 file changed, 22 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 28bd255c144d..ce8f1e992e64 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -753,28 +753,26 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
> work_struct *work)
>         if (ret)
>                 goto requeue;
>
> -       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> -
>         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
>         if (ret)
> -               goto stop_admin_q;
> +               goto requeue;
>
>         set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>
>         ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->cap);
>         if (ret)
> -               goto stop_admin_q;
> +               goto requeue;
>
>         nvme_start_keep_alive(&ctrl->ctrl);
>
>         if (ctrl->queue_count > 1) {
>                 ret = nvme_rdma_init_io_queues(ctrl);
>                 if (ret)
> -                       goto stop_admin_q;
> +                       goto requeue;
>
>                 ret = nvme_rdma_connect_io_queues(ctrl);
>                 if (ret)
> -                       goto stop_admin_q;
> +                       goto requeue;
>         }
>
>         changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> @@ -782,7 +780,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
> work_struct *work)
>         ctrl->ctrl.opts->nr_reconnects = 0;
>
>         if (ctrl->queue_count > 1) {
> -               nvme_start_queues(&ctrl->ctrl);
>                 nvme_queue_scan(&ctrl->ctrl);
>                 nvme_queue_async_events(&ctrl->ctrl);
>         }
> @@ -791,8 +788,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
> work_struct *work)
>
>         return;
>
> -stop_admin_q:
> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
>  requeue:
>         dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
>                         ctrl->ctrl.opts->nr_reconnects);
> @@ -823,6 +818,13 @@ static void nvme_rdma_error_recovery_work(struct 
> work_struct *work)
>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>                                 nvme_cancel_request, &ctrl->ctrl);
>
> +       /*
> +        * queues are not a live anymore, so restart the queues to 
> fail fast
> +        * new IO
> +        */
> +       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> +       nvme_start_queues(&ctrl->ctrl);
> +
>         nvme_rdma_reconnect_or_remove(ctrl);
>  }
>
> @@ -1433,7 +1435,7 @@ nvme_rdma_timeout(struct request *rq, bool 
> reserved)
>  /*
>   * We cannot accept any other command until the Connect command has 
> completed.
>   */
> -static inline bool nvme_rdma_queue_is_ready(struct nvme_rdma_queue 
> *queue,
> +static inline int nvme_rdma_queue_is_ready(struct nvme_rdma_queue 
> *queue,
>                 struct request *rq)
>  {
>         if (unlikely(!test_bit(NVME_RDMA_Q_LIVE, &queue->flags))) {
> @@ -1441,11 +1443,15 @@ static inline bool 
> nvme_rdma_queue_is_ready(struct nvme_rdma_queue *queue,
>
>                 if (!blk_rq_is_passthrough(rq) ||
>                     cmd->common.opcode != nvme_fabrics_command ||
> -                   cmd->fabrics.fctype != nvme_fabrics_type_connect)
> -                       return false;
> +                   cmd->fabrics.fctype != nvme_fabrics_type_connect) {
> +                       if (queue->ctrl->ctrl->state == 
> NVME_CTRL_RECONNECTING)
> +                               return -EIO;
> +                       else
> +                               return -EAGAIN;
> +               }
>         }
>
> -       return true;
> +       return 0;
>  }
>
>  static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
> @@ -1463,8 +1469,9 @@ static int nvme_rdma_queue_rq(struct 
> blk_mq_hw_ctx *hctx,
>
>         WARN_ON_ONCE(rq->tag < 0);
>
> -       if (!nvme_rdma_queue_is_ready(queue, rq))
> -               return BLK_MQ_RQ_QUEUE_BUSY;
> +       ret = nvme_rdma_queue_is_ready(queue, rq);
> +       if (unlikely(ret))
> +               goto err;
>
>         dev = queue->device->dev;
>         ib_dma_sync_single_for_cpu(dev, sqe->dma,
> -- 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-06-05  7:11             ` shahar.salzman
@ 2017-06-05  8:14               ` Sagi Grimberg
  0 siblings, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-06-05  8:14 UTC (permalink / raw)


> I tested the patch, works great.
> 
> Both IO (dd), "multipath -ll", and "nvme list" return instantaneously 
> with IO error, multipath is reinstated as soon as the path is reconnected.

Thanks Shahar,

Alex, does the patch work for you as well?

Christoph, any feedback on this patch? if not, I'll submit a formal
patch soon for 4.12-rc.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-05-30 14:17           ` Sagi Grimberg
  2017-06-05  7:11             ` shahar.salzman
@ 2017-06-05  8:40             ` Christoph Hellwig
  2017-06-05  8:53               ` Sagi Grimberg
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-06-05  8:40 UTC (permalink / raw)


On Tue, May 30, 2017@05:17:40PM +0300, Sagi Grimberg wrote:
> [PATCH] nvme-rdma: fast fail incoming requests while we reconnect
>
> When we encounter an transport/controller errors, error recovery
> kicks in which performs:
> 1. stops io/admin queues
> 2. moves transport queues out of LIVE state
> 3. fast fail pending io
> 4. schedule periodic reconnects.
>
> But we also need to fast fail incoming IO taht enters after we
> already scheduled. Given that our queue is not LIVE anymore, simply
> restart the request queues to fail in .queue_rq

But we shouldn't _fail_ I/O just because we're reconnecting, we
need to be able to retry it once reconnected.

> +                   cmd->fabrics.fctype != nvme_fabrics_type_connect) {
> +                       if (queue->ctrl->ctrl->state == 
> NVME_CTRL_RECONNECTING)
> +                               return -EIO;
> +                       else
> +                               return -EAGAIN;
> +               }

So this looks somewhat bogus to me, while the rest looks ok.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-06-05  8:40             ` Christoph Hellwig
@ 2017-06-05  8:53               ` Sagi Grimberg
  2017-06-05 15:07                 ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2017-06-05  8:53 UTC (permalink / raw)



>> [PATCH] nvme-rdma: fast fail incoming requests while we reconnect
>>
>> When we encounter an transport/controller errors, error recovery
>> kicks in which performs:
>> 1. stops io/admin queues
>> 2. moves transport queues out of LIVE state
>> 3. fast fail pending io
>> 4. schedule periodic reconnects.
>>
>> But we also need to fast fail incoming IO taht enters after we
>> already scheduled. Given that our queue is not LIVE anymore, simply
>> restart the request queues to fail in .queue_rq
> 
> But we shouldn't _fail_ I/O just because we're reconnecting, we
> need to be able to retry it once reconnected.

I'm not sure, the point is to fail fast so that dm or user can
failover traffic. Besides, we iterate and cancel all inflight IO, this
attempts to give the same treatment to IO that arrives later...

In several scsi transports, we have the concept of fast_io_fail_tmo
which we could add to nvme, but from my experience, people usually set
it to a minimum to achieve fast failover (usually smaller than the very
first reconnect attempt).

We have nvme_max_retries modparam, so we could simply fail it fast until
we hit this modparam, but I suspect it'll expire very fast.

> 
>> +                   cmd->fabrics.fctype != nvme_fabrics_type_connect) {
>> +                       if (queue->ctrl->ctrl->state ==
>> NVME_CTRL_RECONNECTING)
>> +                               return -EIO;
>> +                       else
>> +                               return -EAGAIN;
>> +               }
> 
> So this looks somewhat bogus to me, while the rest looks ok.

The point here is that RECONNECTING is a ctrl state that has a
potential to linger for a long time (unlike RESETTING or DELETING),
so we don't want to trigger requeue right away.

I'm open to other ideas. I just want to prevent triggering a redundant
loop of queue_rq -> fail with BUSY -> queue_rq -> fail with BUSY ...

Thoughts?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-06-05  8:53               ` Sagi Grimberg
@ 2017-06-05 15:07                 ` Christoph Hellwig
  2017-06-05 17:23                   ` Sagi Grimberg
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-06-05 15:07 UTC (permalink / raw)


On Mon, Jun 05, 2017@11:53:58AM +0300, Sagi Grimberg wrote:
>> So this looks somewhat bogus to me, while the rest looks ok.
>
> The point here is that RECONNECTING is a ctrl state that has a
> potential to linger for a long time (unlike RESETTING or DELETING),
> so we don't want to trigger requeue right away.
>
> I'm open to other ideas. I just want to prevent triggering a redundant
> loop of queue_rq -> fail with BUSY -> queue_rq -> fail with BUSY ...
>
> Thoughts?

Let's get this patch in, then sort out a common stratefy for the
dev_loss_tmo for all drivers, as FC is already doing some work in
that area.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NVMeoF: multipath stuck after bringing one ethernet port down
  2017-06-05 15:07                 ` Christoph Hellwig
@ 2017-06-05 17:23                   ` Sagi Grimberg
  0 siblings, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2017-06-05 17:23 UTC (permalink / raw)



>>> So this looks somewhat bogus to me, while the rest looks ok.
>>
>> The point here is that RECONNECTING is a ctrl state that has a
>> potential to linger for a long time (unlike RESETTING or DELETING),
>> so we don't want to trigger requeue right away.
>>
>> I'm open to other ideas. I just want to prevent triggering a redundant
>> loop of queue_rq -> fail with BUSY -> queue_rq -> fail with BUSY ...
>>
>> Thoughts?
> 
> Let's get this patch in, then sort out a common stratefy for the
> dev_loss_tmo for all drivers, as FC is already doing some work in
> that area.

It's not dev_loss_tmo (timeout to give up on reconnect attempts),
but yea, I agree. I'll send a patch.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-06-05 17:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-17 16:08 NVMeoF: multipath stuck after bringing one ethernet port down Alex Turin
2017-05-17 17:28 ` Sagi Grimberg
2017-05-18  5:52   ` shahar.salzman
2017-05-23 17:31     ` shahar.salzman
2017-05-25 11:06       ` shahar.salzman
2017-05-25 12:27         ` shahar.salzman
2017-05-25 13:08           ` shahar.salzman
2017-05-30 12:14           ` Sagi Grimberg
2017-05-30 12:11         ` Sagi Grimberg
2017-05-30 12:05       ` Sagi Grimberg
2017-05-30 13:37         ` Max Gurtovoy
2017-05-30 14:17           ` Sagi Grimberg
2017-06-05  7:11             ` shahar.salzman
2017-06-05  8:14               ` Sagi Grimberg
2017-06-05  8:40             ` Christoph Hellwig
2017-06-05  8:53               ` Sagi Grimberg
2017-06-05 15:07                 ` Christoph Hellwig
2017-06-05 17:23                   ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.