linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nvme-rdma: complete requests from ->timeout
@ 2018-11-29 23:59 Jaesoo Lee
  2018-11-30  1:30 ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Jaesoo Lee @ 2018-11-29 23:59 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi
  Cc: linux-nvme, linux-kernel, psajeepa, roland, ashishk, jalee

After f6e7d48 (block: remove BLK_EH_HANDLED), the low-level device driver
is responsible to complete the timed out request and a series of changes
were submitted for various LLDDs to make completions from ->timeout
subsequently. However, adding the completion code in NVMe driver was
skipped with the assumption made in below NVMe LLDD's change.

Commit message of db8c48e (nvme: return BLK_EH_DONE from ->timeout):
 NVMe always completes the request before returning from ->timeout, either
 by polling for it, or by disabling the controller.   Return BLK_EH_DONE so
 that the block layer doesn't even try to complete it again.

This does not hold at least for NVMe RDMA host driver. An example scenario
is when the RDMA connection is gone while the controller is being deleted.
In this case, the nvmf_reg_write32() for sending shutdown admin command by
the delete_work could be hung forever if the command is not completed by
the timeout handler.

Stack trace of hang looks like:
 kworker/u66:2   D    0 21098      2 0x80000080
 Workqueue: nvme-delete-wq nvme_delete_ctrl_work
 Call Trace:
 __schedule+0x2ab/0x880
 schedule+0x36/0x80
 schedule_timeout+0x161/0x300
 ? __next_timer_interrupt+0xe0/0xe0
 io_schedule_timeout+0x1e/0x50
 wait_for_completion_io_timeout+0x130/0x1a0
 ? wake_up_q+0x80/0x80
 blk_execute_rq+0x6e/0xa0
 __nvme_submit_sync_cmd+0x6e/0xe0
 nvmf_reg_write32+0x6c/0xc0 [nvme_fabrics]
 nvme_shutdown_ctrl+0x56/0x110
 nvme_rdma_shutdown_ctrl+0xf8/0x100 [nvme_rdma]
 nvme_rdma_delete_ctrl+0x1a/0x20 [nvme_rdma]
 nvme_delete_ctrl_work+0x66/0x90
 process_one_work+0x179/0x390
 worker_thread+0x1da/0x3e0
 kthread+0x105/0x140
 ? max_active_store+0x80/0x80
 ? kthread_bind+0x20/0x20
 ret_from_fork+0x35/0x40

Signed-off-by: Jaesoo Lee <jalee@purestorage.com>
---
 drivers/nvme/host/rdma.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d181caf..25319b7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1688,6 +1688,7 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
 	/* fail with DNR on cmd timeout */
 	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
 
+	blk_mq_complete_request(rq);
 	return BLK_EH_DONE;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-11-29 23:59 [PATCH] nvme-rdma: complete requests from ->timeout Jaesoo Lee
@ 2018-11-30  1:30 ` Sagi Grimberg
  2018-11-30  1:54   ` Jaesoo Lee
  2018-12-09 14:22   ` Nitzan Carmi
  0 siblings, 2 replies; 13+ messages in thread
From: Sagi Grimberg @ 2018-11-30  1:30 UTC (permalink / raw)
  To: Jaesoo Lee, keith.busch, axboe, hch
  Cc: linux-nvme, linux-kernel, psajeepa, roland, ashishk


> This does not hold at least for NVMe RDMA host driver. An example scenario
> is when the RDMA connection is gone while the controller is being deleted.
> In this case, the nvmf_reg_write32() for sending shutdown admin command by
> the delete_work could be hung forever if the command is not completed by
> the timeout handler.

If the queue is gone, this means that the queue has already flushed and
any commands that were inflight has completed with a flush error
completion...

Can you describe the scenario that caused this hang? When has the
queue became "gone" and when did the shutdown command execute?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-11-30  1:30 ` Sagi Grimberg
@ 2018-11-30  1:54   ` Jaesoo Lee
  2018-12-07  0:18     ` Jaesoo Lee
  2018-12-09 14:22   ` Nitzan Carmi
  1 sibling, 1 reply; 13+ messages in thread
From: Jaesoo Lee @ 2018-11-30  1:54 UTC (permalink / raw)
  To: sagi
  Cc: keith.busch, axboe, hch, linux-nvme, linux-kernel,
	Prabhath Sajeepa, Roland Dreier, Ashish Karkare

Not the queue, but the RDMA connections.

Let me describe the scenario.

1. connected nvme-rdma target with 500 namespaces
: this will make the nvme_remove_namespaces() took a long time to
complete and open the window vulnerable to this bug
2. host will take below code path for nvme_delete_ctrl_work and send
normal shutdown in nvme_shutdown_ctrl()
- nvme_stop_ctrl
  - nvme_stop_keep_alive --> stopped keep alive
- nvme_remove_namespaces --> took too long time, over 10~15s
- nvme_rdma_shutdown_ctrl
  - nvme_rdma_teardown_io_queues
  - nvme_shutdown_ctrl
    - nvmf_reg_write32
      -__nvme_submit_sync_cmd --> nvme_delete_ctrl_work blocked here
  - nvme_rdma_teardown_admin_queue
- nvme_uninit_ctrl
- nvme_put_ctrl
3. the rdma connection is disconnected by the nvme-rdma target
: in our case, this is triggered by the target side timeout mechanism
: I did not try, but I think this could happen if we lost the RoCE link, too.
4. the shutdown notification command timed out and the work stuck
while leaving the controller in NVME_CTRL_DELETING state

Thanks,

Jaesoo Lee.


On Thu, Nov 29, 2018 at 5:30 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > This does not hold at least for NVMe RDMA host driver. An example scenario
> > is when the RDMA connection is gone while the controller is being deleted.
> > In this case, the nvmf_reg_write32() for sending shutdown admin command by
> > the delete_work could be hung forever if the command is not completed by
> > the timeout handler.
>
> If the queue is gone, this means that the queue has already flushed and
> any commands that were inflight has completed with a flush error
> completion...
>
> Can you describe the scenario that caused this hang? When has the
> queue became "gone" and when did the shutdown command execute?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-11-30  1:54   ` Jaesoo Lee
@ 2018-12-07  0:18     ` Jaesoo Lee
  2018-12-07 20:05       ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Jaesoo Lee @ 2018-12-07  0:18 UTC (permalink / raw)
  To: sagi
  Cc: keith.busch, axboe, hch, linux-nvme, linux-kernel,
	Prabhath Sajeepa, Roland Dreier, Ashish Karkare

Could you please take a look at this bug and code review?

We are seeing more instances of this bug and found that reconnect_work
could hang as well, as can be seen from below stacktrace.

 Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
 Call Trace:
 __schedule+0x2ab/0x880
 schedule+0x36/0x80
 schedule_timeout+0x161/0x300
 ? __next_timer_interrupt+0xe0/0xe0
 io_schedule_timeout+0x1e/0x50
 wait_for_completion_io_timeout+0x130/0x1a0
 ? wake_up_q+0x80/0x80
 blk_execute_rq+0x6e/0xa0
 __nvme_submit_sync_cmd+0x6e/0xe0
 nvmf_connect_admin_queue+0x128/0x190 [nvme_fabrics]
 ? wait_for_completion_interruptible_timeout+0x157/0x1b0
 nvme_rdma_start_queue+0x5e/0x90 [nvme_rdma]
 nvme_rdma_setup_ctrl+0x1b4/0x730 [nvme_rdma]
 nvme_rdma_reconnect_ctrl_work+0x27/0x70 [nvme_rdma]
 process_one_work+0x179/0x390
 worker_thread+0x4f/0x3e0
 kthread+0x105/0x140
 ? max_active_store+0x80/0x80
 ? kthread_bind+0x20/0x20

This bug is produced by setting MTU of RoCE interface to '568' for
test while running I/O traffics.

Thanks,

Jaesoo Lee.

On Thu, Nov 29, 2018 at 5:54 PM Jaesoo Lee <jalee@purestorage.com> wrote:
>
> Not the queue, but the RDMA connections.
>
> Let me describe the scenario.
>
> 1. connected nvme-rdma target with 500 namespaces
> : this will make the nvme_remove_namespaces() took a long time to
> complete and open the window vulnerable to this bug
> 2. host will take below code path for nvme_delete_ctrl_work and send
> normal shutdown in nvme_shutdown_ctrl()
> - nvme_stop_ctrl
>   - nvme_stop_keep_alive --> stopped keep alive
> - nvme_remove_namespaces --> took too long time, over 10~15s
> - nvme_rdma_shutdown_ctrl
>   - nvme_rdma_teardown_io_queues
>   - nvme_shutdown_ctrl
>     - nvmf_reg_write32
>       -__nvme_submit_sync_cmd --> nvme_delete_ctrl_work blocked here
>   - nvme_rdma_teardown_admin_queue
> - nvme_uninit_ctrl
> - nvme_put_ctrl
> 3. the rdma connection is disconnected by the nvme-rdma target
> : in our case, this is triggered by the target side timeout mechanism
> : I did not try, but I think this could happen if we lost the RoCE link, too.
> 4. the shutdown notification command timed out and the work stuck
> while leaving the controller in NVME_CTRL_DELETING state
>
> Thanks,
>
> Jaesoo Lee.
>
>
> On Thu, Nov 29, 2018 at 5:30 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> >
> >
> > > This does not hold at least for NVMe RDMA host driver. An example scenario
> > > is when the RDMA connection is gone while the controller is being deleted.
> > > In this case, the nvmf_reg_write32() for sending shutdown admin command by
> > > the delete_work could be hung forever if the command is not completed by
> > > the timeout handler.
> >
> > If the queue is gone, this means that the queue has already flushed and
> > any commands that were inflight has completed with a flush error
> > completion...
> >
> > Can you describe the scenario that caused this hang? When has the
> > queue became "gone" and when did the shutdown command execute?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-12-07  0:18     ` Jaesoo Lee
@ 2018-12-07 20:05       ` Sagi Grimberg
  2018-12-08  2:02         ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2018-12-07 20:05 UTC (permalink / raw)
  To: Jaesoo Lee
  Cc: keith.busch, axboe, hch, linux-nvme, linux-kernel,
	Prabhath Sajeepa, Roland Dreier, Ashish Karkare


> Could you please take a look at this bug and code review?
> 
> We are seeing more instances of this bug and found that reconnect_work
> could hang as well, as can be seen from below stacktrace.
> 
>   Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
>   Call Trace:
>   __schedule+0x2ab/0x880
>   schedule+0x36/0x80
>   schedule_timeout+0x161/0x300
>   ? __next_timer_interrupt+0xe0/0xe0
>   io_schedule_timeout+0x1e/0x50
>   wait_for_completion_io_timeout+0x130/0x1a0
>   ? wake_up_q+0x80/0x80
>   blk_execute_rq+0x6e/0xa0
>   __nvme_submit_sync_cmd+0x6e/0xe0
>   nvmf_connect_admin_queue+0x128/0x190 [nvme_fabrics]
>   ? wait_for_completion_interruptible_timeout+0x157/0x1b0
>   nvme_rdma_start_queue+0x5e/0x90 [nvme_rdma]
>   nvme_rdma_setup_ctrl+0x1b4/0x730 [nvme_rdma]
>   nvme_rdma_reconnect_ctrl_work+0x27/0x70 [nvme_rdma]
>   process_one_work+0x179/0x390
>   worker_thread+0x4f/0x3e0
>   kthread+0x105/0x140
>   ? max_active_store+0x80/0x80
>   ? kthread_bind+0x20/0x20
> 
> This bug is produced by setting MTU of RoCE interface to '568' for
> test while running I/O traffics.

I think that with the latest changes from Keith we can no longer rely
on blk-mq to barrier racing completions. We will probably need
to barrier ourselves in nvme-rdma...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-12-07 20:05       ` Sagi Grimberg
@ 2018-12-08  2:02         ` Keith Busch
  2018-12-08  6:28           ` Jaesoo Lee
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-12-08  2:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jaesoo Lee, axboe, hch, linux-nvme, linux-kernel,
	Prabhath Sajeepa, Roland Dreier, Ashish Karkare

On Fri, Dec 07, 2018 at 12:05:37PM -0800, Sagi Grimberg wrote:
> 
> > Could you please take a look at this bug and code review?
> > 
> > We are seeing more instances of this bug and found that reconnect_work
> > could hang as well, as can be seen from below stacktrace.
> > 
> >   Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> >   Call Trace:
> >   __schedule+0x2ab/0x880
> >   schedule+0x36/0x80
> >   schedule_timeout+0x161/0x300
> >   ? __next_timer_interrupt+0xe0/0xe0
> >   io_schedule_timeout+0x1e/0x50
> >   wait_for_completion_io_timeout+0x130/0x1a0
> >   ? wake_up_q+0x80/0x80
> >   blk_execute_rq+0x6e/0xa0
> >   __nvme_submit_sync_cmd+0x6e/0xe0
> >   nvmf_connect_admin_queue+0x128/0x190 [nvme_fabrics]
> >   ? wait_for_completion_interruptible_timeout+0x157/0x1b0
> >   nvme_rdma_start_queue+0x5e/0x90 [nvme_rdma]
> >   nvme_rdma_setup_ctrl+0x1b4/0x730 [nvme_rdma]
> >   nvme_rdma_reconnect_ctrl_work+0x27/0x70 [nvme_rdma]
> >   process_one_work+0x179/0x390
> >   worker_thread+0x4f/0x3e0
> >   kthread+0x105/0x140
> >   ? max_active_store+0x80/0x80
> >   ? kthread_bind+0x20/0x20
> > 
> > This bug is produced by setting MTU of RoCE interface to '568' for
> > test while running I/O traffics.
> 
> I think that with the latest changes from Keith we can no longer rely
> on blk-mq to barrier racing completions. We will probably need
> to barrier ourselves in nvme-rdma...

You really need to do that anyway. If you were relying on blk-mq to save
you from double completions by ending a request in the nvme driver while
the lower half can still complete the same one, the only thing preventing
data corruption is the probability the request wasn't reallocated for a
new command.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-12-08  2:02         ` Keith Busch
@ 2018-12-08  6:28           ` Jaesoo Lee
  0 siblings, 0 replies; 13+ messages in thread
From: Jaesoo Lee @ 2018-12-08  6:28 UTC (permalink / raw)
  To: keith.busch
  Cc: sagi, axboe, hch, linux-nvme, linux-kernel, Prabhath Sajeepa,
	Roland Dreier, Ashish Karkare

Now, I see that my patch is not safe and can cause double completions.
However, I am having a hard time finding out a good solution to
barrier the racing completions.

Could you suggest where the fix should go and what should it look
like? We can provide more details on reproducing this issue if that
helps.

On Fri, Dec 7, 2018 at 6:04 PM Keith Busch <keith.busch@intel.com> wrote:
>
> On Fri, Dec 07, 2018 at 12:05:37PM -0800, Sagi Grimberg wrote:
> >
> > > Could you please take a look at this bug and code review?
> > >
> > > We are seeing more instances of this bug and found that reconnect_work
> > > could hang as well, as can be seen from below stacktrace.
> > >
> > >   Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> > >   Call Trace:
> > >   __schedule+0x2ab/0x880
> > >   schedule+0x36/0x80
> > >   schedule_timeout+0x161/0x300
> > >   ? __next_timer_interrupt+0xe0/0xe0
> > >   io_schedule_timeout+0x1e/0x50
> > >   wait_for_completion_io_timeout+0x130/0x1a0
> > >   ? wake_up_q+0x80/0x80
> > >   blk_execute_rq+0x6e/0xa0
> > >   __nvme_submit_sync_cmd+0x6e/0xe0
> > >   nvmf_connect_admin_queue+0x128/0x190 [nvme_fabrics]
> > >   ? wait_for_completion_interruptible_timeout+0x157/0x1b0
> > >   nvme_rdma_start_queue+0x5e/0x90 [nvme_rdma]
> > >   nvme_rdma_setup_ctrl+0x1b4/0x730 [nvme_rdma]
> > >   nvme_rdma_reconnect_ctrl_work+0x27/0x70 [nvme_rdma]
> > >   process_one_work+0x179/0x390
> > >   worker_thread+0x4f/0x3e0
> > >   kthread+0x105/0x140
> > >   ? max_active_store+0x80/0x80
> > >   ? kthread_bind+0x20/0x20
> > >
> > > This bug is produced by setting MTU of RoCE interface to '568' for
> > > test while running I/O traffics.
> >
> > I think that with the latest changes from Keith we can no longer rely
> > on blk-mq to barrier racing completions. We will probably need
> > to barrier ourselves in nvme-rdma...
>
> You really need to do that anyway. If you were relying on blk-mq to save
> you from double completions by ending a request in the nvme driver while
> the lower half can still complete the same one, the only thing preventing
> data corruption is the probability the request wasn't reallocated for a
> new command.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-11-30  1:30 ` Sagi Grimberg
  2018-11-30  1:54   ` Jaesoo Lee
@ 2018-12-09 14:22   ` Nitzan Carmi
  2018-12-10 23:40     ` Jaesoo Lee
  1 sibling, 1 reply; 13+ messages in thread
From: Nitzan Carmi @ 2018-12-09 14:22 UTC (permalink / raw)
  To: Sagi Grimberg, Jaesoo Lee, keith.busch, axboe, hch
  Cc: roland, psajeepa, ashishk, linux-kernel, linux-nvme

[-- Attachment #1: Type: text/plain, Size: 1305 bytes --]

Hi,
We encountered similar issue.
I think that the problem is that error_recovery might not even be 
queued, in case we're in DELETING state (or CONNECTING state, for that 
matter), because we cannot move from those states to RESETTING.

We prepared some patches which handle completions in case such scenario 
happens (which, in fact, might happen in numerous error flows).

Does it solve your problem?
Nitzan.


On 30/11/2018 03:30, Sagi Grimberg wrote:
> 
>> This does not hold at least for NVMe RDMA host driver. An example 
>> scenario
>> is when the RDMA connection is gone while the controller is being 
>> deleted.
>> In this case, the nvmf_reg_write32() for sending shutdown admin 
>> command by
>> the delete_work could be hung forever if the command is not completed by
>> the timeout handler.
> 
> If the queue is gone, this means that the queue has already flushed and
> any commands that were inflight has completed with a flush error
> completion...
> 
> Can you describe the scenario that caused this hang? When has the
> queue became "gone" and when did the shutdown command execute?
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

[-- Attachment #2: 0001-nvme-Introduce-nvme_is_aen_req-function.patch --]
[-- Type: text/plain, Size: 2333 bytes --]

From b71e75169d6cf2956cc0930f94c668282ed65596 Mon Sep 17 00:00:00 2001
From: Israel Rukshin <israelr@mellanox.com>
Date: Sun, 11 Nov 2018 15:35:10 +0000
Subject: [PATCH 1/2] nvme: Introduce nvme_is_aen_req function

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/host/nvme.h | 5 +++++
 drivers/nvme/host/pci.c  | 3 +--
 drivers/nvme/host/rdma.c | 4 ++--
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 081cbdcce880..8330f41991ad 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -407,6 +407,11 @@ static inline void nvme_put_ctrl(struct nvme_ctrl *ctrl)
 	put_device(ctrl->device);
 }
 
+static inline bool nvme_is_aen_req(u16 qid, __u16 command_id)
+{
+	return (!qid && command_id >= NVME_AQ_BLK_MQ_DEPTH);
+}
+
 void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c33bb201b884..04af49f2643f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -891,8 +891,7 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx)
 	 * aborts.  We don't even bother to allocate a struct request
 	 * for them but rather special case them here.
 	 */
-	if (unlikely(nvmeq->qid == 0 &&
-			cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH)) {
+	if (unlikely(nvme_is_aen_req(nvmeq->qid, cqe->command_id))) {
 		nvme_complete_async_event(&nvmeq->dev->ctrl,
 				cqe->status, &cqe->result);
 		return;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ab6ec7295bf9..a9eed697505e 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1481,8 +1481,8 @@ static int __nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc, int tag)
 	 * aborts.  We don't even bother to allocate a struct request
 	 * for them but rather special case them here.
 	 */
-	if (unlikely(nvme_rdma_queue_idx(queue) == 0 &&
-			cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH))
+	if (unlikely(nvme_is_aen_req(nvme_rdma_queue_idx(queue),
+				     cqe->command_id)))
 		nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
 				&cqe->result);
 	else
-- 
2.14.3


[-- Attachment #3: 0002-nvme-rdma-Handle-completions-if-error_recovery-fails.patch --]
[-- Type: text/plain, Size: 7166 bytes --]

From 6ede55b82e7ee1b43b380ebdf48c3e1170e7f303 Mon Sep 17 00:00:00 2001
From: Nitzan Carmi <nitzanc@mellanox.com>
Date: Sun, 26 Aug 2018 15:31:55 +0000
Subject: [PATCH 2/2] nvme-rdma: Handle completions if error_recovery fails

Error recovery might not be executed, in case ctrl.state
cannot be moved to RESETTING state. This might happen in:
 - state CONNECTING - during nvmf_connect command failure.
 - state DELETING - during nvmf_set_features command failure
                    (in nvme_shutdown_ctrl).

We need to complete these hanging requests with DNR error.

Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/host/rdma.c | 101 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 76 insertions(+), 25 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index a9eed697505e..430dc5e6c77e 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1055,32 +1055,52 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 	nvme_rdma_reconnect_or_remove(ctrl);
 }
 
-static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
+static int nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 {
-	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
-		return;
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) {
+		if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
+			return 0;
+		else
+			return -EAGAIN;
+	}
 
 	queue_work(nvme_wq, &ctrl->err_work);
+	return 0;
 }
 
-static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
+static int nvme_rdma_wr_error(struct nvme_rdma_ctrl *ctrl, struct ib_wc *wc,
 		const char *op)
 {
-	struct nvme_rdma_queue *queue = cq->cq_context;
-	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
-
 	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
 		dev_info(ctrl->ctrl.device,
 			     "%s for CQE 0x%p failed with status %s (%d)\n",
 			     op, wc->wr_cqe,
 			     ib_wc_status_msg(wc->status), wc->status);
-	nvme_rdma_error_recovery(ctrl);
+	return nvme_rdma_error_recovery(ctrl);
+}
+
+static void nvme_rdma_req_error(struct nvme_rdma_request *req, struct ib_wc *wc,
+		const char *op)
+{
+	struct nvme_rdma_queue *queue = req->queue;
+	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
+
+	if (nvme_rdma_wr_error(ctrl, wc, op) &&
+	    wc->status != IB_WC_WR_FLUSH_ERR) {
+		req->status = cpu_to_le16((NVME_SC_ABORT_REQ |
+					   NVME_SC_DNR) << 1);
+		nvme_end_request(blk_mq_rq_from_pdu(req), req->status,
+				 req->result);
+	}
 }
 
 static void nvme_rdma_memreg_done(struct ib_cq *cq, struct ib_wc *wc)
 {
+	struct nvme_rdma_request *req =
+		container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe);
+
 	if (unlikely(wc->status != IB_WC_SUCCESS))
-		nvme_rdma_wr_error(cq, wc, "MEMREG");
+		nvme_rdma_req_error(req, wc, "MEMREG");
 }
 
 static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
@@ -1090,7 +1110,7 @@ static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
 	struct request *rq = blk_mq_rq_from_pdu(req);
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
-		nvme_rdma_wr_error(cq, wc, "LOCAL_INV");
+		nvme_rdma_req_error(req, wc, "LOCAL_INV");
 		return;
 	}
 
@@ -1304,7 +1324,7 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 	struct request *rq = blk_mq_rq_from_pdu(req);
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
-		nvme_rdma_wr_error(cq, wc, "SEND");
+		nvme_rdma_req_error(req, wc, "SEND");
 		return;
 	}
 
@@ -1380,8 +1400,10 @@ static struct blk_mq_tags *nvme_rdma_tagset(struct nvme_rdma_queue *queue)
 
 static void nvme_rdma_async_done(struct ib_cq *cq, struct ib_wc *wc)
 {
+	struct nvme_rdma_queue *queue = cq->cq_context;
+
 	if (unlikely(wc->status != IB_WC_SUCCESS))
-		nvme_rdma_wr_error(cq, wc, "ASYNC");
+		nvme_rdma_wr_error(queue->ctrl, wc, "ASYNC");
 }
 
 static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
@@ -1411,26 +1433,39 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
 	WARN_ON_ONCE(ret);
 }
 
-static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
-		struct nvme_completion *cqe, struct ib_wc *wc, int tag)
+static struct nvme_rdma_request *nvme_rdma_comp_to_req(
+		struct nvme_rdma_queue *queue, struct nvme_completion *cqe)
 {
 	struct request *rq;
 	struct nvme_rdma_request *req;
-	int ret = 0;
 
 	rq = blk_mq_tag_to_rq(nvme_rdma_tagset(queue), cqe->command_id);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
 			"tag 0x%x on QP %#x not found\n",
 			cqe->command_id, queue->qp->qp_num);
-		nvme_rdma_error_recovery(queue->ctrl);
-		return ret;
+		return NULL;
 	}
 	req = blk_mq_rq_to_pdu(rq);
-
 	req->status = cqe->status;
 	req->result = cqe->result;
 
+	return req;
+}
+
+static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
+		struct nvme_completion *cqe, struct ib_wc *wc, int tag)
+{
+	struct request *rq;
+	struct nvme_rdma_request *req;
+	int ret = 0;
+
+	req = nvme_rdma_comp_to_req(queue, cqe);
+	if (!req) {
+		nvme_rdma_error_recovery(queue->ctrl);
+		return ret;
+	}
+
 	if (wc->wc_flags & IB_WC_WITH_INVALIDATE) {
 		if (unlikely(wc->ex.invalidate_rkey != req->mr->rkey)) {
 			dev_err(queue->ctrl->ctrl.device,
@@ -1451,6 +1486,7 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	}
 
 	if (refcount_dec_and_test(&req->ref)) {
+		rq = blk_mq_rq_from_pdu(req);
 		if (rq->tag == tag)
 			ret = 1;
 		nvme_end_request(rq, req->status, req->result);
@@ -1466,11 +1502,21 @@ static int __nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc, int tag)
 	struct nvme_rdma_queue *queue = cq->cq_context;
 	struct ib_device *ibdev = queue->device->dev;
 	struct nvme_completion *cqe = qe->data;
+	struct nvme_rdma_request *req;
 	const size_t len = sizeof(struct nvme_completion);
 	int ret = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
-		nvme_rdma_wr_error(cq, wc, "RECV");
+		if (!nvme_is_aen_req(nvme_rdma_queue_idx(queue),
+				     cqe->command_id) &&
+		    wc->status != IB_WC_WR_FLUSH_ERR) {
+			req = nvme_rdma_comp_to_req(queue, cqe);
+			if (req) {
+				nvme_rdma_req_error(req, wc, "RECV");
+				return 0;
+			}
+		}
+		nvme_rdma_wr_error(queue->ctrl, wc, "RECV");
 		return 0;
 	}
 
@@ -1679,16 +1725,21 @@ static enum blk_eh_timer_return
 nvme_rdma_timeout(struct request *rq, bool reserved)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_rdma_ctrl *ctrl = req->queue->ctrl;
 
-	dev_warn(req->queue->ctrl->ctrl.device,
+	dev_warn(ctrl->ctrl.device,
 		 "I/O %d QID %d timeout, reset controller\n",
 		 rq->tag, nvme_rdma_queue_idx(req->queue));
 
-	/* queue error recovery */
-	nvme_rdma_error_recovery(req->queue->ctrl);
-
-	/* fail with DNR on cmd timeout */
-	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
+	/*
+	 * if error recovery succeed it will end the request,
+	 * otherwise we have to manually end it
+	 */
+	if (nvme_rdma_error_recovery(ctrl)) {
+		req->status = cpu_to_le16((NVME_SC_ABORT_REQ |
+					   NVME_SC_DNR) << 1);
+		nvme_end_request(rq, req->status, req->result);
+	}
 
 	return BLK_EH_DONE;
 }
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-12-09 14:22   ` Nitzan Carmi
@ 2018-12-10 23:40     ` Jaesoo Lee
  2018-12-11  9:14       ` Nitzan Carmi
  0 siblings, 1 reply; 13+ messages in thread
From: Jaesoo Lee @ 2018-12-10 23:40 UTC (permalink / raw)
  To: nitzanc
  Cc: sagi, keith.busch, axboe, hch, Roland Dreier, Prabhath Sajeepa,
	Ashish Karkare, linux-kernel, linux-nvme

It seems that your patch is addressing the same bug. I will see if
that works for our failure scenarios.

Why don't you make it upstream?

On Sun, Dec 9, 2018 at 6:22 AM Nitzan Carmi <nitzanc@mellanox.com> wrote:
>
> Hi,
> We encountered similar issue.
> I think that the problem is that error_recovery might not even be
> queued, in case we're in DELETING state (or CONNECTING state, for that
> matter), because we cannot move from those states to RESETTING.
>
> We prepared some patches which handle completions in case such scenario
> happens (which, in fact, might happen in numerous error flows).
>
> Does it solve your problem?
> Nitzan.
>
>
> On 30/11/2018 03:30, Sagi Grimberg wrote:
> >
> >> This does not hold at least for NVMe RDMA host driver. An example
> >> scenario
> >> is when the RDMA connection is gone while the controller is being
> >> deleted.
> >> In this case, the nvmf_reg_write32() for sending shutdown admin
> >> command by
> >> the delete_work could be hung forever if the command is not completed by
> >> the timeout handler.
> >
> > If the queue is gone, this means that the queue has already flushed and
> > any commands that were inflight has completed with a flush error
> > completion...
> >
> > Can you describe the scenario that caused this hang? When has the
> > queue became "gone" and when did the shutdown command execute?
> >
> > _______________________________________________
> > Linux-nvme mailing list
> > Linux-nvme@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-12-10 23:40     ` Jaesoo Lee
@ 2018-12-11  9:14       ` Nitzan Carmi
  2018-12-11 23:16         ` Jaesoo Lee
  0 siblings, 1 reply; 13+ messages in thread
From: Nitzan Carmi @ 2018-12-11  9:14 UTC (permalink / raw)
  To: Jaesoo Lee
  Cc: sagi, keith.busch, axboe, hch, Roland Dreier, Prabhath Sajeepa,
	Ashish Karkare, linux-kernel, linux-nvme

I was just in the middle of sending this to upstream when I saw your 
mail, and thought too that it addresses the same bug, although I see a 
little different call trace than yours.

I would be happy if you can verify that this patch works for you too,
and we can push it to upstream.

On 11/12/2018 01:40, Jaesoo Lee wrote:
> It seems that your patch is addressing the same bug. I will see if
> that works for our failure scenarios.
> 
> Why don't you make it upstream?
> 
> On Sun, Dec 9, 2018 at 6:22 AM Nitzan Carmi <nitzanc@mellanox.com> wrote:
>>
>> Hi,
>> We encountered similar issue.
>> I think that the problem is that error_recovery might not even be
>> queued, in case we're in DELETING state (or CONNECTING state, for that
>> matter), because we cannot move from those states to RESETTING.
>>
>> We prepared some patches which handle completions in case such scenario
>> happens (which, in fact, might happen in numerous error flows).
>>
>> Does it solve your problem?
>> Nitzan.
>>
>>
>> On 30/11/2018 03:30, Sagi Grimberg wrote:
>>>
>>>> This does not hold at least for NVMe RDMA host driver. An example
>>>> scenario
>>>> is when the RDMA connection is gone while the controller is being
>>>> deleted.
>>>> In this case, the nvmf_reg_write32() for sending shutdown admin
>>>> command by
>>>> the delete_work could be hung forever if the command is not completed by
>>>> the timeout handler.
>>>
>>> If the queue is gone, this means that the queue has already flushed and
>>> any commands that were inflight has completed with a flush error
>>> completion...
>>>
>>> Can you describe the scenario that caused this hang? When has the
>>> queue became "gone" and when did the shutdown command execute?
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-12-11  9:14       ` Nitzan Carmi
@ 2018-12-11 23:16         ` Jaesoo Lee
  2018-12-11 23:38           ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Jaesoo Lee @ 2018-12-11 23:16 UTC (permalink / raw)
  To: nitzanc
  Cc: sagi, keith.busch, axboe, hch, Roland Dreier, Prabhath Sajeepa,
	Ashish Karkare, linux-kernel, linux-nvme

I cannot reproduce the bug with the patch; in my failure scenarios, it
seems that completing the request on errors in nvme_rdma_send_done
makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is
safe from the double completions.

However, it seems that nvme_rdma_timeout code is still not free from
the double completion problem. So, it looks promising to me if you
could separate out the nvme_rdma_wr_error handling code as a new
patch.
On Tue, Dec 11, 2018 at 1:14 AM Nitzan Carmi <nitzanc@mellanox.com> wrote:
>
> I was just in the middle of sending this to upstream when I saw your
> mail, and thought too that it addresses the same bug, although I see a
> little different call trace than yours.
>
> I would be happy if you can verify that this patch works for you too,
> and we can push it to upstream.
>
> On 11/12/2018 01:40, Jaesoo Lee wrote:
> > It seems that your patch is addressing the same bug. I will see if
> > that works for our failure scenarios.
> >
> > Why don't you make it upstream?
> >
> > On Sun, Dec 9, 2018 at 6:22 AM Nitzan Carmi <nitzanc@mellanox.com> wrote:
> >>
> >> Hi,
> >> We encountered similar issue.
> >> I think that the problem is that error_recovery might not even be
> >> queued, in case we're in DELETING state (or CONNECTING state, for that
> >> matter), because we cannot move from those states to RESETTING.
> >>
> >> We prepared some patches which handle completions in case such scenario
> >> happens (which, in fact, might happen in numerous error flows).
> >>
> >> Does it solve your problem?
> >> Nitzan.
> >>
> >>
> >> On 30/11/2018 03:30, Sagi Grimberg wrote:
> >>>
> >>>> This does not hold at least for NVMe RDMA host driver. An example
> >>>> scenario
> >>>> is when the RDMA connection is gone while the controller is being
> >>>> deleted.
> >>>> In this case, the nvmf_reg_write32() for sending shutdown admin
> >>>> command by
> >>>> the delete_work could be hung forever if the command is not completed by
> >>>> the timeout handler.
> >>>
> >>> If the queue is gone, this means that the queue has already flushed and
> >>> any commands that were inflight has completed with a flush error
> >>> completion...
> >>>
> >>> Can you describe the scenario that caused this hang? When has the
> >>> queue became "gone" and when did the shutdown command execute?
> >>>
> >>> _______________________________________________
> >>> Linux-nvme mailing list
> >>> Linux-nvme@lists.infradead.org
> >>> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-12-11 23:16         ` Jaesoo Lee
@ 2018-12-11 23:38           ` Sagi Grimberg
  2018-12-12  1:31             ` Jaesoo Lee
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2018-12-11 23:38 UTC (permalink / raw)
  To: Jaesoo Lee, nitzanc
  Cc: keith.busch, axboe, hch, Roland Dreier, Prabhath Sajeepa,
	Ashish Karkare, linux-kernel, linux-nvme

> I cannot reproduce the bug with the patch; in my failure scenarios, it
> seems that completing the request on errors in nvme_rdma_send_done
> makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is
> safe from the double completions.
> 
> However, it seems that nvme_rdma_timeout code is still not free from
> the double completion problem. So, it looks promising to me if you
> could separate out the nvme_rdma_wr_error handling code as a new
> patch.

Guys, can you please send proper patches so we can review properly?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] nvme-rdma: complete requests from ->timeout
  2018-12-11 23:38           ` Sagi Grimberg
@ 2018-12-12  1:31             ` Jaesoo Lee
  0 siblings, 0 replies; 13+ messages in thread
From: Jaesoo Lee @ 2018-12-12  1:31 UTC (permalink / raw)
  To: sagi
  Cc: nitzanc, keith.busch, axboe, hch, Roland Dreier,
	Prabhath Sajeepa, Ashish Karkare, linux-kernel, linux-nvme

Please drop this patch. However, it would be happy if this bug can be
fixed as soon as possible.

Nitzan, do you mind if you send your patch for review?
On Tue, Dec 11, 2018 at 3:39 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> > I cannot reproduce the bug with the patch; in my failure scenarios, it
> > seems that completing the request on errors in nvme_rdma_send_done
> > makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is
> > safe from the double completions.
> >
> > However, it seems that nvme_rdma_timeout code is still not free from
> > the double completion problem. So, it looks promising to me if you
> > could separate out the nvme_rdma_wr_error handling code as a new
> > patch.
>
> Guys, can you please send proper patches so we can review properly?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-12-12  1:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-29 23:59 [PATCH] nvme-rdma: complete requests from ->timeout Jaesoo Lee
2018-11-30  1:30 ` Sagi Grimberg
2018-11-30  1:54   ` Jaesoo Lee
2018-12-07  0:18     ` Jaesoo Lee
2018-12-07 20:05       ` Sagi Grimberg
2018-12-08  2:02         ` Keith Busch
2018-12-08  6:28           ` Jaesoo Lee
2018-12-09 14:22   ` Nitzan Carmi
2018-12-10 23:40     ` Jaesoo Lee
2018-12-11  9:14       ` Nitzan Carmi
2018-12-11 23:16         ` Jaesoo Lee
2018-12-11 23:38           ` Sagi Grimberg
2018-12-12  1:31             ` Jaesoo Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).