All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme-rdma: fix timeout handler
@ 2018-12-13 21:29 Sagi Grimberg
  2018-12-13 21:35 ` Sagi Grimberg
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Sagi Grimberg @ 2018-12-13 21:29 UTC (permalink / raw)


Currently, we have several problems with the timeout
handler:
1. If we timeout on the controller establishment flow, we will
hang because we don't execute the error recovery (and we shouldn't
because the create_ctrl flow needs to fail and cleanup on its own)
2. We might also hang if we get a disconnet on a queue while the
controller is already deleting. This racy flow can cause the
controller disable/shutdown admin command to hang.

We cannot complete a timed out request from the timeout handler
without mutual exclusion from the teardown flow (e.g.
nvme_rdma_error_recovery_work). So we serialize it in the
timeout handler and stop the queue to guarantee that no
one races with us from completing the request.

Reported-by: Jaesoo Lee <jalee at purestorage.com>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 80b3113b45fb..af8f4cedabc1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1080,12 +1080,13 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 	nvme_rdma_reconnect_or_remove(ctrl);
 }
 
-static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
+static int nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 {
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
-		return;
+		return -EBUSY;
 
 	queue_work(nvme_wq, &ctrl->err_work);
+	return 0;
 }
 
 static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
@@ -1693,18 +1694,30 @@ static enum blk_eh_timer_return
 nvme_rdma_timeout(struct request *rq, bool reserved)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_rdma_queue *queue = req->queue;
+	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
+
+	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
+		 rq->tag, nvme_rdma_queue_idx(queue));
 
-	dev_warn(req->queue->ctrl->ctrl.device,
-		 "I/O %d QID %d timeout, reset controller\n",
-		 rq->tag, nvme_rdma_queue_idx(req->queue));
+	if (nvme_rdma_error_recovery(req->queue->ctrl)) {
+		union nvme_result res = {};
 
-	/* queue error recovery */
-	nvme_rdma_error_recovery(req->queue->ctrl);
+		nvme_rdma_stop_queue(queue);
+		flush_work(&ctrl->err_work);
 
-	/* fail with DNR on cmd timeout */
-	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
+		/*
+		 * now no-one is competing with us, safely check if the
+		 * was completed and fail it if not.
+		 */
+		if (READ_ONCE(rq->state) != MQ_RQ_COMPLETE) {
+			nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+			nvme_end_request(rq, NVME_SC_ABORT_REQ, res);
+			return BLK_EH_DONE;
+		}
+	}
 
-	return BLK_EH_DONE;
+	return BLK_EH_RESET_TIMER;
 }
 
 static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-13 21:29 [PATCH] nvme-rdma: fix timeout handler Sagi Grimberg
@ 2018-12-13 21:35 ` Sagi Grimberg
  2018-12-15  1:41   ` Jaesoo Lee
  2018-12-16 11:12 ` Israel Rukshin
  2018-12-16 16:32 ` Christoph Hellwig
  2 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2018-12-13 21:35 UTC (permalink / raw)


This was meant to be an RFC, not yet for inclusion. I mostly
wanted Jaesoo and Nitzan to provide their feedback...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-13 21:35 ` Sagi Grimberg
@ 2018-12-15  1:41   ` Jaesoo Lee
  2018-12-15  2:13     ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Jaesoo Lee @ 2018-12-15  1:41 UTC (permalink / raw)


I have tested the code (backported to v4.19.9) and verified it works great for the scenarios I had originally reported. Also the code looks good to me.

Thanks.

On Thu, Dec 13, 2018@01:35:31PM -0800, Sagi Grimberg wrote:
> This was meant to be an RFC, not yet for inclusion. I mostly
> wanted Jaesoo and Nitzan to provide their feedback...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-15  1:41   ` Jaesoo Lee
@ 2018-12-15  2:13     ` Sagi Grimberg
  0 siblings, 0 replies; 13+ messages in thread
From: Sagi Grimberg @ 2018-12-15  2:13 UTC (permalink / raw)



> I have tested the code (backported to v4.19.9) and verified it works great for the scenarios I had originally reported. Also the code looks good to me.

Thanks Jaesoo,

Now I'd like to hear from Christoph and Keith, are we ok with the 
proposed approach that we are poking into the request state?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-13 21:29 [PATCH] nvme-rdma: fix timeout handler Sagi Grimberg
  2018-12-13 21:35 ` Sagi Grimberg
@ 2018-12-16 11:12 ` Israel Rukshin
  2018-12-16 16:32 ` Christoph Hellwig
  2 siblings, 0 replies; 13+ messages in thread
From: Israel Rukshin @ 2018-12-16 11:12 UTC (permalink / raw)


Hi Sagi,

On 12/13/2018 11:29 PM, Sagi Grimberg wrote:
> Currently, we have several problems with the timeout
> handler:
> 1. If we timeout on the controller establishment flow, we will
> hang because we don't execute the error recovery (and we shouldn't
> because the create_ctrl flow needs to fail and cleanup on its own)
> 2. We might also hang if we get a disconnet on a queue while the
> controller is already deleting. This racy flow can cause the
> controller disable/shutdown admin command to hang.
>
> We cannot complete a timed out request from the timeout handler
> without mutual exclusion from the teardown flow (e.g.
> nvme_rdma_error_recovery_work). So we serialize it in the
> timeout handler and stop the queue to guarantee that no
> one races with us from completing the request.
>
> Reported-by: Jaesoo Lee <jalee at purestorage.com>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 33 +++++++++++++++++++++++----------
>   1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 80b3113b45fb..af8f4cedabc1 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1080,12 +1080,13 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
>   	nvme_rdma_reconnect_or_remove(ctrl);
>   }
>   
> -static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
> +static int nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
>   {
>   	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> -		return;
> +		return -EBUSY;
>   
>   	queue_work(nvme_wq, &ctrl->err_work);
> +	return 0;
>   }
>   
>   static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
> @@ -1693,18 +1694,30 @@ static enum blk_eh_timer_return
>   nvme_rdma_timeout(struct request *rq, bool reserved)
>   {
>   	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_rdma_queue *queue = req->queue;
> +	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
> +
> +	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
> +		 rq->tag, nvme_rdma_queue_idx(queue));
>   
> -	dev_warn(req->queue->ctrl->ctrl.device,
> -		 "I/O %d QID %d timeout, reset controller\n",
> -		 rq->tag, nvme_rdma_queue_idx(req->queue));
> +	if (nvme_rdma_error_recovery(req->queue->ctrl)) {
> +		union nvme_result res = {};
>   
> -	/* queue error recovery */
> -	nvme_rdma_error_recovery(req->queue->ctrl);
> +		nvme_rdma_stop_queue(queue);
> +		flush_work(&ctrl->err_work);
>   
> -	/* fail with DNR on cmd timeout */
> -	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
> +		/*
> +		 * now no-one is competing with us, safely check if the
I guess you wanted to write "check if the request was ..."
> +		 * was completed and fail it if not.
> +		 */
> +		if (READ_ONCE(rq->state) != MQ_RQ_COMPLETE) {
> +			nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
> +			nvme_end_request(rq, NVME_SC_ABORT_REQ, res);
You need to pass cpu_to_le16(NVME_SC_ABORT_REQ << 1) like fc does.
> +			return BLK_EH_DONE;
> +		}
> +	}
>   
> -	return BLK_EH_DONE;
> +	return BLK_EH_RESET_TIMER;
>   }
>   
>   static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,

Regards,
Israel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-13 21:29 [PATCH] nvme-rdma: fix timeout handler Sagi Grimberg
  2018-12-13 21:35 ` Sagi Grimberg
  2018-12-16 11:12 ` Israel Rukshin
@ 2018-12-16 16:32 ` Christoph Hellwig
  2018-12-17 21:58   ` Sagi Grimberg
  2 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2018-12-16 16:32 UTC (permalink / raw)


On Thu, Dec 13, 2018@01:29:17PM -0800, Sagi Grimberg wrote:
> Currently, we have several problems with the timeout
> handler:
> 1. If we timeout on the controller establishment flow, we will
> hang because we don't execute the error recovery (and we shouldn't
> because the create_ctrl flow needs to fail and cleanup on its own)
> 2. We might also hang if we get a disconnet on a queue while the
> controller is already deleting. This racy flow can cause the
> controller disable/shutdown admin command to hang.
> 
> We cannot complete a timed out request from the timeout handler
> without mutual exclusion from the teardown flow (e.g.
> nvme_rdma_error_recovery_work). So we serialize it in the
> timeout handler and stop the queue to guarantee that no
> one races with us from completing the request.
> 
> Reported-by: Jaesoo Lee <jalee at purestorage.com>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/rdma.c | 33 +++++++++++++++++++++++----------
>  1 file changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 80b3113b45fb..af8f4cedabc1 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1080,12 +1080,13 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
>  	nvme_rdma_reconnect_or_remove(ctrl);
>  }
>  
> -static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
> +static int nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
>  {
>  	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> -		return;
> +		return -EBUSY;
>  
>  	queue_work(nvme_wq, &ctrl->err_work);
> +	return 0;
>  }
>  
>  static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
> @@ -1693,18 +1694,30 @@ static enum blk_eh_timer_return
>  nvme_rdma_timeout(struct request *rq, bool reserved)
>  {
>  	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_rdma_queue *queue = req->queue;
> +	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
> +
> +	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
> +		 rq->tag, nvme_rdma_queue_idx(queue));
>  
> +	if (nvme_rdma_error_recovery(req->queue->ctrl)) {
> +		union nvme_result res = {};
>  
> +		nvme_rdma_stop_queue(queue);
> +		flush_work(&ctrl->err_work);
>  
> +		/*
> +		 * now no-one is competing with us, safely check if the
> +		 * was completed and fail it if not.
> +		 */
> +		if (READ_ONCE(rq->state) != MQ_RQ_COMPLETE) {
> +			nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
> +			nvme_end_request(rq, NVME_SC_ABORT_REQ, res);
> +			return BLK_EH_DONE;
> +		}

So why did nvme_rdma_error_recovery not complete this request?
I think that is where our problem is, and we are just papering over it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-16 16:32 ` Christoph Hellwig
@ 2018-12-17 21:58   ` Sagi Grimberg
  2018-12-18 16:55     ` James Smart
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2018-12-17 21:58 UTC (permalink / raw)



> So why did nvme_rdma_error_recovery not complete this request?
> I think that is where our problem is, and we are just papering over it.

nvme_rdma_error_recovery in essence, kicks in when the controller is
LIVE and only then sees a failure. What it does is teardown the
controller and schedule a periodic reconnect work..

When we fail the connection establishment, we don't want to do that,
but simply fail and let create_ctrl cleanup and fail.

Technically, nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING) is
failing because we are not LIVE, but rather CONNECTING (as we are still
in the create_ctrl time).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-17 21:58   ` Sagi Grimberg
@ 2018-12-18 16:55     ` James Smart
  2018-12-18 17:07       ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: James Smart @ 2018-12-18 16:55 UTC (permalink / raw)




On 12/17/2018 1:58 PM, Sagi Grimberg wrote:
>
>> So why did nvme_rdma_error_recovery not complete this request?
>> I think that is where our problem is, and we are just papering over it.
>
> nvme_rdma_error_recovery in essence, kicks in when the controller is
> LIVE and only then sees a failure. What it does is teardown the
> controller and schedule a periodic reconnect work..
>
> When we fail the connection establishment, we don't want to do that,
> but simply fail and let create_ctrl cleanup and fail.
>
> Technically, nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING) is
> failing because we are not LIVE, but rather CONNECTING (as we are still
> in the create_ctrl time).

which is why I believe the transports need at least a routine that runs 
down the queues, terminates anything pending (error recovery may be an 
io timeout), and marks the queues down - regardless of the state of the 
controller.? That run down will terminate the connection admin commands 
that may be in error, allowing the fail path Sagi points out to kick in. 
This same path needs to occur at the start of the reset and delete as well.

-- james

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-18 16:55     ` James Smart
@ 2018-12-18 17:07       ` Christoph Hellwig
  2018-12-18 19:00         ` James Smart
  2018-12-20 21:42         ` Sagi Grimberg
  0 siblings, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-12-18 17:07 UTC (permalink / raw)


On Tue, Dec 18, 2018@08:55:46AM -0800, James Smart wrote:
>
>
> On 12/17/2018 1:58 PM, Sagi Grimberg wrote:
>>
>>> So why did nvme_rdma_error_recovery not complete this request?
>>> I think that is where our problem is, and we are just papering over it.
>>
>> nvme_rdma_error_recovery in essence, kicks in when the controller is
>> LIVE and only then sees a failure. What it does is teardown the
>> controller and schedule a periodic reconnect work..
>>
>> When we fail the connection establishment, we don't want to do that,
>> but simply fail and let create_ctrl cleanup and fail.
>>
>> Technically, nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING) is
>> failing because we are not LIVE, but rather CONNECTING (as we are still
>> in the create_ctrl time).
>
> which is why I believe the transports need at least a routine that runs 
> down the queues, terminates anything pending (error recovery may be an io 
> timeout), and marks the queues down - regardless of the state of the 
> controller.? That run down will terminate the connection admin commands 
> that may be in error, allowing the fail path Sagi points out to kick in. 
> This same path needs to occur at the start of the reset and delete as well.

That was my impression as well.  Or in fact just allow a transition
from connecting to reset (or a special connect terminate state that
is very similar except for not recovering)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-18 17:07       ` Christoph Hellwig
@ 2018-12-18 19:00         ` James Smart
  2018-12-20 21:58           ` Sagi Grimberg
  2018-12-20 21:42         ` Sagi Grimberg
  1 sibling, 1 reply; 13+ messages in thread
From: James Smart @ 2018-12-18 19:00 UTC (permalink / raw)




On 12/18/2018 9:07 AM, Christoph Hellwig wrote:
> On Tue, Dec 18, 2018@08:55:46AM -0800, James Smart wrote:
>>
>> On 12/17/2018 1:58 PM, Sagi Grimberg wrote:
>>>> So why did nvme_rdma_error_recovery not complete this request?
>>>> I think that is where our problem is, and we are just papering over it.
>>> nvme_rdma_error_recovery in essence, kicks in when the controller is
>>> LIVE and only then sees a failure. What it does is teardown the
>>> controller and schedule a periodic reconnect work..
>>>
>>> When we fail the connection establishment, we don't want to do that,
>>> but simply fail and let create_ctrl cleanup and fail.
>>>
>>> Technically, nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING) is
>>> failing because we are not LIVE, but rather CONNECTING (as we are still
>>> in the create_ctrl time).
>> which is why I believe the transports need at least a routine that runs
>> down the queues, terminates anything pending (error recovery may be an io
>> timeout), and marks the queues down - regardless of the state of the
>> controller.? That run down will terminate the connection admin commands
>> that may be in error, allowing the fail path Sagi points out to kick in.
>> This same path needs to occur at the start of the reset and delete as well.
> That was my impression as well.  Or in fact just allow a transition
> from connecting to reset (or a special connect terminate state that
> is very similar except for not recovering)

:)? I had proposed that originally, but it was rejected.? I still think 
that state transition is valid, but the pci guys disagreed.

I don't think it needs a new state.?? I think a transport callback that 
can be called from the common layers is sufficient.? rdma already has it 
in nvme_rdma_teardown_io_queues() and nvme_rdma_teardown_admin_queue().? 
And fc has it in __nvme_fc_terminate_io().

-- james

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-18 17:07       ` Christoph Hellwig
  2018-12-18 19:00         ` James Smart
@ 2018-12-20 21:42         ` Sagi Grimberg
  2019-01-01  8:49           ` Sagi Grimberg
  1 sibling, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2018-12-20 21:42 UTC (permalink / raw)



>> which is why I believe the transports need at least a routine that runs
>> down the queues, terminates anything pending (error recovery may be an io
>> timeout), and marks the queues down - regardless of the state of the
>> controller.? That run down will terminate the connection admin commands
>> that may be in error, allowing the fail path Sagi points out to kick in.
>> This same path needs to occur at the start of the reset and delete as well.
> 
> That was my impression as well.  Or in fact just allow a transition
> from connecting to reset (or a special connect terminate state that
> is very similar except for not recovering)

I don't think this is a good idea. Introducing a different state for
connect failure is a bit cumbersome for the state machine and not
trivial to detect from the .timeout handler.

Moreover, its not the only case where we have an issue. In RDMA (and
TCP) at least, we have also the issue that Jaesoo pointed out.

The scenario goes as follows:
1. the host triggers a delete sequence
2. host disconnect I/O queues
3. host issues prop_set (nvme_shutdown_ctrl)
4. target disconnects the admin queue // race with (3)
5. (3) hangs because no one can complete the admin commands
it comes after we pass step (3).

So the flows are slightly less trivial.

Moreover, the behavior proposed is not any different than
what we do in pci which disables the device (and terminates
IO) directly from the timeout handler:
--
         /*
          * Shutdown immediately if controller times out while starting. The
          * reset work will see the pci device disabled when it gets the 
forced
          * cancellation error. All outstanding requests are completed on
          * shutdown, so we return BLK_EH_DONE.
          */
         switch (dev->ctrl.state) {
         case NVME_CTRL_CONNECTING:
         case NVME_CTRL_RESETTING:
                 dev_warn_ratelimited(dev->ctrl.device,
                          "I/O %d QID %d timeout, disable controller\n",
                          req->tag, nvmeq->qid);
                 nvme_dev_disable(dev, false);
                 nvme_req(req)->flags |= NVME_REQ_CANCELLED;
                 return BLK_EH_DONE;
         default:
                 break;
         }
--

nvme_dev_disable eventually calls blk_mq_tagset_busy_iter() on all
inflight requests, so it effectively does the same thing.

Would you prefer if we do the below instead (which match pci exactly)?
--
static enum blk_eh_timer_return
nvme_rdma_timeout(struct request *rq, bool reserved)
{
         struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
         struct nvme_rdma_queue *queue = req->queue;
         struct nvme_rdma_ctrl *ctrl = queue->ctrl;

         dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
                  rq->tag, nvme_rdma_queue_idx(queue));

         if (nvme_rdma_error_recovery(ctrl)) {
                 /*
                  * Shutdown immediately if controller times out
                  * while starting. All outstanding requests are
                  * completed on shutdown, so we return BLK_EH_DONE.
                  */
                 flush_work(&ctrl->err_work);
                 nvme_rdma_stop_queue(queue);
                 nvme_rdma_shutdown_ctrl(queue->ctrl, false);
                 return BLK_EH_DONE;
         }

         return BLK_EH_RESET_TIMER;
}
--

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-18 19:00         ` James Smart
@ 2018-12-20 21:58           ` Sagi Grimberg
  0 siblings, 0 replies; 13+ messages in thread
From: Sagi Grimberg @ 2018-12-20 21:58 UTC (permalink / raw)



>> That was my impression as well.? Or in fact just allow a transition
>> from connecting to reset (or a special connect terminate state that
>> is very similar except for not recovering)
> 
> :)? I had proposed that originally, but it was rejected.? I still think 
> that state transition is valid, but the pci guys disagreed.
> 
> I don't think it needs a new state.?? I think a transport callback that 
> can be called from the common layers is sufficient.? rdma already has it 
> in nvme_rdma_teardown_io_queues() and nvme_rdma_teardown_admin_queue(). 
> And fc has it in __nvme_fc_terminate_io().

I'm not entirely clear where this callout would be invoked from? Note
that pci shares this common flow.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme-rdma: fix timeout handler
  2018-12-20 21:42         ` Sagi Grimberg
@ 2019-01-01  8:49           ` Sagi Grimberg
  0 siblings, 0 replies; 13+ messages in thread
From: Sagi Grimberg @ 2019-01-01  8:49 UTC (permalink / raw)


Ping?

>>> which is why I believe the transports need at least a routine that runs
>>> down the queues, terminates anything pending (error recovery may be 
>>> an io
>>> timeout), and marks the queues down - regardless of the state of the
>>> controller.? That run down will terminate the connection admin commands
>>> that may be in error, allowing the fail path Sagi points out to kick in.
>>> This same path needs to occur at the start of the reset and delete as 
>>> well.
>>
>> That was my impression as well.? Or in fact just allow a transition
>> from connecting to reset (or a special connect terminate state that
>> is very similar except for not recovering)
> 
> I don't think this is a good idea. Introducing a different state for
> connect failure is a bit cumbersome for the state machine and not
> trivial to detect from the .timeout handler.
> 
> Moreover, its not the only case where we have an issue. In RDMA (and
> TCP) at least, we have also the issue that Jaesoo pointed out.
> 
> The scenario goes as follows:
> 1. the host triggers a delete sequence
> 2. host disconnect I/O queues
> 3. host issues prop_set (nvme_shutdown_ctrl)
> 4. target disconnects the admin queue // race with (3)
> 5. (3) hangs because no one can complete the admin commands
> it comes after we pass step (3).
> 
> So the flows are slightly less trivial.
> 
> Moreover, the behavior proposed is not any different than
> what we do in pci which disables the device (and terminates
> IO) directly from the timeout handler:
> -- 
>  ??????? /*
>  ???????? * Shutdown immediately if controller times out while starting. 
> The
>  ???????? * reset work will see the pci device disabled when it gets the 
> forced
>  ???????? * cancellation error. All outstanding requests are completed on
>  ???????? * shutdown, so we return BLK_EH_DONE.
>  ???????? */
>  ??????? switch (dev->ctrl.state) {
>  ??????? case NVME_CTRL_CONNECTING:
>  ??????? case NVME_CTRL_RESETTING:
>  ??????????????? dev_warn_ratelimited(dev->ctrl.device,
>  ???????????????????????? "I/O %d QID %d timeout, disable controller\n",
>  ???????????????????????? req->tag, nvmeq->qid);
>  ??????????????? nvme_dev_disable(dev, false);
>  ??????????????? nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>  ??????????????? return BLK_EH_DONE;
>  ??????? default:
>  ??????????????? break;
>  ??????? }
> -- 
> 
> nvme_dev_disable eventually calls blk_mq_tagset_busy_iter() on all
> inflight requests, so it effectively does the same thing.
> 
> Would you prefer if we do the below instead (which match pci exactly)?
> -- 
> static enum blk_eh_timer_return
> nvme_rdma_timeout(struct request *rq, bool reserved)
> {
>  ??????? struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>  ??????? struct nvme_rdma_queue *queue = req->queue;
>  ??????? struct nvme_rdma_ctrl *ctrl = queue->ctrl;
> 
>  ??????? dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
>  ???????????????? rq->tag, nvme_rdma_queue_idx(queue));
> 
>  ??????? if (nvme_rdma_error_recovery(ctrl)) {
>  ??????????????? /*
>  ???????????????? * Shutdown immediately if controller times out
>  ???????????????? * while starting. All outstanding requests are
>  ???????????????? * completed on shutdown, so we return BLK_EH_DONE.
>  ???????????????? */
>  ??????????????? flush_work(&ctrl->err_work);
>  ??????????????? nvme_rdma_stop_queue(queue);
>  ??????????????? nvme_rdma_shutdown_ctrl(queue->ctrl, false);
>  ??????????????? return BLK_EH_DONE;
>  ??????? }
> 
>  ??????? return BLK_EH_RESET_TIMER;
> }
> -- 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-01-01  8:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-13 21:29 [PATCH] nvme-rdma: fix timeout handler Sagi Grimberg
2018-12-13 21:35 ` Sagi Grimberg
2018-12-15  1:41   ` Jaesoo Lee
2018-12-15  2:13     ` Sagi Grimberg
2018-12-16 11:12 ` Israel Rukshin
2018-12-16 16:32 ` Christoph Hellwig
2018-12-17 21:58   ` Sagi Grimberg
2018-12-18 16:55     ` James Smart
2018-12-18 17:07       ` Christoph Hellwig
2018-12-18 19:00         ` James Smart
2018-12-20 21:58           ` Sagi Grimberg
2018-12-20 21:42         ` Sagi Grimberg
2019-01-01  8:49           ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.