All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] nvme_fc: retry initial controller connections 3 times
@ 2017-10-09 23:39 James Smart
  2017-10-11 11:09 ` Sagi Grimberg
  2017-10-16 12:50 ` Christoph Hellwig
  0 siblings, 2 replies; 4+ messages in thread
From: James Smart @ 2017-10-09 23:39 UTC (permalink / raw)


Currently, if a frame is lost of command fails as part of initial
association create for a new controller, the new controller connection
request will immediately fail.

Add in an immediate 3 retry loop before giving up.

Signed-off-by: James Smart <james.smart at broadcom.com>

---
v2:
  Add additional comment information on why the retries. The comment
  should sufficiently explain the background.
  Note: rdma, in most cases, if initial connections or commands failed,
  would also fail on first connect attempt.
---
 drivers/nvme/host/fc.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 1075488a81d6..18e036c6833b 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2790,7 +2790,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 {
 	struct nvme_fc_ctrl *ctrl;
 	unsigned long flags;
-	int ret, idx;
+	int ret, idx, retry;
 
 	if (!(rport->remoteport.port_role &
 	    (FC_PORT_ROLE_NVME_DISCOVERY | FC_PORT_ROLE_NVME_TARGET))) {
@@ -2882,9 +2882,37 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 	list_add_tail(&ctrl->ctrl_list, &rport->ctrl_list);
 	spin_unlock_irqrestore(&rport->lock, flags);
 
-	ret = nvme_fc_create_association(ctrl);
+	/*
+	 * It's possible that transactions used to create the association
+	 * may fail. Examples: CreateAssociation LS or CreateIOConnection
+	 * LS gets dropped/corrupted/fails; or a frame gets dropped or a
+	 * command times out for one of the actions to init the controller
+	 * (Connect, Get/Set_Property, Set_Features, etc). Many of these
+	 * transport errors (frame drop, LS failure) inherently must kill
+	 * the association. The transport is coded so that any command used
+	 * to create the association (prior to a LIVE state transition
+	 * while NEW or RECONNECTING) will fail if it completes in error or
+	 * times out.
+	 *
+	 * As such: as the connect request was mostly likely due to a
+	 * udev event that discovered the remote port, meaning there is
+	 * not an admin or script there to restart if the connect
+	 * request fails, retry the initial connection creation up to
+	 * three times before giving up and declaring failure.
+	 */
+	for (retry = 0; retry < 3; retry++) {
+		ret = nvme_fc_create_association(ctrl);
+		if (!ret)
+			break;
+	}
+
 	if (ret) {
+		/* couldn't schedule retry - fail out */
+		dev_err(ctrl->ctrl.device,
+			"NVME-FC{%d}: Connect retry failed\n", ctrl->cnum);
+
 		ctrl->ctrl.opts = NULL;
+
 		/* initiate nvme ctrl ref counting teardown */
 		nvme_uninit_ctrl(&ctrl->ctrl);
 		nvme_put_ctrl(&ctrl->ctrl);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2] nvme_fc: retry initial controller connections 3 times
  2017-10-09 23:39 [PATCH v2] nvme_fc: retry initial controller connections 3 times James Smart
@ 2017-10-11 11:09 ` Sagi Grimberg
  2017-10-11 15:07   ` James Smart
  2017-10-16 12:50 ` Christoph Hellwig
  1 sibling, 1 reply; 4+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:09 UTC (permalink / raw)



> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 1075488a81d6..18e036c6833b 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -2790,7 +2790,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
>   {
>   	struct nvme_fc_ctrl *ctrl;
>   	unsigned long flags;
> -	int ret, idx;
> +	int ret, idx, retry;
>   
>   	if (!(rport->remoteport.port_role &
>   	    (FC_PORT_ROLE_NVME_DISCOVERY | FC_PORT_ROLE_NVME_TARGET))) {
> @@ -2882,9 +2882,37 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
>   	list_add_tail(&ctrl->ctrl_list, &rport->ctrl_list);
>   	spin_unlock_irqrestore(&rport->lock, flags);
>   
> -	ret = nvme_fc_create_association(ctrl);
> +	/*
> +	 * It's possible that transactions used to create the association
> +	 * may fail. Examples: CreateAssociation LS or CreateIOConnection
> +	 * LS gets dropped/corrupted/fails; or a frame gets dropped or a
> +	 * command times out for one of the actions to init the controller
> +	 * (Connect, Get/Set_Property, Set_Features, etc). Many of these
> +	 * transport errors (frame drop, LS failure) inherently must kill
> +	 * the association. The transport is coded so that any command used
> +	 * to create the association (prior to a LIVE state transition
> +	 * while NEW or RECONNECTING) will fail if it completes in error or
> +	 * times out.
> +	 *
> +	 * As such: as the connect request was mostly likely due to a
> +	 * udev event that discovered the remote port, meaning there is
> +	 * not an admin or script there to restart if the connect
> +	 * request fails, retry the initial connection creation up to
> +	 * three times before giving up and declaring failure.
> +	 */
> +	for (retry = 0; retry < 3; retry++) {
> +		ret = nvme_fc_create_association(ctrl);
> +		if (!ret)
> +			break;
> +	}
> +
>   	if (ret) {
> +		/* couldn't schedule retry - fail out */
> +		dev_err(ctrl->ctrl.device,
> +			"NVME-FC{%d}: Connect retry failed\n", ctrl->cnum);
> +
>   		ctrl->ctrl.opts = NULL;
> +
>   		/* initiate nvme ctrl ref counting teardown */
>   		nvme_uninit_ctrl(&ctrl->ctrl);
>   		nvme_put_ctrl(&ctrl->ctrl);
> 

Its not directly related to this patch, but the error handling here
seems non trivial, why isn't this going to the normal out_* error
targets below in the function?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2] nvme_fc: retry initial controller connections 3 times
  2017-10-11 11:09 ` Sagi Grimberg
@ 2017-10-11 15:07   ` James Smart
  0 siblings, 0 replies; 4+ messages in thread
From: James Smart @ 2017-10-11 15:07 UTC (permalink / raw)


On 10/11/2017 4:09 AM, Sagi Grimberg wrote:
> Its not directly related to this patch, but the error handling here
> seems non trivial, why isn't this going to the normal out_* error
> targets below in the function?

Because, while debugging the original flow which did teardown in the 
calling sequence like rdma, had a lot of concurrency issues and didn't 
fit well with a more generic create_association/delete_association back 
end. The issue is how many things are real hw contexts in FC that have 
to be flushed with the lldd before delete can really occur.? Those paths 
are all solved already for the LIVE->DELETING flows using ref-counting.? 
So, I perform the calling-sequence teardown until the nvme controller is 
init'd, at which point I defer to the normal refcounting delete flows. 
There's a comment that highlights that.

-- james

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2] nvme_fc: retry initial controller connections 3 times
  2017-10-09 23:39 [PATCH v2] nvme_fc: retry initial controller connections 3 times James Smart
  2017-10-11 11:09 ` Sagi Grimberg
@ 2017-10-16 12:50 ` Christoph Hellwig
  1 sibling, 0 replies; 4+ messages in thread
From: Christoph Hellwig @ 2017-10-16 12:50 UTC (permalink / raw)


Thanks,

I will apply this to the nvme-4.14 tree.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-10-16 12:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-09 23:39 [PATCH v2] nvme_fc: retry initial controller connections 3 times James Smart
2017-10-11 11:09 ` Sagi Grimberg
2017-10-11 15:07   ` James Smart
2017-10-16 12:50 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.