From mboxrd@z Thu Jan 1 00:00:00 1970 From: james.smart@broadcom.com (James Smart) Date: Fri, 17 May 2019 11:52:56 -0700 Subject: [PATCH 3/3] nvme-fc: fail reconnect if state change fails In-Reply-To: <69ca35ef-1c40-03bd-b116-09dc3eb655aa@suse.de> References: <20190517064254.95561-1-hare@suse.de> <20190517064254.95561-4-hare@suse.de> <12b86c1a-9e83-f3b7-d72b-20eef528ab06@broadcom.com> <69ca35ef-1c40-03bd-b116-09dc3eb655aa@suse.de> Message-ID: <04b95971-5ea3-a2a7-92f9-841f585eb34a@broadcom.com> On 5/17/2019 11:34 AM, Hannes Reinecke wrote: > On 5/17/19 6:10 PM, James Smart wrote: >> On 5/16/2019 11:42 PM, Hannes Reinecke wrote: >>> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c >>> index e5c81ba2b7a1..a8ef62512d9a 100644 >>> --- a/drivers/nvme/host/fc.c >>> +++ b/drivers/nvme/host/fc.c >>> @@ -2726,6 +2734,13 @@ nvme_fc_create_association(struct >>> nvme_fc_ctrl *ctrl) >>> ????? } >>> ????? changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE); >>> +??? if (!changed && ctrl->ctrl.state != NVME_CTRL_DELETING) { >>> +??????? dev_err(ctrl->ctrl.device, >>> +??????????? "NVME-FC{%d}: error_recovery: Couldn't change " >>> +??????????? "state from %s to LIVE\n", ctrl->cnum, >>> +??????????? nvme_ctrl_state_name(&ctrl->ctrl)); >>> +??????? return -EAGAIN; >>> +??? } >>> ????? ctrl->ctrl.nr_reconnects = 0; >> >> Sorry, I should have spotted this last time.?? We shouldn't be >> exiting in the bad case - we should be setting ret=-EAGAIN then have >> a goto out_term_aen_ops so that whatever was built up for the >> association is torn down. >> > I _actually_ did do that on purpose; I did ask me the very same question. > Reasoning was that this is the final step, and reverting this would > amount to a full reset. Hence we could just return an error, and let > the reset handle things. Especially as we do not clear the reconnect > counter here. > Am I wrong? > Yes, partly no.??? No, in that tearing down what you built up for the association will be similar to the action of a reset but the problem is - the association create expected that "teardown" to have happened before it runs. In this new case, data structures are left built up, queues live, flags live, yet the controller by state is in limbo.? After the failure return, all that happens in the transport schedules a connect timeout. When the timeout occurs it will attempt to rebuild everything that is already in place. There will be no attempt to terminate the prior association you left live and I have no idea what some of the reallocation paths will do when they start building up on things already live. At best they just overlay and we have leaks - I don't know. You've created a very new and unique path by returning.? Don't do that. -- james