* [PATCH] nvme-fabrics: allow to queue requests for live queues @ 2020-07-28 5:35 Sagi Grimberg 2020-07-28 6:44 ` Chao Leng ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Sagi Grimberg @ 2020-07-28 5:35 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: James Smart Right now we are failing requests based on the controller state (which is checked inline in nvmf_check_ready) however we should definitely accept requests if the queue is live. When entering controller reset, we transition the controller into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath requests (have blk_noretry_request set). This is also the case for NVME_REQ_USER for some reason, but there shouldn't be any reason for us to reject this I/O in a controller reset. In a non-mpath setup, this means that the requests will simply be requeued over and over forever not allowing the q_usage_counter to drop its final reference, causing controller reset to hang if running concurrently with heavy I/O. While we are at it, remove the redundant NVME_CTRL_NEW case, which should never see any I/O as it must first transition to NVME_CTRL_CONNECTING. Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/fabrics.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 4ec4829d6233..2e7838f42e36 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq, { struct nvme_request *req = nvme_req(rq); - /* - * If we are in some state of setup or teardown only allow - * internally generated commands. - */ - if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD)) - return false; - /* * Only allow commands on a live queue, except for the connect command, * which is require to set the queue live in the appropinquate states. */ switch (ctrl->state) { - case NVME_CTRL_NEW: case NVME_CTRL_CONNECTING: - if (nvme_is_fabrics(req->cmd) && + if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) && req->cmd->fabrics.fctype == nvme_fabrics_type_connect) return true; break; -- 2.25.1 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 5:35 [PATCH] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg @ 2020-07-28 6:44 ` Chao Leng 2020-07-28 6:49 ` Sagi Grimberg 2020-07-28 10:50 ` Christoph Hellwig 2020-07-29 5:45 ` Christoph Hellwig 2 siblings, 1 reply; 14+ messages in thread From: Chao Leng @ 2020-07-28 6:44 UTC (permalink / raw) To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: James Smart On 2020/7/28 13:35, Sagi Grimberg wrote: > Right now we are failing requests based on the controller > state (which is checked inline in nvmf_check_ready) however > we should definitely accept requests if the queue is live. > > When entering controller reset, we transition the controller > into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for > non-mpath requests (have blk_noretry_request set). > > This is also the case for NVME_REQ_USER for some reason, but > there shouldn't be any reason for us to reject this I/O in a > controller reset. > > In a non-mpath setup, this means that the requests will simply > be requeued over and over forever not allowing the q_usage_counter > to drop its final reference, causing controller reset to hang > if running concurrently with heavy I/O. > > While we are at it, remove the redundant NVME_CTRL_NEW case, which > should never see any I/O as it must first transition to > NVME_CTRL_CONNECTING. > > Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > --- > drivers/nvme/host/fabrics.c | 10 +--------- > 1 file changed, 1 insertion(+), 9 deletions(-) > > diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c > index 4ec4829d6233..2e7838f42e36 100644 > --- a/drivers/nvme/host/fabrics.c > +++ b/drivers/nvme/host/fabrics.c > @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq, > { > struct nvme_request *req = nvme_req(rq); > > - /* > - * If we are in some state of setup or teardown only allow > - * internally generated commands. > - */ > - if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD)) "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we delete, the normal io will be send to target, the target can not treat the io if the queue is not NVME_CTRL_LIVE. > - return false; > - > /* > * Only allow commands on a live queue, except for the connect command, > * which is require to set the queue live in the appropinquate states. > */ > switch (ctrl->state) { > - case NVME_CTRL_NEW: > case NVME_CTRL_CONNECTING: > - if (nvme_is_fabrics(req->cmd) && > + if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) && If nvme_is_fabrics(req->cmd) is true, blk_rq_is_passthrough(rq) must be ture. why need add check blk_rq_is_passthrough(rq)? > req->cmd->fabrics.fctype == nvme_fabrics_type_connect) > return true; > break; > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 6:44 ` Chao Leng @ 2020-07-28 6:49 ` Sagi Grimberg 2020-07-28 17:11 ` James Smart 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2020-07-28 6:49 UTC (permalink / raw) To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: James Smart >> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c >> index 4ec4829d6233..2e7838f42e36 100644 >> --- a/drivers/nvme/host/fabrics.c >> +++ b/drivers/nvme/host/fabrics.c >> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, >> struct request *rq, >> { >> struct nvme_request *req = nvme_req(rq); >> - /* >> - * If we are in some state of setup or teardown only allow >> - * internally generated commands. >> - */ >> - if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD)) > "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we delete, > the normal io will be send to target, the target can not treat the io > if the queue is not NVME_CTRL_LIVE. Sure it does, the only reason for us to deny this I/O, is if the queue is not live. The controller state should only _advise_ us if we need to look at the queue state. >> - return false; >> - >> /* >> * Only allow commands on a live queue, except for the connect >> command, >> * which is require to set the queue live in the appropinquate >> states. >> */ >> switch (ctrl->state) { >> - case NVME_CTRL_NEW: >> case NVME_CTRL_CONNECTING: >> - if (nvme_is_fabrics(req->cmd) && >> + if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) && > If nvme_is_fabrics(req->cmd) is true, blk_rq_is_passthrough(rq) must > be ture. why need add check blk_rq_is_passthrough(rq)? req->cmd is only valid for passthru commands, hence we need to check for that before we reference req->cmd. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 6:49 ` Sagi Grimberg @ 2020-07-28 17:11 ` James Smart 2020-07-28 17:50 ` Sagi Grimberg 0 siblings, 1 reply; 14+ messages in thread From: James Smart @ 2020-07-28 17:11 UTC (permalink / raw) To: Sagi Grimberg, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch On 7/27/2020 11:49 PM, Sagi Grimberg wrote: > >>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c >>> index 4ec4829d6233..2e7838f42e36 100644 >>> --- a/drivers/nvme/host/fabrics.c >>> +++ b/drivers/nvme/host/fabrics.c >>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl >>> *ctrl, struct request *rq, >>> { >>> struct nvme_request *req = nvme_req(rq); >>> - /* >>> - * If we are in some state of setup or teardown only allow >>> - * internally generated commands. >>> - */ >>> - if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD)) >> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we >> delete, >> the normal io will be send to target, the target can not treat the io >> if the queue is not NVME_CTRL_LIVE. > > Sure it does, the only reason for us to deny this I/O, is if the queue > is not live. The controller state should only _advise_ us if we need to > look at the queue state. I disagree strongly with removing the check on NVME_REQ_USERCMD. We've seen cli ioctls going to the admin queue while we're in the middle of doing controller initialization and it's has hosed the controller state in some cases. We've seen commands issued before the controller is in the proper state. The admin queue may be live - but we don't necessarily want other io sneaking in. As for the blk_rq_is_passthrough check - I guess I can see it being based on the queue state, and the check looks ok (we should never see !blk_rq_is_passthrough on the admin q). But... - I don't know why it was that important to change it. On the connecting path, all you're doing is letting io start flowing before all the queues have been created. Did you really need to start that much sooner ? - But on the resetting path, or deleting cases, you've added a condition now where the controller state was changed, but there was a delay before the transport marked the queue live. It's common practice in the transports to change state then schedule a work element to perform the actual state change. Why would you want io to continue to flow during that window ? This may bring out other problems we've avoided in the past. -- james _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 17:11 ` James Smart @ 2020-07-28 17:50 ` Sagi Grimberg 2020-07-28 20:11 ` James Smart 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2020-07-28 17:50 UTC (permalink / raw) To: James Smart, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch >>>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c >>>> index 4ec4829d6233..2e7838f42e36 100644 >>>> --- a/drivers/nvme/host/fabrics.c >>>> +++ b/drivers/nvme/host/fabrics.c >>>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl >>>> *ctrl, struct request *rq, >>>> { >>>> struct nvme_request *req = nvme_req(rq); >>>> - /* >>>> - * If we are in some state of setup or teardown only allow >>>> - * internally generated commands. >>>> - */ >>>> - if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD)) >>> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we >>> delete, >>> the normal io will be send to target, the target can not treat the io >>> if the queue is not NVME_CTRL_LIVE. >> >> Sure it does, the only reason for us to deny this I/O, is if the queue >> is not live. The controller state should only _advise_ us if we need to >> look at the queue state. > > I disagree strongly with removing the check on NVME_REQ_USERCMD. We've > seen cli ioctls going to the admin queue while we're in the middle of > doing controller initialization and it's has hosed the controller state > in some cases. We've seen commands issued before the controller is in > the proper state. The admin queue may be live - but we don't > necessarily want other io sneaking in. Can you please give an example? NVME_REQ_USERCMD should not be any different from any other type of I/O. Also, do note that pci does allow to queue any type of command based on the queue state only. fabrics should be slightly different because we have the CONNECTING state where we want to let the connect command only to be issued. > As for the blk_rq_is_passthrough check - I guess I can see it being > based on the queue state, and the check looks ok (we should never see > !blk_rq_is_passthrough on the admin q). > But... > - I don't know why it was that important to change it. On the connecting > path, all you're doing is letting io start flowing before all the queues > have been created. Did you really need to start that much sooner ? The issue is that controller in RESETTING state will have requests that are being issued, and if we don't let it pass through, it will hang around forever being requeued preventing queue freeze to complete. See bug report from Krishnamraju Eraparaju. > > - But on the resetting path, or deleting cases, you've added a condition > now where the controller state was changed, but there was a delay before > the transport marked the queue live. It's common practice in the > transports to change state then schedule a work element to perform the > actual state change. Why would you want io to continue to flow during > that window ? This may bring out other problems we've avoided in the > past. What are you referring to? the change here? the controller reset must allow requests that came in before we started queue freeze to pass, otherwise freeze will never complete. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 17:50 ` Sagi Grimberg @ 2020-07-28 20:11 ` James Smart 2020-07-28 20:38 ` Sagi Grimberg 0 siblings, 1 reply; 14+ messages in thread From: James Smart @ 2020-07-28 20:11 UTC (permalink / raw) To: Sagi Grimberg, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch On 7/28/2020 10:50 AM, Sagi Grimberg wrote: > >>>>> diff --git a/drivers/nvme/host/fabrics.c >>>>> b/drivers/nvme/host/fabrics.c >>>>> index 4ec4829d6233..2e7838f42e36 100644 >>>>> --- a/drivers/nvme/host/fabrics.c >>>>> +++ b/drivers/nvme/host/fabrics.c >>>>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl >>>>> *ctrl, struct request *rq, >>>>> { >>>>> struct nvme_request *req = nvme_req(rq); >>>>> - /* >>>>> - * If we are in some state of setup or teardown only allow >>>>> - * internally generated commands. >>>>> - */ >>>>> - if (!blk_rq_is_passthrough(rq) || (req->flags & >>>>> NVME_REQ_USERCMD)) >>>> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we >>>> delete, >>>> the normal io will be send to target, the target can not treat the io >>>> if the queue is not NVME_CTRL_LIVE. >>> >>> Sure it does, the only reason for us to deny this I/O, is if the queue >>> is not live. The controller state should only _advise_ us if we need to >>> look at the queue state. >> >> I disagree strongly with removing the check on NVME_REQ_USERCMD. >> We've seen cli ioctls going to the admin queue while we're in the >> middle of doing controller initialization and it's has hosed the >> controller state in some cases. We've seen commands issued before the >> controller is in the proper state. The admin queue may be live - but >> we don't necessarily want other io sneaking in. > > Can you please give an example? NVME_REQ_USERCMD should not be any > different from any other type of I/O. Keep in mind - in order to send any command, Fabric connect included, the admin queue must be unquiesced. Here's what we have today: At the time of the reset or immediately during the reset, an ioctl is done - it can (and has been seen) to be queued on the request queue as there are not state checks. When this io gets scheduled, as of the resetting or even the connecting state - it's currently failing the NVME_REQ_USERCMD check, which fails back into nvmf_fail_nonready_command, which fails back with BLK_STS_RESOURCE, which reschedules it. This rescheduling reoccurs, until the its tried again with the controller state changed (usually LIVE) which then passes the admin-queue live check. It's good to go. With your patch: The ioctl will at least be requeued with BLK_STS_RESOURCE until after the Fabric Connect is completed (Connect is explicitly allowed by the state-specific check) as hopefully queue->live is false. It may actually be issued if we're in this resetting state only to be terminated afterward. All of the transports set the queue live flag as of Fabric Connect complete. If the ioctl is retried as of that point - it may be immediately sent on the admin queue - before or in-between any of the commands issued by nvme_enable_ctrl() - including before controller enabled, nvme_init_identify(), and before nvme_start_ctrl() has been called. Hopefully none of these commands try to set/change configuration. They simply shouldn't be allowed until after the controller has transitioned to LIVE. There's no need to allow them in-between. > > Also, do note that pci does allow to queue any type of command based > on the queue state only. fabrics should be slightly different because > we have the CONNECTING state where we want to let the connect command > only to be issued. > That doesn't make me feel any better. PCI bit-bangs register writes, the odds they can fail/be discarded on the "fabric" is near zero. Admin Commands - both sending/processing/responding is much much faster than on a fabric, where the cmd can be dropped/delayed by fabric routing. Odds of having core-generated commands fail while reconnecting/reinitializing are much much lower. All says - fabrics have a much higher command and inter-command latency than pcie does - so pcie is going to have a harder time seeing these cases. > >> As for the blk_rq_is_passthrough check - I guess I can see it being >> based on the queue state, and the check looks ok (we should never >> see !blk_rq_is_passthrough on the admin q). >> But... >> - I don't know why it was that important to change it. On the >> connecting path, all you're doing is letting io start flowing before >> all the queues have been created. Did you really need to start that >> much sooner ? > > The issue is that controller in RESETTING state will have requests that > are being issued, and if we don't let it pass through, it will hang > around forever being requeued preventing queue freeze to complete. is it really "forever" or is it until controller loss timeout ? granted 10minutes does seem like forever > > See bug report from Krishnamraju Eraparaju. I'll go look. > >> >> - But on the resetting path, or deleting cases, you've added a >> condition now where the controller state was changed, but there was a >> delay before the transport marked the queue live. It's common >> practice in the transports to change state then schedule a work >> element to perform the actual state change. Why would you want io to >> continue to flow during that window ? This may bring out other >> problems we've avoided in the past. > > What are you referring to? the change here? the controller reset must > allow requests that came in before we started queue freeze to pass, > otherwise freeze will never complete. I'm referring to the io's that now pass through with the patch as the controller state is no longer blocking them (e.g. the removal of the !blk_rq_is_passthrough and NVME_REQ_USERCMD checks). They will pass the ready checks and go out on the wire, creating more active io that has to be deleted when the reset finally quiesces and terminates the active io list. For FC, we have to real link traffic for any io that gets started, so it raises the demands. Should be fine, but it is creating more load at a point where it could have been blocked. -- james _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 20:11 ` James Smart @ 2020-07-28 20:38 ` Sagi Grimberg 2020-07-28 22:47 ` James Smart 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2020-07-28 20:38 UTC (permalink / raw) To: James Smart, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch >> Can you please give an example? NVME_REQ_USERCMD should not be any >> different from any other type of I/O. > > Keep in mind - in order to send any command, Fabric connect included, > the admin queue must be unquiesced. > > Here's what we have today: > At the time of the reset or immediately during the reset, an ioctl is > done - it can (and has been seen) to be queued on the request queue as > there are not state checks. When this io gets scheduled, as of the > resetting or even the connecting state - it's currently failing the > NVME_REQ_USERCMD check, which fails back into > nvmf_fail_nonready_command, which fails back with BLK_STS_RESOURCE, > which reschedules it. This rescheduling reoccurs, until the its tried > again with the controller state changed (usually LIVE) which then passes > the admin-queue live check. It's good to go. > > With your patch: > The ioctl will at least be requeued with BLK_STS_RESOURCE until after > the Fabric Connect is completed (Connect is explicitly allowed by the > state-specific check) as hopefully queue->live is false. It may actually > be issued if we're in this resetting state only to be terminated > afterward. All of the transports set the queue live flag as of Fabric > Connect complete. If the ioctl is retried as of that point - it may be > immediately sent on the admin queue - before or in-between any of the > commands issued by nvme_enable_ctrl() - including before controller > enabled, nvme_init_identify(), and before nvme_start_ctrl() has been > called. Hopefully none of these commands try to set/change > configuration. They simply shouldn't be allowed until after the > controller has transitioned to LIVE. There's no need to allow them > in-between. What you are describing is correct, that is why we have connect_q to save us from it for I/O queues. The problem I have with this code is that checking for NVME_REQ_USERCMD is a big hammer that you land on what can be normal I/O (e.g. nvme read/write/write-zeros), and we must not rely on this indication to prevent what you are describing. It maybe (just maybe) OK to check NVME_REQ_USERCMD only for the admin queue, but we are really circling over the fact that we cannot reliably send admin connect before to go off first, because we have to unquiesce it before we issue the admin connect, and there might be a stray command going into execution before we submit the admin connect. >> Also, do note that pci does allow to queue any type of command based >> on the queue state only. fabrics should be slightly different because >> we have the CONNECTING state where we want to let the connect command >> only to be issued. >> > > That doesn't make me feel any better. PCI bit-bangs register writes, > the odds they can fail/be discarded on the "fabric" is near zero. Admin > Commands - both sending/processing/responding is much much faster than > on a fabric, where the cmd can be dropped/delayed by fabric routing. > Odds of having core-generated commands fail while > reconnecting/reinitializing are much much lower. All says - fabrics > have a much higher command and inter-command latency than pcie does - so > pcie is going to have a harder time seeing these cases. Timing here is not the issue, we should take a step back and understand that we fail any user command for a very specific reason. We should solve the root cause and not paper around it. >>> As for the blk_rq_is_passthrough check - I guess I can see it being >>> based on the queue state, and the check looks ok (we should never >>> see !blk_rq_is_passthrough on the admin q). >>> But... >>> - I don't know why it was that important to change it. On the >>> connecting path, all you're doing is letting io start flowing before >>> all the queues have been created. Did you really need to start that >>> much sooner ? >> >> The issue is that controller in RESETTING state will have requests that >> are being issued, and if we don't let it pass through, it will hang >> around forever being requeued preventing queue freeze to complete. > > is it really "forever" or is it until controller loss timeout ? granted > 10minutes does seem like forever ctrl_loss_tmo can be a 1000000 years, it cannot rescue us. >>> - But on the resetting path, or deleting cases, you've added a >>> condition now where the controller state was changed, but there was a >>> delay before the transport marked the queue live. It's common >>> practice in the transports to change state then schedule a work >>> element to perform the actual state change. Why would you want io to >>> continue to flow during that window ? This may bring out other >>> problems we've avoided in the past. >> >> What are you referring to? the change here? the controller reset must >> allow requests that came in before we started queue freeze to pass, >> otherwise freeze will never complete. > > I'm referring to the io's that now pass through with the patch as the > controller state is no longer blocking them (e.g. the removal of the > !blk_rq_is_passthrough and NVME_REQ_USERCMD checks). First of all, you have a bug hiding there if blk_mq_update_nr_hw_queues actually ever does anything, because it won't be able to reliably freeze the queue (because the !blk_rq_is_passthrough keeps the commands endlessly rescheduling itself, not ever completing hence the queue cannot be frozen). This is what started this, because today we support multiple queue maps, it always ends up freezing the queue, so it can happen much more frequently. They will pass the > ready checks and go out on the wire, creating more active io that has to > be deleted when the reset finally quiesces and terminates the active io > list. For FC, we have to real link traffic for any io that gets started, > so it raises the demands. Should be fine, but it is creating more load > at a point where it could have been blocked. As I said, freezing the queue is impossible when we let these commands hang in limbo state, and queue freeze may occur in a lot of places, so we must either let them pass, or fence them when starting the freeze. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 20:38 ` Sagi Grimberg @ 2020-07-28 22:47 ` James Smart 2020-07-28 23:39 ` Sagi Grimberg 0 siblings, 1 reply; 14+ messages in thread From: James Smart @ 2020-07-28 22:47 UTC (permalink / raw) To: Sagi Grimberg, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch On 7/28/2020 1:38 PM, Sagi Grimberg wrote: > >>> Can you please give an example? NVME_REQ_USERCMD should not be any >>> different from any other type of I/O. >> ... > > The problem I have with this code is that checking for NVME_REQ_USERCMD > is a big hammer that you land on what can be normal I/O > (e.g. nvme read/write/write-zeros), and we must not rely on this > indication to prevent what you are describing. > > It maybe (just maybe) OK to check NVME_REQ_USERCMD only for the admin > queue, but we are really circling over the fact that we cannot reliably > send admin connect before to go off first, because we have to unquiesce > it before we issue the admin connect, and there might be a stray command > going into execution before we submit the admin connect. agree. Didn't you suggest something separate to isolate the connect commands/init commands ? Can we use a different tag set for "initialization" commands? We can use different queue_rq routines with different check ready checks. I don't like the overhead, but sure makes the logic more straightforward. >>>> As for the blk_rq_is_passthrough check - I guess I can see it being >>>> based on the queue state, and the check looks ok (we should never >>>> see !blk_rq_is_passthrough on the admin q). >>>> But... >>>> - I don't know why it was that important to change it. On the >>>> connecting path, all you're doing is letting io start flowing >>>> before all the queues have been created. Did you really need to >>>> start that much sooner ? >>> >>> The issue is that controller in RESETTING state will have requests that >>> are being issued, and if we don't let it pass through, it will hang >>> around forever being requeued preventing queue freeze to complete. >> Isn't this the real problem: we require the request to complete in order to get off the request queue in order to freeze. But to complete - it has to finish with some status. This is ok if running multipath, as mpath queues and retries - but what mpath is not running ? The issuer of the io will get the error ? am I understanding it ? If so I don't think a simple check reorg such as proposed will work as ultimately we can't return an error to something that doesn't know how to requeue/retry. Should we always be putting even non-mp devices under mp, thus mp catches the errors rather than the blk queue, and can have a short delay while the controller reconnects ? At least normal io can be handled this way and we'd have to figure out the ioctl path. But it should still allow the core io, such as scanning, to get the error and fail out, to be picked up after the reconnect. -- james _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 22:47 ` James Smart @ 2020-07-28 23:39 ` Sagi Grimberg 0 siblings, 0 replies; 14+ messages in thread From: Sagi Grimberg @ 2020-07-28 23:39 UTC (permalink / raw) To: James Smart, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch >>>> Can you please give an example? NVME_REQ_USERCMD should not be any >>>> different from any other type of I/O. >>> ... >> >> The problem I have with this code is that checking for NVME_REQ_USERCMD >> is a big hammer that you land on what can be normal I/O >> (e.g. nvme read/write/write-zeros), and we must not rely on this >> indication to prevent what you are describing. >> >> It maybe (just maybe) OK to check NVME_REQ_USERCMD only for the admin >> queue, but we are really circling over the fact that we cannot reliably >> send admin connect before to go off first, because we have to unquiesce >> it before we issue the admin connect, and there might be a stray command >> going into execution before we submit the admin connect. > > agree. > > Didn't you suggest something separate to isolate the connect > commands/init commands ? Can we use a different tag set for > "initialization" commands? We can use different queue_rq routines with > different check ready checks. I don't like the overhead, but sure makes > the logic more straightforward. I think I've added this now that I'm looking back: e7832cb48a65 ("nvme: make fabrics command run on a separate request queue") But it still doesn't solve the user commands you are referring to. >>>>> As for the blk_rq_is_passthrough check - I guess I can see it being >>>>> based on the queue state, and the check looks ok (we should never >>>>> see !blk_rq_is_passthrough on the admin q). >>>>> But... >>>>> - I don't know why it was that important to change it. On the >>>>> connecting path, all you're doing is letting io start flowing >>>>> before all the queues have been created. Did you really need to >>>>> start that much sooner ? >>>> >>>> The issue is that controller in RESETTING state will have requests that >>>> are being issued, and if we don't let it pass through, it will hang >>>> around forever being requeued preventing queue freeze to complete. >>> > > Isn't this the real problem: we require the request to complete in > order to get off the request queue in order to freeze. But to complete - > it has to finish with some status. This is ok if running multipath, as > mpath queues and retries - but what mpath is not running ? The issuer of > the io will get the error ? We already fail noretry errors when we cancel (or abort) anything that was submitted before, so there is nothing new here. non-mpath does not set FAILFAST so it gets requeued, user commands do, so they are failed, this is not different than before. > Should we always be putting even non-mp devices under mp, thus mp > catches the errors rather than the blk queue, and can have a short delay > while the controller reconnects ? At least normal io can be handled > this way and we'd have to figure out the ioctl path. But it should > still allow the core io, such as scanning, to get the error and fail > out, to be picked up after the reconnect. non-mpath still gets requeued because it doesn't set any sort of FAILFAST flag. or _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 5:35 [PATCH] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg 2020-07-28 6:44 ` Chao Leng @ 2020-07-28 10:50 ` Christoph Hellwig 2020-07-28 16:50 ` James Smart 2020-07-29 5:45 ` Christoph Hellwig 2 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2020-07-28 10:50 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart Applied to nvme-5.9, thanks. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 10:50 ` Christoph Hellwig @ 2020-07-28 16:50 ` James Smart 0 siblings, 0 replies; 14+ messages in thread From: James Smart @ 2020-07-28 16:50 UTC (permalink / raw) To: Christoph Hellwig, Sagi Grimberg; +Cc: Keith Busch, linux-nvme On 7/28/2020 3:50 AM, Christoph Hellwig wrote: > Applied to nvme-5.9, thanks. No - I disagree with this patch. I'll reply in the other thread. Where is the delay I see with other patches before being applied ? -- james _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-28 5:35 [PATCH] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg 2020-07-28 6:44 ` Chao Leng 2020-07-28 10:50 ` Christoph Hellwig @ 2020-07-29 5:45 ` Christoph Hellwig 2020-07-29 5:53 ` Sagi Grimberg 2 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2020-07-29 5:45 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart Dropped from nvme-5.9 to wait to the discussion to finish as I want to send a PR to Jens today.. On Mon, Jul 27, 2020 at 10:35:23PM -0700, Sagi Grimberg wrote: > Right now we are failing requests based on the controller > state (which is checked inline in nvmf_check_ready) however > we should definitely accept requests if the queue is live. > > When entering controller reset, we transition the controller > into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for > non-mpath requests (have blk_noretry_request set). > > This is also the case for NVME_REQ_USER for some reason, but > there shouldn't be any reason for us to reject this I/O in a > controller reset. > > In a non-mpath setup, this means that the requests will simply > be requeued over and over forever not allowing the q_usage_counter > to drop its final reference, causing controller reset to hang > if running concurrently with heavy I/O. > > While we are at it, remove the redundant NVME_CTRL_NEW case, which > should never see any I/O as it must first transition to > NVME_CTRL_CONNECTING. > > Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > --- > drivers/nvme/host/fabrics.c | 10 +--------- > 1 file changed, 1 insertion(+), 9 deletions(-) > > diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c > index 4ec4829d6233..2e7838f42e36 100644 > --- a/drivers/nvme/host/fabrics.c > +++ b/drivers/nvme/host/fabrics.c > @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq, > { > struct nvme_request *req = nvme_req(rq); > > - /* > - * If we are in some state of setup or teardown only allow > - * internally generated commands. > - */ > - if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD)) > - return false; > - > /* > * Only allow commands on a live queue, except for the connect command, > * which is require to set the queue live in the appropinquate states. > */ > switch (ctrl->state) { > - case NVME_CTRL_NEW: > case NVME_CTRL_CONNECTING: > - if (nvme_is_fabrics(req->cmd) && > + if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) && > req->cmd->fabrics.fctype == nvme_fabrics_type_connect) > return true; > break; > -- > 2.25.1 ---end quoted text--- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-29 5:45 ` Christoph Hellwig @ 2020-07-29 5:53 ` Sagi Grimberg 2020-07-29 6:05 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2020-07-29 5:53 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme > Dropped from nvme-5.9 to wait to the discussion to finish as I want > to send a PR to Jens today.. OK, this is a bug fix, so we can send it later for sure. I can also send a patch that doesn't touch the USERCMDs that James is concerned about and we will fix that later... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues 2020-07-29 5:53 ` Sagi Grimberg @ 2020-07-29 6:05 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2020-07-29 6:05 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart On Tue, Jul 28, 2020 at 10:53:41PM -0700, Sagi Grimberg wrote: > >> Dropped from nvme-5.9 to wait to the discussion to finish as I want >> to send a PR to Jens today.. > > OK, this is a bug fix, so we can send it later for sure. > > I can also send a patch that doesn't touch the USERCMDs that > James is concerned about and we will fix that later... I need to catch up on the actual discussion. I'm heading off to a vacation today and just need to get everything in order first. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2020-07-29 6:05 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-28 5:35 [PATCH] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg 2020-07-28 6:44 ` Chao Leng 2020-07-28 6:49 ` Sagi Grimberg 2020-07-28 17:11 ` James Smart 2020-07-28 17:50 ` Sagi Grimberg 2020-07-28 20:11 ` James Smart 2020-07-28 20:38 ` Sagi Grimberg 2020-07-28 22:47 ` James Smart 2020-07-28 23:39 ` Sagi Grimberg 2020-07-28 10:50 ` Christoph Hellwig 2020-07-28 16:50 ` James Smart 2020-07-29 5:45 ` Christoph Hellwig 2020-07-29 5:53 ` Sagi Grimberg 2020-07-29 6:05 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).