* [PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions @ 2019-07-29 23:31 Sagi Grimberg 2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg 2019-07-29 23:32 ` [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning Sagi Grimberg 0 siblings, 2 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-07-29 23:31 UTC (permalink / raw) Hey Hannes, Here is two patches that to my understanding of the issues you describe in your patchset: "nvme: flush rescan worker before resetting" and your report of "spurious I/O errors during failover". Patch #1 avoids removing a namespace if the revalidation I/O failed because of a racing controller reset (or removal). This should fix any spurious I/O failures during path failover. Patch #2 avoids a use-after-free condition in the case where the scan work ends up actually removing a namespace but is racing with a controller reset (that is scheduled after the removal) and is accessing the request queue (nvme_stop_queues) after it was destroyed. The one trace that you have mentioned in our discussions that is still not clear to me is: -- [67088.344034] WARNING: CPU: 4 PID: 25020 at ../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0 [...] [67088.344106] Call Trace: [67088.344112] blk_freeze_queue_start+0x2a/0x40 [67088.344114] blk_freeze_queue+0xe/0x40 [67088.344118] nvme_update_disk_info+0x36/0x260 [nvme_core] [67088.344122] __nvme_revalidate_disk+0xca/0xf0 [nvme_core] [67088.344125] nvme_revalidate_disk+0xa6/0x120 [nvme_core] [67088.344127] ? blk_mq_get_tag+0xa3/0x220 [67088.344130] revalidate_disk+0x23/0xc0 [67088.344133] nvme_validate_ns+0x43/0x830 [nvme_core] [67088.344137] ? wake_up_q+0x70/0x70 [67088.344139] ? blk_mq_free_request+0x12a/0x160 [67088.344142] ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core] [67088.344145] nvme_scan_work+0x2b3/0x350 [nvme_core] [67088.344149] process_one_work+0x1da/0x400 [67088.344150] worker_thread+0x2b/0x3f0 [67088.344152] ? process_one_work+0x400/0x400 -- Which indicates that we are revalidating a namespace that was already removed. Given that the only namespace removal that is outside of the scan_work is in nvme_remove_namespaces() which flushes the scan_work before it actually removes the namespaces. I'm still lost how this can happen. Can you please apply the following two patches and report if they address the issues you are seeing? And if not, can you please report a call trace of the hanged threads? And, given that your are in a multipath environment, can you apply these on top of: "nvme: fix controller removal race with scan work"? Thanks. Sagi Grimberg (2): nvme: don't remove namespace if revalidate failed because of controller reset nvme: fix possible use-after-free condition when controller reset is racing namespace scanning drivers/nvme/host/core.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-29 23:31 [PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions Sagi Grimberg @ 2019-07-29 23:32 ` Sagi Grimberg 2019-07-30 0:59 ` Keith Busch ` (2 more replies) 2019-07-29 23:32 ` [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning Sagi Grimberg 1 sibling, 3 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-07-29 23:32 UTC (permalink / raw) If a controller reset is racing with a namespace revalidation, the revalidation I/O will surely fail, but we should not remove the namespace as we will execute the I/O when the controller is LIVE again. Specifically check if the controller is LIVE because as RESETTING/CONNECTING are transient and DELETING/DEAD will eventually remove the namespace in the removal code path. This fixes sporious I/O errors in path failover coditions where the controller reset is racing with the namespace scan work. Reported-by: Hannes Reinecke <hare at suse.de> Signed-off-by: Sagi Grimberg <sagi at grimberg.me> --- drivers/nvme/host/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index fa31da0762b9..5f6970e7ba73 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid) ns = nvme_find_get_ns(ctrl, nsid); if (ns) { - if (ns->disk && revalidate_disk(ns->disk)) + if (ns->disk && revalidate_disk(ns->disk) && + ctrl->state != NVME_CTRL_LIVE) nvme_ns_remove(ns); nvme_put_ns(ns); } else -- 2.17.1 ^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg @ 2019-07-30 0:59 ` Keith Busch 2019-07-30 1:04 ` Sagi Grimberg 2019-07-30 1:04 ` Ming Lei 2019-07-31 12:18 ` Hannes Reinecke 2 siblings, 1 reply; 42+ messages in thread From: Keith Busch @ 2019-07-30 0:59 UTC (permalink / raw) On Mon, Jul 29, 2019@5:32 PM Sagi Grimberg <sagi@grimberg.me> wrote: > If a controller reset is racing with a namespace revalidation, the > revalidation I/O will surely fail, but we should not remove the > namespace as we will execute the I/O when the controller is LIVE again. > Specifically check if the controller is LIVE because as > RESETTING/CONNECTING are transient and DELETING/DEAD will eventually > remove the namespace in the removal code path. > > This fixes sporious I/O errors in path failover coditions where the > controller reset is racing with the namespace scan work. > > Reported-by: Hannes Reinecke <hare at suse.de> > Signed-off-by: Sagi Grimberg <sagi at grimberg.me> > --- > drivers/nvme/host/core.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index fa31da0762b9..5f6970e7ba73 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid) > > ns = nvme_find_get_ns(ctrl, nsid); > if (ns) { > - if (ns->disk && revalidate_disk(ns->disk)) > + if (ns->disk && revalidate_disk(ns->disk) && > + ctrl->state != NVME_CTRL_LIVE) > nvme_ns_remove(ns); That should be '== NVME_CTRL_LIVE', right? You don't want to remove it for the resetting state, and the removing states take care of removal directly. On the RESETTING state, shouldn't we avoid calling revalidate_disk() in the first place? An intermittent failure during reset could cause the disk to temporarily go to a zero-capacity, which may cause some filesystem issues that we could avoid if we know we're about to rescan this namespace. > nvme_put_ns(ns); > } else ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 0:59 ` Keith Busch @ 2019-07-30 1:04 ` Sagi Grimberg 0 siblings, 0 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-07-30 1:04 UTC (permalink / raw) >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index fa31da0762b9..5f6970e7ba73 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid) >> >> ns = nvme_find_get_ns(ctrl, nsid); >> if (ns) { >> - if (ns->disk && revalidate_disk(ns->disk)) >> + if (ns->disk && revalidate_disk(ns->disk) && >> + ctrl->state != NVME_CTRL_LIVE) >> nvme_ns_remove(ns); > > That should be '== NVME_CTRL_LIVE', right? You don't want to remove it > for the resetting state, and the removing states take care of removal > directly. rrr... you're right... > On the RESETTING state, shouldn't we avoid calling revalidate_disk() > in the first place? An intermittent failure during reset could cause > the disk to temporarily go to a zero-capacity, which may cause some > filesystem issues that we could avoid if we know we're about to rescan > this namespace. Yea, this should do the trick I guess: -- diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index fa31da0762b9..d01976c93160 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid) ns = nvme_find_get_ns(ctrl, nsid); if (ns) { - if (ns->disk && revalidate_disk(ns->disk)) + if (ns->disk && ctrl->state == NVME_CTRL_LIVE && + revalidate_disk(ns->disk) nvme_ns_remove(ns); nvme_put_ns(ns); } else -- ^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg 2019-07-30 0:59 ` Keith Busch @ 2019-07-30 1:04 ` Ming Lei 2019-07-30 1:06 ` Sagi Grimberg 2019-07-31 12:18 ` Hannes Reinecke 2 siblings, 1 reply; 42+ messages in thread From: Ming Lei @ 2019-07-30 1:04 UTC (permalink / raw) On Tue, Jul 30, 2019@7:32 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > If a controller reset is racing with a namespace revalidation, the > revalidation I/O will surely fail, but we should not remove the No, there is sync IO request in revalidation, and the sync IO can't be completed during resetting. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 1:04 ` Ming Lei @ 2019-07-30 1:06 ` Sagi Grimberg 2019-07-30 1:10 ` Ming Lei 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-30 1:06 UTC (permalink / raw) >> If a controller reset is racing with a namespace revalidation, the >> revalidation I/O will surely fail, but we should not remove the > > No, there is sync IO request in revalidation, and the sync IO can't be completed > during resetting. Why? of course it can. The controller reset is responsible to quiesce + abort + unquiesce both I/O and admin commands. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 1:06 ` Sagi Grimberg @ 2019-07-30 1:10 ` Ming Lei 2019-07-30 1:19 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Ming Lei @ 2019-07-30 1:10 UTC (permalink / raw) On Tue, Jul 30, 2019@9:06 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > >> If a controller reset is racing with a namespace revalidation, the > >> revalidation I/O will surely fail, but we should not remove the > > > > No, there is sync IO request in revalidation, and the sync IO can't be completed > > during resetting. > > Why? of course it can. The controller reset is responsible to > quiesce + abort + unquiesce both I/O and admin commands. Abort simply cancels the in-flight requests, which will be retried & re-queued into blk-mq queues, and will be dispatched again after reset is done. That is why the revalidation IO won't be failed. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 1:10 ` Ming Lei @ 2019-07-30 1:19 ` Sagi Grimberg 2019-07-30 1:30 ` Ming Lei 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-30 1:19 UTC (permalink / raw) >>>> If a controller reset is racing with a namespace revalidation, the >>>> revalidation I/O will surely fail, but we should not remove the >>> >>> No, there is sync IO request in revalidation, and the sync IO can't be completed >>> during resetting. >> >> Why? of course it can. The controller reset is responsible to >> quiesce + abort + unquiesce both I/O and admin commands. > > Abort simply cancels the in-flight requests, which will be retried & > re-queued into > blk-mq queues, and will be dispatched again after reset is done. That is why the > revalidation IO won't be failed. These I/Os are admin which will not be retried (because we mark it with REQ_FAILFAST_DRIVER). As for normal I/O that meant to be retried, it still must either failfast or failover to another path. (see the patch I sent Logan for an issue in this area: [PATCH v3] nvme: fix controller removal race with scan work) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 1:19 ` Sagi Grimberg @ 2019-07-30 1:30 ` Ming Lei 2019-07-30 1:40 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Ming Lei @ 2019-07-30 1:30 UTC (permalink / raw) On Tue, Jul 30, 2019@9:19 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > >>>> If a controller reset is racing with a namespace revalidation, the > >>>> revalidation I/O will surely fail, but we should not remove the > >>> > >>> No, there is sync IO request in revalidation, and the sync IO can't be completed > >>> during resetting. > >> > >> Why? of course it can. The controller reset is responsible to > >> quiesce + abort + unquiesce both I/O and admin commands. > > > > Abort simply cancels the in-flight requests, which will be retried & > > re-queued into > > blk-mq queues, and will be dispatched again after reset is done. That is why the > > revalidation IO won't be failed. > > These I/Os are admin which will not be retried (because we mark it with > REQ_FAILFAST_DRIVER). Yeah, this way is fine since data loss won't be caused. > As for normal I/O that meant to be retried, it > still must either failfast or failover to another path. Why? If the IO is failed, dirty pages may be lost, and it is one generic issue in either multipath or not, and I don't think the normal IO should be failed at least in case of non-multipath. > > (see the patch I sent Logan for an issue in this area: > [PATCH v3] nvme: fix controller removal race with scan work) The above patch is only for multipath. thanks, Ming Lei ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 1:30 ` Ming Lei @ 2019-07-30 1:40 ` Sagi Grimberg 2019-07-30 2:09 ` Ming Lei 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-30 1:40 UTC (permalink / raw) On 7/29/19 6:30 PM, Ming Lei wrote: > On Tue, Jul 30, 2019@9:19 AM Sagi Grimberg <sagi@grimberg.me> wrote: >> >> >>>>>> If a controller reset is racing with a namespace revalidation, the >>>>>> revalidation I/O will surely fail, but we should not remove the >>>>> >>>>> No, there is sync IO request in revalidation, and the sync IO can't be completed >>>>> during resetting. >>>> >>>> Why? of course it can. The controller reset is responsible to >>>> quiesce + abort + unquiesce both I/O and admin commands. >>> >>> Abort simply cancels the in-flight requests, which will be retried & >>> re-queued into >>> blk-mq queues, and will be dispatched again after reset is done. That is why the >>> revalidation IO won't be failed. >> >> These I/Os are admin which will not be retried (because we mark it with >> REQ_FAILFAST_DRIVER). > > Yeah, this way is fine since data loss won't be caused. > >> As for normal I/O that meant to be retried, it >> still must either failfast or failover to another path. > > Why? If the IO is failed, dirty pages may be lost, and it is one generic > issue in either multipath or not, and I don't think the normal IO should be > failed at least in case of non-multipath. This was meant only when the namespace is going away. Yea, if the namespace is not going away as described in this patch, then yes, it will block until reset is completed. >> (see the patch I sent Logan for an issue in this area: >> [PATCH v3] nvme: fix controller removal race with scan work) > > The above patch is only for multipath. Yes, and again, addresses the case that the namespace is going away. So I think we are in agreement? I only need to change the commit message from: "the revalidation I/O" to "the admin I/O" ? ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 1:40 ` Sagi Grimberg @ 2019-07-30 2:09 ` Ming Lei 2019-07-30 17:12 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Ming Lei @ 2019-07-30 2:09 UTC (permalink / raw) On Tue, Jul 30, 2019@9:40 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > On 7/29/19 6:30 PM, Ming Lei wrote: > > On Tue, Jul 30, 2019@9:19 AM Sagi Grimberg <sagi@grimberg.me> wrote: > >> > >> > >>>>>> If a controller reset is racing with a namespace revalidation, the > >>>>>> revalidation I/O will surely fail, but we should not remove the > >>>>> > >>>>> No, there is sync IO request in revalidation, and the sync IO can't be completed > >>>>> during resetting. > >>>> > >>>> Why? of course it can. The controller reset is responsible to > >>>> quiesce + abort + unquiesce both I/O and admin commands. > >>> > >>> Abort simply cancels the in-flight requests, which will be retried & > >>> re-queued into > >>> blk-mq queues, and will be dispatched again after reset is done. That is why the > >>> revalidation IO won't be failed. > >> > >> These I/Os are admin which will not be retried (because we mark it with > >> REQ_FAILFAST_DRIVER). > > > > Yeah, this way is fine since data loss won't be caused. > > > >> As for normal I/O that meant to be retried, it > >> still must either failfast or failover to another path. > > > > Why? If the IO is failed, dirty pages may be lost, and it is one generic > > issue in either multipath or not, and I don't think the normal IO should be > > failed at least in case of non-multipath. > > This was meant only when the namespace is going away. > Yea, if the namespace is not going away as described in this patch, then > yes, it will block until reset is completed. > > >> (see the patch I sent Logan for an issue in this area: > >> [PATCH v3] nvme: fix controller removal race with scan work) > > > > The above patch is only for multipath. > > Yes, and again, addresses the case that the namespace is going away. > > So I think we are in agreement? I only need to change the commit > message from: "the revalidation I/O" to "the admin I/O" ? That words of 'admin I/O' isn't related with the patch or issue. > Yea, this should do the trick I guess: > -- > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index fa31da0762b9..d01976c93160 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl > *ctrl, unsigned nsid) > > ns = nvme_find_get_ns(ctrl, nsid); > if (ns) { > - if (ns->disk && revalidate_disk(ns->disk)) > + if (ns->disk && ctrl->state == NVME_CTRL_LIVE && > + revalidate_disk(ns->disk) > nvme_ns_remove(ns); > nvme_put_ns(ns); > } else If RESET is triggered just inside revalidate_disk(), and not done after revalidate_disk() returns, there is still race between reset and scan work. For example, nvme_ns_remove() may hang for ever since the ns may be quiesced in RESET, and can't be un-quiesced any more. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 2:09 ` Ming Lei @ 2019-07-30 17:12 ` Sagi Grimberg 2019-07-30 17:30 ` Keith Busch 2019-07-31 6:58 ` Hannes Reinecke 0 siblings, 2 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-07-30 17:12 UTC (permalink / raw) >> Yes, and again, addresses the case that the namespace is going away. >> >> So I think we are in agreement? I only need to change the commit >> message from: "the revalidation I/O" to "the admin I/O" ? > > That words of 'admin I/O' isn't related with the patch or issue. But it is, the original issue was due to the fact that nvme_revalidate_disk() I/Os such as nvme_identify_ns() or nvme_identify_ns_descs(). This was the original issue. >> Yea, this should do the trick I guess: >> -- >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index fa31da0762b9..d01976c93160 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl >> *ctrl, unsigned nsid) >> >> ns = nvme_find_get_ns(ctrl, nsid); >> if (ns) { >> - if (ns->disk && revalidate_disk(ns->disk)) >> + if (ns->disk && ctrl->state == NVME_CTRL_LIVE && >> + revalidate_disk(ns->disk) >> nvme_ns_remove(ns); >> nvme_put_ns(ns); >> } else > > If RESET is triggered just inside revalidate_disk(), and not done after > revalidate_disk() returns, there is still race between reset and scan work. > You are correct, this was why I had the ctrl->state check after revalidate_disk so if it failed because we are in a reset we should not remove the namespace. We need a reliable way to NOT remove the namespace if revalidate_disk failed because the controller is resetting and we don't have a channel to the controller at this very moment... Keith, As for the failure during reset scenario, this is happening only when the namespace is about to go away or something is seriously wrong right (looking from where nvme_kill_queues is called). Do you still think we should avoid calling the revalidate_disk if the controller is resetting? ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 17:12 ` Sagi Grimberg @ 2019-07-30 17:30 ` Keith Busch 2019-07-30 18:15 ` Sagi Grimberg 2019-07-31 7:01 ` Hannes Reinecke 2019-07-31 6:58 ` Hannes Reinecke 1 sibling, 2 replies; 42+ messages in thread From: Keith Busch @ 2019-07-30 17:30 UTC (permalink / raw) On Tue, Jul 30, 2019@10:12:42AM -0700, Sagi Grimberg wrote: > > > > Yes, and again, addresses the case that the namespace is going away. > > > > > > So I think we are in agreement? I only need to change the commit > > > message from: "the revalidation I/O" to "the admin I/O" ? > > > > That words of 'admin I/O' isn't related with the patch or issue. > > But it is, the original issue was due to the fact that > nvme_revalidate_disk() I/Os such as nvme_identify_ns() or > nvme_identify_ns_descs(). This was the original issue. > > > > Yea, this should do the trick I guess: > > > -- > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > index fa31da0762b9..d01976c93160 100644 > > > --- a/drivers/nvme/host/core.c > > > +++ b/drivers/nvme/host/core.c > > > @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl > > > *ctrl, unsigned nsid) > > > > > > ns = nvme_find_get_ns(ctrl, nsid); > > > if (ns) { > > > - if (ns->disk && revalidate_disk(ns->disk)) > > > + if (ns->disk && ctrl->state == NVME_CTRL_LIVE && > > > + revalidate_disk(ns->disk) > > > nvme_ns_remove(ns); > > > nvme_put_ns(ns); > > > } else > > > > If RESET is triggered just inside revalidate_disk(), and not done after > > revalidate_disk() returns, there is still race between reset and scan work. > > > > You are correct, this was why I had the ctrl->state check after > revalidate_disk so if it failed because we are in a reset we should > not remove the namespace. > > We need a reliable way to NOT remove the namespace if revalidate_disk > failed because the controller is resetting and we don't have a channel > to the controller at this very moment... > > Keith, > > As for the failure during reset scenario, this is happening only when > the namespace is about to go away or something is seriously wrong right > (looking from where nvme_kill_queues is called). > > Do you still think we should avoid calling the revalidate_disk if the > controller is resetting? I was considering if a reset happens to trigger when nvme's revalidate_disk tries to read identify namespace. It's possible that command gets aborted, and we don't retry admin commands, so we'd return -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace. I'm not too concerned about this corner case actually occuring in practice, though. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 17:30 ` Keith Busch @ 2019-07-30 18:15 ` Sagi Grimberg 2019-07-31 7:13 ` Hannes Reinecke 2019-07-31 7:01 ` Hannes Reinecke 1 sibling, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-30 18:15 UTC (permalink / raw) >> Keith, >> >> As for the failure during reset scenario, this is happening only when >> the namespace is about to go away or something is seriously wrong right >> (looking from where nvme_kill_queues is called). >> >> Do you still think we should avoid calling the revalidate_disk if the >> controller is resetting? > > I was considering if a reset happens to trigger when nvme's > revalidate_disk tries to read identify namespace. It's possible that > command gets aborted, and we don't retry admin commands, so we'd return > -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace. But if the ctrl->state condition is after the revalidate_disk call it won't remove the namespace. -- if (ns->disk && revalidate_disk(ns->disk && ctrl->state == NVME_CTRL_LIVE) nvme_ns_remove(ns); -- > > I'm not too concerned about this corner case actually occuring in > practice, though. Say, why are we removing the namepsace if the revalidate failed? Don't see an equivalent behavior in scsi or others in drivers/block.. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 18:15 ` Sagi Grimberg @ 2019-07-31 7:13 ` Hannes Reinecke 2019-07-31 18:08 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Hannes Reinecke @ 2019-07-31 7:13 UTC (permalink / raw) On 7/30/19 8:15 PM, Sagi Grimberg wrote: > >>> Keith, >>> >>> As for the failure during reset scenario, this is happening only when >>> the namespace is about to go away or something is seriously wrong right >>> (looking from where nvme_kill_queues is called). >>> >>> Do you still think we should avoid calling the revalidate_disk if the >>> controller is resetting? >> >> I was considering if a reset happens to trigger when nvme's >> revalidate_disk tries to read identify namespace. It's possible that >> command gets aborted, and we don't retry admin commands, so we'd return >> -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace. > > But if the ctrl->state condition is after the revalidate_disk call it > won't remove the namespace. > -- > ??????????????? if (ns->disk && revalidate_disk(ns->disk && > ??????????????????? ctrl->state == NVME_CTRL_LIVE) > ??????????????????????? nvme_ns_remove(ns); > -- > >> >> I'm not too concerned about this corner case actually occuring in >> practice, though. > > Say, why are we removing the namepsace if the revalidate failed? > Don't see an equivalent behavior in scsi or others in drivers/block.. That probably due to the convoluted logic in nvme_scan_ns_list(). That scans all namespaces in the list, and removes those which fails. Which seems sensible, but really shouldn't be called during reset as we cannot get any sensible results for any of those calls. This was the reason for my original patch to flush the scanning thread before entering reset. Maybe we should revive it... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 7:13 ` Hannes Reinecke @ 2019-07-31 18:08 ` Sagi Grimberg 0 siblings, 0 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 18:08 UTC (permalink / raw) >>>> Keith, >>>> >>>> As for the failure during reset scenario, this is happening only when >>>> the namespace is about to go away or something is seriously wrong right >>>> (looking from where nvme_kill_queues is called). >>>> >>>> Do you still think we should avoid calling the revalidate_disk if the >>>> controller is resetting? >>> >>> I was considering if a reset happens to trigger when nvme's >>> revalidate_disk tries to read identify namespace. It's possible that >>> command gets aborted, and we don't retry admin commands, so we'd return >>> -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace. >> >> But if the ctrl->state condition is after the revalidate_disk call it >> won't remove the namespace. >> -- >> ??????????????? if (ns->disk && revalidate_disk(ns->disk && >> ??????????????????? ctrl->state == NVME_CTRL_LIVE) >> ??????????????????????? nvme_ns_remove(ns); >> -- >> >>> >>> I'm not too concerned about this corner case actually occuring in >>> practice, though. >> >> Say, why are we removing the namepsace if the revalidate failed? >> Don't see an equivalent behavior in scsi or others in drivers/block.. > > That probably due to the convoluted logic in nvme_scan_ns_list(). > That scans all namespaces in the list, and removes those which fails. > Which seems sensible, but really shouldn't be called during reset > as we cannot get any sensible results for any of those calls. > > This was the reason for my original patch to flush the scanning thread > before entering reset. Maybe we should revive it... I stand firm that controller resets should not attempt to flush the scan_work with some hooks to try to make it bail out early. I just don't think its the correct solution. Note that the scan flush existed in the reset flow before (nvme_stop_ctrl) but moved away because it caused issues before and probably will again if its restored. When the controller resets, I/Os fail (fast-fail or block), the model is fine in my mind, but needs some hashing out to be more aware of that. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 17:30 ` Keith Busch 2019-07-30 18:15 ` Sagi Grimberg @ 2019-07-31 7:01 ` Hannes Reinecke 2019-07-31 14:16 ` Keith Busch 2019-07-31 18:03 ` Sagi Grimberg 1 sibling, 2 replies; 42+ messages in thread From: Hannes Reinecke @ 2019-07-31 7:01 UTC (permalink / raw) On 7/30/19 7:30 PM, Keith Busch wrote: > On Tue, Jul 30, 2019@10:12:42AM -0700, Sagi Grimberg wrote: >> >>>> Yes, and again, addresses the case that the namespace is going away. >>>> >>>> So I think we are in agreement? I only need to change the commit >>>> message from: "the revalidation I/O" to "the admin I/O" ? >>> >>> That words of 'admin I/O' isn't related with the patch or issue. >> >> But it is, the original issue was due to the fact that >> nvme_revalidate_disk() I/Os such as nvme_identify_ns() or >> nvme_identify_ns_descs(). This was the original issue. >> >>>> Yea, this should do the trick I guess: >>>> -- >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>> index fa31da0762b9..d01976c93160 100644 >>>> --- a/drivers/nvme/host/core.c >>>> +++ b/drivers/nvme/host/core.c >>>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl >>>> *ctrl, unsigned nsid) >>>> >>>> ns = nvme_find_get_ns(ctrl, nsid); >>>> if (ns) { >>>> - if (ns->disk && revalidate_disk(ns->disk)) >>>> + if (ns->disk && ctrl->state == NVME_CTRL_LIVE && >>>> + revalidate_disk(ns->disk) >>>> nvme_ns_remove(ns); >>>> nvme_put_ns(ns); >>>> } else >>> >>> If RESET is triggered just inside revalidate_disk(), and not done after >>> revalidate_disk() returns, there is still race between reset and scan work. >>> >> >> You are correct, this was why I had the ctrl->state check after >> revalidate_disk so if it failed because we are in a reset we should >> not remove the namespace. >> >> We need a reliable way to NOT remove the namespace if revalidate_disk >> failed because the controller is resetting and we don't have a channel >> to the controller at this very moment... >> >> Keith, >> >> As for the failure during reset scenario, this is happening only when >> the namespace is about to go away or something is seriously wrong right >> (looking from where nvme_kill_queues is called). >> >> Do you still think we should avoid calling the revalidate_disk if the >> controller is resetting? > > I was considering if a reset happens to trigger when nvme's > revalidate_disk tries to read identify namespace. It's possible that > command gets aborted, and we don't retry admin commands, so we'd return > -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace. > > I'm not too concerned about this corner case actually occuring in > practice, though. > ... discarding those poor folks having to hunt down this very same issue for several months now ... Yes, it _does_ occur. Not on PCI, mind, but definitely for FC. RDMA might have a slightly better chance of not hitting it, but even there we have seen it. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 7:01 ` Hannes Reinecke @ 2019-07-31 14:16 ` Keith Busch 2019-07-31 18:03 ` Sagi Grimberg 1 sibling, 0 replies; 42+ messages in thread From: Keith Busch @ 2019-07-31 14:16 UTC (permalink / raw) On Wed, Jul 31, 2019@12:01:58AM -0700, Hannes Reinecke wrote: > > I'm not too concerned about this corner case actually occuring in > > practice, though. > > > ... discarding those poor folks having to hunt down this very same issue > for several months now ... > > Yes, it _does_ occur. > Not on PCI, mind, but definitely for FC. RDMA might have a slightly > better chance of not hitting it, but even there we have seen it. Fair enough! I'll re-read your patches with this better understanding on the context. :) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 7:01 ` Hannes Reinecke 2019-07-31 14:16 ` Keith Busch @ 2019-07-31 18:03 ` Sagi Grimberg 2019-07-31 19:32 ` Keith Busch 1 sibling, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 18:03 UTC (permalink / raw) >> I was considering if a reset happens to trigger when nvme's >> revalidate_disk tries to read identify namespace. It's possible that >> command gets aborted, and we don't retry admin commands, so we'd return >> -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace. >> >> I'm not too concerned about this corner case actually occuring in >> practice, though. >> > ... discarding those poor folks having to hunt down this very same issue > for several months now ... > > Yes, it _does_ occur. > Not on PCI, mind, but definitely for FC. RDMA might have a slightly > better chance of not hitting it, but even there we have seen it. And this patch prevents from the namespaces being removed, which _is_ the wrong behavior we need to prevent. As I said, we should not remove that namespace instead of trying to synchronize the remove with everything else... I think I asked this but was not answered, why are we removing the namespace at all? do others do the same thing (remove the disk if revalidation fails)? ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 18:03 ` Sagi Grimberg @ 2019-07-31 19:32 ` Keith Busch 2019-07-31 20:08 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Keith Busch @ 2019-07-31 19:32 UTC (permalink / raw) On Wed, Jul 31, 2019@11:03:41AM -0700, Sagi Grimberg wrote: > > I think I asked this but was not answered, why are we removing > the namespace at all? do others do the same thing (remove the > disk if revalidation fails)? If a namespace no longer exists, what do you want to do with it instead of removing it? ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 19:32 ` Keith Busch @ 2019-07-31 20:08 ` Sagi Grimberg 2019-07-31 20:16 ` Keith Busch 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 20:08 UTC (permalink / raw) >> I think I asked this but was not answered, why are we removing >> the namespace at all? do others do the same thing (remove the >> disk if revalidation fails)? > > If a namespace no longer exists, Why is it no longer exists? it failed revalidate.. > what do you want to do with it instead of removing it? Well, I don't see anyone else even checking the return status of revalidate_disk.. Perhaps Hannes can share more on how scsi handles this? ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 20:08 ` Sagi Grimberg @ 2019-07-31 20:16 ` Keith Busch 2019-07-31 20:45 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Keith Busch @ 2019-07-31 20:16 UTC (permalink / raw) On Wed, Jul 31, 2019@01:08:05PM -0700, Sagi Grimberg wrote: > > > > I think I asked this but was not answered, why are we removing > > > the namespace at all? do others do the same thing (remove the > > > disk if revalidation fails)? > > > > If a namespace no longer exists, > > Why is it no longer exists? it failed revalidate.. One way it fails to validate is if it doesn't exist, i.e., the controller returned an error when attempting to identify it. The other way it may fail to revalidate is if its identify has changed since we last discovered it, so removal is better than data corruption. Either scenario could happen from administrative namespace management commands to any controller in the subsystem. > > what do you want to do with it instead of removing it? > > Well, I don't see anyone else even checking the return > status of revalidate_disk.. Perhaps Hannes can share more > on how scsi handles this? Other drivers may have a different path to this, but I'd imagine they handle a LUN deletion somehow. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 20:16 ` Keith Busch @ 2019-07-31 20:45 ` Sagi Grimberg 2019-07-31 20:58 ` Keith Busch 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 20:45 UTC (permalink / raw) >>>> I think I asked this but was not answered, why are we removing >>>> the namespace at all? do others do the same thing (remove the >>>> disk if revalidation fails)? >>> >>> If a namespace no longer exists, >> >> Why is it no longer exists? it failed revalidate.. > > One way it fails to validate is if it doesn't exist, i.e., the > controller returned an error when attempting to identify it. > > The other way it may fail to revalidate is if its identify has changed > since we last discovered it, so removal is better than data corruption. Well, perhaps we can mark failures resulting from reset with a transport error. For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and check for that as the return code for revalidate_disk? Thoughts? > Either scenario could happen from administrative namespace management > commands to any controller in the subsystem. > >>> what do you want to do with it instead of removing it? >> >> Well, I don't see anyone else even checking the return >> status of revalidate_disk.. Perhaps Hannes can share more >> on how scsi handles this? > > Other drivers may have a different path to this, but I'd imagine they > handle a LUN deletion somehow. Well, its not coming from revalidate_disk failure as the return status is not checked... ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 20:45 ` Sagi Grimberg @ 2019-07-31 20:58 ` Keith Busch 2019-07-31 21:14 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Keith Busch @ 2019-07-31 20:58 UTC (permalink / raw) On Wed, Jul 31, 2019@01:45:12PM -0700, Sagi Grimberg wrote: > > > > > > I think I asked this but was not answered, why are we removing > > > > > the namespace at all? do others do the same thing (remove the > > > > > disk if revalidation fails)? > > > > > > > > If a namespace no longer exists, > > > > > > Why is it no longer exists? it failed revalidate.. > > > > One way it fails to validate is if it doesn't exist, i.e., the > > controller returned an error when attempting to identify it. > > > > The other way it may fail to revalidate is if its identify has changed > > since we last discovered it, so removal is better than data corruption. > > Well, perhaps we can mark failures resulting from reset with a transport > error. > > For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps > we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and > check for that as the return code for revalidate_disk? > > Thoughts? Would it be sufficient to let these admin commands requeue? Instead of flushing the scan work, we can let it block for IO on a reset, and the IO will resume when the reset completes. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 20:58 ` Keith Busch @ 2019-07-31 21:14 ` Sagi Grimberg 2019-07-31 21:54 ` Keith Busch 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 21:14 UTC (permalink / raw) >>> The other way it may fail to revalidate is if its identify has changed >>> since we last discovered it, so removal is better than data corruption. >> >> Well, perhaps we can mark failures resulting from reset with a transport >> error. >> >> For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps >> we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and >> check for that as the return code for revalidate_disk? >> >> Thoughts? > > Would it be sufficient to let these admin commands requeue? Instead of > flushing the scan work, we can let it block for IO on a reset, and the > IO will resume when the reset completes. Well, I don't think we should do that. Unlike I/O commands, which can failover to a different path, these admin commands are bound to the specific controller. In case it takes minutes/hours/days for the controller to restore normal operation, it will be unpleasant to say the least to have admin operations get stuck for so long. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 21:14 ` Sagi Grimberg @ 2019-07-31 21:54 ` Keith Busch 2019-08-01 1:13 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Keith Busch @ 2019-07-31 21:54 UTC (permalink / raw) On Wed, Jul 31, 2019@02:14:25PM -0700, Sagi Grimberg wrote: > > > > > The other way it may fail to revalidate is if its identify has changed > > > > since we last discovered it, so removal is better than data corruption. > > > > > > Well, perhaps we can mark failures resulting from reset with a transport > > > error. > > > > > > For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps > > > we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and > > > check for that as the return code for revalidate_disk? > > > > > > Thoughts? > > > > Would it be sufficient to let these admin commands requeue? Instead of > > flushing the scan work, we can let it block for IO on a reset, and the > > IO will resume when the reset completes. > > Well, I don't think we should do that. Unlike I/O commands, which can > failover to a different path, these admin commands are bound to the > specific controller. In case it takes minutes/hours/days for the > controller to restore normal operation, it will be unpleasant to say > the least to have admin operations get stuck for so long. Unpleasant for who? The scan_work is the only thing waiting for these commands, no one else should care because you can't run IO if you're stuck in very long reset anyway. I think the main point is that we don't want to take a delete action on a transient condition, but sprinkling NVME_CTRL_LIVE checks is open to many other races. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 21:54 ` Keith Busch @ 2019-08-01 1:13 ` Sagi Grimberg 2019-08-01 14:33 ` Keith Busch 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-08-01 1:13 UTC (permalink / raw) >> Well, I don't think we should do that. Unlike I/O commands, which can >> failover to a different path, these admin commands are bound to the >> specific controller. In case it takes minutes/hours/days for the >> controller to restore normal operation, it will be unpleasant to say >> the least to have admin operations get stuck for so long. > > Unpleasant for who? The scan_work is the only thing waiting for these > commands, no one else should care because you can't run IO if you're > stuck in very long reset anyway. The hung task detector would care, and a user who will attempt to issue a passthru command, and the rest of the system that have one of the kworkers sacrificed for a significant amount of time... > I think the main point is that we don't want to take a delete action on > a transient condition, but sprinkling NVME_CTRL_LIVE checks is open to > many other races. Hence I suggested the transport error code... ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-08-01 1:13 ` Sagi Grimberg @ 2019-08-01 14:33 ` Keith Busch 2019-08-01 18:52 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Keith Busch @ 2019-08-01 14:33 UTC (permalink / raw) On Wed, Jul 31, 2019@06:13:06PM -0700, Sagi Grimberg wrote: > > >> Well, I don't think we should do that. Unlike I/O commands, which can > >> failover to a different path, these admin commands are bound to the > >> specific controller. In case it takes minutes/hours/days for the > >> controller to restore normal operation, it will be unpleasant to say > >> the least to have admin operations get stuck for so long. > > > > Unpleasant for who? The scan_work is the only thing waiting for these > > commands, no one else should care because you can't run IO if you're > > stuck in very long reset anyway. > > The hung task detector would care, and a user who will attempt to issue > a passthru command, and the rest of the system that have one of the > kworkers sacrificed for a significant amount of time... blk_execute_rq already defeats hung task detection for stalled IO. My point, though, was passthru doesn't care about scan_work. A submitted passthru command is blocked for reset, so blocking scan_work doesn't make that situation any better or worse. > > I think the main point is that we don't want to take a delete action on > > a transient condition, but sprinkling NVME_CTRL_LIVE checks is open to > > many other races. > > Hence I suggested the transport error code... That should work too. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-08-01 14:33 ` Keith Busch @ 2019-08-01 18:52 ` Sagi Grimberg 0 siblings, 0 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-08-01 18:52 UTC (permalink / raw) >>>> Well, I don't think we should do that. Unlike I/O commands, which can >>>> failover to a different path, these admin commands are bound to the >>>> specific controller. In case it takes minutes/hours/days for the >>>> controller to restore normal operation, it will be unpleasant to say >>>> the least to have admin operations get stuck for so long. >>> >>> Unpleasant for who? The scan_work is the only thing waiting for these >>> commands, no one else should care because you can't run IO if you're >>> stuck in very long reset anyway. >> >> The hung task detector would care, and a user who will attempt to issue >> a passthru command, and the rest of the system that have one of the >> kworkers sacrificed for a significant amount of time... > > blk_execute_rq already defeats hung task detection for stalled IO. > > My point, though, was passthru doesn't care about scan_work. A submitted > passthru command is blocked for reset, Not in fabrics drivers (unless I'm missing something that changed). > so blocking scan_work doesn't make that situation any better or worse. I think that when we talk about reset in fabrics, we have in mind a long process (mainly because of the fact that network port failures are a lot more frequent and span some amount of time). This is why fabric drivers, when they get a transport error, they go into the reset flow and they quiesce+terminate+unquiesce to fastfail admin commands. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-30 17:12 ` Sagi Grimberg 2019-07-30 17:30 ` Keith Busch @ 2019-07-31 6:58 ` Hannes Reinecke 2019-07-31 18:11 ` Sagi Grimberg 1 sibling, 1 reply; 42+ messages in thread From: Hannes Reinecke @ 2019-07-31 6:58 UTC (permalink / raw) On 7/30/19 7:12 PM, Sagi Grimberg wrote: > >>> Yes, and again, addresses the case that the namespace is going away. >>> >>> So I think we are in agreement? I only need to change the commit >>> message from: "the revalidation I/O" to "the admin I/O" ? >> >> That words of 'admin I/O' isn't related with the patch or issue. > > But it is, the original issue was due to the fact that > nvme_revalidate_disk() I/Os such as nvme_identify_ns() or > nvme_identify_ns_descs(). This was the original issue. > >>> Yea, this should do the trick I guess: >>> -- >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index fa31da0762b9..d01976c93160 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl >>> *ctrl, unsigned nsid) >>> >>> ????????? ns = nvme_find_get_ns(ctrl, nsid); >>> ????????? if (ns) { >>> -?????????????? if (ns->disk && revalidate_disk(ns->disk)) >>> +?????????????? if (ns->disk && ctrl->state == NVME_CTRL_LIVE && >>> +?????????????????? revalidate_disk(ns->disk) >>> ????????????????????????? nvme_ns_remove(ns); >>> ????????????????? nvme_put_ns(ns); >>> ????????? } else >> >> If RESET is triggered just inside revalidate_disk(), and not done after >> revalidate_disk() returns,? there is still race between reset and scan >> work. >> > > You are correct, this was why I had the ctrl->state check after > revalidate_disk so if it failed because we are in a reset we should > not remove the namespace. > > We need a reliable way to NOT remove the namespace if revalidate_disk > failed because the controller is resetting and we don't have a channel > to the controller at this very moment... > A similar thing is true for the 'ns_head' structure; I could easily envision a scenario where all paths are dropped (due to revalidate failures or whatnot) with all controllers in reset. But as all paths are dropped struct ns_head doesn't have any reference to any namespaces (and consequently, to any controller), and will be failing all I/O before itself being removed. It will be re-created after resetting completes and all namespaces are re-attached, for sure, but it still means that we're getting I/O errors where we really shouldn't. One is tempted to say 'queue_if_no_path' here ... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 6:58 ` Hannes Reinecke @ 2019-07-31 18:11 ` Sagi Grimberg 2019-07-31 20:02 ` Hannes Reinecke 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 18:11 UTC (permalink / raw) >> You are correct, this was why I had the ctrl->state check after >> revalidate_disk so if it failed because we are in a reset we should >> not remove the namespace. >> >> We need a reliable way to NOT remove the namespace if revalidate_disk >> failed because the controller is resetting and we don't have a channel >> to the controller at this very moment... >> > A similar thing is true for the 'ns_head' structure; I could easily > envision a scenario where all paths are dropped (due to revalidate > failures or whatnot) with all controllers in reset. Hannes, paths should not be dropped in controller resets, only when the controller is going away. Its simple as that. > But as all paths are dropped struct ns_head doesn't have any reference > to any namespaces (and consequently, to any controller), and will be > failing all I/O before itself being removed. > It will be re-created after resetting completes and all namespaces are > re-attached, for sure, but it still means that we're getting I/O errors > where we really shouldn't. Paths should only be removed when the controller is going away, not when it is temporarily disconnected. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 18:11 ` Sagi Grimberg @ 2019-07-31 20:02 ` Hannes Reinecke 2019-07-31 20:16 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Hannes Reinecke @ 2019-07-31 20:02 UTC (permalink / raw) On 7/31/19 8:11 PM, Sagi Grimberg wrote: > >>> You are correct, this was why I had the ctrl->state check after >>> revalidate_disk so if it failed because we are in a reset we should >>> not remove the namespace. >>> >>> We need a reliable way to NOT remove the namespace if revalidate_disk >>> failed because the controller is resetting and we don't have a channel >>> to the controller at this very moment... >>> >> A similar thing is true for the 'ns_head' structure; I could easily >> envision a scenario where all paths are dropped (due to revalidate >> failures or whatnot) with all controllers in reset. > > Hannes, paths should not be dropped in controller resets, only when > the controller is going away. Its simple as that. > Fully agree. But so far we don't seem to be there yet. >> But as all paths are dropped struct ns_head doesn't have any reference >> to any namespaces (and consequently, to any controller), and will be >> failing all I/O before itself being removed. >> It will be re-created after resetting completes and all namespaces are >> re-attached, for sure, but it still means that we're getting I/O errors >> where we really shouldn't. > > Paths should only be removed when the controller is going away, not when > it is temporarily disconnected. I'm fully in line with that. However, it looks as if this isn't the case currently :-) And whenever I try to come up with a solution here I always come back to the thought: It would've been easier if we just could flush scan_work ... And my main worry here is: surely we'll manage to hash out the situation eventually, and fixup all code paths to avoid dropping paths during reset. But the solution will be very fragile, as any innocent change there might break things again. So I'm not convinced that asynchronous reset and rescan is a feature which should be kept. (And honestly, we're working towards a solution where scanning during reset will become a no-op ...) But anyway, prime objective is to get this mess fixed :-) Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 20:02 ` Hannes Reinecke @ 2019-07-31 20:16 ` Sagi Grimberg 0 siblings, 0 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 20:16 UTC (permalink / raw) > Fully agree. But so far we don't seem to be there yet. Right. >>> But as all paths are dropped struct ns_head doesn't have any reference >>> to any namespaces (and consequently, to any controller), and will be >>> failing all I/O before itself being removed. >>> It will be re-created after resetting completes and all namespaces are >>> re-attached, for sure, but it still means that we're getting I/O errors >>> where we really shouldn't. >> >> Paths should only be removed when the controller is going away, not when >> it is temporarily disconnected. > > I'm fully in line with that. > However, it looks as if this isn't the case currently :-) > And whenever I try to come up with a solution here I always come back to > the thought: It would've been easier if we just could flush scan_work ... > > And my main worry here is: surely we'll manage to hash out the situation > eventually, and fixup all code paths to avoid dropping paths during reset. > But the solution will be very fragile, as any innocent change there > might break things again. So I'm not convinced that asynchronous reset > and rescan is a feature which should be kept. > (And honestly, we're working towards a solution where scanning during > reset will become a no-op ...) But flushing the scan_work is even more fragile.. the scan_work is a non-trivial work that may have dependencies with other processes. For example, passthru commands with effects take the scan lock during its lifetime, which means that it may be blocked even before it checks the ctrl state (which is again, unreliable with respect to resets). So do resets flush all outstanding passthru commands now? So I think that flushing the scan_work might bring us new issues because of this dependency. That is why I think we should address the root cause which is we're wrongly removing the namespace, not the fact that we failed to scan it. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg 2019-07-30 0:59 ` Keith Busch 2019-07-30 1:04 ` Ming Lei @ 2019-07-31 12:18 ` Hannes Reinecke 2019-07-31 18:16 ` Sagi Grimberg 2 siblings, 1 reply; 42+ messages in thread From: Hannes Reinecke @ 2019-07-31 12:18 UTC (permalink / raw) On 7/30/19 1:32 AM, Sagi Grimberg wrote: > If a controller reset is racing with a namespace revalidation, the > revalidation I/O will surely fail, but we should not remove the > namespace as we will execute the I/O when the controller is LIVE again. > Specifically check if the controller is LIVE because as > RESETTING/CONNECTING are transient and DELETING/DEAD will eventually > remove the namespace in the removal code path. > > This fixes sporious I/O errors in path failover coditions where the > controller reset is racing with the namespace scan work. > > Reported-by: Hannes Reinecke <hare at suse.de> > Signed-off-by: Sagi Grimberg <sagi at grimberg.me> > --- > drivers/nvme/host/core.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index fa31da0762b9..5f6970e7ba73 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid) > > ns = nvme_find_get_ns(ctrl, nsid); > if (ns) { > - if (ns->disk && revalidate_disk(ns->disk)) > + if (ns->disk && revalidate_disk(ns->disk) && > + ctrl->state != NVME_CTRL_LIVE) > nvme_ns_remove(ns); > nvme_put_ns(ns); > } else > The thing I'm worried about here is concurrency (this was also what caused my earlier attempts to be rejected). Thing is, the controller state can be set asynchronously. Which in itself is protected by ctrl->lock, but here we're not taking the lock at all. Hence we might be seeing a stale value, causing us to makt the wrong decision. I have a patchset checking 'ctrl->state' under lock; will be posting it later. And another thing: where's the point in 'revalidate_disk()' to be called if the controller is not live? At best it won't do anything; at worst we'll stall if we have to do more than one I/O (the reset cycle will abort _one_ I/O, but if we have to submit more than one we'll be stuck, too). Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 12:18 ` Hannes Reinecke @ 2019-07-31 18:16 ` Sagi Grimberg 2019-07-31 20:04 ` Hannes Reinecke 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 18:16 UTC (permalink / raw) >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index fa31da0762b9..5f6970e7ba73 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid) >> >> ns = nvme_find_get_ns(ctrl, nsid); >> if (ns) { >> - if (ns->disk && revalidate_disk(ns->disk)) >> + if (ns->disk && revalidate_disk(ns->disk) && >> + ctrl->state != NVME_CTRL_LIVE) >> nvme_ns_remove(ns); >> nvme_put_ns(ns); >> } else >> > The thing I'm worried about here is concurrency (this was also what > caused my earlier attempts to be rejected). > Thing is, the controller state can be set asynchronously. Which in > itself is protected by ctrl->lock, but here we're not taking the lock at > all. Hence we might be seeing a stale value, causing us to makt the > wrong decision. This is why the state check is _after_ the failure, if we failed to do I/O because of a reset, it will be reflected in the state. > I have a patchset checking 'ctrl->state' under lock; will be posting it > later. Unless the wraps your actions according the state, the state still advisory. > And another thing: where's the point in 'revalidate_disk()' to be called > if the controller is not live? At best it won't do anything; at worst > we'll stall if we have to do more than one I/O (the reset cycle will > abort _one_ I/O, but if we have to submit more than one we'll be stuck, > too). Exactly because of the race that you are indicating. We can't reliably know if the controller is resetting (or will be starting a reset soon) so we go ahead with the revalidate knowing that if the controller is resetting it will fail. That is why the state check is after (this tells us if the revalidation failed because of a reset or something else). ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 18:16 ` Sagi Grimberg @ 2019-07-31 20:04 ` Hannes Reinecke 2019-07-31 20:37 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Hannes Reinecke @ 2019-07-31 20:04 UTC (permalink / raw) On 7/31/19 8:16 PM, Sagi Grimberg wrote: > >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index fa31da0762b9..5f6970e7ba73 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl >>> *ctrl, unsigned nsid) >>> ????? ns = nvme_find_get_ns(ctrl, nsid); >>> ????? if (ns) { >>> -??????? if (ns->disk && revalidate_disk(ns->disk)) >>> +??????? if (ns->disk && revalidate_disk(ns->disk) && >>> +??????????? ctrl->state != NVME_CTRL_LIVE) >>> ????????????? nvme_ns_remove(ns); >>> ????????? nvme_put_ns(ns); >>> ????? } else >>> >> The thing I'm worried about here is concurrency (this was also what >> caused my earlier attempts to be rejected). >> Thing is, the controller state can be set asynchronously. Which in >> itself is protected by ctrl->lock, but here we're not taking the lock at >> all. Hence we might be seeing a stale value, causing us to makt the >> wrong decision. > > This is why the state check is _after_ the failure, if we failed to do > I/O because of a reset, it will be reflected in the state. > >> I have a patchset checking 'ctrl->state' under lock; will be posting it >> later. > > Unless the wraps your actions according the state, the state still > advisory. > >> And another thing: where's the point in 'revalidate_disk()' to be called >> if the controller is not live? At best it won't do anything; at worst >> we'll stall if we have to do more than one I/O (the reset cycle will >> abort _one_ I/O, but if we have to submit more than one we'll be stuck, >> too). > > Exactly because of the race that you are indicating. We can't reliably > know if the controller is resetting (or will be starting a reset soon) > so we go ahead with the revalidate knowing that if the controller is > resetting it will fail. That is why the state check is after (this tells > us if the revalidation failed because of a reset or something else). Okay, fair enough. Reviewed-by: Hannes Reinecke <hare at suse.com> Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset 2019-07-31 20:04 ` Hannes Reinecke @ 2019-07-31 20:37 ` Sagi Grimberg 0 siblings, 0 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 20:37 UTC (permalink / raw) >> Exactly because of the race that you are indicating. We can't reliably >> know if the controller is resetting (or will be starting a reset soon) >> so we go ahead with the revalidate knowing that if the controller is >> resetting it will fail. That is why the state check is after (this tells >> us if the revalidation failed because of a reset or something else). > > Okay, fair enough. > > Reviewed-by: Hannes Reinecke <hare at suse.com> Hannes, Can we understand if this addresses the issues you have reported? From the cover-letter: Hey Hannes, Here is two patches that to my understanding of the issues you describe in your patchset: "nvme: flush rescan worker before resetting" and your report of "spurious I/O errors during failover". [...] Can you please apply the following two patches and report if they address the issues you are seeing? And if not, can you please report a call trace of the hanged threads? And, given that your are in a multipath environment, can you apply these on top of: "nvme: fix controller removal race with scan work"? ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning 2019-07-29 23:31 [PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions Sagi Grimberg 2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg @ 2019-07-29 23:32 ` Sagi Grimberg 2019-07-31 12:23 ` Hannes Reinecke 1 sibling, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-29 23:32 UTC (permalink / raw) remove ns from ctrl namespaces list before destroying the request queue as the reset work will unquisce for all the controller namespaces, while an online scan might remove invalid namespaces resulting from a concurrent AEN. While we're at it, move nvme_check_last_path to right after we modify the multipath reated cleanups. Reported-by: Hannes Reinecke <hare at suse.de> Signed-off-by: Sagi Grimberg <sagi at grimberg.me> --- drivers/nvme/host/core.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 5f6970e7ba73..9f8f8f5feeae 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3406,6 +3406,11 @@ static void nvme_ns_remove(struct nvme_ns *ns) synchronize_rcu(); /* guarantee not available in head->list */ nvme_mpath_clear_current_path(ns); synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */ + nvme_mpath_check_last_path(ns); + + down_write(&ns->ctrl->namespaces_rwsem); + list_del_init(&ns->list); + up_write(&ns->ctrl->namespaces_rwsem); if (ns->disk && ns->disk->flags & GENHD_FL_UP) { del_gendisk(ns->disk); @@ -3414,11 +3419,6 @@ static void nvme_ns_remove(struct nvme_ns *ns) blk_integrity_unregister(ns->disk); } - down_write(&ns->ctrl->namespaces_rwsem); - list_del_init(&ns->list); - up_write(&ns->ctrl->namespaces_rwsem); - - nvme_mpath_check_last_path(ns); nvme_put_ns(ns); } -- 2.17.1 ^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning 2019-07-29 23:32 ` [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning Sagi Grimberg @ 2019-07-31 12:23 ` Hannes Reinecke 2019-07-31 18:21 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Hannes Reinecke @ 2019-07-31 12:23 UTC (permalink / raw) On 7/30/19 1:32 AM, Sagi Grimberg wrote: > remove ns from ctrl namespaces list before destroying the request queue > as the reset work will unquisce for all the controller namespaces, while > an online scan might remove invalid namespaces resulting from a > concurrent AEN. > > While we're at it, move nvme_check_last_path to right after we modify > the multipath reated cleanups. > > Reported-by: Hannes Reinecke <hare at suse.de> > Signed-off-by: Sagi Grimberg <sagi at grimberg.me> > --- > drivers/nvme/host/core.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 5f6970e7ba73..9f8f8f5feeae 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3406,6 +3406,11 @@ static void nvme_ns_remove(struct nvme_ns *ns) > synchronize_rcu(); /* guarantee not available in head->list */ > nvme_mpath_clear_current_path(ns); > synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */ > + nvme_mpath_check_last_path(ns); > + > + down_write(&ns->ctrl->namespaces_rwsem); > + list_del_init(&ns->list); > + up_write(&ns->ctrl->namespaces_rwsem); > > if (ns->disk && ns->disk->flags & GENHD_FL_UP) { > del_gendisk(ns->disk); > @@ -3414,11 +3419,6 @@ static void nvme_ns_remove(struct nvme_ns *ns) > blk_integrity_unregister(ns->disk); > } > > - down_write(&ns->ctrl->namespaces_rwsem); > - list_del_init(&ns->list); > - up_write(&ns->ctrl->namespaces_rwsem); > - > - nvme_mpath_check_last_path(ns); > nvme_put_ns(ns); > } > > How does this one play with nvme_stop_queues()/nvme_start_queues() we're doing during reset? IE what happens if this is called after nvme_stop_queues(), but before nvme_start_queues()? We'll end up having a stopped queue when deleting the disk; from what I've seen blk_cleanup_queue() will just freeze the queue and wait for I/O to complete, which it'll never will as the queue is stopped ... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning 2019-07-31 12:23 ` Hannes Reinecke @ 2019-07-31 18:21 ` Sagi Grimberg 2019-08-01 7:24 ` Hannes Reinecke 0 siblings, 1 reply; 42+ messages in thread From: Sagi Grimberg @ 2019-07-31 18:21 UTC (permalink / raw) >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index 5f6970e7ba73..9f8f8f5feeae 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -3406,6 +3406,11 @@ static void nvme_ns_remove(struct nvme_ns *ns) >> synchronize_rcu(); /* guarantee not available in head->list */ >> nvme_mpath_clear_current_path(ns); >> synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */ >> + nvme_mpath_check_last_path(ns); >> + >> + down_write(&ns->ctrl->namespaces_rwsem); >> + list_del_init(&ns->list); >> + up_write(&ns->ctrl->namespaces_rwsem); >> >> if (ns->disk && ns->disk->flags & GENHD_FL_UP) { >> del_gendisk(ns->disk); >> @@ -3414,11 +3419,6 @@ static void nvme_ns_remove(struct nvme_ns *ns) >> blk_integrity_unregister(ns->disk); >> } >> >> - down_write(&ns->ctrl->namespaces_rwsem); >> - list_del_init(&ns->list); >> - up_write(&ns->ctrl->namespaces_rwsem); >> - >> - nvme_mpath_check_last_path(ns); >> nvme_put_ns(ns); >> } >> >> > How does this one play with nvme_stop_queues()/nvme_start_queues() we're > doing during reset? > IE what happens if this is called after nvme_stop_queues(), but before > nvme_start_queues()? > We'll end up having a stopped queue when deleting the disk; from what > I've seen blk_cleanup_queue() will just freeze the queue and wait for > I/O to complete, which it'll never will as the queue is stopped ... nvme_stop/start_queue are quiescing the queue, not freezing it which does not block requests in blk_queue_enter. So I don't think that blk_cleanup_queue will hang for unquiesced queues. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning 2019-07-31 18:21 ` Sagi Grimberg @ 2019-08-01 7:24 ` Hannes Reinecke 2019-08-01 18:46 ` Sagi Grimberg 0 siblings, 1 reply; 42+ messages in thread From: Hannes Reinecke @ 2019-08-01 7:24 UTC (permalink / raw) On 7/31/19 8:21 PM, Sagi Grimberg wrote: > >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index 5f6970e7ba73..9f8f8f5feeae 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -3406,6 +3406,11 @@ static void nvme_ns_remove(struct nvme_ns *ns) >>> ????? synchronize_rcu(); /* guarantee not available in head->list */ >>> ????? nvme_mpath_clear_current_path(ns); >>> ????? synchronize_srcu(&ns->head->srcu); /* wait for concurrent >>> submissions */ >>> +??? nvme_mpath_check_last_path(ns); >>> + >>> +??? down_write(&ns->ctrl->namespaces_rwsem); >>> +??? list_del_init(&ns->list); >>> +??? up_write(&ns->ctrl->namespaces_rwsem); >>> ? ????? if (ns->disk && ns->disk->flags & GENHD_FL_UP) { >>> ????????? del_gendisk(ns->disk); >>> @@ -3414,11 +3419,6 @@ static void nvme_ns_remove(struct nvme_ns *ns) >>> ????????????? blk_integrity_unregister(ns->disk); >>> ????? } >>> ? -??? down_write(&ns->ctrl->namespaces_rwsem); >>> -??? list_del_init(&ns->list); >>> -??? up_write(&ns->ctrl->namespaces_rwsem); >>> - >>> -??? nvme_mpath_check_last_path(ns); >>> ????? nvme_put_ns(ns); >>> ? } >>> ? >> How does this one play with nvme_stop_queues()/nvme_start_queues() we're >> doing during reset? >> IE what happens if this is called after nvme_stop_queues(), but before >> nvme_start_queues()? >> We'll end up having a stopped queue when deleting the disk; from what >> I've seen blk_cleanup_queue() will just freeze the queue and wait for >> I/O to complete, which it'll never will as the queue is stopped ... > > nvme_stop/start_queue are quiescing the queue, not freezing it which > does not block requests in blk_queue_enter. > So I don't think that blk_cleanup_queue will hang for unquiesced queues. My worry here is for quiesced queues; they are not unquiesced when blk_clean_queue() is running. So I'm not sure if we're able to flush all commands then. But be it as it may, I've sent a patchset for checking the NS_REMOVING flag when traversing the namespace list. That should take care of this issue, and should render this patch obsolete. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg) ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning 2019-08-01 7:24 ` Hannes Reinecke @ 2019-08-01 18:46 ` Sagi Grimberg 0 siblings, 0 replies; 42+ messages in thread From: Sagi Grimberg @ 2019-08-01 18:46 UTC (permalink / raw) >> nvme_stop/start_queue are quiescing the queue, not freezing it which >> does not block requests in blk_queue_enter. >> So I don't think that blk_cleanup_queue will hang for unquiesced queues. > > My worry here is for quiesced queues; they are not unquiesced when > blk_clean_queue() is running. So I'm not sure if we're able to flush all > commands then. blk_cleanup_queue does not block on quiesced queues, it blocks on frozen queues. > But be it as it may, I've sent a patchset for checking the NS_REMOVING > flag when traversing the namespace list. That should take care of this > issue, and should render this patch obsolete. I've shared feedback on that one, I need to understand what is the crash that you are talking about.. and why not promoting the list removal in nvme_ns_remove() doesn't address it. ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2019-08-01 18:52 UTC | newest] Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-07-29 23:31 [PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions Sagi Grimberg 2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg 2019-07-30 0:59 ` Keith Busch 2019-07-30 1:04 ` Sagi Grimberg 2019-07-30 1:04 ` Ming Lei 2019-07-30 1:06 ` Sagi Grimberg 2019-07-30 1:10 ` Ming Lei 2019-07-30 1:19 ` Sagi Grimberg 2019-07-30 1:30 ` Ming Lei 2019-07-30 1:40 ` Sagi Grimberg 2019-07-30 2:09 ` Ming Lei 2019-07-30 17:12 ` Sagi Grimberg 2019-07-30 17:30 ` Keith Busch 2019-07-30 18:15 ` Sagi Grimberg 2019-07-31 7:13 ` Hannes Reinecke 2019-07-31 18:08 ` Sagi Grimberg 2019-07-31 7:01 ` Hannes Reinecke 2019-07-31 14:16 ` Keith Busch 2019-07-31 18:03 ` Sagi Grimberg 2019-07-31 19:32 ` Keith Busch 2019-07-31 20:08 ` Sagi Grimberg 2019-07-31 20:16 ` Keith Busch 2019-07-31 20:45 ` Sagi Grimberg 2019-07-31 20:58 ` Keith Busch 2019-07-31 21:14 ` Sagi Grimberg 2019-07-31 21:54 ` Keith Busch 2019-08-01 1:13 ` Sagi Grimberg 2019-08-01 14:33 ` Keith Busch 2019-08-01 18:52 ` Sagi Grimberg 2019-07-31 6:58 ` Hannes Reinecke 2019-07-31 18:11 ` Sagi Grimberg 2019-07-31 20:02 ` Hannes Reinecke 2019-07-31 20:16 ` Sagi Grimberg 2019-07-31 12:18 ` Hannes Reinecke 2019-07-31 18:16 ` Sagi Grimberg 2019-07-31 20:04 ` Hannes Reinecke 2019-07-31 20:37 ` Sagi Grimberg 2019-07-29 23:32 ` [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning Sagi Grimberg 2019-07-31 12:23 ` Hannes Reinecke 2019-07-31 18:21 ` Sagi Grimberg 2019-08-01 7:24 ` Hannes Reinecke 2019-08-01 18:46 ` Sagi Grimberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.