[PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions
@ 2019-07-29 23:31 Sagi Grimberg
  2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg
  2019-07-29 23:32 ` [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning Sagi Grimberg
  0 siblings, 2 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-29 23:31 UTC (permalink / raw)


Hey Hannes,

Here is two patches that to my understanding of the issues you describe
in your patchset: "nvme: flush rescan worker before resetting" and your
report of "spurious I/O errors during failover".

Patch #1 avoids removing a namespace if the revalidation I/O failed because
of a racing controller reset (or removal). This should fix any spurious I/O
failures during path failover.

Patch #2 avoids a use-after-free condition in the case where the scan work ends
up actually removing a namespace but is racing with a controller reset (that is
scheduled after the removal) and is accessing the request queue
(nvme_stop_queues) after it was destroyed.

The one trace that you have mentioned in our discussions that is still
not clear to me is:
--
[67088.344034] WARNING: CPU: 4 PID: 25020 at
../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
[...]
[67088.344106] Call Trace:
[67088.344112]  blk_freeze_queue_start+0x2a/0x40
[67088.344114]  blk_freeze_queue+0xe/0x40
[67088.344118]  nvme_update_disk_info+0x36/0x260 [nvme_core]
[67088.344122]  __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
[67088.344125]  nvme_revalidate_disk+0xa6/0x120 [nvme_core]
[67088.344127]  ? blk_mq_get_tag+0xa3/0x220
[67088.344130]  revalidate_disk+0x23/0xc0
[67088.344133]  nvme_validate_ns+0x43/0x830 [nvme_core]
[67088.344137]  ? wake_up_q+0x70/0x70
[67088.344139]  ? blk_mq_free_request+0x12a/0x160
[67088.344142]  ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
[67088.344145]  nvme_scan_work+0x2b3/0x350 [nvme_core]
[67088.344149]  process_one_work+0x1da/0x400
[67088.344150]  worker_thread+0x2b/0x3f0
[67088.344152]  ? process_one_work+0x400/0x400
--

Which indicates that we are revalidating a namespace that was already
removed. Given that the only namespace removal that is outside of the
scan_work is in nvme_remove_namespaces() which flushes the scan_work
before it actually removes the namespaces. I'm still lost how this can happen.

Can you please apply the following two patches and report if
they address the issues you are seeing? And if not, can you please
report a call trace of the hanged threads?

And, given that your are in a multipath environment, can you apply
these on top of: "nvme: fix controller removal race with scan work"?

Thanks.

Sagi Grimberg (2):
  nvme: don't remove namespace if revalidate failed because of
    controller reset
  nvme: fix possible use-after-free condition when controller reset is
    racing namespace scanning

 drivers/nvme/host/core.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-29 23:31 [PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions Sagi Grimberg
@ 2019-07-29 23:32 ` Sagi Grimberg
  2019-07-30  0:59   ` Keith Busch
                     ` (2 more replies)
  2019-07-29 23:32 ` [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning Sagi Grimberg
  1 sibling, 3 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-29 23:32 UTC (permalink / raw)


If a controller reset is racing with a namespace revalidation, the
revalidation I/O will surely fail, but we should not remove the
namespace as we will execute the I/O when the controller is LIVE again.
Specifically check if the controller is LIVE because as
RESETTING/CONNECTING are transient and DELETING/DEAD will eventually
remove the namespace in the removal code path.

This fixes sporious I/O errors in path failover coditions where the
controller reset is racing with the namespace scan work.

Reported-by: Hannes Reinecke  <hare at suse.de>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa31da0762b9..5f6970e7ba73 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	ns = nvme_find_get_ns(ctrl, nsid);
 	if (ns) {
-		if (ns->disk && revalidate_disk(ns->disk))
+		if (ns->disk && revalidate_disk(ns->disk) &&
+		    ctrl->state != NVME_CTRL_LIVE)
 			nvme_ns_remove(ns);
 		nvme_put_ns(ns);
 	} else
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning
  2019-07-29 23:31 [PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions Sagi Grimberg
  2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg
@ 2019-07-29 23:32 ` Sagi Grimberg
  2019-07-31 12:23   ` Hannes Reinecke
  1 sibling, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-29 23:32 UTC (permalink / raw)


remove ns from ctrl namespaces list before destroying the request queue
as the reset work will unquisce for all the controller namespaces, while
an online scan might remove invalid namespaces resulting from a
concurrent AEN.

While we're at it, move nvme_check_last_path to right after we modify
the multipath reated cleanups.

Reported-by: Hannes Reinecke  <hare at suse.de>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5f6970e7ba73..9f8f8f5feeae 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3406,6 +3406,11 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	synchronize_rcu(); /* guarantee not available in head->list */
 	nvme_mpath_clear_current_path(ns);
 	synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */
+	nvme_mpath_check_last_path(ns);
+
+	down_write(&ns->ctrl->namespaces_rwsem);
+	list_del_init(&ns->list);
+	up_write(&ns->ctrl->namespaces_rwsem);
 
 	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
 		del_gendisk(ns->disk);
@@ -3414,11 +3419,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 			blk_integrity_unregister(ns->disk);
 	}
 
-	down_write(&ns->ctrl->namespaces_rwsem);
-	list_del_init(&ns->list);
-	up_write(&ns->ctrl->namespaces_rwsem);
-
-	nvme_mpath_check_last_path(ns);
 	nvme_put_ns(ns);
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg
@ 2019-07-30  0:59   ` Keith Busch
  2019-07-30  1:04     ` Sagi Grimberg
  2019-07-30  1:04   ` Ming Lei
  2019-07-31 12:18   ` Hannes Reinecke
  2 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2019-07-30  0:59 UTC (permalink / raw)


On Mon, Jul 29, 2019@5:32 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> If a controller reset is racing with a namespace revalidation, the
> revalidation I/O will surely fail, but we should not remove the
> namespace as we will execute the I/O when the controller is LIVE again.
> Specifically check if the controller is LIVE because as
> RESETTING/CONNECTING are transient and DELETING/DEAD will eventually
> remove the namespace in the removal code path.
>
> This fixes sporious I/O errors in path failover coditions where the
> controller reset is racing with the namespace scan work.
>
> Reported-by: Hannes Reinecke  <hare at suse.de>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa31da0762b9..5f6970e7ba73 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>
>         ns = nvme_find_get_ns(ctrl, nsid);
>         if (ns) {
> -               if (ns->disk && revalidate_disk(ns->disk))
> +               if (ns->disk && revalidate_disk(ns->disk) &&
> +                   ctrl->state != NVME_CTRL_LIVE)
>                         nvme_ns_remove(ns);

That should be '== NVME_CTRL_LIVE', right? You don't want to remove it
for the resetting state, and the removing states take care of removal
directly.

On the RESETTING state, shouldn't we avoid calling revalidate_disk()
in the first place? An intermittent failure during reset could cause
the disk to temporarily go to a zero-capacity, which may cause some
filesystem issues that we could avoid if we know we're about to rescan
this namespace.

>                 nvme_put_ns(ns);
>         } else

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg
  2019-07-30  0:59   ` Keith Busch
@ 2019-07-30  1:04   ` Ming Lei
  2019-07-30  1:06     ` Sagi Grimberg
  2019-07-31 12:18   ` Hannes Reinecke
  2 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2019-07-30  1:04 UTC (permalink / raw)


On Tue, Jul 30, 2019@7:32 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> If a controller reset is racing with a namespace revalidation, the
> revalidation I/O will surely fail, but we should not remove the

No, there is sync IO request in revalidation, and the sync IO can't be completed
during resetting.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30  0:59   ` Keith Busch
@ 2019-07-30  1:04     ` Sagi Grimberg
  0 siblings, 0 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-30  1:04 UTC (permalink / raw)



>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index fa31da0762b9..5f6970e7ba73 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>>
>>          ns = nvme_find_get_ns(ctrl, nsid);
>>          if (ns) {
>> -               if (ns->disk && revalidate_disk(ns->disk))
>> +               if (ns->disk && revalidate_disk(ns->disk) &&
>> +                   ctrl->state != NVME_CTRL_LIVE)
>>                          nvme_ns_remove(ns);
> 
> That should be '== NVME_CTRL_LIVE', right? You don't want to remove it
> for the resetting state, and the removing states take care of removal
> directly.

rrr... you're right...

> On the RESETTING state, shouldn't we avoid calling revalidate_disk()
> in the first place? An intermittent failure during reset could cause
> the disk to temporarily go to a zero-capacity, which may cause some
> filesystem issues that we could avoid if we know we're about to rescan
> this namespace.

Yea, this should do the trick I guess:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa31da0762b9..d01976c93160 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl 
*ctrl, unsigned nsid)

         ns = nvme_find_get_ns(ctrl, nsid);
         if (ns) {
-               if (ns->disk && revalidate_disk(ns->disk))
+               if (ns->disk && ctrl->state == NVME_CTRL_LIVE &&
+                   revalidate_disk(ns->disk)
                         nvme_ns_remove(ns);
                 nvme_put_ns(ns);
         } else
--

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30  1:04   ` Ming Lei
@ 2019-07-30  1:06     ` Sagi Grimberg
  2019-07-30  1:10       ` Ming Lei
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-30  1:06 UTC (permalink / raw)



>> If a controller reset is racing with a namespace revalidation, the
>> revalidation I/O will surely fail, but we should not remove the
> 
> No, there is sync IO request in revalidation, and the sync IO can't be completed
> during resetting.

Why? of course it can. The controller reset is responsible to
quiesce + abort + unquiesce both I/O and admin commands.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30  1:06     ` Sagi Grimberg
@ 2019-07-30  1:10       ` Ming Lei
  2019-07-30  1:19         ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2019-07-30  1:10 UTC (permalink / raw)


On Tue, Jul 30, 2019@9:06 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> If a controller reset is racing with a namespace revalidation, the
> >> revalidation I/O will surely fail, but we should not remove the
> >
> > No, there is sync IO request in revalidation, and the sync IO can't be completed
> > during resetting.
>
> Why? of course it can. The controller reset is responsible to
> quiesce + abort + unquiesce both I/O and admin commands.

Abort simply cancels the in-flight requests, which will be retried &
re-queued into
blk-mq queues, and will be dispatched again after reset is done. That is why the
revalidation IO won't be failed.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30  1:10       ` Ming Lei
@ 2019-07-30  1:19         ` Sagi Grimberg
  2019-07-30  1:30           ` Ming Lei
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-30  1:19 UTC (permalink / raw)



>>>> If a controller reset is racing with a namespace revalidation, the
>>>> revalidation I/O will surely fail, but we should not remove the
>>>
>>> No, there is sync IO request in revalidation, and the sync IO can't be completed
>>> during resetting.
>>
>> Why? of course it can. The controller reset is responsible to
>> quiesce + abort + unquiesce both I/O and admin commands.
> 
> Abort simply cancels the in-flight requests, which will be retried &
> re-queued into
> blk-mq queues, and will be dispatched again after reset is done. That is why the
> revalidation IO won't be failed.

These I/Os are admin which will not be retried (because we mark it with
REQ_FAILFAST_DRIVER). As for normal I/O that meant to be retried, it
still must either failfast or failover to another path.

(see the patch I sent Logan for an issue in this area:
[PATCH v3] nvme: fix controller removal race with scan work)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30  1:19         ` Sagi Grimberg
@ 2019-07-30  1:30           ` Ming Lei
  2019-07-30  1:40             ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2019-07-30  1:30 UTC (permalink / raw)


On Tue, Jul 30, 2019@9:19 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >>>> If a controller reset is racing with a namespace revalidation, the
> >>>> revalidation I/O will surely fail, but we should not remove the
> >>>
> >>> No, there is sync IO request in revalidation, and the sync IO can't be completed
> >>> during resetting.
> >>
> >> Why? of course it can. The controller reset is responsible to
> >> quiesce + abort + unquiesce both I/O and admin commands.
> >
> > Abort simply cancels the in-flight requests, which will be retried &
> > re-queued into
> > blk-mq queues, and will be dispatched again after reset is done. That is why the
> > revalidation IO won't be failed.
>
> These I/Os are admin which will not be retried (because we mark it with
> REQ_FAILFAST_DRIVER).

Yeah, this way is fine since data loss won't be caused.

> As for normal I/O that meant to be retried, it
> still must either failfast or failover to another path.

Why?  If the IO is failed, dirty pages may be lost, and it is one generic
issue in either multipath or not, and I don't think the normal IO should be
failed at least in case of non-multipath.

>
> (see the patch I sent Logan for an issue in this area:
> [PATCH v3] nvme: fix controller removal race with scan work)

The above patch is only for multipath.


thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30  1:30           ` Ming Lei
@ 2019-07-30  1:40             ` Sagi Grimberg
  2019-07-30  2:09               ` Ming Lei
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-30  1:40 UTC (permalink / raw)




On 7/29/19 6:30 PM, Ming Lei wrote:
> On Tue, Jul 30, 2019@9:19 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>
>>>>>> If a controller reset is racing with a namespace revalidation, the
>>>>>> revalidation I/O will surely fail, but we should not remove the
>>>>>
>>>>> No, there is sync IO request in revalidation, and the sync IO can't be completed
>>>>> during resetting.
>>>>
>>>> Why? of course it can. The controller reset is responsible to
>>>> quiesce + abort + unquiesce both I/O and admin commands.
>>>
>>> Abort simply cancels the in-flight requests, which will be retried &
>>> re-queued into
>>> blk-mq queues, and will be dispatched again after reset is done. That is why the
>>> revalidation IO won't be failed.
>>
>> These I/Os are admin which will not be retried (because we mark it with
>> REQ_FAILFAST_DRIVER).
> 
> Yeah, this way is fine since data loss won't be caused.
> 
>> As for normal I/O that meant to be retried, it
>> still must either failfast or failover to another path.
> 
> Why?  If the IO is failed, dirty pages may be lost, and it is one generic
> issue in either multipath or not, and I don't think the normal IO should be
> failed at least in case of non-multipath.

This was meant only when the namespace is going away.
Yea, if the namespace is not going away as described in this patch, then
yes, it will block until reset is completed.

>> (see the patch I sent Logan for an issue in this area:
>> [PATCH v3] nvme: fix controller removal race with scan work)
> 
> The above patch is only for multipath.

Yes, and again, addresses the case that the namespace is going away.

So I think we are in agreement? I only need to change the commit
message from: "the revalidation I/O" to "the admin I/O" ?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30  1:40             ` Sagi Grimberg
@ 2019-07-30  2:09               ` Ming Lei
  2019-07-30 17:12                 ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2019-07-30  2:09 UTC (permalink / raw)


On Tue, Jul 30, 2019@9:40 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 7/29/19 6:30 PM, Ming Lei wrote:
> > On Tue, Jul 30, 2019@9:19 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> >>
> >>
> >>>>>> If a controller reset is racing with a namespace revalidation, the
> >>>>>> revalidation I/O will surely fail, but we should not remove the
> >>>>>
> >>>>> No, there is sync IO request in revalidation, and the sync IO can't be completed
> >>>>> during resetting.
> >>>>
> >>>> Why? of course it can. The controller reset is responsible to
> >>>> quiesce + abort + unquiesce both I/O and admin commands.
> >>>
> >>> Abort simply cancels the in-flight requests, which will be retried &
> >>> re-queued into
> >>> blk-mq queues, and will be dispatched again after reset is done. That is why the
> >>> revalidation IO won't be failed.
> >>
> >> These I/Os are admin which will not be retried (because we mark it with
> >> REQ_FAILFAST_DRIVER).
> >
> > Yeah, this way is fine since data loss won't be caused.
> >
> >> As for normal I/O that meant to be retried, it
> >> still must either failfast or failover to another path.
> >
> > Why?  If the IO is failed, dirty pages may be lost, and it is one generic
> > issue in either multipath or not, and I don't think the normal IO should be
> > failed at least in case of non-multipath.
>
> This was meant only when the namespace is going away.
> Yea, if the namespace is not going away as described in this patch, then
> yes, it will block until reset is completed.
>
> >> (see the patch I sent Logan for an issue in this area:
> >> [PATCH v3] nvme: fix controller removal race with scan work)
> >
> > The above patch is only for multipath.
>
> Yes, and again, addresses the case that the namespace is going away.
>
> So I think we are in agreement? I only need to change the commit
> message from: "the revalidation I/O" to "the admin I/O" ?

That words of 'admin I/O' isn't related with the patch or issue.

> Yea, this should do the trick I guess:
> --
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa31da0762b9..d01976c93160 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl
> *ctrl, unsigned nsid)
>
>          ns = nvme_find_get_ns(ctrl, nsid);
>          if (ns) {
> -               if (ns->disk && revalidate_disk(ns->disk))
> +               if (ns->disk && ctrl->state == NVME_CTRL_LIVE &&
> +                   revalidate_disk(ns->disk)
>                          nvme_ns_remove(ns);
>                  nvme_put_ns(ns);
>          } else

If RESET is triggered just inside revalidate_disk(), and not done after
revalidate_disk() returns,  there is still race between reset and scan work.

For example, nvme_ns_remove() may hang for ever since the ns may be
quiesced in RESET, and can't be un-quiesced any more.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30  2:09               ` Ming Lei
@ 2019-07-30 17:12                 ` Sagi Grimberg
  2019-07-30 17:30                   ` Keith Busch
  2019-07-31  6:58                   ` Hannes Reinecke
  0 siblings, 2 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-30 17:12 UTC (permalink / raw)



>> Yes, and again, addresses the case that the namespace is going away.
>>
>> So I think we are in agreement? I only need to change the commit
>> message from: "the revalidation I/O" to "the admin I/O" ?
> 
> That words of 'admin I/O' isn't related with the patch or issue.

But it is, the original issue was due to the fact that
nvme_revalidate_disk() I/Os such as nvme_identify_ns() or
nvme_identify_ns_descs(). This was the original issue.

>> Yea, this should do the trick I guess:
>> --
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index fa31da0762b9..d01976c93160 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl
>> *ctrl, unsigned nsid)
>>
>>           ns = nvme_find_get_ns(ctrl, nsid);
>>           if (ns) {
>> -               if (ns->disk && revalidate_disk(ns->disk))
>> +               if (ns->disk && ctrl->state == NVME_CTRL_LIVE &&
>> +                   revalidate_disk(ns->disk)
>>                           nvme_ns_remove(ns);
>>                   nvme_put_ns(ns);
>>           } else
> 
> If RESET is triggered just inside revalidate_disk(), and not done after
> revalidate_disk() returns,  there is still race between reset and scan work.
> 

You are correct, this was why I had the ctrl->state check after
revalidate_disk so if it failed because we are in a reset we should
not remove the namespace.

We need a reliable way to NOT remove the namespace if revalidate_disk
failed because the controller is resetting and we don't have a channel
to the controller at this very moment...

Keith,

As for the failure during reset scenario, this is happening only when
the namespace is about to go away or something is seriously wrong right
(looking from where nvme_kill_queues is called).

Do you still think we should avoid calling the revalidate_disk if the
controller is resetting?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30 17:12                 ` Sagi Grimberg
@ 2019-07-30 17:30                   ` Keith Busch
  2019-07-30 18:15                     ` Sagi Grimberg
  2019-07-31  7:01                     ` Hannes Reinecke
  2019-07-31  6:58                   ` Hannes Reinecke
  1 sibling, 2 replies; 42+ messages in thread
From: Keith Busch @ 2019-07-30 17:30 UTC (permalink / raw)


On Tue, Jul 30, 2019@10:12:42AM -0700, Sagi Grimberg wrote:
> 
> > > Yes, and again, addresses the case that the namespace is going away.
> > > 
> > > So I think we are in agreement? I only need to change the commit
> > > message from: "the revalidation I/O" to "the admin I/O" ?
> > 
> > That words of 'admin I/O' isn't related with the patch or issue.
> 
> But it is, the original issue was due to the fact that
> nvme_revalidate_disk() I/Os such as nvme_identify_ns() or
> nvme_identify_ns_descs(). This was the original issue.
> 
> > > Yea, this should do the trick I guess:
> > > --
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index fa31da0762b9..d01976c93160 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl
> > > *ctrl, unsigned nsid)
> > > 
> > >           ns = nvme_find_get_ns(ctrl, nsid);
> > >           if (ns) {
> > > -               if (ns->disk && revalidate_disk(ns->disk))
> > > +               if (ns->disk && ctrl->state == NVME_CTRL_LIVE &&
> > > +                   revalidate_disk(ns->disk)
> > >                           nvme_ns_remove(ns);
> > >                   nvme_put_ns(ns);
> > >           } else
> > 
> > If RESET is triggered just inside revalidate_disk(), and not done after
> > revalidate_disk() returns,  there is still race between reset and scan work.
> > 
> 
> You are correct, this was why I had the ctrl->state check after
> revalidate_disk so if it failed because we are in a reset we should
> not remove the namespace.
> 
> We need a reliable way to NOT remove the namespace if revalidate_disk
> failed because the controller is resetting and we don't have a channel
> to the controller at this very moment...
> 
> Keith,
> 
> As for the failure during reset scenario, this is happening only when
> the namespace is about to go away or something is seriously wrong right
> (looking from where nvme_kill_queues is called).
> 
> Do you still think we should avoid calling the revalidate_disk if the
> controller is resetting?

I was considering if a reset happens to trigger when nvme's
revalidate_disk tries to read identify namespace. It's possible that
command gets aborted, and we don't retry admin commands, so we'd return
-ENODEV and nvme_validate_ns() removes an otherwise healthy namespace.

I'm not too concerned about this corner case actually occuring in
practice, though.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30 17:30                   ` Keith Busch
@ 2019-07-30 18:15                     ` Sagi Grimberg
  2019-07-31  7:13                       ` Hannes Reinecke
  2019-07-31  7:01                     ` Hannes Reinecke
  1 sibling, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-30 18:15 UTC (permalink / raw)



>> Keith,
>>
>> As for the failure during reset scenario, this is happening only when
>> the namespace is about to go away or something is seriously wrong right
>> (looking from where nvme_kill_queues is called).
>>
>> Do you still think we should avoid calling the revalidate_disk if the
>> controller is resetting?
> 
> I was considering if a reset happens to trigger when nvme's
> revalidate_disk tries to read identify namespace. It's possible that
> command gets aborted, and we don't retry admin commands, so we'd return
> -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace.

But if the ctrl->state condition is after the revalidate_disk call it
won't remove the namespace.
--
                 if (ns->disk && revalidate_disk(ns->disk &&
                     ctrl->state == NVME_CTRL_LIVE)
                         nvme_ns_remove(ns);
--

> 
> I'm not too concerned about this corner case actually occuring in
> practice, though.

Say, why are we removing the namepsace if the revalidate failed?
Don't see an equivalent behavior in scsi or others in drivers/block..

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30 17:12                 ` Sagi Grimberg
  2019-07-30 17:30                   ` Keith Busch
@ 2019-07-31  6:58                   ` Hannes Reinecke
  2019-07-31 18:11                     ` Sagi Grimberg
  1 sibling, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2019-07-31  6:58 UTC (permalink / raw)


On 7/30/19 7:12 PM, Sagi Grimberg wrote:
> 
>>> Yes, and again, addresses the case that the namespace is going away.
>>>
>>> So I think we are in agreement? I only need to change the commit
>>> message from: "the revalidation I/O" to "the admin I/O" ?
>>
>> That words of 'admin I/O' isn't related with the patch or issue.
> 
> But it is, the original issue was due to the fact that
> nvme_revalidate_disk() I/Os such as nvme_identify_ns() or
> nvme_identify_ns_descs(). This was the original issue.
> 
>>> Yea, this should do the trick I guess:
>>> -- 
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index fa31da0762b9..d01976c93160 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl
>>> *ctrl, unsigned nsid)
>>>
>>> ????????? ns = nvme_find_get_ns(ctrl, nsid);
>>> ????????? if (ns) {
>>> -?????????????? if (ns->disk && revalidate_disk(ns->disk))
>>> +?????????????? if (ns->disk && ctrl->state == NVME_CTRL_LIVE &&
>>> +?????????????????? revalidate_disk(ns->disk)
>>> ????????????????????????? nvme_ns_remove(ns);
>>> ????????????????? nvme_put_ns(ns);
>>> ????????? } else
>>
>> If RESET is triggered just inside revalidate_disk(), and not done after
>> revalidate_disk() returns,? there is still race between reset and scan
>> work.
>>
> 
> You are correct, this was why I had the ctrl->state check after
> revalidate_disk so if it failed because we are in a reset we should
> not remove the namespace.
> 
> We need a reliable way to NOT remove the namespace if revalidate_disk
> failed because the controller is resetting and we don't have a channel
> to the controller at this very moment...
> 
A similar thing is true for the 'ns_head' structure; I could easily
envision a scenario where all paths are dropped (due to revalidate
failures or whatnot) with all controllers in reset.
But as all paths are dropped struct ns_head doesn't have any reference
to any namespaces (and consequently, to any controller), and will be
failing all I/O before itself being removed.
It will be re-created after resetting completes and all namespaces are
re-attached, for sure, but it still means that we're getting I/O errors
where we really shouldn't.
One is tempted to say 'queue_if_no_path' here ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30 17:30                   ` Keith Busch
  2019-07-30 18:15                     ` Sagi Grimberg
@ 2019-07-31  7:01                     ` Hannes Reinecke
  2019-07-31 14:16                       ` Keith Busch
  2019-07-31 18:03                       ` Sagi Grimberg
  1 sibling, 2 replies; 42+ messages in thread
From: Hannes Reinecke @ 2019-07-31  7:01 UTC (permalink / raw)


On 7/30/19 7:30 PM, Keith Busch wrote:
> On Tue, Jul 30, 2019@10:12:42AM -0700, Sagi Grimberg wrote:
>>
>>>> Yes, and again, addresses the case that the namespace is going away.
>>>>
>>>> So I think we are in agreement? I only need to change the commit
>>>> message from: "the revalidation I/O" to "the admin I/O" ?
>>>
>>> That words of 'admin I/O' isn't related with the patch or issue.
>>
>> But it is, the original issue was due to the fact that
>> nvme_revalidate_disk() I/Os such as nvme_identify_ns() or
>> nvme_identify_ns_descs(). This was the original issue.
>>
>>>> Yea, this should do the trick I guess:
>>>> --
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index fa31da0762b9..d01976c93160 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl
>>>> *ctrl, unsigned nsid)
>>>>
>>>>           ns = nvme_find_get_ns(ctrl, nsid);
>>>>           if (ns) {
>>>> -               if (ns->disk && revalidate_disk(ns->disk))
>>>> +               if (ns->disk && ctrl->state == NVME_CTRL_LIVE &&
>>>> +                   revalidate_disk(ns->disk)
>>>>                           nvme_ns_remove(ns);
>>>>                   nvme_put_ns(ns);
>>>>           } else
>>>
>>> If RESET is triggered just inside revalidate_disk(), and not done after
>>> revalidate_disk() returns,  there is still race between reset and scan work.
>>>
>>
>> You are correct, this was why I had the ctrl->state check after
>> revalidate_disk so if it failed because we are in a reset we should
>> not remove the namespace.
>>
>> We need a reliable way to NOT remove the namespace if revalidate_disk
>> failed because the controller is resetting and we don't have a channel
>> to the controller at this very moment...
>>
>> Keith,
>>
>> As for the failure during reset scenario, this is happening only when
>> the namespace is about to go away or something is seriously wrong right
>> (looking from where nvme_kill_queues is called).
>>
>> Do you still think we should avoid calling the revalidate_disk if the
>> controller is resetting?
> 
> I was considering if a reset happens to trigger when nvme's
> revalidate_disk tries to read identify namespace. It's possible that
> command gets aborted, and we don't retry admin commands, so we'd return
> -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace.
> 
> I'm not too concerned about this corner case actually occuring in
> practice, though.
> 
... discarding those poor folks having to hunt down this very same issue
for several months now ...

Yes, it _does_ occur.
Not on PCI, mind, but definitely for FC. RDMA might have a slightly
better chance of not hitting it, but even there we have seen it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-30 18:15                     ` Sagi Grimberg
@ 2019-07-31  7:13                       ` Hannes Reinecke
  2019-07-31 18:08                         ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2019-07-31  7:13 UTC (permalink / raw)


On 7/30/19 8:15 PM, Sagi Grimberg wrote:
> 
>>> Keith,
>>>
>>> As for the failure during reset scenario, this is happening only when
>>> the namespace is about to go away or something is seriously wrong right
>>> (looking from where nvme_kill_queues is called).
>>>
>>> Do you still think we should avoid calling the revalidate_disk if the
>>> controller is resetting?
>>
>> I was considering if a reset happens to trigger when nvme's
>> revalidate_disk tries to read identify namespace. It's possible that
>> command gets aborted, and we don't retry admin commands, so we'd return
>> -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace.
> 
> But if the ctrl->state condition is after the revalidate_disk call it
> won't remove the namespace.
> -- 
> ??????????????? if (ns->disk && revalidate_disk(ns->disk &&
> ??????????????????? ctrl->state == NVME_CTRL_LIVE)
> ??????????????????????? nvme_ns_remove(ns);
> -- 
> 
>>
>> I'm not too concerned about this corner case actually occuring in
>> practice, though.
> 
> Say, why are we removing the namepsace if the revalidate failed?
> Don't see an equivalent behavior in scsi or others in drivers/block..

That probably due to the convoluted logic in nvme_scan_ns_list().
That scans all namespaces in the list, and removes those which fails.
Which seems sensible, but really shouldn't be called during reset
as we cannot get any sensible results for any of those calls.

This was the reason for my original patch to flush the scanning thread
before entering reset. Maybe we should revive it...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg
  2019-07-30  0:59   ` Keith Busch
  2019-07-30  1:04   ` Ming Lei
@ 2019-07-31 12:18   ` Hannes Reinecke
  2019-07-31 18:16     ` Sagi Grimberg
  2 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2019-07-31 12:18 UTC (permalink / raw)


On 7/30/19 1:32 AM, Sagi Grimberg wrote:
> If a controller reset is racing with a namespace revalidation, the
> revalidation I/O will surely fail, but we should not remove the
> namespace as we will execute the I/O when the controller is LIVE again.
> Specifically check if the controller is LIVE because as
> RESETTING/CONNECTING are transient and DELETING/DEAD will eventually
> remove the namespace in the removal code path.
> 
> This fixes sporious I/O errors in path failover coditions where the
> controller reset is racing with the namespace scan work.
> 
> Reported-by: Hannes Reinecke  <hare at suse.de>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa31da0762b9..5f6970e7ba73 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>  
>  	ns = nvme_find_get_ns(ctrl, nsid);
>  	if (ns) {
> -		if (ns->disk && revalidate_disk(ns->disk))
> +		if (ns->disk && revalidate_disk(ns->disk) &&
> +		    ctrl->state != NVME_CTRL_LIVE)
>  			nvme_ns_remove(ns);
>  		nvme_put_ns(ns);
>  	} else
> 
The thing I'm worried about here is concurrency (this was also what
caused my earlier attempts to be rejected).
Thing is, the controller state can be set asynchronously. Which in
itself is protected by ctrl->lock, but here we're not taking the lock at
all. Hence we might be seeing a stale value, causing us to makt the
wrong decision.

I have a patchset checking 'ctrl->state' under lock; will be posting it
later.

And another thing: where's the point in 'revalidate_disk()' to be called
if the controller is not live? At best it won't do anything; at worst
we'll stall if we have to do more than one I/O (the reset cycle will
abort _one_ I/O, but if we have to submit more than one we'll be stuck,
too).

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning
  2019-07-29 23:32 ` [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning Sagi Grimberg
@ 2019-07-31 12:23   ` Hannes Reinecke
  2019-07-31 18:21     ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2019-07-31 12:23 UTC (permalink / raw)


On 7/30/19 1:32 AM, Sagi Grimberg wrote:
> remove ns from ctrl namespaces list before destroying the request queue
> as the reset work will unquisce for all the controller namespaces, while
> an online scan might remove invalid namespaces resulting from a
> concurrent AEN.
> 
> While we're at it, move nvme_check_last_path to right after we modify
> the multipath reated cleanups.
> 
> Reported-by: Hannes Reinecke  <hare at suse.de>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/core.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 5f6970e7ba73..9f8f8f5feeae 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3406,6 +3406,11 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>  	synchronize_rcu(); /* guarantee not available in head->list */
>  	nvme_mpath_clear_current_path(ns);
>  	synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */
> +	nvme_mpath_check_last_path(ns);
> +
> +	down_write(&ns->ctrl->namespaces_rwsem);
> +	list_del_init(&ns->list);
> +	up_write(&ns->ctrl->namespaces_rwsem);
>  
>  	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
>  		del_gendisk(ns->disk);
> @@ -3414,11 +3419,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>  			blk_integrity_unregister(ns->disk);
>  	}
>  
> -	down_write(&ns->ctrl->namespaces_rwsem);
> -	list_del_init(&ns->list);
> -	up_write(&ns->ctrl->namespaces_rwsem);
> -
> -	nvme_mpath_check_last_path(ns);
>  	nvme_put_ns(ns);
>  }
>  
> 
How does this one play with nvme_stop_queues()/nvme_start_queues() we're
doing during reset?
IE what happens if this is called after nvme_stop_queues(), but before
nvme_start_queues()?
We'll end up having a stopped queue when deleting the disk; from what
I've seen blk_cleanup_queue() will just freeze the queue and wait for
I/O to complete, which it'll never will as the queue is stopped ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31  7:01                     ` Hannes Reinecke
@ 2019-07-31 14:16                       ` Keith Busch
  2019-07-31 18:03                       ` Sagi Grimberg
  1 sibling, 0 replies; 42+ messages in thread
From: Keith Busch @ 2019-07-31 14:16 UTC (permalink / raw)


On Wed, Jul 31, 2019@12:01:58AM -0700, Hannes Reinecke wrote:
> > I'm not too concerned about this corner case actually occuring in
> > practice, though.
> > 
> ... discarding those poor folks having to hunt down this very same issue
> for several months now ...
> 
> Yes, it _does_ occur.
> Not on PCI, mind, but definitely for FC. RDMA might have a slightly
> better chance of not hitting it, but even there we have seen it.

Fair enough! I'll re-read your patches with this better understanding
on the context. :)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31  7:01                     ` Hannes Reinecke
  2019-07-31 14:16                       ` Keith Busch
@ 2019-07-31 18:03                       ` Sagi Grimberg
  2019-07-31 19:32                         ` Keith Busch
  1 sibling, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 18:03 UTC (permalink / raw)



>> I was considering if a reset happens to trigger when nvme's
>> revalidate_disk tries to read identify namespace. It's possible that
>> command gets aborted, and we don't retry admin commands, so we'd return
>> -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace.
>>
>> I'm not too concerned about this corner case actually occuring in
>> practice, though.
>>
> ... discarding those poor folks having to hunt down this very same issue
> for several months now ...
> 
> Yes, it _does_ occur.
> Not on PCI, mind, but definitely for FC. RDMA might have a slightly
> better chance of not hitting it, but even there we have seen it.

And this patch prevents from the namespaces being removed, which _is_
the wrong behavior we need to prevent. As I said, we should not
remove that namespace instead of trying to synchronize the remove
with everything else...

I think I asked this but was not answered, why are we removing
the namespace at all? do others do the same thing (remove the
disk if revalidation fails)?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31  7:13                       ` Hannes Reinecke
@ 2019-07-31 18:08                         ` Sagi Grimberg
  0 siblings, 0 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 18:08 UTC (permalink / raw)



>>>> Keith,
>>>>
>>>> As for the failure during reset scenario, this is happening only when
>>>> the namespace is about to go away or something is seriously wrong right
>>>> (looking from where nvme_kill_queues is called).
>>>>
>>>> Do you still think we should avoid calling the revalidate_disk if the
>>>> controller is resetting?
>>>
>>> I was considering if a reset happens to trigger when nvme's
>>> revalidate_disk tries to read identify namespace. It's possible that
>>> command gets aborted, and we don't retry admin commands, so we'd return
>>> -ENODEV and nvme_validate_ns() removes an otherwise healthy namespace.
>>
>> But if the ctrl->state condition is after the revalidate_disk call it
>> won't remove the namespace.
>> -- 
>>  ??????????????? if (ns->disk && revalidate_disk(ns->disk &&
>>  ??????????????????? ctrl->state == NVME_CTRL_LIVE)
>>  ??????????????????????? nvme_ns_remove(ns);
>> -- 
>>
>>>
>>> I'm not too concerned about this corner case actually occuring in
>>> practice, though.
>>
>> Say, why are we removing the namepsace if the revalidate failed?
>> Don't see an equivalent behavior in scsi or others in drivers/block..
> 
> That probably due to the convoluted logic in nvme_scan_ns_list().
> That scans all namespaces in the list, and removes those which fails.
> Which seems sensible, but really shouldn't be called during reset
> as we cannot get any sensible results for any of those calls.
> 
> This was the reason for my original patch to flush the scanning thread
> before entering reset. Maybe we should revive it...

I stand firm that controller resets should not attempt to flush
the scan_work with some hooks to try to make it bail out early. I just
don't think its the correct solution. Note that the scan flush existed
in the reset flow before (nvme_stop_ctrl) but moved away because it
caused issues before and probably will again if its restored.

When the controller resets, I/Os fail (fast-fail or block), the model is 
fine in my mind, but needs some hashing out to be more aware of that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31  6:58                   ` Hannes Reinecke
@ 2019-07-31 18:11                     ` Sagi Grimberg
  2019-07-31 20:02                       ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 18:11 UTC (permalink / raw)



>> You are correct, this was why I had the ctrl->state check after
>> revalidate_disk so if it failed because we are in a reset we should
>> not remove the namespace.
>>
>> We need a reliable way to NOT remove the namespace if revalidate_disk
>> failed because the controller is resetting and we don't have a channel
>> to the controller at this very moment...
>>
> A similar thing is true for the 'ns_head' structure; I could easily
> envision a scenario where all paths are dropped (due to revalidate
> failures or whatnot) with all controllers in reset.

Hannes, paths should not be dropped in controller resets, only when
the controller is going away. Its simple as that.

> But as all paths are dropped struct ns_head doesn't have any reference
> to any namespaces (and consequently, to any controller), and will be
> failing all I/O before itself being removed.
> It will be re-created after resetting completes and all namespaces are
> re-attached, for sure, but it still means that we're getting I/O errors
> where we really shouldn't.

Paths should only be removed when the controller is going away, not when
it is temporarily disconnected.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 12:18   ` Hannes Reinecke
@ 2019-07-31 18:16     ` Sagi Grimberg
  2019-07-31 20:04       ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 18:16 UTC (permalink / raw)



>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index fa31da0762b9..5f6970e7ba73 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>>   
>>   	ns = nvme_find_get_ns(ctrl, nsid);
>>   	if (ns) {
>> -		if (ns->disk && revalidate_disk(ns->disk))
>> +		if (ns->disk && revalidate_disk(ns->disk) &&
>> +		    ctrl->state != NVME_CTRL_LIVE)
>>   			nvme_ns_remove(ns);
>>   		nvme_put_ns(ns);
>>   	} else
>>
> The thing I'm worried about here is concurrency (this was also what
> caused my earlier attempts to be rejected).
> Thing is, the controller state can be set asynchronously. Which in
> itself is protected by ctrl->lock, but here we're not taking the lock at
> all. Hence we might be seeing a stale value, causing us to makt the
> wrong decision.

This is why the state check is _after_ the failure, if we failed to do
I/O because of a reset, it will be reflected in the state.

> I have a patchset checking 'ctrl->state' under lock; will be posting it
> later.

Unless the wraps your actions according the state, the state still
advisory.

> And another thing: where's the point in 'revalidate_disk()' to be called
> if the controller is not live? At best it won't do anything; at worst
> we'll stall if we have to do more than one I/O (the reset cycle will
> abort _one_ I/O, but if we have to submit more than one we'll be stuck,
> too).

Exactly because of the race that you are indicating. We can't reliably
know if the controller is resetting (or will be starting a reset soon)
so we go ahead with the revalidate knowing that if the controller is
resetting it will fail. That is why the state check is after (this tells
us if the revalidation failed because of a reset or something else).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning
  2019-07-31 12:23   ` Hannes Reinecke
@ 2019-07-31 18:21     ` Sagi Grimberg
  2019-08-01  7:24       ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 18:21 UTC (permalink / raw)



>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 5f6970e7ba73..9f8f8f5feeae 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3406,6 +3406,11 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>   	synchronize_rcu(); /* guarantee not available in head->list */
>>   	nvme_mpath_clear_current_path(ns);
>>   	synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */
>> +	nvme_mpath_check_last_path(ns);
>> +
>> +	down_write(&ns->ctrl->namespaces_rwsem);
>> +	list_del_init(&ns->list);
>> +	up_write(&ns->ctrl->namespaces_rwsem);
>>   
>>   	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
>>   		del_gendisk(ns->disk);
>> @@ -3414,11 +3419,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>   			blk_integrity_unregister(ns->disk);
>>   	}
>>   
>> -	down_write(&ns->ctrl->namespaces_rwsem);
>> -	list_del_init(&ns->list);
>> -	up_write(&ns->ctrl->namespaces_rwsem);
>> -
>> -	nvme_mpath_check_last_path(ns);
>>   	nvme_put_ns(ns);
>>   }
>>   
>>
> How does this one play with nvme_stop_queues()/nvme_start_queues() we're
> doing during reset?
> IE what happens if this is called after nvme_stop_queues(), but before
> nvme_start_queues()?
> We'll end up having a stopped queue when deleting the disk; from what
> I've seen blk_cleanup_queue() will just freeze the queue and wait for
> I/O to complete, which it'll never will as the queue is stopped ...

nvme_stop/start_queue are quiescing the queue, not freezing it which
does not block requests in blk_queue_enter.
So I don't think that blk_cleanup_queue will hang for unquiesced queues.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 18:03                       ` Sagi Grimberg
@ 2019-07-31 19:32                         ` Keith Busch
  2019-07-31 20:08                           ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2019-07-31 19:32 UTC (permalink / raw)


On Wed, Jul 31, 2019@11:03:41AM -0700, Sagi Grimberg wrote:
> 
> I think I asked this but was not answered, why are we removing
> the namespace at all? do others do the same thing (remove the
> disk if revalidation fails)?

If a namespace no longer exists, what do you want to do with it instead
of removing it?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 18:11                     ` Sagi Grimberg
@ 2019-07-31 20:02                       ` Hannes Reinecke
  2019-07-31 20:16                         ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2019-07-31 20:02 UTC (permalink / raw)


On 7/31/19 8:11 PM, Sagi Grimberg wrote:
> 
>>> You are correct, this was why I had the ctrl->state check after
>>> revalidate_disk so if it failed because we are in a reset we should
>>> not remove the namespace.
>>>
>>> We need a reliable way to NOT remove the namespace if revalidate_disk
>>> failed because the controller is resetting and we don't have a channel
>>> to the controller at this very moment...
>>>
>> A similar thing is true for the 'ns_head' structure; I could easily
>> envision a scenario where all paths are dropped (due to revalidate
>> failures or whatnot) with all controllers in reset.
> 
> Hannes, paths should not be dropped in controller resets, only when
> the controller is going away. Its simple as that.
> 
Fully agree. But so far we don't seem to be there yet.

>> But as all paths are dropped struct ns_head doesn't have any reference
>> to any namespaces (and consequently, to any controller), and will be
>> failing all I/O before itself being removed.
>> It will be re-created after resetting completes and all namespaces are
>> re-attached, for sure, but it still means that we're getting I/O errors
>> where we really shouldn't.
> 
> Paths should only be removed when the controller is going away, not when
> it is temporarily disconnected.

I'm fully in line with that.
However, it looks as if this isn't the case currently :-)
And whenever I try to come up with a solution here I always come back to 
the thought: It would've been easier if we just could flush scan_work ...

And my main worry here is: surely we'll manage to hash out the situation 
eventually, and fixup all code paths to avoid dropping paths during reset.
But the solution will be very fragile, as any innocent change there 
might break things again. So I'm not convinced that asynchronous reset 
and rescan is a feature which should be kept.
(And honestly, we're working towards a solution where scanning during 
reset will become a no-op ...)

But anyway, prime objective is to get this mess fixed :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 18:16     ` Sagi Grimberg
@ 2019-07-31 20:04       ` Hannes Reinecke
  2019-07-31 20:37         ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2019-07-31 20:04 UTC (permalink / raw)


On 7/31/19 8:16 PM, Sagi Grimberg wrote:
> 
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index fa31da0762b9..5f6970e7ba73 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -3428,7 +3428,8 @@ static void nvme_validate_ns(struct nvme_ctrl 
>>> *ctrl, unsigned nsid)
>>> ????? ns = nvme_find_get_ns(ctrl, nsid);
>>> ????? if (ns) {
>>> -??????? if (ns->disk && revalidate_disk(ns->disk))
>>> +??????? if (ns->disk && revalidate_disk(ns->disk) &&
>>> +??????????? ctrl->state != NVME_CTRL_LIVE)
>>> ????????????? nvme_ns_remove(ns);
>>> ????????? nvme_put_ns(ns);
>>> ????? } else
>>>
>> The thing I'm worried about here is concurrency (this was also what
>> caused my earlier attempts to be rejected).
>> Thing is, the controller state can be set asynchronously. Which in
>> itself is protected by ctrl->lock, but here we're not taking the lock at
>> all. Hence we might be seeing a stale value, causing us to makt the
>> wrong decision.
> 
> This is why the state check is _after_ the failure, if we failed to do
> I/O because of a reset, it will be reflected in the state.
> 
>> I have a patchset checking 'ctrl->state' under lock; will be posting it
>> later.
> 
> Unless the wraps your actions according the state, the state still
> advisory.
> 
>> And another thing: where's the point in 'revalidate_disk()' to be called
>> if the controller is not live? At best it won't do anything; at worst
>> we'll stall if we have to do more than one I/O (the reset cycle will
>> abort _one_ I/O, but if we have to submit more than one we'll be stuck,
>> too).
> 
> Exactly because of the race that you are indicating. We can't reliably
> know if the controller is resetting (or will be starting a reset soon)
> so we go ahead with the revalidate knowing that if the controller is
> resetting it will fail. That is why the state check is after (this tells
> us if the revalidation failed because of a reset or something else).

Okay, fair enough.

Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 19:32                         ` Keith Busch
@ 2019-07-31 20:08                           ` Sagi Grimberg
  2019-07-31 20:16                             ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 20:08 UTC (permalink / raw)



>> I think I asked this but was not answered, why are we removing
>> the namespace at all? do others do the same thing (remove the
>> disk if revalidation fails)?
> 
> If a namespace no longer exists,

Why is it no longer exists? it failed revalidate..

> what do you want to do with it instead of removing it?

Well, I don't see anyone else even checking the return
status of revalidate_disk.. Perhaps Hannes can share more
on how scsi handles this?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 20:08                           ` Sagi Grimberg
@ 2019-07-31 20:16                             ` Keith Busch
  2019-07-31 20:45                               ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2019-07-31 20:16 UTC (permalink / raw)

On Wed, Jul 31, 2019@01:08:05PM -0700, Sagi Grimberg wrote:
> 
> > > I think I asked this but was not answered, why are we removing
> > > the namespace at all? do others do the same thing (remove the
> > > disk if revalidation fails)?
> > 
> > If a namespace no longer exists,
> 
> Why is it no longer exists? it failed revalidate..

One way it fails to validate is if it doesn't exist, i.e., the
controller returned an error when attempting to identify it.

The other way it may fail to revalidate is if its identify has changed
since we last discovered it, so removal is better than data corruption.

Either scenario could happen from administrative namespace management
commands to any controller in the subsystem.

> > what do you want to do with it instead of removing it?
> 
> Well, I don't see anyone else even checking the return
> status of revalidate_disk.. Perhaps Hannes can share more
> on how scsi handles this?

Other drivers may have a different path to this, but I'd imagine they
handle a LUN deletion somehow.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 20:02                       ` Hannes Reinecke
@ 2019-07-31 20:16                         ` Sagi Grimberg
  0 siblings, 0 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 20:16 UTC (permalink / raw)



> Fully agree. But so far we don't seem to be there yet.

Right.

>>> But as all paths are dropped struct ns_head doesn't have any reference
>>> to any namespaces (and consequently, to any controller), and will be
>>> failing all I/O before itself being removed.
>>> It will be re-created after resetting completes and all namespaces are
>>> re-attached, for sure, but it still means that we're getting I/O errors
>>> where we really shouldn't.
>>
>> Paths should only be removed when the controller is going away, not when
>> it is temporarily disconnected.
> 
> I'm fully in line with that.
> However, it looks as if this isn't the case currently :-)
> And whenever I try to come up with a solution here I always come back to 
> the thought: It would've been easier if we just could flush scan_work ...
> 
> And my main worry here is: surely we'll manage to hash out the situation 
> eventually, and fixup all code paths to avoid dropping paths during reset.
> But the solution will be very fragile, as any innocent change there 
> might break things again. So I'm not convinced that asynchronous reset 
> and rescan is a feature which should be kept.
> (And honestly, we're working towards a solution where scanning during 
> reset will become a no-op ...)

But flushing the scan_work is even more fragile.. the scan_work is a
non-trivial work that may have dependencies with other processes.

For example, passthru commands with effects take the scan lock during
its lifetime, which means that it may be blocked even before it checks
the ctrl state (which is again, unreliable with respect to resets).
So do resets flush all outstanding passthru commands now?

So I think that flushing the scan_work might bring us new issues because
of this dependency. That is why I think we should address the root cause
which is we're wrongly removing the namespace, not the fact that we
failed to scan it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 20:04       ` Hannes Reinecke
@ 2019-07-31 20:37         ` Sagi Grimberg
  0 siblings, 0 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 20:37 UTC (permalink / raw)



>> Exactly because of the race that you are indicating. We can't reliably
>> know if the controller is resetting (or will be starting a reset soon)
>> so we go ahead with the revalidate knowing that if the controller is
>> resetting it will fail. That is why the state check is after (this tells
>> us if the revalidation failed because of a reset or something else).
> 
> Okay, fair enough.
> 
> Reviewed-by: Hannes Reinecke <hare at suse.com>

Hannes,

Can we understand if this addresses the issues you have reported?

 From the cover-letter:
Hey Hannes,

Here is two patches that to my understanding of the issues you describe
in your patchset: "nvme: flush rescan worker before resetting" and your
report of "spurious I/O errors during failover".

[...]

Can you please apply the following two patches and report if
they address the issues you are seeing? And if not, can you please
report a call trace of the hanged threads?

And, given that your are in a multipath environment, can you apply
these on top of: "nvme: fix controller removal race with scan work"?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 20:16                             ` Keith Busch
@ 2019-07-31 20:45                               ` Sagi Grimberg
  2019-07-31 20:58                                 ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 20:45 UTC (permalink / raw)



>>>> I think I asked this but was not answered, why are we removing
>>>> the namespace at all? do others do the same thing (remove the
>>>> disk if revalidation fails)?
>>>
>>> If a namespace no longer exists,
>>
>> Why is it no longer exists? it failed revalidate..
> 
> One way it fails to validate is if it doesn't exist, i.e., the
> controller returned an error when attempting to identify it.
> 
> The other way it may fail to revalidate is if its identify has changed
> since we last discovered it, so removal is better than data corruption.

Well, perhaps we can mark failures resulting from reset with a transport
error.

For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps
we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and
check for that as the return code for revalidate_disk?

Thoughts?

> Either scenario could happen from administrative namespace management
> commands to any controller in the subsystem.
>   
>>> what do you want to do with it instead of removing it?
>>
>> Well, I don't see anyone else even checking the return
>> status of revalidate_disk.. Perhaps Hannes can share more
>> on how scsi handles this?
> 
> Other drivers may have a different path to this, but I'd imagine they
> handle a LUN deletion somehow.

Well, its not coming from revalidate_disk failure as the return status
is not checked...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 20:45                               ` Sagi Grimberg
@ 2019-07-31 20:58                                 ` Keith Busch
  2019-07-31 21:14                                   ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2019-07-31 20:58 UTC (permalink / raw)


On Wed, Jul 31, 2019@01:45:12PM -0700, Sagi Grimberg wrote:
> 
> > > > > I think I asked this but was not answered, why are we removing
> > > > > the namespace at all? do others do the same thing (remove the
> > > > > disk if revalidation fails)?
> > > > 
> > > > If a namespace no longer exists,
> > > 
> > > Why is it no longer exists? it failed revalidate..
> > 
> > One way it fails to validate is if it doesn't exist, i.e., the
> > controller returned an error when attempting to identify it.
> > 
> > The other way it may fail to revalidate is if its identify has changed
> > since we last discovered it, so removal is better than data corruption.
> 
> Well, perhaps we can mark failures resulting from reset with a transport
> error.
> 
> For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps
> we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and
> check for that as the return code for revalidate_disk?
> 
> Thoughts?

Would it be sufficient to let these admin commands requeue? Instead of
flushing the scan work, we can let it block for IO on a reset, and the
IO will resume when the reset completes.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 20:58                                 ` Keith Busch
@ 2019-07-31 21:14                                   ` Sagi Grimberg
  2019-07-31 21:54                                     ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-07-31 21:14 UTC (permalink / raw)



>>> The other way it may fail to revalidate is if its identify has changed
>>> since we last discovered it, so removal is better than data corruption.
>>
>> Well, perhaps we can mark failures resulting from reset with a transport
>> error.
>>
>> For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps
>> we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and
>> check for that as the return code for revalidate_disk?
>>
>> Thoughts?
> 
> Would it be sufficient to let these admin commands requeue? Instead of
> flushing the scan work, we can let it block for IO on a reset, and the
> IO will resume when the reset completes.

Well, I don't think we should do that. Unlike I/O commands, which can
failover to a different path, these admin commands are bound to the
specific controller. In case it takes minutes/hours/days for the
controller to restore normal operation, it will be unpleasant to say
the least to have admin operations get stuck for so long.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 21:14                                   ` Sagi Grimberg
@ 2019-07-31 21:54                                     ` Keith Busch
  2019-08-01  1:13                                       ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2019-07-31 21:54 UTC (permalink / raw)


On Wed, Jul 31, 2019@02:14:25PM -0700, Sagi Grimberg wrote:
> 
> > > > The other way it may fail to revalidate is if its identify has changed
> > > > since we last discovered it, so removal is better than data corruption.
> > > 
> > > Well, perhaps we can mark failures resulting from reset with a transport
> > > error.
> > > 
> > > For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps
> > > we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and
> > > check for that as the return code for revalidate_disk?
> > > 
> > > Thoughts?
> > 
> > Would it be sufficient to let these admin commands requeue? Instead of
> > flushing the scan work, we can let it block for IO on a reset, and the
> > IO will resume when the reset completes.
> 
> Well, I don't think we should do that. Unlike I/O commands, which can
> failover to a different path, these admin commands are bound to the
> specific controller. In case it takes minutes/hours/days for the
> controller to restore normal operation, it will be unpleasant to say
> the least to have admin operations get stuck for so long.

Unpleasant for who? The scan_work is the only thing waiting for these
commands, no one else should care because you can't run IO if you're
stuck in very long reset anyway.

I think the main point is that we don't want to take a delete action on
a transient condition, but sprinkling NVME_CTRL_LIVE checks is open to
many other races.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-07-31 21:54                                     ` Keith Busch
@ 2019-08-01  1:13                                       ` Sagi Grimberg
  2019-08-01 14:33                                         ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Sagi Grimberg @ 2019-08-01  1:13 UTC (permalink / raw)



>> Well, I don't think we should do that. Unlike I/O commands, which can
>> failover to a different path, these admin commands are bound to the
>> specific controller. In case it takes minutes/hours/days for the
>> controller to restore normal operation, it will be unpleasant to say
>> the least to have admin operations get stuck for so long.
> 
> Unpleasant for who? The scan_work is the only thing waiting for these
> commands, no one else should care because you can't run IO if you're
> stuck in very long reset anyway.

The hung task detector would care, and a user who will attempt to issue
a passthru command, and the rest of the system that have one of the
kworkers sacrificed for a significant amount of time...

> I think the main point is that we don't want to take a delete action on
> a transient condition, but sprinkling NVME_CTRL_LIVE checks is open to
> many other races.

Hence I suggested the transport error code...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning
  2019-07-31 18:21     ` Sagi Grimberg
@ 2019-08-01  7:24       ` Hannes Reinecke
  2019-08-01 18:46         ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2019-08-01  7:24 UTC (permalink / raw)


On 7/31/19 8:21 PM, Sagi Grimberg wrote:
> 
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 5f6970e7ba73..9f8f8f5feeae 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -3406,6 +3406,11 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>> ????? synchronize_rcu(); /* guarantee not available in head->list */
>>> ????? nvme_mpath_clear_current_path(ns);
>>> ????? synchronize_srcu(&ns->head->srcu); /* wait for concurrent
>>> submissions */
>>> +??? nvme_mpath_check_last_path(ns);
>>> +
>>> +??? down_write(&ns->ctrl->namespaces_rwsem);
>>> +??? list_del_init(&ns->list);
>>> +??? up_write(&ns->ctrl->namespaces_rwsem);
>>> ? ????? if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
>>> ????????? del_gendisk(ns->disk);
>>> @@ -3414,11 +3419,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>> ????????????? blk_integrity_unregister(ns->disk);
>>> ????? }
>>> ? -??? down_write(&ns->ctrl->namespaces_rwsem);
>>> -??? list_del_init(&ns->list);
>>> -??? up_write(&ns->ctrl->namespaces_rwsem);
>>> -
>>> -??? nvme_mpath_check_last_path(ns);
>>> ????? nvme_put_ns(ns);
>>> ? }
>>> ?
>> How does this one play with nvme_stop_queues()/nvme_start_queues() we're
>> doing during reset?
>> IE what happens if this is called after nvme_stop_queues(), but before
>> nvme_start_queues()?
>> We'll end up having a stopped queue when deleting the disk; from what
>> I've seen blk_cleanup_queue() will just freeze the queue and wait for
>> I/O to complete, which it'll never will as the queue is stopped ...
> 
> nvme_stop/start_queue are quiescing the queue, not freezing it which
> does not block requests in blk_queue_enter.
> So I don't think that blk_cleanup_queue will hang for unquiesced queues.

My worry here is for quiesced queues; they are not unquiesced when
blk_clean_queue() is running. So I'm not sure if we're able to flush all
commands then.

But be it as it may, I've sent a patchset for checking the NS_REMOVING
flag when traversing the namespace list. That should take care of this
issue, and should render this patch obsolete.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-08-01  1:13                                       ` Sagi Grimberg
@ 2019-08-01 14:33                                         ` Keith Busch
  2019-08-01 18:52                                           ` Sagi Grimberg
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2019-08-01 14:33 UTC (permalink / raw)


On Wed, Jul 31, 2019@06:13:06PM -0700, Sagi Grimberg wrote:
> 
> >> Well, I don't think we should do that. Unlike I/O commands, which can
> >> failover to a different path, these admin commands are bound to the
> >> specific controller. In case it takes minutes/hours/days for the
> >> controller to restore normal operation, it will be unpleasant to say
> >> the least to have admin operations get stuck for so long.
> > 
> > Unpleasant for who? The scan_work is the only thing waiting for these
> > commands, no one else should care because you can't run IO if you're
> > stuck in very long reset anyway.
> 
> The hung task detector would care, and a user who will attempt to issue
> a passthru command, and the rest of the system that have one of the
> kworkers sacrificed for a significant amount of time...

blk_execute_rq already defeats hung task detection for stalled IO.

My point, though, was passthru doesn't care about scan_work. A submitted
passthru command is blocked for reset, so blocking scan_work doesn't
make that situation any better or worse.
 
> > I think the main point is that we don't want to take a delete action on
> > a transient condition, but sprinkling NVME_CTRL_LIVE checks is open to
> > many other races.
> 
> Hence I suggested the transport error code...

That should work too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning
  2019-08-01  7:24       ` Hannes Reinecke
@ 2019-08-01 18:46         ` Sagi Grimberg
  0 siblings, 0 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-08-01 18:46 UTC (permalink / raw)



>> nvme_stop/start_queue are quiescing the queue, not freezing it which
>> does not block requests in blk_queue_enter.
>> So I don't think that blk_cleanup_queue will hang for unquiesced queues.
> 
> My worry here is for quiesced queues; they are not unquiesced when
> blk_clean_queue() is running. So I'm not sure if we're able to flush all
> commands then.

blk_cleanup_queue does not block on quiesced queues, it blocks on frozen
queues.

> But be it as it may, I've sent a patchset for checking the NS_REMOVING
> flag when traversing the namespace list. That should take care of this
> issue, and should render this patch obsolete.

I've shared feedback on that one, I need to understand what is the
crash that you are talking about.. and why not promoting the list
removal in nvme_ns_remove() doesn't address it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset
  2019-08-01 14:33                                         ` Keith Busch
@ 2019-08-01 18:52                                           ` Sagi Grimberg
  0 siblings, 0 replies; 42+ messages in thread
From: Sagi Grimberg @ 2019-08-01 18:52 UTC (permalink / raw)



>>>> Well, I don't think we should do that. Unlike I/O commands, which can
>>>> failover to a different path, these admin commands are bound to the
>>>> specific controller. In case it takes minutes/hours/days for the
>>>> controller to restore normal operation, it will be unpleasant to say
>>>> the least to have admin operations get stuck for so long.
>>>
>>> Unpleasant for who? The scan_work is the only thing waiting for these
>>> commands, no one else should care because you can't run IO if you're
>>> stuck in very long reset anyway.
>>
>> The hung task detector would care, and a user who will attempt to issue
>> a passthru command, and the rest of the system that have one of the
>> kworkers sacrificed for a significant amount of time...
> 
> blk_execute_rq already defeats hung task detection for stalled IO.
> 
> My point, though, was passthru doesn't care about scan_work. A submitted
> passthru command is blocked for reset,

Not in fabrics drivers (unless I'm missing something that changed).

> so blocking scan_work doesn't make that situation any better or worse.

I think that when we talk about reset in fabrics, we have in mind a long
process (mainly because of the fact that network port failures are a lot
more frequent and span some amount of time). This is why fabric drivers,
when they get a transport error, they go into the reset flow and they
quiesce+terminate+unquiesce to fastfail admin commands.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2019-08-01 18:52 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-29 23:31 [PATCH rfc 0/2] nvme controller reset and namespace scan work race conditions Sagi Grimberg
2019-07-29 23:32 ` [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset Sagi Grimberg
2019-07-30  0:59   ` Keith Busch
2019-07-30  1:04     ` Sagi Grimberg
2019-07-30  1:04   ` Ming Lei
2019-07-30  1:06     ` Sagi Grimberg
2019-07-30  1:10       ` Ming Lei
2019-07-30  1:19         ` Sagi Grimberg
2019-07-30  1:30           ` Ming Lei
2019-07-30  1:40             ` Sagi Grimberg
2019-07-30  2:09               ` Ming Lei
2019-07-30 17:12                 ` Sagi Grimberg
2019-07-30 17:30                   ` Keith Busch
2019-07-30 18:15                     ` Sagi Grimberg
2019-07-31  7:13                       ` Hannes Reinecke
2019-07-31 18:08                         ` Sagi Grimberg
2019-07-31  7:01                     ` Hannes Reinecke
2019-07-31 14:16                       ` Keith Busch
2019-07-31 18:03                       ` Sagi Grimberg
2019-07-31 19:32                         ` Keith Busch
2019-07-31 20:08                           ` Sagi Grimberg
2019-07-31 20:16                             ` Keith Busch
2019-07-31 20:45                               ` Sagi Grimberg
2019-07-31 20:58                                 ` Keith Busch
2019-07-31 21:14                                   ` Sagi Grimberg
2019-07-31 21:54                                     ` Keith Busch
2019-08-01  1:13                                       ` Sagi Grimberg
2019-08-01 14:33                                         ` Keith Busch
2019-08-01 18:52                                           ` Sagi Grimberg
2019-07-31  6:58                   ` Hannes Reinecke
2019-07-31 18:11                     ` Sagi Grimberg
2019-07-31 20:02                       ` Hannes Reinecke
2019-07-31 20:16                         ` Sagi Grimberg
2019-07-31 12:18   ` Hannes Reinecke
2019-07-31 18:16     ` Sagi Grimberg
2019-07-31 20:04       ` Hannes Reinecke
2019-07-31 20:37         ` Sagi Grimberg
2019-07-29 23:32 ` [PATCH rfc 2/2] nvme: fix possible use-after-free condition when controller reset is racing namespace scanning Sagi Grimberg
2019-07-31 12:23   ` Hannes Reinecke
2019-07-31 18:21     ` Sagi Grimberg
2019-08-01  7:24       ` Hannes Reinecke
2019-08-01 18:46         ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.