All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] nvme: fix identify error status silent ignore
@ 2020-06-26 17:46 Sagi Grimberg
  2020-06-29  6:43 ` Christoph Hellwig
  2020-06-29  8:55 ` Niklas Cassel
  0 siblings, 2 replies; 5+ messages in thread
From: Sagi Grimberg @ 2020-06-26 17:46 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch

Patch 59c7c3caaaf8 intended to only silently ignore
non retry-able errors (DNR bit set) such that we can still
identify misbehaving controllers, and in the other hand
propagate retry-able errors (DNR bit cleared) so we don't
wrongly abandon a namespace just because it happens to be
temporarily inaccessible.

The goal remains the same as the original commit where this
was introduced but unfortunately had the logic backwards.

Fixes: 59c7c3caaaf8 ("nvme: fix possible hang when ns
scanning fails during error recovery")
Reported-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Changes from v2:
- added comment on non-trivial code

Changes from v1:
- remove paranthesis

 drivers/nvme/host/core.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2afed32d3892..92dc2327bf3a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1128,9 +1128,15 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 			"Identify Descriptors failed (%d)\n", status);
 		 /*
 		  * Don't treat an error as fatal, as we potentially already
-		  * have a NGUID or EUI-64.
+		  * have a NGUID or EUI-64. If we failed with DNR set, we want
+		  * to silently ignore the error as we can still identify
+		  * the device, but if the status has DNR set, we want
+		  * to propogate the error back specifically for the disk
+		  * revalidation flow to make sure we don't abandon the
+		  * device just because of a temporal retry-able error (such
+		  * as path of transport errors).
 		  */
-		if (status > 0 && !(status & NVME_SC_DNR))
+		if (status > 0 && status & NVME_SC_DNR)
 			status = 0;
 		goto free_data;
 	}
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] nvme: fix identify error status silent ignore
  2020-06-26 17:46 [PATCH v3] nvme: fix identify error status silent ignore Sagi Grimberg
@ 2020-06-29  6:43 ` Christoph Hellwig
  2020-06-29  7:08   ` Sagi Grimberg
  2020-06-29  8:55 ` Niklas Cassel
  1 sibling, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2020-06-29  6:43 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme

On Fri, Jun 26, 2020 at 10:46:29AM -0700, Sagi Grimberg wrote:
> Patch 59c7c3caaaf8 intended to only silently ignore
> non retry-able errors (DNR bit set) such that we can still
> identify misbehaving controllers, and in the other hand
> propagate retry-able errors (DNR bit cleared) so we don't
> wrongly abandon a namespace just because it happens to be
> temporarily inaccessible.
> 
> The goal remains the same as the original commit where this
> was introduced but unfortunately had the logic backwards.

So how did this work all the time?  I'm really worried that someone
this is actually going to break things.  What do you think about
only queueing it up for 5.9 for now?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] nvme: fix identify error status silent ignore
  2020-06-29  6:43 ` Christoph Hellwig
@ 2020-06-29  7:08   ` Sagi Grimberg
  2020-06-29 11:49     ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2020-06-29  7:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>> Patch 59c7c3caaaf8 intended to only silently ignore
>> non retry-able errors (DNR bit set) such that we can still
>> identify misbehaving controllers, and in the other hand
>> propagate retry-able errors (DNR bit cleared) so we don't
>> wrongly abandon a namespace just because it happens to be
>> temporarily inaccessible.
>>
>> The goal remains the same as the original commit where this
>> was introduced but unfortunately had the logic backwards.
> 
> So how did this work all the time?  I'm really worried that someone
> this is actually going to break things.  What do you think about
> only queueing it up for 5.9 for now?

The code is broken now, the original patch was a mistake. It didn't
fix the scan/error-recovery race, but also broke the ignore we had
for pci. So I think this should go to 5.8-rc and also be picked up
for stable 5.7.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] nvme: fix identify error status silent ignore
  2020-06-26 17:46 [PATCH v3] nvme: fix identify error status silent ignore Sagi Grimberg
  2020-06-29  6:43 ` Christoph Hellwig
@ 2020-06-29  8:55 ` Niklas Cassel
  1 sibling, 0 replies; 5+ messages in thread
From: Niklas Cassel @ 2020-06-29  8:55 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme

On Fri, Jun 26, 2020 at 10:46:29AM -0700, Sagi Grimberg wrote:
> Patch 59c7c3caaaf8 intended to only silently ignore
> non retry-able errors (DNR bit set) such that we can still
> identify misbehaving controllers, and in the other hand
> propagate retry-able errors (DNR bit cleared) so we don't
> wrongly abandon a namespace just because it happens to be
> temporarily inaccessible.
> 
> The goal remains the same as the original commit where this
> was introduced but unfortunately had the logic backwards.
> 
> Fixes: 59c7c3caaaf8 ("nvme: fix possible hang when ns
> scanning fails during error recovery")
> Reported-by: Keith Busch <kbusch@kernel.org>
> Reviewed-by: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> Changes from v2:
> - added comment on non-trivial code
> 
> Changes from v1:
> - remove paranthesis
> 
>  drivers/nvme/host/core.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2afed32d3892..92dc2327bf3a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1128,9 +1128,15 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
>  			"Identify Descriptors failed (%d)\n", status);
>  		 /*
>  		  * Don't treat an error as fatal, as we potentially already
> -		  * have a NGUID or EUI-64.
> +		  * have a NGUID or EUI-64. If we failed with DNR set, we want
> +		  * to silently ignore the error as we can still identify
> +		  * the device, but if the status has DNR set, we want
> +		  * to propogate the error back specifically for the disk

s/propogate/propagate/

Perhaps this minor nit could be fixed up while applying?


Kind regards,
Niklas

> +		  * revalidation flow to make sure we don't abandon the
> +		  * device just because of a temporal retry-able error (such
> +		  * as path of transport errors).
>  		  */
> -		if (status > 0 && !(status & NVME_SC_DNR))
> +		if (status > 0 && status & NVME_SC_DNR)
>  			status = 0;
>  		goto free_data;
>  	}
> -- 
> 2.25.1
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] nvme: fix identify error status silent ignore
  2020-06-29  7:08   ` Sagi Grimberg
@ 2020-06-29 11:49     ` Christoph Hellwig
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2020-06-29 11:49 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme

Applied to nvme-5.8 with minor fixups.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-06-29 11:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26 17:46 [PATCH v3] nvme: fix identify error status silent ignore Sagi Grimberg
2020-06-29  6:43 ` Christoph Hellwig
2020-06-29  7:08   ` Sagi Grimberg
2020-06-29 11:49     ` Christoph Hellwig
2020-06-29  8:55 ` Niklas Cassel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.