[PATCH] nvme-core: fix io interrupt when work with dm-multipah

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nvme-core: fix io interrupt when work with dm-multipah
@ 2020-07-27  5:58 Chao Leng
  2020-07-28 11:19 ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Chao Leng @ 2020-07-27  5:58 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, axboe, hch, lengchao, sagi

The protocol NVM-Express-1.4 define:
Command Interrupted: Command processing was interrupted and the
controller is unable to successfully complete the command. The host
should retry the command. If this status code is returned, then
the controller shall clear the Do Not Retry bit to ‘0’ in the Status
field of the CQE (refer to Figure 124). The controller shall not return
this status code unless the host has set the Advanced Command Retry
Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to
section 5.21.1.22).

According the protocol define, NVME_SC_CMD_INTERRUPTED need retry.
The error code NVME_SC_CMD_INTERRUPTED should not translate to
BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET,
dm-multipah will return error to application. So if target return error
code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED
should translate to BLK_STS_IOERR by default, dm-multipath will fail
over to other path retry the io.

Signed-off-by: Chao Leng <lengchao@huawei.com>
---
 drivers/nvme/host/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c2c5bc4fb702..1a1ad5e5212c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -211,7 +211,6 @@ static blk_status_t nvme_error_status(u16 status)
 	case NVME_SC_CAP_EXCEEDED:
 		return BLK_STS_NOSPC;
 	case NVME_SC_LBA_RANGE:
-	case NVME_SC_CMD_INTERRUPTED:
 	case NVME_SC_NS_NOT_READY:
 		return BLK_STS_TARGET;
 	case NVME_SC_BAD_ATTRIBUTES:
-- 
2.16.4

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-07-27  5:58 [PATCH] nvme-core: fix io interrupt when work with dm-multipah Chao Leng
@ 2020-07-28 11:19 ` Christoph Hellwig
  2020-07-29  2:54   ` Chao Leng
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2020-07-28 11:19 UTC (permalink / raw)
  To: Chao Leng; +Cc: kbusch, axboe, hch, linux-nvme, sagi

On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote:
> The protocol NVM-Express-1.4 define:
> Command Interrupted: Command processing was interrupted and the
> controller is unable to successfully complete the command. The host
> should retry the command. If this status code is returned, then
> the controller shall clear the Do Not Retry bit to ‘0’ in the Status
> field of the CQE (refer to Figure 124). The controller shall not return
> this status code unless the host has set the Advanced Command Retry
> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to
> section 5.21.1.22).
> 
> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry.
> The error code NVME_SC_CMD_INTERRUPTED should not translate to
> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET,
> dm-multipah will return error to application. So if target return error
> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED
> should translate to BLK_STS_IOERR by default, dm-multipath will fail
> over to other path retry the io.

IOERR still seems wrong, though.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-07-28 11:19 ` Christoph Hellwig
@ 2020-07-29  2:54   ` Chao Leng
  2020-07-29  5:59     ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Chao Leng @ 2020-07-29  2:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: kbusch, axboe, sagi, linux-nvme



On 2020/7/28 19:19, Christoph Hellwig wrote:
> On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote:
>> The protocol NVM-Express-1.4 define:
>> Command Interrupted: Command processing was interrupted and the
>> controller is unable to successfully complete the command. The host
>> should retry the command. If this status code is returned, then
>> the controller shall clear the Do Not Retry bit to ‘0’ in the Status
>> field of the CQE (refer to Figure 124). The controller shall not return
>> this status code unless the host has set the Advanced Command Retry
>> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to
>> section 5.21.1.22).
>>
>> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry.
>> The error code NVME_SC_CMD_INTERRUPTED should not translate to
>> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET,
>> dm-multipah will return error to application. So if target return error
>> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED
>> should translate to BLK_STS_IOERR by default, dm-multipath will fail
>> over to other path retry the io.
> 
> IOERR still seems wrong, though.
> .

BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED
just means target need retry io. It is not suitable to translate
NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to
BLK_STS_IOERR is also not suitable, we should translate
NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.
We can do like this:
---
  drivers/nvme/host/core.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c2c5bc4fb702..359ce471df1d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -211,7 +211,6 @@ static blk_status_t nvme_error_status(u16 status)
  	case NVME_SC_CAP_EXCEEDED:
  		return BLK_STS_NOSPC;
  	case NVME_SC_LBA_RANGE:
-	case NVME_SC_CMD_INTERRUPTED:
  	case NVME_SC_NS_NOT_READY:
  		return BLK_STS_TARGET;
  	case NVME_SC_BAD_ATTRIBUTES:
@@ -236,6 +235,8 @@ static blk_status_t nvme_error_status(u16 status)
  		return BLK_STS_NEXUS;
  	case NVME_SC_HOST_PATH_ERROR:
  		return BLK_STS_TRANSPORT;
+	case NVME_SC_CMD_INTERRUPTED:
+		return BLK_STS_AGAIN;
  	default:
  		return BLK_STS_IOERR;
  	}
-- 
2.16.4

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-07-29  2:54   ` Chao Leng
@ 2020-07-29  5:59     ` Christoph Hellwig
  2020-07-30  1:49       ` Chao Leng
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2020-07-29  5:59 UTC (permalink / raw)
  To: Chao Leng; +Cc: kbusch, axboe, Christoph Hellwig, linux-nvme, sagi

On Wed, Jul 29, 2020 at 10:54:29AM +0800, Chao Leng wrote:
>
>
> On 2020/7/28 19:19, Christoph Hellwig wrote:
>> On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote:
>>> The protocol NVM-Express-1.4 define:
>>> Command Interrupted: Command processing was interrupted and the
>>> controller is unable to successfully complete the command. The host
>>> should retry the command. If this status code is returned, then
>>> the controller shall clear the Do Not Retry bit to ‘0’ in the Status
>>> field of the CQE (refer to Figure 124). The controller shall not return
>>> this status code unless the host has set the Advanced Command Retry
>>> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to
>>> section 5.21.1.22).
>>>
>>> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry.
>>> The error code NVME_SC_CMD_INTERRUPTED should not translate to
>>> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET,
>>> dm-multipah will return error to application. So if target return error
>>> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED
>>> should translate to BLK_STS_IOERR by default, dm-multipath will fail
>>> over to other path retry the io.
>>
>> IOERR still seems wrong, though.
>> .
>
> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED
> just means target need retry io. It is not suitable to translate
> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to
> BLK_STS_IOERR is also not suitable, we should translate
> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.
> We can do like this:

BLK_STS_AGAIN is a bad choice as we use it for calls that block when
the callers asked for non-blocking submission.  I'm really not sure
we want to change anything here - the error definition clearly states
it is not a failure but a request to retry later.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-07-29  5:59     ` Christoph Hellwig
@ 2020-07-30  1:49       ` Chao Leng
  2020-08-05  6:40         ` Chao Leng
  0 siblings, 1 reply; 50+ messages in thread
From: Chao Leng @ 2020-07-30  1:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: kbusch, axboe, sagi, linux-nvme, John.Meneghini

On 2020/7/29 13:59, Christoph Hellwig wrote:
> On Wed, Jul 29, 2020 at 10:54:29AM +0800, Chao Leng wrote:
>>
>>
>> On 2020/7/28 19:19, Christoph Hellwig wrote:
>>> On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote:
>>>> The protocol NVM-Express-1.4 define:
>>>> Command Interrupted: Command processing was interrupted and the
>>>> controller is unable to successfully complete the command. The host
>>>> should retry the command. If this status code is returned, then
>>>> the controller shall clear the Do Not Retry bit to ‘0’ in the Status
>>>> field of the CQE (refer to Figure 124). The controller shall not return
>>>> this status code unless the host has set the Advanced Command Retry
>>>> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to
>>>> section 5.21.1.22).
>>>>
>>>> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry.
>>>> The error code NVME_SC_CMD_INTERRUPTED should not translate to
>>>> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET,
>>>> dm-multipah will return error to application. So if target return error
>>>> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED
>>>> should translate to BLK_STS_IOERR by default, dm-multipath will fail
>>>> over to other path retry the io.
>>>
>>> IOERR still seems wrong, though.
>>> .
>>
>> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED
>> just means target need retry io. It is not suitable to translate
>> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to
>> BLK_STS_IOERR is also not suitable, we should translate
>> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.
>> We can do like this:
> 
> BLK_STS_AGAIN is a bad choice as we use it for calls that block when
> the callers asked for non-blocking submission.  I'm really not sure
> we want to change anything here - the error definition clearly states
> it is not a failure but a request to retry later.
> .
> 
BLK_STS_AGAIN is not a good choice, but BLK_STS_TARGET is also not
a good choice. I find the patch that translate NVME_SC_CMD_INTERRUPTED
to BLK_STS_TARGET.

commit:35038bffa87da282010b91108cadd13238bb5bbd
nvme: Translate more status codes to blk_status_t
Decode interrupted command and not ready namespace nvme status codes to
BLK_STS_TARGET. These are not generic IO errors and should use a non-path
specific error so that it can use the non-failover retry path.
Reported-by: John Meneghini <John.Meneghini@netapp.com>

In the old version, need translate NVME_SC_CMD_INTERRUPTED to
BLK_STS_TARGET, because nvme multipath check the blk_path_error,
we expect retry IO in the current path, so have to translate
NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Although converting to
BLK_STS_TARGET is not a good choice, there seems to be no better choice
for the old version code. But now we do not need translate
NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET, nvme multipath already
improve and does not depend on blk_path_error.

According to the description of BLK_STS_TARGET:
[BLK_STS_TARGET]	= { -EREMOTEIO,	"critical target" }
BLK_STS_TARGET may easily mistaken as a fatal error on the storage.
I already check all the error code of BLK_STS_xx, there is no good
choice for NVME_SC_CMD_INTERRUPTED now. In the absence of a suitable
error code, it is a good choice to retain the default. So I strongly
suggest delete translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET.

John, what do you think?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-07-30  1:49       ` Chao Leng
@ 2020-08-05  6:40         ` Chao Leng
  2020-08-05 15:29           ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Chao Leng @ 2020-08-05  6:40 UTC (permalink / raw)
  Cc: linux-nvme, John.Meneghini

John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED
to BLK_STS_TARGET? Thank you.

On 2020/7/30 9:49, Chao Leng wrote:
> 
> 
> On 2020/7/29 13:59, Christoph Hellwig wrote:
>> On Wed, Jul 29, 2020 at 10:54:29AM +0800, Chao Leng wrote:
>>>
>>>
>>> On 2020/7/28 19:19, Christoph Hellwig wrote:
>>>> On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote:
>>>>> The protocol NVM-Express-1.4 define:
>>>>> Command Interrupted: Command processing was interrupted and the
>>>>> controller is unable to successfully complete the command. The host
>>>>> should retry the command. If this status code is returned, then
>>>>> the controller shall clear the Do Not Retry bit to ‘0’ in the Status
>>>>> field of the CQE (refer to Figure 124). The controller shall not return
>>>>> this status code unless the host has set the Advanced Command Retry
>>>>> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to
>>>>> section 5.21.1.22).
>>>>>
>>>>> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry.
>>>>> The error code NVME_SC_CMD_INTERRUPTED should not translate to
>>>>> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET,
>>>>> dm-multipah will return error to application. So if target return error
>>>>> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED
>>>>> should translate to BLK_STS_IOERR by default, dm-multipath will fail
>>>>> over to other path retry the io.
>>>>
>>>> IOERR still seems wrong, though.
>>>> .
>>>
>>> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED
>>> just means target need retry io. It is not suitable to translate
>>> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to
>>> BLK_STS_IOERR is also not suitable, we should translate
>>> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.
>>> We can do like this:
>>
>> BLK_STS_AGAIN is a bad choice as we use it for calls that block when
>> the callers asked for non-blocking submission.  I'm really not sure
>> we want to change anything here - the error definition clearly states
>> it is not a failure but a request to retry later.
>> .
>>
> BLK_STS_AGAIN is not a good choice, but BLK_STS_TARGET is also not
> a good choice. I find the patch that translate NVME_SC_CMD_INTERRUPTED
> to BLK_STS_TARGET.
> 
> commit:35038bffa87da282010b91108cadd13238bb5bbd
> nvme: Translate more status codes to blk_status_t
> Decode interrupted command and not ready namespace nvme status codes to
> BLK_STS_TARGET. These are not generic IO errors and should use a non-path
> specific error so that it can use the non-failover retry path.
> Reported-by: John Meneghini <John.Meneghini@netapp.com>
> 
> In the old version, need translate NVME_SC_CMD_INTERRUPTED to
> BLK_STS_TARGET, because nvme multipath check the blk_path_error,
> we expect retry IO in the current path, so have to translate
> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Although converting to
> BLK_STS_TARGET is not a good choice, there seems to be no better choice
> for the old version code. But now we do not need translate
> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET, nvme multipath already
> improve and does not depend on blk_path_error.
> 
> According to the description of BLK_STS_TARGET:
> [BLK_STS_TARGET]    = { -EREMOTEIO,    "critical target" }
> BLK_STS_TARGET may easily mistaken as a fatal error on the storage.
> I already check all the error code of BLK_STS_xx, there is no good
> choice for NVME_SC_CMD_INTERRUPTED now. In the absence of a suitable
> error code, it is a good choice to retain the default. So I strongly
> suggest delete translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET.
> 
> John, what do you think?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-08-05  6:40         ` Chao Leng
@ 2020-08-05 15:29           ` Keith Busch
  2020-08-06  5:52             ` Chao Leng
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2020-08-05 15:29 UTC (permalink / raw)
  To: Chao Leng; +Cc: linux-nvme, John.Meneghini

On Wed, Aug 05, 2020 at 02:40:48PM +0800, Chao Leng wrote:
> John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED
> to BLK_STS_TARGET? Thank you.

Perhaps BLK_STS_TARGET should only failfast for REQ_FAILFAST_DEV rather
than REQ_FAILFAST_PATH.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-08-05 15:29           ` Keith Busch
@ 2020-08-06  5:52             ` Chao Leng
  2020-08-06 14:26               ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Chao Leng @ 2020-08-06  5:52 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, John.Meneghini



On 2020/8/5 23:29, Keith Busch wrote:
> On Wed, Aug 05, 2020 at 02:40:48PM +0800, Chao Leng wrote:
>> John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED
>> to BLK_STS_TARGET? Thank you.
> 
> Perhaps BLK_STS_TARGET should only failfast for REQ_FAILFAST_DEV rather
> than REQ_FAILFAST_PATH.
> 
Now we transate NVME_SC_LBA_RANGE, NVME_SC_CMD_INTERRUPTED, NVME_SC_NS_NOT_READY
to BLK_STS_TARGET. NVME_SC_LBA_RANGE should failfast, because retry can not success.
NVME_SC_NS_NOT_READY may retry success, but the probality is very low.
NVME_SC_CMD_INTERRUPTED need retry, according to protocol define, retry will success.

So it is recommended that add the local preferential retry mechanism, like this:
---
  drivers/nvme/host/core.c | 2 +-
  drivers/nvme/host/nvme.h | 8 ++++++++
  2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1a1ad5e5212c..3d39528051ad 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -242,7 +242,7 @@ static blk_status_t nvme_error_status(u16 status)

  static inline bool nvme_req_needs_retry(struct request *req)
  {
-	if (blk_noretry_request(req))
+	if (!nvme_req_local_retry(req) && blk_noretry_request(req))
  		return false;
  	if (nvme_req(req)->status & NVME_SC_DNR)
  		return false;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c0f4226d3299..17064455f2db 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -710,4 +710,12 @@ void nvme_hwmon_init(struct nvme_ctrl *ctrl);
  static inline void nvme_hwmon_init(struct nvme_ctrl *ctrl) { }
  #endif

+static inline bool nvme_req_local_retry(struct request *req)
+{
+	if (nvme_req(req)->status == NVME_SC_CMD_INTERRUPTED)
+		return true;
+	return false;
+}
+
+
  #endif /* _NVME_H */
-- 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-08-06  5:52             ` Chao Leng
@ 2020-08-06 14:26               ` Keith Busch
  2020-08-06 15:59                 ` Meneghini, John
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2020-08-06 14:26 UTC (permalink / raw)
  To: Chao Leng; +Cc: linux-nvme, John.Meneghini

On Thu, Aug 06, 2020 at 01:52:42PM +0800, Chao Leng wrote:
> NVME_SC_LBA_RANGE should failfast, because retry can not success.

The DNR bit is how NVMe conveys such things.

> NVME_SC_NS_NOT_READY may retry success, but the probality is very low.
> NVME_SC_CMD_INTERRUPTED need retry, according to protocol define, retry will success.

If the upper layers set a request such that "noretry" is true, that's
the behavior you're going to get. Sprinkling special exceptions around
is not a good idea.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-08-06 14:26               ` Keith Busch
@ 2020-08-06 15:59                 ` Meneghini, John
  2020-08-06 16:17                   ` Meneghini, John
  2020-08-07  0:03                   ` [PATCH] nvme-core: fix io interrupt when work with dm-multipah Sagi Grimberg
  0 siblings, 2 replies; 50+ messages in thread
From: Meneghini, John @ 2020-08-06 15:59 UTC (permalink / raw)
  To: Keith Busch, Chao Leng
  Cc: Christoph Hellwig, Ewan Milne, Hannes de, linux-nvme, Meneghini, John

Thanks for cc'ing me on this thread.   The email from this DL usually goes into a folder
so I don't watch it too closely.  Cc me directly if you need my attention. That way the
email shows up at the top of my inbox.

>> On 8/5/20, 2:40 AM, "Chao Leng" <lengchao@huawei.com> wrote:

>> John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED
>> to BLK_STS_TARGET? Thank you.

I think returning to BLK_STS_TARGET for NVME_SC_CMD_INTERRUPTED and NVME_SC_NS_NOT_READY
was a potentially non-backwards compatible change that Keith made in response to a problem that
I reported in patch 35038bffa87da.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da

I don’t think it would be a mistake to back out this change.  We've had a number of problems
in the distributions with dm-mp and other legacy code getting confused by this change.

>> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED
>> just means target need retry io. It is not suitable to translate
>> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to
>> BLK_STS_IOERR is also not suitable, we should translate
>> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.
>> We can do like this:
>
> BLK_STS_AGAIN is a bad choice as we use it for calls that block when
> the callers asked for non-blocking submission.  I'm really not sure
> we want to change anything here - the error definition clearly states
> it is not a failure but a request to retry later.

Agreed. The historical behavior in nvme-core has been to return BLK_STS_IOERR status for all new
or unknown NVMe errors. 

.   On 8/6/20, 10:26 AM, "Keith Busch" <kbusch@kernel.org> wrote:

    > On Thu, Aug 06, 2020 at 01:52:42PM +0800, Chao Leng wrote:
    > NVME_SC_LBA_RANGE should failfast, because retry can not success.

    The DNR bit is how NVMe conveys such things.

I think that NVME_SC_LBA_RANGE has historically returned BLK_STS_TARGET so that
shouldn't be changed.  I agree that DNR - and now the new ACRE mechanism - is what
should control the command retry behavior in NVMe.  

    > NVME_SC_NS_NOT_READY may retry success, but the probality is very low.
    > NVME_SC_CMD_INTERRUPTED need retry, according to protocol define, retry will success.

    If the upper layers set a request such that "noretry" is true, that's
    the behavior you're going to get. Sprinkling special exceptions around
    is not a good idea.

I agree. I think the problem here is that the current BLK_STS and FAST_FAIL mechanisms
were designed support legacy protocols like SCSI.  They assume that all retry behavior is
controlled by other components in the stack.  NVMe is presenting new protocol features
and semantics which probably can't be effectively supported by those legacy BLK_STS
and FAST_FAIL mechanisms without passing more information up the stack.

So I think the best approach is to maintain backwards compatibility with
the existing BLK_STS codes and semantics or create new errors and mechanisms
which will require new software to handle those new error codes all the way
up the stack.   But don't overload things.

John Meneghini
ONTAP SAN Target Architect
johnm@netapp.com 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-08-06 15:59                 ` Meneghini, John
@ 2020-08-06 16:17                   ` Meneghini, John
  2020-08-06 18:40                     ` Mike Snitzer
  2020-08-07  0:03                   ` [PATCH] nvme-core: fix io interrupt when work with dm-multipah Sagi Grimberg
  1 sibling, 1 reply; 50+ messages in thread
From: Meneghini, John @ 2020-08-06 16:17 UTC (permalink / raw)
  To: Keith Busch, Chao Leng
  Cc: msnitzer, linux-nvme, Christoph Hellwig, Hannes de, Ewan Milne,
	Meneghini, John

On 8/6/20, 11:59 AM, "Meneghini, John" <John.Meneghini@netapp.com> wrote:

    Maybe translate to
    >> BLK_STS_IOERR is also not suitable, we should translate
    >> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.

I think this depends upon what the error handling is up the stack for BLK_STS_IOERR.

What does DM do with BLK_STS_IOERR?

    > BLK_STS_AGAIN is a bad choice as we use it for calls that block when
    > the callers asked for non-blocking submission.  I'm really not sure
    > we want to change anything here - the error definition clearly states
    > it is not a failure but a request to retry later.

So it sounds like you may need a new BLK_STS error.   However, even if you add
a new error, that's not going to be enough to communicate the CRDT or DNR 
information up the stack.
 
} blk_errors[] = {
        [BLK_STS_OK]            = { 0,          "" },
        [BLK_STS_NOTSUPP]       = { -EOPNOTSUPP, "operation not supported" },
        [BLK_STS_TIMEOUT]       = { -ETIMEDOUT, "timeout" },
        [BLK_STS_NOSPC]         = { -ENOSPC,    "critical space allocation" },
        [BLK_STS_TRANSPORT]     = { -ENOLINK,   "recoverable transport" },
        [BLK_STS_TARGET]        = { -EREMOTEIO, "critical target" },
        [BLK_STS_NEXUS]         = { -EBADE,     "critical nexus" },
        [BLK_STS_MEDIUM]        = { -ENODATA,   "critical medium" },
        [BLK_STS_PROTECTION]    = { -EILSEQ,    "protection" },
        [BLK_STS_RESOURCE]      = { -ENOMEM,    "kernel resource" },
        [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
        [BLK_STS_AGAIN]         = { -EAGAIN,    "nonblocking retry" },

        /* device mapper special case, should not leak out: */
        [BLK_STS_DM_REQUEUE]    = { -EREMCHG, "dm internal retry" },

        /* everything else not covered above: */
        [BLK_STS_IOERR]         = { -EIO,       "I/O" },
};

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme-core: fix io interrupt when work with dm-multipah
  2020-08-06 16:17                   ` Meneghini, John
@ 2020-08-06 18:40                     ` Mike Snitzer
  2020-08-06 19:19                       ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2020-08-06 18:40 UTC (permalink / raw)
  To: Meneghini, John
  Cc: Ewan Milne, Christoph Hellwig, linux-nvme, Chao Leng,
	Keith Busch, Hannes de

On Thu, Aug 06 2020 at 12:17pm -0400,
Meneghini, John <John.Meneghini@netapp.com> wrote:

> On 8/6/20, 11:59 AM, "Meneghini, John" <John.Meneghini@netapp.com> wrote:
> 
>     Maybe translate to
>     >> BLK_STS_IOERR is also not suitable, we should translate
>     >> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.
> 
> I think this depends upon what the error handling is up the stack for BLK_STS_IOERR.
> 
> What does DM do with BLK_STS_IOERR?

DM treats it as retryable.  See blk_path_error().

>     > BLK_STS_AGAIN is a bad choice as we use it for calls that block when
>     > the callers asked for non-blocking submission.  I'm really not sure
>     > we want to change anything here - the error definition clearly states
>     > it is not a failure but a request to retry later.
> 
> So it sounds like you may need a new BLK_STS error.   However, even if you add
> a new error, that's not going to be enough to communicate the CRDT or DNR 
> information up the stack.
>  
> } blk_errors[] = {
>         [BLK_STS_OK]            = { 0,          "" },
>         [BLK_STS_NOTSUPP]       = { -EOPNOTSUPP, "operation not supported" },
>         [BLK_STS_TIMEOUT]       = { -ETIMEDOUT, "timeout" },
>         [BLK_STS_NOSPC]         = { -ENOSPC,    "critical space allocation" },
>         [BLK_STS_TRANSPORT]     = { -ENOLINK,   "recoverable transport" },
>         [BLK_STS_TARGET]        = { -EREMOTEIO, "critical target" },
>         [BLK_STS_NEXUS]         = { -EBADE,     "critical nexus" },
>         [BLK_STS_MEDIUM]        = { -ENODATA,   "critical medium" },
>         [BLK_STS_PROTECTION]    = { -EILSEQ,    "protection" },
>         [BLK_STS_RESOURCE]      = { -ENOMEM,    "kernel resource" },
>         [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
>         [BLK_STS_AGAIN]         = { -EAGAIN,    "nonblocking retry" },
> 
>         /* device mapper special case, should not leak out: */
>         [BLK_STS_DM_REQUEUE]    = { -EREMCHG, "dm internal retry" },
> 
>         /* everything else not covered above: */
>         [BLK_STS_IOERR]         = { -EIO,       "I/O" },
> };
> 

We've yet to determine how important it is that the target provided
delay information be honored...

In any case, NVMe translating NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET
is definitely wrong.  That conveys the error is not retryable (see
blk_path_error()).

Shouldn't NVMe translate NVME_SC_CMD_INTERRUPTED to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE?

DM will retry immediately if BLK_STS_RESOURCE is returned.
DM will delay a fixed 100ms if BLK_STS_DEV_RESOURCE is used.

(Ming said BLK_STS_RESOURCE isn't Linux specific and can be used by
drivers)


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-06 18:40                     ` Mike Snitzer
@ 2020-08-06 19:19                       ` Mike Snitzer
  2020-08-06 22:42                         ` Meneghini, John
                                           ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Mike Snitzer @ 2020-08-06 19:19 UTC (permalink / raw)
  To: Meneghini, John
  Cc: Ewan Milne, Christoph Hellwig, linux-nvme, Chao Leng,
	Keith Busch, Hannes Reinecke

On Thu, Aug 06 2020 at  2:40pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Aug 06 2020 at 12:17pm -0400,
> Meneghini, John <John.Meneghini@netapp.com> wrote:
> 
> > On 8/6/20, 11:59 AM, "Meneghini, John" <John.Meneghini@netapp.com> wrote:
> > 
> >     Maybe translate to
> >     >> BLK_STS_IOERR is also not suitable, we should translate
> >     >> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.
> > 
> > I think this depends upon what the error handling is up the stack for BLK_STS_IOERR.
> > 
> > What does DM do with BLK_STS_IOERR?
> 
> DM treats it as retryable.  See blk_path_error().
> 
> >     > BLK_STS_AGAIN is a bad choice as we use it for calls that block when
> >     > the callers asked for non-blocking submission.  I'm really not sure
> >     > we want to change anything here - the error definition clearly states
> >     > it is not a failure but a request to retry later.
> > 
> > So it sounds like you may need a new BLK_STS error.   However, even if you add
> > a new error, that's not going to be enough to communicate the CRDT or DNR 
> > information up the stack.
> >  
> > } blk_errors[] = {
> >         [BLK_STS_OK]            = { 0,          "" },
> >         [BLK_STS_NOTSUPP]       = { -EOPNOTSUPP, "operation not supported" },
> >         [BLK_STS_TIMEOUT]       = { -ETIMEDOUT, "timeout" },
> >         [BLK_STS_NOSPC]         = { -ENOSPC,    "critical space allocation" },
> >         [BLK_STS_TRANSPORT]     = { -ENOLINK,   "recoverable transport" },
> >         [BLK_STS_TARGET]        = { -EREMOTEIO, "critical target" },
> >         [BLK_STS_NEXUS]         = { -EBADE,     "critical nexus" },
> >         [BLK_STS_MEDIUM]        = { -ENODATA,   "critical medium" },
> >         [BLK_STS_PROTECTION]    = { -EILSEQ,    "protection" },
> >         [BLK_STS_RESOURCE]      = { -ENOMEM,    "kernel resource" },
> >         [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
> >         [BLK_STS_AGAIN]         = { -EAGAIN,    "nonblocking retry" },
> > 
> >         /* device mapper special case, should not leak out: */
> >         [BLK_STS_DM_REQUEUE]    = { -EREMCHG, "dm internal retry" },
> > 
> >         /* everything else not covered above: */
> >         [BLK_STS_IOERR]         = { -EIO,       "I/O" },
> > };
> > 
> 
> We've yet to determine how important it is that the target provided
> delay information be honored...
> 
> In any case, NVMe translating NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET
> is definitely wrong.  That conveys the error is not retryable (see
> blk_path_error()).
> 
> Shouldn't NVMe translate NVME_SC_CMD_INTERRUPTED to BLK_STS_RESOURCE or
> BLK_STS_DEV_RESOURCE?
> 
> DM will retry immediately if BLK_STS_RESOURCE is returned.
> DM will delay a fixed 100ms if BLK_STS_DEV_RESOURCE is used.

Ngh, I got that inverted.. BLK_STS_RESOURCE will result in the 100ms
delayed retry.  BLK_STS_DEV_RESOURCE results in immediate retry.

But going back to BLK_STS_IOERR by reverting commit 35038bffa87 would
work too.

> (Ming said BLK_STS_RESOURCE isn't Linux [block core] specific and can
> be used by drivers)

Regardless, reading back on this thread, I think there is at least some
consensus about reverting commit 35038bffa87 ("nvme: Translate more
status codes to blk_status_t") ?

And on a related note, building on the thread I started here (but
haven't heard back from any NVMe maintainers on):
https://www.redhat.com/archives/dm-devel/2020-July/msg00051.html
I'd also be happy as a pig in shit if this patch were applied:

From: Mike Snitzer <snitzer@redhat.com>
Date: Thu, 2 Jul 2020 01:43:27 -0400
Subject: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq()

Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
status") removed NVMe's use blk_path_error() -- presummably because
nvme_failover_req() was modified to return whether a command should be
retried or not.

By not using blk_path_error() there is serious potential for
regression for how upper layers (e.g. DM multipath) respond to NVMe's
error conditions.  This has played out now due to commit 35038bffa87
("nvme: Translate more status codes to blk_status_t").  Had NVMe
continued to use blk_path_error() it too would not have retried an
NVMe command that got NVME_SC_CMD_INTERRUPTED.

Fix this potential for NVMe error handling regression, possibly
outside NVMe, by restoring NVMe's use of blk_path_error().

Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status")
Cc: stable@vger.kerneel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/nvme/host/core.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6585d57112ad..072f629da4d8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req)
 		nvme_req(req)->ctrl->comp_seen = true;
 
 	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
-			return;
+		if (blk_path_error(status)) {
+			if (req->cmd_flags & REQ_NVME_MPATH) {
+				if (nvme_failover_req(req))
+					return;
+				/* fallthru to normal error handling */
+			}
+		}
 
 		if (!blk_queue_dying(req->q)) {
 			nvme_retry_req(req);
-- 
2.18.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-06 19:19                       ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer
@ 2020-08-06 22:42                         ` Meneghini, John
  2020-08-07  0:07                           ` Mike Snitzer
  2020-08-07  0:44                         ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Sagi Grimberg
  2020-08-10 12:43                         ` Christoph Hellwig
  2 siblings, 1 reply; 50+ messages in thread
From: Meneghini, John @ 2020-08-06 22:42 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ewan Milne, Christoph Hellwig, linux-nvme, Chao Leng,
	Keith Busch, Meneghini, John, Hannes Reinecke

On 8/6/20, 3:20 PM, "Mike Snitzer" <snitzer@redhat.com> wrote:

    From: Mike Snitzer <snitzer@redhat.com>
    Date: Thu, 2 Jul 2020 01:43:27 -0400
    Subject: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq()

    Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
    status") removed NVMe's use blk_path_error() -- presummably because
    nvme_failover_req() was modified to return whether a command should be
    retried or not.

Yes, that was a major part of this patch.  nvme_failover_req() was completely 
reworked and fixed because it was badly broken. 

    By not using blk_path_error() there is serious potential for
    regression for how upper layers (e.g. DM multipath) respond to NVMe's
    error conditions.  This has played out now due to commit 35038bffa87
    ("nvme: Translate more status codes to blk_status_t").  Had NVMe
    continued to use blk_path_error() it too would not have retried an
    NVMe command that got NVME_SC_CMD_INTERRUPTED.

I'm not sure others would consider it broken.  The idea was to get the blk_path_error()
logic out of the core nvme driver completion routine. 

    Fix this potential for NVMe error handling regression, possibly
    outside NVMe, by restoring NVMe's use of blk_path_error().

The point is: blk_path_error() has nothing to do with NVMe errors.
This is dm-multipath logic stuck in the middle of the NVMe error 
handling code.

C symbol: blk_path_error

  File        Function             Line
0 dm-mpath.c  multipath_end_io     1606 if (error && blk_path_error(error)) {
1 dm-mpath.c  multipath_end_io_bio 1646 if (!*error || !blk_path_error(*error))
2 blk_types.h blk_path_error        118 static inline bool blk_path_error(blk_status_t error)

    Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status")
    Cc: stable@vger.kerneel.org
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    ---
    drivers/nvme/host/core.c | 9 +++++++--
    1 file changed, 7 insertions(+), 2 deletions(-)

    diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
    index 6585d57112ad..072f629da4d8 100644
    --- a/drivers/nvme/host/core.c
    +++ b/drivers/nvme/host/core.c
    @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req)
                    nvme_req(req)->ctrl->comp_seen = true;

            if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
    -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
    -                       return;
    +               if (blk_path_error(status)) {
    +                       if (req->cmd_flags & REQ_NVME_MPATH) {
    +                               if (nvme_failover_req(req))
    +                                       return;
    +                               /* fallthru to normal error handling */
    +                       }
    +               }

This would basically undo the patch Hannes, Christoph, and I put together in
commit 764e9332098c0.  This patch greatly simplified and improved the
whole nvme_complete_rq() code path, and I don't support undoing that change.

If you want blk_path_error() to handle any new NVMe error status differently,  I would suggest creating a new 
BLK_STS code to translate that NVMe status and add it to the one place in nvme_complete_rq() designed to
do the NVMe status to BLK_STS translation:  nvme_error_status().  Then you can change  your application specific 
error handling code to handle the new BLK_STS code appropriately, further up the stack.  Don't stick your
application specific error handling logic into the middle of the nvme driver's nvme_complete_rq routine.

/John
                    if (!blk_queue_dying(req->q)) {
                            nvme_retry_req(req);
    --
    2.18.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-06 22:42                         ` Meneghini, John
@ 2020-08-07  0:07                           ` Mike Snitzer
  2020-08-07  1:21                             ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2020-08-07  0:07 UTC (permalink / raw)
  To: Meneghini, John
  Cc: Ewan Milne, Christoph Hellwig, dm-devel, linux-nvme, Chao Leng,
	Keith Busch, Hannes Reinecke

On Thu, Aug 06 2020 at  6:42pm -0400,
Meneghini, John <John.Meneghini@netapp.com> wrote:

> On 8/6/20, 3:20 PM, "Mike Snitzer" <snitzer@redhat.com> wrote:
> 
>     From: Mike Snitzer <snitzer@redhat.com>
>     Date: Thu, 2 Jul 2020 01:43:27 -0400
>     Subject: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq()
> 
>     Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
>     status") removed NVMe's use blk_path_error() -- presummably because
>     nvme_failover_req() was modified to return whether a command should be
>     retried or not.
> 
> Yes, that was a major part of this patch.  nvme_failover_req() was completely 
> reworked and fixed because it was badly broken. 

Sure, and I didn't change any of that.  Still all in place.

>     By not using blk_path_error() there is serious potential for
>     regression for how upper layers (e.g. DM multipath) respond to NVMe's
>     error conditions.  This has played out now due to commit 35038bffa87
>     ("nvme: Translate more status codes to blk_status_t").  Had NVMe
>     continued to use blk_path_error() it too would not have retried an
>     NVMe command that got NVME_SC_CMD_INTERRUPTED.
> 
> I'm not sure others would consider it broken.  The idea was to get the blk_path_error()
> logic out of the core nvme driver completion routine. 
> 
>     Fix this potential for NVMe error handling regression, possibly
>     outside NVMe, by restoring NVMe's use of blk_path_error().
> 
> The point is: blk_path_error() has nothing to do with NVMe errors.
> This is dm-multipath logic stuck in the middle of the NVMe error 
> handling code.

No, it is a means to have multiple subsystems (to this point both SCSI
and NVMe) doing the correct thing when translating subsystem specific
error codes to BLK_STS codes.

If you, or others you name drop below, understood the point we wouldn't
be having this conversation.  You'd accept the point of blk_path_error()
to be valid and required codification of what constitutes a retryable
path error for the Linux block layer.

Any BLK_STS mapping of NVMe specific error codes would need to not screw
up by categorizing a retryable error as non-retryable (and vice-versa).

Again, assuming proper testing, commit 35038bffa87 wouldn't have made it
upstream if NVMe still used blk_path_error().

> C symbol: blk_path_error
> 
>   File        Function             Line
> 0 dm-mpath.c  multipath_end_io     1606 if (error && blk_path_error(error)) {
> 1 dm-mpath.c  multipath_end_io_bio 1646 if (!*error || !blk_path_error(*error))
> 2 blk_types.h blk_path_error        118 static inline bool blk_path_error(blk_status_t error)

Yes, your commit 764e9332098c0 needlessly removed NVMe's use of
blk_path_error().  If you're saying it wasn't needless please explain
why.

>     Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status")
>     Cc: stable@vger.kerneel.org
>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>     ---
>     drivers/nvme/host/core.c | 9 +++++++--
>     1 file changed, 7 insertions(+), 2 deletions(-)
> 
>     diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>     index 6585d57112ad..072f629da4d8 100644
>     --- a/drivers/nvme/host/core.c
>     +++ b/drivers/nvme/host/core.c
>     @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req)
>                     nvme_req(req)->ctrl->comp_seen = true;
> 
>             if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
>     -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
>     -                       return;
>     +               if (blk_path_error(status)) {
>     +                       if (req->cmd_flags & REQ_NVME_MPATH) {
>     +                               if (nvme_failover_req(req))
>     +                                       return;
>     +                               /* fallthru to normal error handling */
>     +                       }
>     +               }
> 
> This would basically undo the patch Hannes, Christoph, and I put together in
> commit 764e9332098c0.  This patch greatly simplified and improved the
> whole nvme_complete_rq() code path, and I don't support undoing that change.

Please elaborate on how I've undone anything?

The only thing I may have done is forced NVMe to take more care about
properly translating its NVME_SC to BLK_STS in nvme_error_status().
Which is a good thing.

> If you want blk_path_error() to handle any new NVMe error status differently,  I would suggest creating a new 
> BLK_STS code to translate that NVMe status and add it to the one place in nvme_complete_rq() designed to
> do the NVMe status to BLK_STS translation:  nvme_error_status().  Then you can change  your application specific 
> error handling code to handle the new BLK_STS code appropriately, further up the stack.  Don't stick your
> application specific error handling logic into the middle of the nvme driver's nvme_complete_rq routine.

Removing NVMe as a primary user of blk_path_error() was a pretty serious
mistake.  One you clearly aren't willing to own.  Yet the associated
breakage still exists as well as the potential for further regressions.

This really isn't some vantage point holy war.  Please don't continue to
make this something it isn't.  Would be cool if you could see what a
disservice this hostility is causing.  Otherwise you'll continue to
taint and/or avoid fixing NVMe with misplaced anti-DM-multipath
tribalism.

Maybe (re?)read these commit headers:

e96fef2c3fa3 nvme: Add more command status translation
908e45643d64 nvme/multipath: Consult blk_status_t for failover
9111e5686c8c block: Provide blk_status_t decoding for path errors
e1f425e770d2 nvme/multipath: Use blk_path_error
a1275677f8cd dm mpath: Use blk_path_error

with: git log e96fef2c3fa3^..a1275677f8cd

Anyway, no new BLK_STS is needed at this point.  More discipline with
how NVMe's error handling is changed is.  If NVMe wants to ensure its
interface isn't broken regularly it _should_ use blk_path_error() to
validate future nvme_error_status() changes.  Miscategorizing NVMe
errors to upper layers is a bug -- not open for debate.  Nor should be
using block core infrastructure we put in place to help stabilize how
Linux block drivers convey retryable errors.

Mike


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-07  0:07                           ` Mike Snitzer
@ 2020-08-07  1:21                             ` Sagi Grimberg
  2020-08-07  4:50                               ` Mike Snitzer
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2020-08-07  1:21 UTC (permalink / raw)
  To: Mike Snitzer, Meneghini, John
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng,
	Keith Busch, Hannes Reinecke

Hey Mike,

>> The point is: blk_path_error() has nothing to do with NVMe errors.
>> This is dm-multipath logic stuck in the middle of the NVMe error
>> handling code.
> 
> No, it is a means to have multiple subsystems (to this point both SCSI
> and NVMe) doing the correct thing when translating subsystem specific
> error codes to BLK_STS codes.

Not exactly, don't find any use of this in scsi. The purpose is to make
sure that nvme and dm speak the same language.

> If you, or others you name drop below, understood the point we wouldn't
> be having this conversation.  You'd accept the point of blk_path_error()
> to be valid and required codification of what constitutes a retryable
> path error for the Linux block layer.

This incident is a case where the specific nvme status was designed
to retry on the same path respecting the controller retry delay.
And because nvme used blk_path_error at the time it caused us to use a
non-retryable status to get around that. Granted, no one had 
dm-multipath in mind.

So in a sense, there is consensus on changing patch 35038bffa87da
_because_ nvme no longer uses blk_path_error. Otherwise it would break.

> Any BLK_STS mapping of NVMe specific error codes would need to not screw
> up by categorizing a retryable error as non-retryable (and vice-versa).

But it is a special type of retryable. There is nothing that fits the
semantics of the current behavior.

> Again, assuming proper testing, commit 35038bffa87 wouldn't have made it
> upstream if NVMe still used blk_path_error().

Agree.

> Yes, your commit 764e9332098c0 needlessly removed NVMe's use of
> blk_path_error().  If you're saying it wasn't needless please explain
> why.
> 
>>      Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status")
>>      Cc: stable@vger.kerneel.org
>>      Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>      ---
>>      drivers/nvme/host/core.c | 9 +++++++--
>>      1 file changed, 7 insertions(+), 2 deletions(-)
>>
>>      diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>      index 6585d57112ad..072f629da4d8 100644
>>      --- a/drivers/nvme/host/core.c
>>      +++ b/drivers/nvme/host/core.c
>>      @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req)
>>                      nvme_req(req)->ctrl->comp_seen = true;
>>
>>              if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
>>      -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
>>      -                       return;
>>      +               if (blk_path_error(status)) {
>>      +                       if (req->cmd_flags & REQ_NVME_MPATH) {
>>      +                               if (nvme_failover_req(req))
>>      +                                       return;
>>      +                               /* fallthru to normal error handling */
>>      +                       }
>>      +               }
>>
>> This would basically undo the patch Hannes, Christoph, and I put together in
>> commit 764e9332098c0.  This patch greatly simplified and improved the
>> whole nvme_complete_rq() code path, and I don't support undoing that change.
> 
> Please elaborate on how I've undone anything?

I think you're right, this hasn't undone the patch from John, but it
breaks NVME_SC_CMD_INTERRUPTED error handling behavior.

> The only thing I may have done is forced NVMe to take more care about
> properly translating its NVME_SC to BLK_STS in nvme_error_status().
> Which is a good thing.

I don't think there is an issue here of mistakenly converting an nvme
status code to a wrong block status code. This conversion is there
because there is no block status that give us the semantics we need
which is apparently specific to nvme.

I personally don't mind restoring blk_path_error to nvme, I don't
particularly feel strong about it either. But for sure blk_path_error
needs to first provide the semantics needed for NVME_SC_CMD_INTERRUPTED.

...

> Anyway, no new BLK_STS is needed at this point.  More discipline with
> how NVMe's error handling is changed is.

Please read the above.

> If NVMe wants to ensure its
> interface isn't broken regularly it _should_ use blk_path_error() to
> validate future nvme_error_status() changes.  Miscategorizing NVMe
> errors to upper layers is a bug -- not open for debate.

Again, don't agree is a Miscategorization nor a bug, its just something
that is NVMe specific.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-07  1:21                             ` Sagi Grimberg
@ 2020-08-07  4:50                               ` Mike Snitzer
  2020-08-07 23:35                                 ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2020-08-07  4:50 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel,
	Ewan Milne, Chao Leng, Keith Busch, Meneghini, John

On Thu, Aug 06 2020 at  9:21pm -0400,
Sagi Grimberg <sagi@grimberg.me> wrote:

> Hey Mike,
> 
> >>The point is: blk_path_error() has nothing to do with NVMe errors.
> >>This is dm-multipath logic stuck in the middle of the NVMe error
> >>handling code.
> >
> >No, it is a means to have multiple subsystems (to this point both SCSI
> >and NVMe) doing the correct thing when translating subsystem specific
> >error codes to BLK_STS codes.
> 
> Not exactly, don't find any use of this in scsi. The purpose is to make
> sure that nvme and dm speak the same language.

SCSI doesn't need to do additional work to train a multipath layer
because dm-multipath _is_ SCSI's multipathing in Linux.

So ensuring SCSI properly classifies its error codes happens as a
side-effect of ensuring continued multipath functionality.

Hannes introduced all these differentiated error codes in block core
because of SCSI.  NVMe is meant to build on the infrastructure that was
established.

> >If you, or others you name drop below, understood the point we wouldn't
> >be having this conversation.  You'd accept the point of blk_path_error()
> >to be valid and required codification of what constitutes a retryable
> >path error for the Linux block layer.
> 
> This incident is a case where the specific nvme status was designed
> to retry on the same path respecting the controller retry delay.
> And because nvme used blk_path_error at the time it caused us to use a
> non-retryable status to get around that. Granted, no one had
> dm-multipath in mind.
> 
> So in a sense, there is consensus on changing patch 35038bffa87da
> _because_ nvme no longer uses blk_path_error. Otherwise it would break.

"break" meaning it would do failover instead of the more optimal local
retry to the same controller.

I see.  Wish the header for commit 35038bffa87da touched on this even a
little bit ;)

But AFAICT the patch I provided doesn't compromise proper local retry --
as long as we first fix nvme_error_status() to return a retryable
BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.

Think of blk_path_error() as a more coarse-grained "is this retryable or
a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
nvme_error_status() _should_ respond with something retryable (I'd
prefer BLK_STS_RESOURCE to be honest).

And then nvme_failover_req() is finer-grained; by returning false it now
allows short-circuiting failover and reverting back to NVMe's normal
controller based error recovery -- which it does for
NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().

And then the previous nvme_error_status() classification of retryable
BLK_STS obviously never gets returned up the IO stack; it gets thrown
away.

> >Any BLK_STS mapping of NVMe specific error codes would need to not screw
> >up by categorizing a retryable error as non-retryable (and vice-versa).
> 
> But it is a special type of retryable. There is nothing that fits the
> semantics of the current behavior.

I agree.  But that's fine actually.

And this issue is the poster-child for why properly supporting a duality
of driver-level vs upper level multipathing capabilities is pretty much
impossible unless a clean design factors out the different error classes
-- and local error retry is handled before punting to higher level
failover retry.  Think if NVMe were to adopt a bit more disciplined
"local then failover" error handling it all gets easier.

This local retry _is_ NVMe specific.  NVMe should just own retrying on
the same controller no matter what (I'll hope that such retry has
awareness to not retry indefinitely though!).  And this has nothing to
do with multipathing, so the logic to handle it shouldn't be trapped in
nvme_failover_req().

I think NVMe can easily fix this by having an earlier stage of checking,
e.g. nvme_local_retry_req(), that shortcircuits ever getting to
higher-level multipathing consideration (be it native NVMe or DM
multipathing) for cases like NVME_SC_CMD_INTERRUPTED. 
To be clear: the "default" case of nvme_failover_req() that returns
false to fallback to NVMe's "local" normal NVMe error handling -- that
can stay.. but a more explicit handling of cases like
NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
check that happens before nvme_failover_req() in nvme_complete_rq().

This is where a patch will speak a thousand words... maybe ;)

> >Again, assuming proper testing, commit 35038bffa87 wouldn't have made it
> >upstream if NVMe still used blk_path_error().
> 
> Agree.
> 
> >Yes, your commit 764e9332098c0 needlessly removed NVMe's use of
> >blk_path_error().  If you're saying it wasn't needless please explain
> >why.
> >
> >>     Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status")
> >>     Cc: stable@vger.kerneel.org
> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >>     ---
> >>     drivers/nvme/host/core.c | 9 +++++++--
> >>     1 file changed, 7 insertions(+), 2 deletions(-)
> >>
> >>     diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>     index 6585d57112ad..072f629da4d8 100644
> >>     --- a/drivers/nvme/host/core.c
> >>     +++ b/drivers/nvme/host/core.c
> >>     @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req)
> >>                     nvme_req(req)->ctrl->comp_seen = true;
> >>
> >>             if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> >>     -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
> >>     -                       return;
> >>     +               if (blk_path_error(status)) {
> >>     +                       if (req->cmd_flags & REQ_NVME_MPATH) {
> >>     +                               if (nvme_failover_req(req))
> >>     +                                       return;
> >>     +                               /* fallthru to normal error handling */
> >>     +                       }
> >>     +               }
> >>
> >>This would basically undo the patch Hannes, Christoph, and I put together in
> >>commit 764e9332098c0.  This patch greatly simplified and improved the
> >>whole nvme_complete_rq() code path, and I don't support undoing that change.
> >
> >Please elaborate on how I've undone anything?
> 
> I think you're right, this hasn't undone the patch from John, but it
> breaks NVME_SC_CMD_INTERRUPTED error handling behavior.

Yes, again we'd need to first fix nvme_error_status() to return a
retryable BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a
prereq.

> >The only thing I may have done is forced NVMe to take more care about
> >properly translating its NVME_SC to BLK_STS in nvme_error_status().
> >Which is a good thing.
> 
> I don't think there is an issue here of mistakenly converting an nvme
> status code to a wrong block status code. This conversion is there
> because there is no block status that give us the semantics we need
> which is apparently specific to nvme.
> 
> I personally don't mind restoring blk_path_error to nvme, I don't
> particularly feel strong about it either. But for sure blk_path_error
> needs to first provide the semantics needed for NVME_SC_CMD_INTERRUPTED.

Hopefully it doesn't.  As I discuss above with my "I think NVMe can
easily fix".  NVMe should trap this class of "local" retryable error
_before_ going on to check if higher-level failover is needed.  If it
does that: problem solved for both NVMe and DM multipathing. (could be
missing something though, I'll try to craft a patch later today to shake
it out).

> ...
> 
> >Anyway, no new BLK_STS is needed at this point.  More discipline with
> >how NVMe's error handling is changed is.
> 
> Please read the above.

I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
failover.

> >If NVMe wants to ensure its
> >interface isn't broken regularly it _should_ use blk_path_error() to
> >validate future nvme_error_status() changes.  Miscategorizing NVMe
> >errors to upper layers is a bug -- not open for debate.
> 
> Again, don't agree is a Miscategorization nor a bug, its just something
> that is NVMe specific.

Right I understand.

Think it safe to assume these types of details are why Christoph wanted
to avoid the notion of native NVMe and DM multipathing having
compatible error handling.  There was some wisdom with that position
(especially with native NVMe goals in mind).  But if things are tweaked
slightly then both camps _can_ be made happy.

There just needs to be a willingness to work through the details,
defensiveness needs to be shed on both sides, so constructive
review/consideration of problems can happen.  Think that has already
happened here with our exchange.  I'm open to investing effort here if
others are up for humoring my attempt to explore fixing the issues in a
mutually beneficial way.  What's the worst that can happen?  My simple
patches might continue to be ignored? ;)

Mike

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-07  4:50                               ` Mike Snitzer
@ 2020-08-07 23:35                                 ` Sagi Grimberg
  2020-08-08 21:08                                   ` Meneghini, John
  2020-08-10 14:36                                   ` Mike Snitzer
  0 siblings, 2 replies; 50+ messages in thread
From: Sagi Grimberg @ 2020-08-07 23:35 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel,
	Ewan Milne, Chao Leng, Keith Busch, Meneghini, John


>> Hey Mike,
>>
>>>> The point is: blk_path_error() has nothing to do with NVMe errors.
>>>> This is dm-multipath logic stuck in the middle of the NVMe error
>>>> handling code.
>>>
>>> No, it is a means to have multiple subsystems (to this point both SCSI
>>> and NVMe) doing the correct thing when translating subsystem specific
>>> error codes to BLK_STS codes.
>>
>> Not exactly, don't find any use of this in scsi. The purpose is to make
>> sure that nvme and dm speak the same language.
> 
> SCSI doesn't need to do additional work to train a multipath layer
> because dm-multipath _is_ SCSI's multipathing in Linux.

Agree.

> So ensuring SCSI properly classifies its error codes happens as a
> side-effect of ensuring continued multipath functionality.
> 
> Hannes introduced all these differentiated error codes in block core
> because of SCSI.  NVMe is meant to build on the infrastructure that was
> established.

Yes, exactly my point. blk_path_error is designed to make nvme and 
dm-multipath speak the same language.

>>> If you, or others you name drop below, understood the point we wouldn't
>>> be having this conversation.  You'd accept the point of blk_path_error()
>>> to be valid and required codification of what constitutes a retryable
>>> path error for the Linux block layer.
>>
>> This incident is a case where the specific nvme status was designed
>> to retry on the same path respecting the controller retry delay.
>> And because nvme used blk_path_error at the time it caused us to use a
>> non-retryable status to get around that. Granted, no one had
>> dm-multipath in mind.
>>
>> So in a sense, there is consensus on changing patch 35038bffa87da
>> _because_ nvme no longer uses blk_path_error. Otherwise it would break.
> 
> "break" meaning it would do failover instead of the more optimal local
> retry to the same controller.
> 
> I see.  Wish the header for commit 35038bffa87da touched on this even a
> little bit ;)

I think it did, but maybe didn't put too much emphasis on it.

> But AFAICT the patch I provided doesn't compromise proper local retry --
> as long as we first fix nvme_error_status() to return a retryable
> BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.
> 
> Think of blk_path_error() as a more coarse-grained "is this retryable or
> a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
> nvme_error_status() _should_ respond with something retryable (I'd
> prefer BLK_STS_RESOURCE to be honest).

But blk_path_error semantically mean "is this a pathing error", or at
least that what its name suggest.

> And then nvme_failover_req() is finer-grained; by returning false it now
> allows short-circuiting failover and reverting back to NVMe's normal
> controller based error recovery -- which it does for
> NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().
> 
> And then the previous nvme_error_status() classification of retryable
> BLK_STS obviously never gets returned up the IO stack; it gets thrown
> away.

I see what you are saying. The issue is that the code becomes
convoluted (it's a pathing error, oh wait, no its not a pathing error).

>>> Any BLK_STS mapping of NVMe specific error codes would need to not screw
>>> up by categorizing a retryable error as non-retryable (and vice-versa).
>>
>> But it is a special type of retryable. There is nothing that fits the
>> semantics of the current behavior.
> 
> I agree.  But that's fine actually.
> 
> And this issue is the poster-child for why properly supporting a duality
> of driver-level vs upper level multipathing capabilities is pretty much
> impossible unless a clean design factors out the different error classes
> -- and local error retry is handled before punting to higher level
> failover retry.  Think if NVMe were to adopt a bit more disciplined
> "local then failover" error handling it all gets easier.

I don't think punting before is easier, because we do a local retry if:
- no multipathing sw on top
- request needs retry (e.g. no DNR, notretry is off etc..)
- nvme error is not pathing related (nvme_failover_req returned false)

> This local retry _is_ NVMe specific.  NVMe should just own retrying on
> the same controller no matter what (I'll hope that such retry has
> awareness to not retry indefinitely though!).

it will retry until the retry limit.

>  And this has nothing to
> do with multipathing, so the logic to handle it shouldn't be trapped in
> nvme_failover_req().

Well given that nvme_failover_req already may not actually failover this
makes some sense to me (although I did have some resistance to make it
that way in the first place, but was convinced otherwise).

> I think NVMe can easily fix this by having an earlier stage of checking,
> e.g. nvme_local_retry_req(), that shortcircuits ever getting to
> higher-level multipathing consideration (be it native NVMe or DM
> multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
> To be clear: the "default" case of nvme_failover_req() that returns
> false to fallback to NVMe's "local" normal NVMe error handling -- that
> can stay.. but a more explicit handling of cases like
> NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
> check that happens before nvme_failover_req() in nvme_complete_rq().

I don't necessarily agree with having a dedicated nvme_local_retry_req().
a request that isn't failed over, goes to local error handling (retry or
not). I actually think that just adding the condition to
nvme_complete_req and having nvme_failover_req reject it would work.

Keith?

>>> Anyway, no new BLK_STS is needed at this point.  More discipline with
>>> how NVMe's error handling is changed is.
>>
>> Please read the above.
> 
> I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
> NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
> failover.

Not sure that is better, but we can see a patch first to determine.

>>> If NVMe wants to ensure its
>>> interface isn't broken regularly it _should_ use blk_path_error() to
>>> validate future nvme_error_status() changes.  Miscategorizing NVMe
>>> errors to upper layers is a bug -- not open for debate.
>>
>> Again, don't agree is a Miscategorization nor a bug, its just something
>> that is NVMe specific.
> 
> Right I understand.
> 
> Think it safe to assume these types of details are why Christoph wanted
> to avoid the notion of native NVMe and DM multipathing having
> compatible error handling.  There was some wisdom with that position
> (especially with native NVMe goals in mind).  But if things are tweaked
> slightly then both camps _can_ be made happy.
> 
> There just needs to be a willingness to work through the details,
> defensiveness needs to be shed on both sides, so constructive
> review/consideration of problems can happen.

Agreed.

> Think that has already
> happened here with our exchange.  I'm open to investing effort here if
> others are up for humoring my attempt to explore fixing the issues in a
> mutually beneficial way.  What's the worst that can happen?  My simple
> patches might continue to be ignored? ;)

I won't ignore it, and I apologize of ignoring the original patch 
posted, I guess it flew under the radar...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-07 23:35                                 ` Sagi Grimberg
@ 2020-08-08 21:08                                   ` Meneghini, John
  2020-08-08 21:11                                     ` Meneghini, John
  2020-08-10  8:10                                     ` Chao Leng
  2020-08-10 14:36                                   ` Mike Snitzer
  1 sibling, 2 replies; 50+ messages in thread
From: Meneghini, John @ 2020-08-08 21:08 UTC (permalink / raw)
  To: Sagi Grimberg, Mike Snitzer
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng,
	Keith Busch, Meneghini, John, Hannes Reinecke

I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does.

The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry.  You can find a copy of this TP here:

http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields.  To make use of these delay fields a new CRD field was added to the CQE Status Field.  This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls.  CRDTs can range from 100 milliseconds to 6559 seconds.  Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit.  This allows the host to enable or disable the feature. 

With this background there are a couple of misconceptions in this thread which I'd like to address.  

The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status.  Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function
(after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^  So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do.  There is a much larger change in error semantics going on with ACRE than just this single error.

The second is:  All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec.  However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status.  This is how NVMe error semantics work and this is a part of what was behind Keith's patch . 

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da

Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do.  I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out.  However, ultimately, I think nvme-core should return an error status like  BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are.

Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq()

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120

I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do.  Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks.  As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq().  All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic.  

At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability.  This is something which has plagued SCSI implementations for years,  and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this.  There are plans to introduce more Host Behavior Support features in the future.  

In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this.  We don't want to turn NVMe into SCSI.  I know I don't want to.

/John

^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need).  

On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote:

    >> Hey Mike,
    >>
    >>>> The point is: blk_path_error() has nothing to do with NVMe errors.
    >>>> This is dm-multipath logic stuck in the middle of the NVMe error
    >>>> handling code.
    >>>
    >>> No, it is a means to have multiple subsystems (to this point both SCSI
    >>> and NVMe) doing the correct thing when translating subsystem specific
    >>> error codes to BLK_STS codes.
    >>
    >> Not exactly, don't find any use of this in scsi. The purpose is to make
    >> sure that nvme and dm speak the same language.
    >
    > SCSI doesn't need to do additional work to train a multipath layer
    > because dm-multipath _is_ SCSI's multipathing in Linux.

    Agree.

    > So ensuring SCSI properly classifies its error codes happens as a
    > side-effect of ensuring continued multipath functionality.
    >
    > Hannes introduced all these differentiated error codes in block core
    > because of SCSI.  NVMe is meant to build on the infrastructure that was
    > established.

    Yes, exactly my point. blk_path_error is designed to make nvme and
    dm-multipath speak the same language.

    >>> If you, or others you name drop below, understood the point we wouldn't
    >>> be having this conversation.  You'd accept the point of blk_path_error()
    >>> to be valid and required codification of what constitutes a retryable
    >>> path error for the Linux block layer.
    >>
    >> This incident is a case where the specific nvme status was designed
    >> to retry on the same path respecting the controller retry delay.
    >> And because nvme used blk_path_error at the time it caused us to use a
    >> non-retryable status to get around that. Granted, no one had
    >> dm-multipath in mind.
    >>
    >> So in a sense, there is consensus on changing patch 35038bffa87da
    >> _because_ nvme no longer uses blk_path_error. Otherwise it would break.
    >
    > "break" meaning it would do failover instead of the more optimal local
    > retry to the same controller.
    >
    > I see.  Wish the header for commit 35038bffa87da touched on this even a
    > little bit ;)

    I think it did, but maybe didn't put too much emphasis on it.

    > But AFAICT the patch I provided doesn't compromise proper local retry --
    > as long as we first fix nvme_error_status() to return a retryable
    > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.
    >
    > Think of blk_path_error() as a more coarse-grained "is this retryable or
    > a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
    > nvme_error_status() _should_ respond with something retryable (I'd
    > prefer BLK_STS_RESOURCE to be honest).

    But blk_path_error semantically mean "is this a pathing error", or at
    least that what its name suggest.

    > And then nvme_failover_req() is finer-grained; by returning false it now
    > allows short-circuiting failover and reverting back to NVMe's normal
    > controller based error recovery -- which it does for
    > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().
    >
    > And then the previous nvme_error_status() classification of retryable
    > BLK_STS obviously never gets returned up the IO stack; it gets thrown
    > away.

    I see what you are saying. The issue is that the code becomes
    convoluted (it's a pathing error, oh wait, no its not a pathing error).

    >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw
    >>> up by categorizing a retryable error as non-retryable (and vice-versa).
    >>
    >> But it is a special type of retryable. There is nothing that fits the
    >> semantics of the current behavior.
    >
    > I agree.  But that's fine actually.
    >
    > And this issue is the poster-child for why properly supporting a duality
    > of driver-level vs upper level multipathing capabilities is pretty much
    > impossible unless a clean design factors out the different error classes
    > -- and local error retry is handled before punting to higher level
    > failover retry.  Think if NVMe were to adopt a bit more disciplined
    > "local then failover" error handling it all gets easier.

    I don't think punting before is easier, because we do a local retry if:
    - no multipathing sw on top
    - request needs retry (e.g. no DNR, notretry is off etc..)
    - nvme error is not pathing related (nvme_failover_req returned false)

    > This local retry _is_ NVMe specific.  NVMe should just own retrying on
    > the same controller no matter what (I'll hope that such retry has
    > awareness to not retry indefinitely though!).

    it will retry until the retry limit.

    >  And this has nothing to
    > do with multipathing, so the logic to handle it shouldn't be trapped in
    > nvme_failover_req().

    Well given that nvme_failover_req already may not actually failover this
    makes some sense to me (although I did have some resistance to make it
    that way in the first place, but was convinced otherwise).

    > I think NVMe can easily fix this by having an earlier stage of checking,
    > e.g. nvme_local_retry_req(), that shortcircuits ever getting to
    > higher-level multipathing consideration (be it native NVMe or DM
    > multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
    > To be clear: the "default" case of nvme_failover_req() that returns
    > false to fallback to NVMe's "local" normal NVMe error handling -- that
    > can stay.. but a more explicit handling of cases like
    > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
    > check that happens before nvme_failover_req() in nvme_complete_rq().

    I don't necessarily agree with having a dedicated nvme_local_retry_req().
    a request that isn't failed over, goes to local error handling (retry or
    not). I actually think that just adding the condition to
    nvme_complete_req and having nvme_failover_req reject it would work.

    Keith?

    >>> Anyway, no new BLK_STS is needed at this point.  More discipline with
    >>> how NVMe's error handling is changed is.
    >>
    >> Please read the above.
    >
    > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
    > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
    > failover.

    Not sure that is better, but we can see a patch first to determine.

    >>> If NVMe wants to ensure its
    >>> interface isn't broken regularly it _should_ use blk_path_error() to
    >>> validate future nvme_error_status() changes.  Miscategorizing NVMe
    >>> errors to upper layers is a bug -- not open for debate.
    >>
    >> Again, don't agree is a Miscategorization nor a bug, its just something
    >> that is NVMe specific.
    >
    > Right I understand.
    >
    > Think it safe to assume these types of details are why Christoph wanted
    > to avoid the notion of native NVMe and DM multipathing having
    > compatible error handling.  There was some wisdom with that position
    > (especially with native NVMe goals in mind).  But if things are tweaked
    > slightly then both camps _can_ be made happy.
    >
    > There just needs to be a willingness to work through the details,
    > defensiveness needs to be shed on both sides, so constructive
    > review/consideration of problems can happen.

    Agreed.

    > Think that has already
    > happened here with our exchange.  I'm open to investing effort here if
    > others are up for humoring my attempt to explore fixing the issues in a
    > mutually beneficial way.  What's the worst that can happen?  My simple
    > patches might continue to be ignored? ;)

    I won't ignore it, and I apologize of ignoring the original patch
    posted, I guess it flew under the radar...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-08 21:08                                   ` Meneghini, John
@ 2020-08-08 21:11                                     ` Meneghini, John
  2020-08-10 14:48                                       ` Mike Snitzer
  2020-08-10  8:10                                     ` Chao Leng
  1 sibling, 1 reply; 50+ messages in thread
From: Meneghini, John @ 2020-08-08 21:11 UTC (permalink / raw)
  To: Sagi Grimberg, Mike Snitzer
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng,
	Keith Busch, Meneghini, John, Hannes Reinecke

Having said all of this, I think one thing which could be done to help support NVMe with DMP is to approximate any new NVMe error semantics by adding new BLK_STS codes. Reusing or overloading existing BLK_STS may not work well because everyone else already has a claim on those semantics.  I would suggest adding new BLK_STS to the nvme_error_status() routine.  I think this is the one place where NVMe status to BLK_STS translation belongs, and changes can be made there without impacting any of the NVMe error handling logic in nvme-core - now that we've removed blk_path_error() from nvme_complete_rq().

Another choice would be to simply disable ACRE.  This won't solve the problem of changing NVMe error semantics with dm-multipath but it can certainly simplify the problem. Maybe a patch that disables ACRE when REQ_NVME_MPATH is clear.  Or you can always just turn the feature off with the nvme-cli.  I'll be sure to keep an eye on the Host Behavior Support feature at NVMexpress.org.  We created this new nvme feature control mechanism to help with problems like this and I'll be sure to watch out for new protocol features which could break compatibility with dm-multipath in the future. 

/John

On 8/8/20, 5:08 PM, "Meneghini, John" <John.Meneghini@netapp.com> wrote:

    I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does.

    The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry.  You can find a copy of this TP here:

    http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

    This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields.  To make use of these delay fields a new CRD field was added to the CQE Status Field.  This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls.  CRDTs can range from 100 milliseconds to 6559 seconds.  Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit.  This allows the host to enable or disable the feature. 

    With this background there are a couple of misconceptions in this thread which I'd like to address.  

    The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status.  Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function
    (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^  So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do.  There is a much larger change in error semantics going on with ACRE than just this single error.

    The second is:  All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec.  However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status.  This is how NVMe error semantics work and this is a part of what was behind Keith's patch . 

    https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da

    Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do.  I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out.  However, ultimately, I think nvme-core should return an error status like  BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are.

    Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq()

    https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120

    I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do.  Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks.  As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq().  All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic.  

    At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability.  This is something which has plagued SCSI implementations for years,  and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this.  There are plans to introduce more Host Behavior Support features in the future.  

    In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this.  We don't want to turn NVMe into SCSI.  I know I don't want to.

    /John

    ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need).  

    On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote:

        >> Hey Mike,
        >>
        >>>> The point is: blk_path_error() has nothing to do with NVMe errors.
        >>>> This is dm-multipath logic stuck in the middle of the NVMe error
        >>>> handling code.
        >>>
        >>> No, it is a means to have multiple subsystems (to this point both SCSI
        >>> and NVMe) doing the correct thing when translating subsystem specific
        >>> error codes to BLK_STS codes.
        >>
        >> Not exactly, don't find any use of this in scsi. The purpose is to make
        >> sure that nvme and dm speak the same language.
        >
        > SCSI doesn't need to do additional work to train a multipath layer
        > because dm-multipath _is_ SCSI's multipathing in Linux.

        Agree.

        > So ensuring SCSI properly classifies its error codes happens as a
        > side-effect of ensuring continued multipath functionality.
        >
        > Hannes introduced all these differentiated error codes in block core
        > because of SCSI.  NVMe is meant to build on the infrastructure that was
        > established.

        Yes, exactly my point. blk_path_error is designed to make nvme and
        dm-multipath speak the same language.

        >>> If you, or others you name drop below, understood the point we wouldn't
        >>> be having this conversation.  You'd accept the point of blk_path_error()
        >>> to be valid and required codification of what constitutes a retryable
        >>> path error for the Linux block layer.
        >>
        >> This incident is a case where the specific nvme status was designed
        >> to retry on the same path respecting the controller retry delay.
        >> And because nvme used blk_path_error at the time it caused us to use a
        >> non-retryable status to get around that. Granted, no one had
        >> dm-multipath in mind.
        >>
        >> So in a sense, there is consensus on changing patch 35038bffa87da
        >> _because_ nvme no longer uses blk_path_error. Otherwise it would break.
        >
        > "break" meaning it would do failover instead of the more optimal local
        > retry to the same controller.
        >
        > I see.  Wish the header for commit 35038bffa87da touched on this even a
        > little bit ;)

        I think it did, but maybe didn't put too much emphasis on it.

        > But AFAICT the patch I provided doesn't compromise proper local retry --
        > as long as we first fix nvme_error_status() to return a retryable
        > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.
        >
        > Think of blk_path_error() as a more coarse-grained "is this retryable or
        > a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
        > nvme_error_status() _should_ respond with something retryable (I'd
        > prefer BLK_STS_RESOURCE to be honest).

        But blk_path_error semantically mean "is this a pathing error", or at
        least that what its name suggest.

        > And then nvme_failover_req() is finer-grained; by returning false it now
        > allows short-circuiting failover and reverting back to NVMe's normal
        > controller based error recovery -- which it does for
        > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().
        >
        > And then the previous nvme_error_status() classification of retryable
        > BLK_STS obviously never gets returned up the IO stack; it gets thrown
        > away.

        I see what you are saying. The issue is that the code becomes
        convoluted (it's a pathing error, oh wait, no its not a pathing error).

        >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw
        >>> up by categorizing a retryable error as non-retryable (and vice-versa).
        >>
        >> But it is a special type of retryable. There is nothing that fits the
        >> semantics of the current behavior.
        >
        > I agree.  But that's fine actually.
        >
        > And this issue is the poster-child for why properly supporting a duality
        > of driver-level vs upper level multipathing capabilities is pretty much
        > impossible unless a clean design factors out the different error classes
        > -- and local error retry is handled before punting to higher level
        > failover retry.  Think if NVMe were to adopt a bit more disciplined
        > "local then failover" error handling it all gets easier.

        I don't think punting before is easier, because we do a local retry if:
        - no multipathing sw on top
        - request needs retry (e.g. no DNR, notretry is off etc..)
        - nvme error is not pathing related (nvme_failover_req returned false)

        > This local retry _is_ NVMe specific.  NVMe should just own retrying on
        > the same controller no matter what (I'll hope that such retry has
        > awareness to not retry indefinitely though!).

        it will retry until the retry limit.

        >  And this has nothing to
        > do with multipathing, so the logic to handle it shouldn't be trapped in
        > nvme_failover_req().

        Well given that nvme_failover_req already may not actually failover this
        makes some sense to me (although I did have some resistance to make it
        that way in the first place, but was convinced otherwise).

        > I think NVMe can easily fix this by having an earlier stage of checking,
        > e.g. nvme_local_retry_req(), that shortcircuits ever getting to
        > higher-level multipathing consideration (be it native NVMe or DM
        > multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
        > To be clear: the "default" case of nvme_failover_req() that returns
        > false to fallback to NVMe's "local" normal NVMe error handling -- that
        > can stay.. but a more explicit handling of cases like
        > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
        > check that happens before nvme_failover_req() in nvme_complete_rq().

        I don't necessarily agree with having a dedicated nvme_local_retry_req().
        a request that isn't failed over, goes to local error handling (retry or
        not). I actually think that just adding the condition to
        nvme_complete_req and having nvme_failover_req reject it would work.

        Keith?

        >>> Anyway, no new BLK_STS is needed at this point.  More discipline with
        >>> how NVMe's error handling is changed is.
        >>
        >> Please read the above.
        >
        > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
        > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
        > failover.

        Not sure that is better, but we can see a patch first to determine.

        >>> If NVMe wants to ensure its
        >>> interface isn't broken regularly it _should_ use blk_path_error() to
        >>> validate future nvme_error_status() changes.  Miscategorizing NVMe
        >>> errors to upper layers is a bug -- not open for debate.
        >>
        >> Again, don't agree is a Miscategorization nor a bug, its just something
        >> that is NVMe specific.
        >
        > Right I understand.
        >
        > Think it safe to assume these types of details are why Christoph wanted
        > to avoid the notion of native NVMe and DM multipathing having
        > compatible error handling.  There was some wisdom with that position
        > (especially with native NVMe goals in mind).  But if things are tweaked
        > slightly then both camps _can_ be made happy.
        >
        > There just needs to be a willingness to work through the details,
        > defensiveness needs to be shed on both sides, so constructive
        > review/consideration of problems can happen.

        Agreed.

        > Think that has already
        > happened here with our exchange.  I'm open to investing effort here if
        > others are up for humoring my attempt to explore fixing the issues in a
        > mutually beneficial way.  What's the worst that can happen?  My simple
        > patches might continue to be ignored? ;)

        I won't ignore it, and I apologize of ignoring the original patch
        posted, I guess it flew under the radar...


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-08 21:11                                     ` Meneghini, John
@ 2020-08-10 14:48                                       ` Mike Snitzer
  2020-08-11 12:54                                         ` Meneghini, John
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2020-08-10 14:48 UTC (permalink / raw)
  To: Meneghini, John
  Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, dm-devel,
	Ewan Milne, Chao Leng, Keith Busch, Hannes Reinecke

Hi John,

I appreciate your insight about ACRE.  And that NVME_SC_CMD_INTERRUPTED
is really just a preview for the kind of NVMe error handling advances to
come.

But, because I started harping about blk_path_error() I got you
hyper-focused on it and that led to all your thinking about me wanting
to somehow normalize both SCSI and NVMe.  Couldn't be further from my
thinking.  I've always wanted Linux's NVMe driver to do whatever it
needed to handle the evolution of NVMe.  I just really need it to do it
in a way that: respects NVMe's spec for ANA, ACRE and the like in way
that doesn't make NVMe error handling broken if native NVMe multipath
isn't used.

The good news is I still don't see a need to model NVMe error handling
with BLK_STS et al.

But AFAICT the NVMe error handling needs some fixing/cleanup.  To be
continued, with patches so we can collectively have a more productive
discussion.

Thanks,
Mike

On Sat, Aug 08 2020 at  5:11pm -0400,
Meneghini, John <John.Meneghini@netapp.com> wrote:

> Having said all of this, I think one thing which could be done to help
> support NVMe with DMP is to approximate any new NVMe error semantics
> by adding new BLK_STS codes. Reusing or overloading existing BLK_STS
> may not work well because everyone else already has a claim on those
> semantics.  I would suggest adding new BLK_STS to the
> nvme_error_status() routine.  I think this is the one place where NVMe
> status to BLK_STS translation belongs, and changes can be made there
> without impacting any of the NVMe error handling logic in nvme-core -
> now that we've removed blk_path_error() from nvme_complete_rq().
> 
> Another choice would be to simply disable ACRE.  This won't solve the
> problem of changing NVMe error semantics with dm-multipath but it can
> certainly simplify the problem. Maybe a patch that disables ACRE when
> REQ_NVME_MPATH is clear.  Or you can always just turn the feature off
> with the nvme-cli.  I'll be sure to keep an eye on the Host Behavior
> Support feature at NVMexpress.org.  We created this new nvme feature
> control mechanism to help with problems like this and I'll be sure to
> watch out for new protocol features which could break compatibility
> with dm-multipath in the future.
> 
> /John
> 
> On 8/8/20, 5:08 PM, "Meneghini, John" <John.Meneghini@netapp.com> wrote:
> 
>     I'd like to up level this whole conversation for a minute by
> talking about exactly what ACRE does. 
> 
>     The genesis of the changes discussed in this thread is NVMe
> TP-4033 - Enhanced Command Retry.  You can find a copy of this TP
> here: 
> 
>     http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
> 
>     This technical proposal added a command retry delay feature which
> is programmed by the controller. The controller advertises a set of 3
> different timing delays though the Identify Controller data structure
> CRDT{1-2} fields.  To make use of these delay fields a new CRD field
> was added to the CQE Status Field.  This allows the NVMe controller to
> specify exactly how long a command retry should be delayed, with 3
> possible timers that it chooses and controls.  CRDTs can range from
> 100 milliseconds to 6559 seconds.  Because this capability can have
> such a radical effect on backwards compatibility a new NVMe Feature
> Identifier was added (Host Behavior Support - Feature ID 16h) with an
> Advanced Command Retry Enable (ACRE) bit.  This allows the host to
> enable or disable the feature.
> 
>     With this background there are a couple of misconceptions in this
> thread which I'd like to address.
> 
>     The first is: ACRE has nothing to do with the
> NVME_SC_CMD_INTERRUPTED status.  Yes, this new error status was added
> as a part of TP-4033 but the CRD field of the CQE status can be set by
> the controller with *any* NVMe error status. As long as the DNR bit is
> not set the Command Retry Delay can come into effect. This is how the
> spec is written and this is exactly how it has been implemented in the
> core nvme_complete_rq() function (after change 507fe46ac91276120). For
> example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of
> 2 seconds.^  So CDRT needs to be supported with all error status if
> the host is going to enable ACRE, and I think it's a big mistake to
> get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The
> NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a
> general purpose "busy" status, something that was missing from NVMe,
> and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other
> specific NVMe error status, is the wrong thing to do.  There is a much
> larger change in error semantics going on with ACRE than just this
> single error.
> 
>     The second is:  All NVMe error status that do not have a Status
> Code Type of 3h (Path Related Status) are subsystem scoped. This is a
> topic that has gone through some debate on the linux-nvme mailing list
> and at NVMexpress.org; and there have been some ECNs to the spec to
> address this. There may be some exceptions to this rule because there
> are always implementations out there that may not follow, and there
> are bugs in the spec.  However, this is the intention of the NVMe spec
> and it matters. This means that, in a multi-pathing environment,
> retrying any command on a different path will not provide a different
> result. Retries should all occur on the same controller - unless it is
> a path related status.  This is how NVMe error semantics work and this
> is a part of what was behind Keith's patch .
> 
>     https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da
> 
>     Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on
> another path is simply not the right thing to do, and returning
> BLK_STS_TARGET after all command retries, with CRDT, have been
> exhausted communicates the right thing to the upper layer. From the
> perspective of nvme-multipathing Keith's patch was exactly the correct
> thing to do.  I understand that this may have caused a backwards
> compatibly problem with dm-multipath, and that's the only reason why
> I've talked about backing it out.  However, ultimately, I think
> nvme-core should return an error status like  BLK_STS_TARGET that
> says, semantically - the IO has failed, no retry will work - because
> this is what the NVMe error semantics are.
> 
>     Taken together both of these facts about the NVMe protocol
> semantics are what's behind my patch which removed blk_path_error()
> from nvme_complete_rq()
> 
>     https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120
> 
>     I understand that there is a goal to try and avoid having
> different failure/recovery handling semantically in response to
> different error status between nvme-multipath and dm-multipath, but
> NVMe error semantics are truly different from SCSI error semantics,
> and they are changing. The Linux host needs to enable and support
> those changes unhampered by the past. With this goal in mind, removing
> the blk_path_error() code from nvme-core was the right thing to do.
> Hannes and I struggled with the patch to try and make it work with
> blk_path_error() for weeks.  As pointed out in the thread below,
> blk_path_error() is the SCSI multipathing logic and we can't use it in
> nvme_complete_rq().  All it does is import all of the legacy problems
> of dm-multipath, and of SCSI, into the nvme completion/multipath
> logic.
> 
>     At NVMexpress.org we consciously added the ACRE feature because
> the SCSI protocol had no such capability.  This is something which has
> plagued SCSI implementations for years,  and all kinds of tricks have
> been played, in both the SCSI host stack and in SCSI target stack, to
> deal with the problem. The goal of NVMe is to create a better block
> storage protocol and ACRE is just one example of many places where the
> industry is trying to do this.  There are plans to introduce more Host
> Behavior Support features in the future.
> 
>     In the end, we are consciously changing NVMe, both in the spec and
> in its implementation, to make it different from SCSI. I think this is
> what's at the bottom of the changes discussed in this thread, and this
> is why so many people are so passionate about this.  We don't want to
> turn NVMe into SCSI.  I know I don't want to.
> 
>     /John
> 
>     ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need).  
> 
>     On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote:
> 
>         >> Hey Mike,
>         >>
>         >>>> The point is: blk_path_error() has nothing to do with NVMe errors.
>         >>>> This is dm-multipath logic stuck in the middle of the NVMe error
>         >>>> handling code.
>         >>>
>         >>> No, it is a means to have multiple subsystems (to this point both SCSI
>         >>> and NVMe) doing the correct thing when translating subsystem specific
>         >>> error codes to BLK_STS codes.
>         >>
>         >> Not exactly, don't find any use of this in scsi. The purpose is to make
>         >> sure that nvme and dm speak the same language.
>         >
>         > SCSI doesn't need to do additional work to train a multipath layer
>         > because dm-multipath _is_ SCSI's multipathing in Linux.
> 
>         Agree.
> 
>         > So ensuring SCSI properly classifies its error codes happens as a
>         > side-effect of ensuring continued multipath functionality.
>         >
>         > Hannes introduced all these differentiated error codes in block core
>         > because of SCSI.  NVMe is meant to build on the infrastructure that was
>         > established.
> 
>         Yes, exactly my point. blk_path_error is designed to make nvme and
>         dm-multipath speak the same language.
> 
>         >>> If you, or others you name drop below, understood the point we wouldn't
>         >>> be having this conversation.  You'd accept the point of blk_path_error()
>         >>> to be valid and required codification of what constitutes a retryable
>         >>> path error for the Linux block layer.
>         >>
>         >> This incident is a case where the specific nvme status was designed
>         >> to retry on the same path respecting the controller retry delay.
>         >> And because nvme used blk_path_error at the time it caused us to use a
>         >> non-retryable status to get around that. Granted, no one had
>         >> dm-multipath in mind.
>         >>
>         >> So in a sense, there is consensus on changing patch 35038bffa87da
>         >> _because_ nvme no longer uses blk_path_error. Otherwise it would break.
>         >
>         > "break" meaning it would do failover instead of the more optimal local
>         > retry to the same controller.
>         >
>         > I see.  Wish the header for commit 35038bffa87da touched on this even a
>         > little bit ;)
> 
>         I think it did, but maybe didn't put too much emphasis on it.
> 
>         > But AFAICT the patch I provided doesn't compromise proper local retry --
>         > as long as we first fix nvme_error_status() to return a retryable
>         > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.
>         >
>         > Think of blk_path_error() as a more coarse-grained "is this retryable or
>         > a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
>         > nvme_error_status() _should_ respond with something retryable (I'd
>         > prefer BLK_STS_RESOURCE to be honest).
> 
>         But blk_path_error semantically mean "is this a pathing error", or at
>         least that what its name suggest.
> 
>         > And then nvme_failover_req() is finer-grained; by returning false it now
>         > allows short-circuiting failover and reverting back to NVMe's normal
>         > controller based error recovery -- which it does for
>         > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().
>         >
>         > And then the previous nvme_error_status() classification of retryable
>         > BLK_STS obviously never gets returned up the IO stack; it gets thrown
>         > away.
> 
>         I see what you are saying. The issue is that the code becomes
>         convoluted (it's a pathing error, oh wait, no its not a pathing error).
> 
>         >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw
>         >>> up by categorizing a retryable error as non-retryable (and vice-versa).
>         >>
>         >> But it is a special type of retryable. There is nothing that fits the
>         >> semantics of the current behavior.
>         >
>         > I agree.  But that's fine actually.
>         >
>         > And this issue is the poster-child for why properly supporting a duality
>         > of driver-level vs upper level multipathing capabilities is pretty much
>         > impossible unless a clean design factors out the different error classes
>         > -- and local error retry is handled before punting to higher level
>         > failover retry.  Think if NVMe were to adopt a bit more disciplined
>         > "local then failover" error handling it all gets easier.
> 
>         I don't think punting before is easier, because we do a local retry if:
>         - no multipathing sw on top
>         - request needs retry (e.g. no DNR, notretry is off etc..)
>         - nvme error is not pathing related (nvme_failover_req returned false)
> 
>         > This local retry _is_ NVMe specific.  NVMe should just own retrying on
>         > the same controller no matter what (I'll hope that such retry has
>         > awareness to not retry indefinitely though!).
> 
>         it will retry until the retry limit.
> 
>         >  And this has nothing to
>         > do with multipathing, so the logic to handle it shouldn't be trapped in
>         > nvme_failover_req().
> 
>         Well given that nvme_failover_req already may not actually failover this
>         makes some sense to me (although I did have some resistance to make it
>         that way in the first place, but was convinced otherwise).
> 
>         > I think NVMe can easily fix this by having an earlier stage of checking,
>         > e.g. nvme_local_retry_req(), that shortcircuits ever getting to
>         > higher-level multipathing consideration (be it native NVMe or DM
>         > multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
>         > To be clear: the "default" case of nvme_failover_req() that returns
>         > false to fallback to NVMe's "local" normal NVMe error handling -- that
>         > can stay.. but a more explicit handling of cases like
>         > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
>         > check that happens before nvme_failover_req() in nvme_complete_rq().
> 
>         I don't necessarily agree with having a dedicated nvme_local_retry_req().
>         a request that isn't failed over, goes to local error handling (retry or
>         not). I actually think that just adding the condition to
>         nvme_complete_req and having nvme_failover_req reject it would work.
> 
>         Keith?
> 
>         >>> Anyway, no new BLK_STS is needed at this point.  More discipline with
>         >>> how NVMe's error handling is changed is.
>         >>
>         >> Please read the above.
>         >
>         > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
>         > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
>         > failover.
> 
>         Not sure that is better, but we can see a patch first to determine.
> 
>         >>> If NVMe wants to ensure its
>         >>> interface isn't broken regularly it _should_ use blk_path_error() to
>         >>> validate future nvme_error_status() changes.  Miscategorizing NVMe
>         >>> errors to upper layers is a bug -- not open for debate.
>         >>
>         >> Again, don't agree is a Miscategorization nor a bug, its just something
>         >> that is NVMe specific.
>         >
>         > Right I understand.
>         >
>         > Think it safe to assume these types of details are why Christoph wanted
>         > to avoid the notion of native NVMe and DM multipathing having
>         > compatible error handling.  There was some wisdom with that position
>         > (especially with native NVMe goals in mind).  But if things are tweaked
>         > slightly then both camps _can_ be made happy.
>         >
>         > There just needs to be a willingness to work through the details,
>         > defensiveness needs to be shed on both sides, so constructive
>         > review/consideration of problems can happen.
> 
>         Agreed.
> 
>         > Think that has already
>         > happened here with our exchange.  I'm open to investing effort here if
>         > others are up for humoring my attempt to explore fixing the issues in a
>         > mutually beneficial way.  What's the worst that can happen?  My simple
>         > patches might continue to be ignored? ;)
> 
>         I won't ignore it, and I apologize of ignoring the original patch
>         posted, I guess it flew under the radar...
> 
> 


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-10 14:48                                       ` Mike Snitzer
@ 2020-08-11 12:54                                         ` Meneghini, John
  0 siblings, 0 replies; 50+ messages in thread
From: Meneghini, John @ 2020-08-11 12:54 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, dm-devel,
	Ewan Milne, Chao Leng, Keith Busch, Meneghini, John,
	Hannes Reinecke

On 8/10/20, 10:49 AM, "Mike Snitzer" <snitzer@redhat.com> wrote:

    I appreciate your insight about ACRE.  And that NVME_SC_CMD_INTERRUPTED
    is really just a preview for the kind of NVMe error handling advances to
    come.

    But, because I started harping about blk_path_error() I got you
    hyper-focused on it and that led to all your thinking about me wanting
    to somehow normalize both SCSI and NVMe.  Couldn't be further from my
    thinking.  I've always wanted Linux's NVMe driver to do whatever it
    needed to handle the evolution of NVMe.  I just really need it to do it
    in a way that: respects NVMe's spec for ANA, ACRE and the like in way
    that doesn't make NVMe error handling broken if native NVMe multipath
    isn't used.

Agreed Mike.

I just wanted be sure we are understanding the problem clearly.

    The good news is I still don't see a need to model NVMe error handling
    with BLK_STS et al.

Yes, I agree some alternative to modeling things around BLK_STS would be good... perhaps even required.

I'll admit my suggested solutions don't help much.

    But AFAICT the NVMe error handling needs some fixing/cleanup.  To be
    continued, with patches so we can collectively have a more productive
    discussion.

If someone can come up with a solution that allows nvme-core error handling to work freely with and without native nvme-multipathing, I'm all for it.  
But we also need a solution that any solution is future proof because more changes are coming.  

/John

    Thanks,
    Mike

    On Sat, Aug 08 2020 at  5:11pm -0400,
    Meneghini, John <John.Meneghini@netapp.com> wrote:

    > Having said all of this, I think one thing which could be done to help
    > support NVMe with DMP is to approximate any new NVMe error semantics
    > by adding new BLK_STS codes. Reusing or overloading existing BLK_STS
    > may not work well because everyone else already has a claim on those
    > semantics.  I would suggest adding new BLK_STS to the
    > nvme_error_status() routine.  I think this is the one place where NVMe
    > status to BLK_STS translation belongs, and changes can be made there
    > without impacting any of the NVMe error handling logic in nvme-core -
    > now that we've removed blk_path_error() from nvme_complete_rq().
    >
    > Another choice would be to simply disable ACRE.  This won't solve the
    > problem of changing NVMe error semantics with dm-multipath but it can
    > certainly simplify the problem. Maybe a patch that disables ACRE when
    > REQ_NVME_MPATH is clear.  Or you can always just turn the feature off
    > with the nvme-cli.  I'll be sure to keep an eye on the Host Behavior
    > Support feature at NVMexpress.org.  We created this new nvme feature
    > control mechanism to help with problems like this and I'll be sure to
    > watch out for new protocol features which could break compatibility
    > with dm-multipath in the future.
    >
    > /John
    >
    > On 8/8/20, 5:08 PM, "Meneghini, John" <John.Meneghini@netapp.com> wrote:
    >
    >     I'd like to up level this whole conversation for a minute by
    > talking about exactly what ACRE does.
    >
    >     The genesis of the changes discussed in this thread is NVMe
    > TP-4033 - Enhanced Command Retry.  You can find a copy of this TP
    > here:
    >
    >     http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
    >
    >     This technical proposal added a command retry delay feature which
    > is programmed by the controller. The controller advertises a set of 3
    > different timing delays though the Identify Controller data structure
    > CRDT{1-2} fields.  To make use of these delay fields a new CRD field
    > was added to the CQE Status Field.  This allows the NVMe controller to
    > specify exactly how long a command retry should be delayed, with 3
    > possible timers that it chooses and controls.  CRDTs can range from
    > 100 milliseconds to 6559 seconds.  Because this capability can have
    > such a radical effect on backwards compatibility a new NVMe Feature
    > Identifier was added (Host Behavior Support - Feature ID 16h) with an
    > Advanced Command Retry Enable (ACRE) bit.  This allows the host to
    > enable or disable the feature.
    >
    >     With this background there are a couple of misconceptions in this
    > thread which I'd like to address.
    >
    >     The first is: ACRE has nothing to do with the
    > NVME_SC_CMD_INTERRUPTED status.  Yes, this new error status was added
    > as a part of TP-4033 but the CRD field of the CQE status can be set by
    > the controller with *any* NVMe error status. As long as the DNR bit is
    > not set the Command Retry Delay can come into effect. This is how the
    > spec is written and this is exactly how it has been implemented in the
    > core nvme_complete_rq() function (after change 507fe46ac91276120). For
    > example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of
    > 2 seconds.^  So CDRT needs to be supported with all error status if
    > the host is going to enable ACRE, and I think it's a big mistake to
    > get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The
    > NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a
    > general purpose "busy" status, something that was missing from NVMe,
    > and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other
    > specific NVMe error status, is the wrong thing to do.  There is a much
    > larger change in error semantics going on with ACRE than just this
    > single error.
    >
    >     The second is:  All NVMe error status that do not have a Status
    > Code Type of 3h (Path Related Status) are subsystem scoped. This is a
    > topic that has gone through some debate on the linux-nvme mailing list
    > and at NVMexpress.org; and there have been some ECNs to the spec to
    > address this. There may be some exceptions to this rule because there
    > are always implementations out there that may not follow, and there
    > are bugs in the spec.  However, this is the intention of the NVMe spec
    > and it matters. This means that, in a multi-pathing environment,
    > retrying any command on a different path will not provide a different
    > result. Retries should all occur on the same controller - unless it is
    > a path related status.  This is how NVMe error semantics work and this
    > is a part of what was behind Keith's patch .
    >
    >     https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da
    >
    >     Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on
    > another path is simply not the right thing to do, and returning
    > BLK_STS_TARGET after all command retries, with CRDT, have been
    > exhausted communicates the right thing to the upper layer. From the
    > perspective of nvme-multipathing Keith's patch was exactly the correct
    > thing to do.  I understand that this may have caused a backwards
    > compatibly problem with dm-multipath, and that's the only reason why
    > I've talked about backing it out.  However, ultimately, I think
    > nvme-core should return an error status like  BLK_STS_TARGET that
    > says, semantically - the IO has failed, no retry will work - because
    > this is what the NVMe error semantics are.
    >
    >     Taken together both of these facts about the NVMe protocol
    > semantics are what's behind my patch which removed blk_path_error()
    > from nvme_complete_rq()
    >
    >     https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120
    >
    >     I understand that there is a goal to try and avoid having
    > different failure/recovery handling semantically in response to
    > different error status between nvme-multipath and dm-multipath, but
    > NVMe error semantics are truly different from SCSI error semantics,
    > and they are changing. The Linux host needs to enable and support
    > those changes unhampered by the past. With this goal in mind, removing
    > the blk_path_error() code from nvme-core was the right thing to do.
    > Hannes and I struggled with the patch to try and make it work with
    > blk_path_error() for weeks.  As pointed out in the thread below,
    > blk_path_error() is the SCSI multipathing logic and we can't use it in
    > nvme_complete_rq().  All it does is import all of the legacy problems
    > of dm-multipath, and of SCSI, into the nvme completion/multipath
    > logic.
    >
    >     At NVMexpress.org we consciously added the ACRE feature because
    > the SCSI protocol had no such capability.  This is something which has
    > plagued SCSI implementations for years,  and all kinds of tricks have
    > been played, in both the SCSI host stack and in SCSI target stack, to
    > deal with the problem. The goal of NVMe is to create a better block
    > storage protocol and ACRE is just one example of many places where the
    > industry is trying to do this.  There are plans to introduce more Host
    > Behavior Support features in the future.
    >
    >     In the end, we are consciously changing NVMe, both in the spec and
    > in its implementation, to make it different from SCSI. I think this is
    > what's at the bottom of the changes discussed in this thread, and this
    > is why so many people are so passionate about this.  We don't want to
    > turn NVMe into SCSI.  I know I don't want to.
    >
    >     /John
    >
    >     ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need).
    >


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-08 21:08                                   ` Meneghini, John
  2020-08-08 21:11                                     ` Meneghini, John
@ 2020-08-10  8:10                                     ` Chao Leng
  2020-08-11 12:36                                       ` Meneghini, John
  1 sibling, 1 reply; 50+ messages in thread
From: Chao Leng @ 2020-08-10  8:10 UTC (permalink / raw)
  To: Meneghini, John, Sagi Grimberg, Mike Snitzer
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch,
	Hannes Reinecke

I can not agree with you more.
The root cause of the conflict is REQ_FAILFAST_TRANSPORT.
REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol
do not difine the local retry mechanism. SCSI implements a fuzzy local
retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath
software, multipath software retry according error code is expected.
nvme is different with scsi about this. It define local retry mechanism
and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT.

Another, for nvme multipath, if the error code is not a path error,
multipath will not fail over to retry. but maybe blk_queue_dying return
true, IO can not be retry at current path, thus IO will interrupted.
blk_queue_dying and path error both need fail over to retry.

So we can do like this:
---
  drivers/nvme/host/core.c      | 26 +++++++++++++++++++-------
  drivers/nvme/host/multipath.c | 11 +++--------
  drivers/nvme/host/nvme.h      |  5 ++---
  include/linux/nvme.h          |  9 +++++++++
  4 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4ee2330c603e..07471bd37f60 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -243,7 +243,7 @@ static blk_status_t nvme_error_status(u16 status)

  static inline bool nvme_req_needs_retry(struct request *req)
  {
-	if (blk_noretry_request(req))
+	if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER))
  		return false;
  	if (nvme_req(req)->status & NVME_SC_DNR)
  		return false;
@@ -252,6 +252,14 @@ static inline bool nvme_req_needs_retry(struct request *req)
  	return true;
  }

+static inline bool nvme_req_path_error(struct request *req)
+{
+	if ((nvme_req(req)->status & NVME_SCT_MASK) == NVME_SCT_PATH ||
+		blk_queue_dying(req->q))
+		return true;
+	return false;
+}
+
  static void nvme_retry_req(struct request *req)
  {
  	struct nvme_ns *ns = req->q->queuedata;
@@ -270,7 +278,7 @@ static void nvme_retry_req(struct request *req)

  void nvme_complete_rq(struct request *req)
  {
-	blk_status_t status = nvme_error_status(nvme_req(req)->status);
+	blk_status_t status;

  	trace_nvme_complete_rq(req);

@@ -279,16 +287,20 @@ void nvme_complete_rq(struct request *req)
  	if (nvme_req(req)->ctrl->kas)
  		nvme_req(req)->ctrl->comp_seen = true;

-	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
-			return;
-
-		if (!blk_queue_dying(req->q)) {
+	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
+		nvme_req_needs_retry(req))) {
+		if (nvme_req_path_error(req)) {
+			if (req->cmd_flags & REQ_NVME_MPATH) {
+				nvme_failover_req(req);
+				return;
+			}
+		} else {
  			nvme_retry_req(req);
  			return;
  		}
  	}

+	status = nvme_error_status(nvme_req(req)->status);
  	nvme_trace_bio_complete(req, status);
  	blk_mq_end_request(req, status);
  }
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 66509472fe06..e182fb3bcd0c 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -65,7 +65,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
  	}
  }

-bool nvme_failover_req(struct request *req)
+void nvme_failover_req(struct request *req)
  {
  	struct nvme_ns *ns = req->q->queuedata;
  	u16 status = nvme_req(req)->status;
@@ -90,17 +90,13 @@ bool nvme_failover_req(struct request *req)
  			queue_work(nvme_wq, &ns->ctrl->ana_work);
  		}
  		break;
-	case NVME_SC_HOST_PATH_ERROR:
-	case NVME_SC_HOST_ABORTED_CMD:
+	default:
  		/*
-		 * Temporary transport disruption in talking to the controller.
+		 * Normal error path.
  		 * Try to send on a new path.
  		 */
  		nvme_mpath_clear_current_path(ns);
  		break;
-	default:
-		/* This was a non-ANA error so follow the normal error path. */
-		return false;
  	}

  	spin_lock_irqsave(&ns->head->requeue_lock, flags);
@@ -109,7 +105,6 @@ bool nvme_failover_req(struct request *req)
  	blk_mq_end_request(req, 0);

  	kblockd_schedule_work(&ns->head->requeue_work);
-	return true;
  }

  void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 09ffc3246f60..cbb5d4ba6241 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -582,7 +582,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
  void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
  void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
  			struct nvme_ctrl *ctrl, int *flags);
-bool nvme_failover_req(struct request *req);
+void nvme_failover_req(struct request *req);
  void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
  int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
  void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
@@ -640,9 +640,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
  	sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
  }

-static inline bool nvme_failover_req(struct request *req)
+static inline void nvme_failover_req(struct request *req)
  {
-	return false;
  }
  static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
  {
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 5ce51ab4c50e..8c4a5b4d5b4d 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -1441,6 +1441,15 @@ enum {
  	NVME_SC_DNR			= 0x4000,
  };

+#define NVME_SCT_MASK 0x700
+enum {
+	NVME_SCT_GENERIC = 0,
+	NVME_SCT_COMMAND_SPECIFIC = 0x100,
+	NVME_SCT_MEDIA = 0x200,
+	NVME_SCT_PATH = 0x300,
+	NVME_SCT_VENDOR = 0x700
+};
+
  struct nvme_completion {
  	/*
  	 * Used by Admin and Fabrics commands to return data:
-- 
2.16.4


On 2020/8/9 5:08, Meneghini, John wrote:
> I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does.
> 
> The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry.  You can find a copy of this TP here:
> 
> http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
> 
> This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields.  To make use of these delay fields a new CRD field was added to the CQE Status Field.  This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls.  CRDTs can range from 100 milliseconds to 6559 seconds.  Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit.  This allows the host to enable or disable the feature.
> 
> With this background there are a couple of misconceptions in this thread which I'd like to address.
> 
> The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status.  Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function
> (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^  So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do.  There is a much larger change in error semantics going on with ACRE than just this single error.
> 
> The second is:  All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec.  However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status.  This is how NVMe error semantics work and this is a part of what was behind Keith's patch .
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da
> 
> Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do.  I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out.  However, ultimately, I think nvme-core should return an error status like  BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are.
> 
> Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq()
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120
> 
> I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do.  Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks.  As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq().  All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic.
> 
> At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability.  This is something which has plagued SCSI implementations for years,  and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this.  There are plans to introduce more Host Behavior Support features in the future.
> 
> In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this.  We don't want to turn NVMe into SCSI.  I know I don't want to.
> 
> /John
> 
> ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need).
> 
> On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote:
> 
>      >> Hey Mike,
>      >>
>      >>>> The point is: blk_path_error() has nothing to do with NVMe errors.
>      >>>> This is dm-multipath logic stuck in the middle of the NVMe error
>      >>>> handling code.
>      >>>
>      >>> No, it is a means to have multiple subsystems (to this point both SCSI
>      >>> and NVMe) doing the correct thing when translating subsystem specific
>      >>> error codes to BLK_STS codes.
>      >>
>      >> Not exactly, don't find any use of this in scsi. The purpose is to make
>      >> sure that nvme and dm speak the same language.
>      >
>      > SCSI doesn't need to do additional work to train a multipath layer
>      > because dm-multipath _is_ SCSI's multipathing in Linux.
> 
>      Agree.
> 
>      > So ensuring SCSI properly classifies its error codes happens as a
>      > side-effect of ensuring continued multipath functionality.
>      >
>      > Hannes introduced all these differentiated error codes in block core
>      > because of SCSI.  NVMe is meant to build on the infrastructure that was
>      > established.
> 
>      Yes, exactly my point. blk_path_error is designed to make nvme and
>      dm-multipath speak the same language.
> 
>      >>> If you, or others you name drop below, understood the point we wouldn't
>      >>> be having this conversation.  You'd accept the point of blk_path_error()
>      >>> to be valid and required codification of what constitutes a retryable
>      >>> path error for the Linux block layer.
>      >>
>      >> This incident is a case where the specific nvme status was designed
>      >> to retry on the same path respecting the controller retry delay.
>      >> And because nvme used blk_path_error at the time it caused us to use a
>      >> non-retryable status to get around that. Granted, no one had
>      >> dm-multipath in mind.
>      >>
>      >> So in a sense, there is consensus on changing patch 35038bffa87da
>      >> _because_ nvme no longer uses blk_path_error. Otherwise it would break.
>      >
>      > "break" meaning it would do failover instead of the more optimal local
>      > retry to the same controller.
>      >
>      > I see.  Wish the header for commit 35038bffa87da touched on this even a
>      > little bit ;)
> 
>      I think it did, but maybe didn't put too much emphasis on it.
> 
>      > But AFAICT the patch I provided doesn't compromise proper local retry --
>      > as long as we first fix nvme_error_status() to return a retryable
>      > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.
>      >
>      > Think of blk_path_error() as a more coarse-grained "is this retryable or
>      > a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
>      > nvme_error_status() _should_ respond with something retryable (I'd
>      > prefer BLK_STS_RESOURCE to be honest).
> 
>      But blk_path_error semantically mean "is this a pathing error", or at
>      least that what its name suggest.
> 
>      > And then nvme_failover_req() is finer-grained; by returning false it now
>      > allows short-circuiting failover and reverting back to NVMe's normal
>      > controller based error recovery -- which it does for
>      > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().
>      >
>      > And then the previous nvme_error_status() classification of retryable
>      > BLK_STS obviously never gets returned up the IO stack; it gets thrown
>      > away.
> 
>      I see what you are saying. The issue is that the code becomes
>      convoluted (it's a pathing error, oh wait, no its not a pathing error).
> 
>      >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw
>      >>> up by categorizing a retryable error as non-retryable (and vice-versa).
>      >>
>      >> But it is a special type of retryable. There is nothing that fits the
>      >> semantics of the current behavior.
>      >
>      > I agree.  But that's fine actually.
>      >
>      > And this issue is the poster-child for why properly supporting a duality
>      > of driver-level vs upper level multipathing capabilities is pretty much
>      > impossible unless a clean design factors out the different error classes
>      > -- and local error retry is handled before punting to higher level
>      > failover retry.  Think if NVMe were to adopt a bit more disciplined
>      > "local then failover" error handling it all gets easier.
> 
>      I don't think punting before is easier, because we do a local retry if:
>      - no multipathing sw on top
>      - request needs retry (e.g. no DNR, notretry is off etc..)
>      - nvme error is not pathing related (nvme_failover_req returned false)
> 
>      > This local retry _is_ NVMe specific.  NVMe should just own retrying on
>      > the same controller no matter what (I'll hope that such retry has
>      > awareness to not retry indefinitely though!).
> 
>      it will retry until the retry limit.
> 
>      >  And this has nothing to
>      > do with multipathing, so the logic to handle it shouldn't be trapped in
>      > nvme_failover_req().
> 
>      Well given that nvme_failover_req already may not actually failover this
>      makes some sense to me (although I did have some resistance to make it
>      that way in the first place, but was convinced otherwise).
> 
>      > I think NVMe can easily fix this by having an earlier stage of checking,
>      > e.g. nvme_local_retry_req(), that shortcircuits ever getting to
>      > higher-level multipathing consideration (be it native NVMe or DM
>      > multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
>      > To be clear: the "default" case of nvme_failover_req() that returns
>      > false to fallback to NVMe's "local" normal NVMe error handling -- that
>      > can stay.. but a more explicit handling of cases like
>      > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
>      > check that happens before nvme_failover_req() in nvme_complete_rq().
> 
>      I don't necessarily agree with having a dedicated nvme_local_retry_req().
>      a request that isn't failed over, goes to local error handling (retry or
>      not). I actually think that just adding the condition to
>      nvme_complete_req and having nvme_failover_req reject it would work.
> 
>      Keith?
> 
>      >>> Anyway, no new BLK_STS is needed at this point.  More discipline with
>      >>> how NVMe's error handling is changed is.
>      >>
>      >> Please read the above.
>      >
>      > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
>      > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
>      > failover.
> 
>      Not sure that is better, but we can see a patch first to determine.
> 
>      >>> If NVMe wants to ensure its
>      >>> interface isn't broken regularly it _should_ use blk_path_error() to
>      >>> validate future nvme_error_status() changes.  Miscategorizing NVMe
>      >>> errors to upper layers is a bug -- not open for debate.
>      >>
>      >> Again, don't agree is a Miscategorization nor a bug, its just something
>      >> that is NVMe specific.
>      >
>      > Right I understand.
>      >
>      > Think it safe to assume these types of details are why Christoph wanted
>      > to avoid the notion of native NVMe and DM multipathing having
>      > compatible error handling.  There was some wisdom with that position
>      > (especially with native NVMe goals in mind).  But if things are tweaked
>      > slightly then both camps _can_ be made happy.
>      >
>      > There just needs to be a willingness to work through the details,
>      > defensiveness needs to be shed on both sides, so constructive
>      > review/consideration of problems can happen.
> 
>      Agreed.
> 
>      > Think that has already
>      > happened here with our exchange.  I'm open to investing effort here if
>      > others are up for humoring my attempt to explore fixing the issues in a
>      > mutually beneficial way.  What's the worst that can happen?  My simple
>      > patches might continue to be ignored? ;)
> 
>      I won't ignore it, and I apologize of ignoring the original patch
>      posted, I guess it flew under the radar...
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-10  8:10                                     ` Chao Leng
@ 2020-08-11 12:36                                       ` Meneghini, John
  2020-08-12  7:51                                         ` Chao Leng
  0 siblings, 1 reply; 50+ messages in thread
From: Meneghini, John @ 2020-08-11 12:36 UTC (permalink / raw)
  To: Chao Leng, Sagi Grimberg, Mike Snitzer
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch,
	Meneghini, John, Hannes Reinecke

Chao,

I don't see this patch on patchwork.

https://patchwork.kernel.org/project/linux-block/list/?submitter=173677

Can you upload this there.  I'd like to apply this patch to my sandbox and take a closer look.

Thanks,

/John

On 8/10/20, 4:10 AM, "Chao Leng" <lengchao@huawei.com> wrote:

    NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




    I can not agree with you more.
    The root cause of the conflict is REQ_FAILFAST_TRANSPORT.
    REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol
    do not difine the local retry mechanism. SCSI implements a fuzzy local
    retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath
    software, multipath software retry according error code is expected.
    nvme is different with scsi about this. It define local retry mechanism
    and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT.

    Another, for nvme multipath, if the error code is not a path error,
    multipath will not fail over to retry. but maybe blk_queue_dying return
    true, IO can not be retry at current path, thus IO will interrupted.
    blk_queue_dying and path error both need fail over to retry.

    So we can do like this:
    ---
      drivers/nvme/host/core.c      | 26 +++++++++++++++++++-------
      drivers/nvme/host/multipath.c | 11 +++--------
      drivers/nvme/host/nvme.h      |  5 ++---
      include/linux/nvme.h          |  9 +++++++++
      4 files changed, 33 insertions(+), 18 deletions(-)

    diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
    index 4ee2330c603e..07471bd37f60 100644
    --- a/drivers/nvme/host/core.c
    +++ b/drivers/nvme/host/core.c
    @@ -243,7 +243,7 @@ static blk_status_t nvme_error_status(u16 status)

      static inline bool nvme_req_needs_retry(struct request *req)
      {
    -       if (blk_noretry_request(req))
    +       if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER))
                    return false;
            if (nvme_req(req)->status & NVME_SC_DNR)
                    return false;
    @@ -252,6 +252,14 @@ static inline bool nvme_req_needs_retry(struct request *req)
            return true;
      }

    +static inline bool nvme_req_path_error(struct request *req)
    +{
    +       if ((nvme_req(req)->status & NVME_SCT_MASK) == NVME_SCT_PATH ||
    +               blk_queue_dying(req->q))
    +               return true;
    +       return false;
    +}
    +
      static void nvme_retry_req(struct request *req)
      {
            struct nvme_ns *ns = req->q->queuedata;
    @@ -270,7 +278,7 @@ static void nvme_retry_req(struct request *req)

      void nvme_complete_rq(struct request *req)
      {
    -       blk_status_t status = nvme_error_status(nvme_req(req)->status);
    +       blk_status_t status;

            trace_nvme_complete_rq(req);

    @@ -279,16 +287,20 @@ void nvme_complete_rq(struct request *req)
            if (nvme_req(req)->ctrl->kas)
                    nvme_req(req)->ctrl->comp_seen = true;

    -       if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
    -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
    -                       return;
    -
    -               if (!blk_queue_dying(req->q)) {
    +       if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
    +               nvme_req_needs_retry(req))) {
    +               if (nvme_req_path_error(req)) {
    +                       if (req->cmd_flags & REQ_NVME_MPATH) {
    +                               nvme_failover_req(req);
    +                               return;
    +                       }
    +               } else {
                            nvme_retry_req(req);
                            return;
                    }
            }

    +       status = nvme_error_status(nvme_req(req)->status);
            nvme_trace_bio_complete(req, status);
            blk_mq_end_request(req, status);
      }
    diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
    index 66509472fe06..e182fb3bcd0c 100644
    --- a/drivers/nvme/host/multipath.c
    +++ b/drivers/nvme/host/multipath.c
    @@ -65,7 +65,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
            }
      }

    -bool nvme_failover_req(struct request *req)
    +void nvme_failover_req(struct request *req)
      {
            struct nvme_ns *ns = req->q->queuedata;
            u16 status = nvme_req(req)->status;
    @@ -90,17 +90,13 @@ bool nvme_failover_req(struct request *req)
                            queue_work(nvme_wq, &ns->ctrl->ana_work);
                    }
                    break;
    -       case NVME_SC_HOST_PATH_ERROR:
    -       case NVME_SC_HOST_ABORTED_CMD:
    +       default:
                    /*
    -                * Temporary transport disruption in talking to the controller.
    +                * Normal error path.
                     * Try to send on a new path.
                     */
                    nvme_mpath_clear_current_path(ns);
                    break;
    -       default:
    -               /* This was a non-ANA error so follow the normal error path. */
    -               return false;
            }

            spin_lock_irqsave(&ns->head->requeue_lock, flags);
    @@ -109,7 +105,6 @@ bool nvme_failover_req(struct request *req)
            blk_mq_end_request(req, 0);

            kblockd_schedule_work(&ns->head->requeue_work);
    -       return true;
      }

      void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
    diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
    index 09ffc3246f60..cbb5d4ba6241 100644
    --- a/drivers/nvme/host/nvme.h
    +++ b/drivers/nvme/host/nvme.h
    @@ -582,7 +582,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
      void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
      void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
                            struct nvme_ctrl *ctrl, int *flags);
    -bool nvme_failover_req(struct request *req);
    +void nvme_failover_req(struct request *req);
      void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
      int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
      void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
    @@ -640,9 +640,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
            sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
      }

    -static inline bool nvme_failover_req(struct request *req)
    +static inline void nvme_failover_req(struct request *req)
      {
    -       return false;
      }
      static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
      {
    diff --git a/include/linux/nvme.h b/include/linux/nvme.h
    index 5ce51ab4c50e..8c4a5b4d5b4d 100644
    --- a/include/linux/nvme.h
    +++ b/include/linux/nvme.h
    @@ -1441,6 +1441,15 @@ enum {
            NVME_SC_DNR                     = 0x4000,
      };

    +#define NVME_SCT_MASK 0x700
    +enum {
    +       NVME_SCT_GENERIC = 0,
    +       NVME_SCT_COMMAND_SPECIFIC = 0x100,
    +       NVME_SCT_MEDIA = 0x200,
    +       NVME_SCT_PATH = 0x300,
    +       NVME_SCT_VENDOR = 0x700
    +};
    +
      struct nvme_completion {
            /*
             * Used by Admin and Fabrics commands to return data:
    --
    2.16.4


    On 2020/8/9 5:08, Meneghini, John wrote:
    > I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does.
    >
    > The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry.  You can find a copy of this TP here:
    >
    > http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
    >
    > This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields.  To make use of these delay fields a new CRD field was added to the CQE Status Field.  This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls.  CRDTs can range from 100 milliseconds to 6559 seconds.  Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit.  This allows the host to enable or disable the feature.
    >
    > With this background there are a couple of misconceptions in this thread which I'd like to address.
    >
    > The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status.  Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function
    > (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^  So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do.  There is a much larger change in error semantics going on with ACRE than just this single error.
    >
    > The second is:  All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec.  However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status.  This is how NVMe error semantics work and this is a part of what was behind Keith's patch .
    >
    > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da
    >
    > Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do.  I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out.  However, ultimately, I think nvme-core should return an error status like  BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are.
    >
    > Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq()
    >
    > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120
    >
    > I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do.  Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks.  As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq().  All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic.
    >
    > At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability.  This is something which has plagued SCSI implementations for years,  and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this.  There are plans to introduce more Host Behavior Support features in the future.
    >
    > In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this.  We don't want to turn NVMe into SCSI.  I know I don't want to.
    >
    > /John
    >
    > ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need).
    >
    > On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote:
    >
    >      >> Hey Mike,
    >      >>
    >      >>>> The point is: blk_path_error() has nothing to do with NVMe errors.
    >      >>>> This is dm-multipath logic stuck in the middle of the NVMe error
    >      >>>> handling code.
    >      >>>
    >      >>> No, it is a means to have multiple subsystems (to this point both SCSI
    >      >>> and NVMe) doing the correct thing when translating subsystem specific
    >      >>> error codes to BLK_STS codes.
    >      >>
    >      >> Not exactly, don't find any use of this in scsi. The purpose is to make
    >      >> sure that nvme and dm speak the same language.
    >      >
    >      > SCSI doesn't need to do additional work to train a multipath layer
    >      > because dm-multipath _is_ SCSI's multipathing in Linux.
    >
    >      Agree.
    >
    >      > So ensuring SCSI properly classifies its error codes happens as a
    >      > side-effect of ensuring continued multipath functionality.
    >      >
    >      > Hannes introduced all these differentiated error codes in block core
    >      > because of SCSI.  NVMe is meant to build on the infrastructure that was
    >      > established.
    >
    >      Yes, exactly my point. blk_path_error is designed to make nvme and
    >      dm-multipath speak the same language.
    >
    >      >>> If you, or others you name drop below, understood the point we wouldn't
    >      >>> be having this conversation.  You'd accept the point of blk_path_error()
    >      >>> to be valid and required codification of what constitutes a retryable
    >      >>> path error for the Linux block layer.
    >      >>
    >      >> This incident is a case where the specific nvme status was designed
    >      >> to retry on the same path respecting the controller retry delay.
    >      >> And because nvme used blk_path_error at the time it caused us to use a
    >      >> non-retryable status to get around that. Granted, no one had
    >      >> dm-multipath in mind.
    >      >>
    >      >> So in a sense, there is consensus on changing patch 35038bffa87da
    >      >> _because_ nvme no longer uses blk_path_error. Otherwise it would break.
    >      >
    >      > "break" meaning it would do failover instead of the more optimal local
    >      > retry to the same controller.
    >      >
    >      > I see.  Wish the header for commit 35038bffa87da touched on this even a
    >      > little bit ;)
    >
    >      I think it did, but maybe didn't put too much emphasis on it.
    >
    >      > But AFAICT the patch I provided doesn't compromise proper local retry --
    >      > as long as we first fix nvme_error_status() to return a retryable
    >      > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.
    >      >
    >      > Think of blk_path_error() as a more coarse-grained "is this retryable or
    >      > a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
    >      > nvme_error_status() _should_ respond with something retryable (I'd
    >      > prefer BLK_STS_RESOURCE to be honest).
    >
    >      But blk_path_error semantically mean "is this a pathing error", or at
    >      least that what its name suggest.
    >
    >      > And then nvme_failover_req() is finer-grained; by returning false it now
    >      > allows short-circuiting failover and reverting back to NVMe's normal
    >      > controller based error recovery -- which it does for
    >      > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().
    >      >
    >      > And then the previous nvme_error_status() classification of retryable
    >      > BLK_STS obviously never gets returned up the IO stack; it gets thrown
    >      > away.
    >
    >      I see what you are saying. The issue is that the code becomes
    >      convoluted (it's a pathing error, oh wait, no its not a pathing error).
    >
    >      >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw
    >      >>> up by categorizing a retryable error as non-retryable (and vice-versa).
    >      >>
    >      >> But it is a special type of retryable. There is nothing that fits the
    >      >> semantics of the current behavior.
    >      >
    >      > I agree.  But that's fine actually.
    >      >
    >      > And this issue is the poster-child for why properly supporting a duality
    >      > of driver-level vs upper level multipathing capabilities is pretty much
    >      > impossible unless a clean design factors out the different error classes
    >      > -- and local error retry is handled before punting to higher level
    >      > failover retry.  Think if NVMe were to adopt a bit more disciplined
    >      > "local then failover" error handling it all gets easier.
    >
    >      I don't think punting before is easier, because we do a local retry if:
    >      - no multipathing sw on top
    >      - request needs retry (e.g. no DNR, notretry is off etc..)
    >      - nvme error is not pathing related (nvme_failover_req returned false)
    >
    >      > This local retry _is_ NVMe specific.  NVMe should just own retrying on
    >      > the same controller no matter what (I'll hope that such retry has
    >      > awareness to not retry indefinitely though!).
    >
    >      it will retry until the retry limit.
    >
    >      >  And this has nothing to
    >      > do with multipathing, so the logic to handle it shouldn't be trapped in
    >      > nvme_failover_req().
    >
    >      Well given that nvme_failover_req already may not actually failover this
    >      makes some sense to me (although I did have some resistance to make it
    >      that way in the first place, but was convinced otherwise).
    >
    >      > I think NVMe can easily fix this by having an earlier stage of checking,
    >      > e.g. nvme_local_retry_req(), that shortcircuits ever getting to
    >      > higher-level multipathing consideration (be it native NVMe or DM
    >      > multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
    >      > To be clear: the "default" case of nvme_failover_req() that returns
    >      > false to fallback to NVMe's "local" normal NVMe error handling -- that
    >      > can stay.. but a more explicit handling of cases like
    >      > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
    >      > check that happens before nvme_failover_req() in nvme_complete_rq().
    >
    >      I don't necessarily agree with having a dedicated nvme_local_retry_req().
    >      a request that isn't failed over, goes to local error handling (retry or
    >      not). I actually think that just adding the condition to
    >      nvme_complete_req and having nvme_failover_req reject it would work.
    >
    >      Keith?
    >
    >      >>> Anyway, no new BLK_STS is needed at this point.  More discipline with
    >      >>> how NVMe's error handling is changed is.
    >      >>
    >      >> Please read the above.
    >      >
    >      > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
    >      > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
    >      > failover.
    >
    >      Not sure that is better, but we can see a patch first to determine.
    >
    >      >>> If NVMe wants to ensure its
    >      >>> interface isn't broken regularly it _should_ use blk_path_error() to
    >      >>> validate future nvme_error_status() changes.  Miscategorizing NVMe
    >      >>> errors to upper layers is a bug -- not open for debate.
    >      >>
    >      >> Again, don't agree is a Miscategorization nor a bug, its just something
    >      >> that is NVMe specific.
    >      >
    >      > Right I understand.
    >      >
    >      > Think it safe to assume these types of details are why Christoph wanted
    >      > to avoid the notion of native NVMe and DM multipathing having
    >      > compatible error handling.  There was some wisdom with that position
    >      > (especially with native NVMe goals in mind).  But if things are tweaked
    >      > slightly then both camps _can_ be made happy.
    >      >
    >      > There just needs to be a willingness to work through the details,
    >      > defensiveness needs to be shed on both sides, so constructive
    >      > review/consideration of problems can happen.
    >
    >      Agreed.
    >
    >      > Think that has already
    >      > happened here with our exchange.  I'm open to investing effort here if
    >      > others are up for humoring my attempt to explore fixing the issues in a
    >      > mutually beneficial way.  What's the worst that can happen?  My simple
    >      > patches might continue to be ignored? ;)
    >
    >      I won't ignore it, and I apologize of ignoring the original patch
    >      posted, I guess it flew under the radar...
    >

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-11 12:36                                       ` Meneghini, John
@ 2020-08-12  7:51                                         ` Chao Leng
  0 siblings, 0 replies; 50+ messages in thread
From: Chao Leng @ 2020-08-12  7:51 UTC (permalink / raw)
  To: Meneghini, John, Sagi Grimberg, Mike Snitzer
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch,
	Hannes Reinecke



On 2020/8/11 20:36, Meneghini, John wrote:
> Chao,
> 
> I don't see this patch on patchwork.
> 
> https://patchwork.kernel.org/project/linux-block/list/?submitter=173677
> 
> Can you upload this there.  I'd like to apply this patch to my sandbox and take a closer look.
Ok, I will send the patch later.
> 
> Thanks,
> 
> /John
> 
> On 8/10/20, 4:10 AM, "Chao Leng" <lengchao@huawei.com> wrote:
> 
>      NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> 
>      I can not agree with you more.
>      The root cause of the conflict is REQ_FAILFAST_TRANSPORT.
>      REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol
>      do not difine the local retry mechanism. SCSI implements a fuzzy local
>      retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath
>      software, multipath software retry according error code is expected.
>      nvme is different with scsi about this. It define local retry mechanism
>      and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT.
> 
>      Another, for nvme multipath, if the error code is not a path error,
>      multipath will not fail over to retry. but maybe blk_queue_dying return
>      true, IO can not be retry at current path, thus IO will interrupted.
>      blk_queue_dying and path error both need fail over to retry.
> 
>      So we can do like this:
>      ---
>        drivers/nvme/host/core.c      | 26 +++++++++++++++++++-------
>        drivers/nvme/host/multipath.c | 11 +++--------
>        drivers/nvme/host/nvme.h      |  5 ++---
>        include/linux/nvme.h          |  9 +++++++++
>        4 files changed, 33 insertions(+), 18 deletions(-)
> 
>      diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>      index 4ee2330c603e..07471bd37f60 100644
>      --- a/drivers/nvme/host/core.c
>      +++ b/drivers/nvme/host/core.c
>      @@ -243,7 +243,7 @@ static blk_status_t nvme_error_status(u16 status)
> 
>        static inline bool nvme_req_needs_retry(struct request *req)
>        {
>      -       if (blk_noretry_request(req))
>      +       if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER))
>                      return false;
>              if (nvme_req(req)->status & NVME_SC_DNR)
>                      return false;
>      @@ -252,6 +252,14 @@ static inline bool nvme_req_needs_retry(struct request *req)
>              return true;
>        }
> 
>      +static inline bool nvme_req_path_error(struct request *req)
>      +{
>      +       if ((nvme_req(req)->status & NVME_SCT_MASK) == NVME_SCT_PATH ||
>      +               blk_queue_dying(req->q))
>      +               return true;
>      +       return false;
>      +}
>      +
>        static void nvme_retry_req(struct request *req)
>        {
>              struct nvme_ns *ns = req->q->queuedata;
>      @@ -270,7 +278,7 @@ static void nvme_retry_req(struct request *req)
> 
>        void nvme_complete_rq(struct request *req)
>        {
>      -       blk_status_t status = nvme_error_status(nvme_req(req)->status);
>      +       blk_status_t status;
> 
>              trace_nvme_complete_rq(req);
> 
>      @@ -279,16 +287,20 @@ void nvme_complete_rq(struct request *req)
>              if (nvme_req(req)->ctrl->kas)
>                      nvme_req(req)->ctrl->comp_seen = true;
> 
>      -       if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
>      -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
>      -                       return;
>      -
>      -               if (!blk_queue_dying(req->q)) {
>      +       if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
>      +               nvme_req_needs_retry(req))) {
>      +               if (nvme_req_path_error(req)) {
>      +                       if (req->cmd_flags & REQ_NVME_MPATH) {
>      +                               nvme_failover_req(req);
>      +                               return;
>      +                       }
>      +               } else {
>                              nvme_retry_req(req);
>                              return;
>                      }
>              }
> 
>      +       status = nvme_error_status(nvme_req(req)->status);
>              nvme_trace_bio_complete(req, status);
>              blk_mq_end_request(req, status);
>        }
>      diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>      index 66509472fe06..e182fb3bcd0c 100644
>      --- a/drivers/nvme/host/multipath.c
>      +++ b/drivers/nvme/host/multipath.c
>      @@ -65,7 +65,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
>              }
>        }
> 
>      -bool nvme_failover_req(struct request *req)
>      +void nvme_failover_req(struct request *req)
>        {
>              struct nvme_ns *ns = req->q->queuedata;
>              u16 status = nvme_req(req)->status;
>      @@ -90,17 +90,13 @@ bool nvme_failover_req(struct request *req)
>                              queue_work(nvme_wq, &ns->ctrl->ana_work);
>                      }
>                      break;
>      -       case NVME_SC_HOST_PATH_ERROR:
>      -       case NVME_SC_HOST_ABORTED_CMD:
>      +       default:
>                      /*
>      -                * Temporary transport disruption in talking to the controller.
>      +                * Normal error path.
>                       * Try to send on a new path.
>                       */
>                      nvme_mpath_clear_current_path(ns);
>                      break;
>      -       default:
>      -               /* This was a non-ANA error so follow the normal error path. */
>      -               return false;
>              }
> 
>              spin_lock_irqsave(&ns->head->requeue_lock, flags);
>      @@ -109,7 +105,6 @@ bool nvme_failover_req(struct request *req)
>              blk_mq_end_request(req, 0);
> 
>              kblockd_schedule_work(&ns->head->requeue_work);
>      -       return true;
>        }
> 
>        void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
>      diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>      index 09ffc3246f60..cbb5d4ba6241 100644
>      --- a/drivers/nvme/host/nvme.h
>      +++ b/drivers/nvme/host/nvme.h
>      @@ -582,7 +582,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
>        void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
>        void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
>                              struct nvme_ctrl *ctrl, int *flags);
>      -bool nvme_failover_req(struct request *req);
>      +void nvme_failover_req(struct request *req);
>        void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
>        int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
>        void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
>      @@ -640,9 +640,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
>              sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
>        }
> 
>      -static inline bool nvme_failover_req(struct request *req)
>      +static inline void nvme_failover_req(struct request *req)
>        {
>      -       return false;
>        }
>        static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
>        {
>      diff --git a/include/linux/nvme.h b/include/linux/nvme.h
>      index 5ce51ab4c50e..8c4a5b4d5b4d 100644
>      --- a/include/linux/nvme.h
>      +++ b/include/linux/nvme.h
>      @@ -1441,6 +1441,15 @@ enum {
>              NVME_SC_DNR                     = 0x4000,
>        };
> 
>      +#define NVME_SCT_MASK 0x700
>      +enum {
>      +       NVME_SCT_GENERIC = 0,
>      +       NVME_SCT_COMMAND_SPECIFIC = 0x100,
>      +       NVME_SCT_MEDIA = 0x200,
>      +       NVME_SCT_PATH = 0x300,
>      +       NVME_SCT_VENDOR = 0x700
>      +};
>      +
>        struct nvme_completion {
>              /*
>               * Used by Admin and Fabrics commands to return data:
>      --
>      2.16.4
> 
> 
>      On 2020/8/9 5:08, Meneghini, John wrote:
>      > I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does.
>      >
>      > The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry.  You can find a copy of this TP here:
>      >
>      > http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
>      >
>      > This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields.  To make use of these delay fields a new CRD field was added to the CQE Status Field.  This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls.  CRDTs can range from 100 milliseconds to 6559 seconds.  Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit.  This allows the host to enable or disable the feature.
>      >
>      > With this background there are a couple of misconceptions in this thread which I'd like to address.
>      >
>      > The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status.  Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function
>      > (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^  So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do.  There is a much larger change in error semantics going on with ACRE than just this single error.
>      >
>      > The second is:  All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec.  However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status.  This is how NVMe error semantics work and this is a part of what was behind Keith's patch .
>      >
>      > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da
>      >
>      > Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do.  I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out.  However, ultimately, I think nvme-core should return an error status like  BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are.
>      >
>      > Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq()
>      >
>      > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120
>      >
>      > I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do.  Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks.  As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq().  All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic.
>      >
>      > At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability.  This is something which has plagued SCSI implementations for years,  and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this.  There are plans to introduce more Host Behavior Support features in the future.
>      >
>      > In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this.  We don't want to turn NVMe into SCSI.  I know I don't want to.
>      >
>      > /John
>      >
>      > ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need).
>      >
>      > On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote:
>      >
>      >      >> Hey Mike,
>      >      >>
>      >      >>>> The point is: blk_path_error() has nothing to do with NVMe errors.
>      >      >>>> This is dm-multipath logic stuck in the middle of the NVMe error
>      >      >>>> handling code.
>      >      >>>
>      >      >>> No, it is a means to have multiple subsystems (to this point both SCSI
>      >      >>> and NVMe) doing the correct thing when translating subsystem specific
>      >      >>> error codes to BLK_STS codes.
>      >      >>
>      >      >> Not exactly, don't find any use of this in scsi. The purpose is to make
>      >      >> sure that nvme and dm speak the same language.
>      >      >
>      >      > SCSI doesn't need to do additional work to train a multipath layer
>      >      > because dm-multipath _is_ SCSI's multipathing in Linux.
>      >
>      >      Agree.
>      >
>      >      > So ensuring SCSI properly classifies its error codes happens as a
>      >      > side-effect of ensuring continued multipath functionality.
>      >      >
>      >      > Hannes introduced all these differentiated error codes in block core
>      >      > because of SCSI.  NVMe is meant to build on the infrastructure that was
>      >      > established.
>      >
>      >      Yes, exactly my point. blk_path_error is designed to make nvme and
>      >      dm-multipath speak the same language.
>      >
>      >      >>> If you, or others you name drop below, understood the point we wouldn't
>      >      >>> be having this conversation.  You'd accept the point of blk_path_error()
>      >      >>> to be valid and required codification of what constitutes a retryable
>      >      >>> path error for the Linux block layer.
>      >      >>
>      >      >> This incident is a case where the specific nvme status was designed
>      >      >> to retry on the same path respecting the controller retry delay.
>      >      >> And because nvme used blk_path_error at the time it caused us to use a
>      >      >> non-retryable status to get around that. Granted, no one had
>      >      >> dm-multipath in mind.
>      >      >>
>      >      >> So in a sense, there is consensus on changing patch 35038bffa87da
>      >      >> _because_ nvme no longer uses blk_path_error. Otherwise it would break.
>      >      >
>      >      > "break" meaning it would do failover instead of the more optimal local
>      >      > retry to the same controller.
>      >      >
>      >      > I see.  Wish the header for commit 35038bffa87da touched on this even a
>      >      > little bit ;)
>      >
>      >      I think it did, but maybe didn't put too much emphasis on it.
>      >
>      >      > But AFAICT the patch I provided doesn't compromise proper local retry --
>      >      > as long as we first fix nvme_error_status() to return a retryable
>      >      > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.
>      >      >
>      >      > Think of blk_path_error() as a more coarse-grained "is this retryable or
>      >      > a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
>      >      > nvme_error_status() _should_ respond with something retryable (I'd
>      >      > prefer BLK_STS_RESOURCE to be honest).
>      >
>      >      But blk_path_error semantically mean "is this a pathing error", or at
>      >      least that what its name suggest.
>      >
>      >      > And then nvme_failover_req() is finer-grained; by returning false it now
>      >      > allows short-circuiting failover and reverting back to NVMe's normal
>      >      > controller based error recovery -- which it does for
>      >      > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().
>      >      >
>      >      > And then the previous nvme_error_status() classification of retryable
>      >      > BLK_STS obviously never gets returned up the IO stack; it gets thrown
>      >      > away.
>      >
>      >      I see what you are saying. The issue is that the code becomes
>      >      convoluted (it's a pathing error, oh wait, no its not a pathing error).
>      >
>      >      >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw
>      >      >>> up by categorizing a retryable error as non-retryable (and vice-versa).
>      >      >>
>      >      >> But it is a special type of retryable. There is nothing that fits the
>      >      >> semantics of the current behavior.
>      >      >
>      >      > I agree.  But that's fine actually.
>      >      >
>      >      > And this issue is the poster-child for why properly supporting a duality
>      >      > of driver-level vs upper level multipathing capabilities is pretty much
>      >      > impossible unless a clean design factors out the different error classes
>      >      > -- and local error retry is handled before punting to higher level
>      >      > failover retry.  Think if NVMe were to adopt a bit more disciplined
>      >      > "local then failover" error handling it all gets easier.
>      >
>      >      I don't think punting before is easier, because we do a local retry if:
>      >      - no multipathing sw on top
>      >      - request needs retry (e.g. no DNR, notretry is off etc..)
>      >      - nvme error is not pathing related (nvme_failover_req returned false)
>      >
>      >      > This local retry _is_ NVMe specific.  NVMe should just own retrying on
>      >      > the same controller no matter what (I'll hope that such retry has
>      >      > awareness to not retry indefinitely though!).
>      >
>      >      it will retry until the retry limit.
>      >
>      >      >  And this has nothing to
>      >      > do with multipathing, so the logic to handle it shouldn't be trapped in
>      >      > nvme_failover_req().
>      >
>      >      Well given that nvme_failover_req already may not actually failover this
>      >      makes some sense to me (although I did have some resistance to make it
>      >      that way in the first place, but was convinced otherwise).
>      >
>      >      > I think NVMe can easily fix this by having an earlier stage of checking,
>      >      > e.g. nvme_local_retry_req(), that shortcircuits ever getting to
>      >      > higher-level multipathing consideration (be it native NVMe or DM
>      >      > multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
>      >      > To be clear: the "default" case of nvme_failover_req() that returns
>      >      > false to fallback to NVMe's "local" normal NVMe error handling -- that
>      >      > can stay.. but a more explicit handling of cases like
>      >      > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
>      >      > check that happens before nvme_failover_req() in nvme_complete_rq().
>      >
>      >      I don't necessarily agree with having a dedicated nvme_local_retry_req().
>      >      a request that isn't failed over, goes to local error handling (retry or
>      >      not). I actually think that just adding the condition to
>      >      nvme_complete_req and having nvme_failover_req reject it would work.
>      >
>      >      Keith?
>      >
>      >      >>> Anyway, no new BLK_STS is needed at this point.  More discipline with
>      >      >>> how NVMe's error handling is changed is.
>      >      >>
>      >      >> Please read the above.
>      >      >
>      >      > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
>      >      > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
>      >      > failover.
>      >
>      >      Not sure that is better, but we can see a patch first to determine.
>      >
>      >      >>> If NVMe wants to ensure its
>      >      >>> interface isn't broken regularly it _should_ use blk_path_error() to
>      >      >>> validate future nvme_error_status() changes.  Miscategorizing NVMe
>      >      >>> errors to upper layers is a bug -- not open for debate.
>      >      >>
>      >      >> Again, don't agree is a Miscategorization nor a bug, its just something
>      >      >> that is NVMe specific.
>      >      >
>      >      > Right I understand.
>      >      >
>      >      > Think it safe to assume these types of details are why Christoph wanted
>      >      > to avoid the notion of native NVMe and DM multipathing having
>      >      > compatible error handling.  There was some wisdom with that position
>      >      > (especially with native NVMe goals in mind).  But if things are tweaked
>      >      > slightly then both camps _can_ be made happy.
>      >      >
>      >      > There just needs to be a willingness to work through the details,
>      >      > defensiveness needs to be shed on both sides, so constructive
>      >      > review/consideration of problems can happen.
>      >
>      >      Agreed.
>      >
>      >      > Think that has already
>      >      > happened here with our exchange.  I'm open to investing effort here if
>      >      > others are up for humoring my attempt to explore fixing the issues in a
>      >      > mutually beneficial way.  What's the worst that can happen?  My simple
>      >      > patches might continue to be ignored? ;)
>      >
>      >      I won't ignore it, and I apologize of ignoring the original patch
>      >      posted, I guess it flew under the radar...
>      >
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-07 23:35                                 ` Sagi Grimberg
  2020-08-08 21:08                                   ` Meneghini, John
@ 2020-08-10 14:36                                   ` Mike Snitzer
  2020-08-10 17:22                                     ` [PATCH] nvme: explicitly use normal NVMe error handling when appropriate Mike Snitzer
  1 sibling, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2020-08-10 14:36 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel,
	Ewan Milne, Chao Leng, Keith Busch, Meneghini, John

On Fri, Aug 07 2020 at  7:35pm -0400,
Sagi Grimberg <sagi@grimberg.me> wrote:

> 
> >>Hey Mike,
> >>
> >>>>The point is: blk_path_error() has nothing to do with NVMe errors.
> >>>>This is dm-multipath logic stuck in the middle of the NVMe error
> >>>>handling code.
> >>>
> >>>No, it is a means to have multiple subsystems (to this point both SCSI
> >>>and NVMe) doing the correct thing when translating subsystem specific
> >>>error codes to BLK_STS codes.
> >>
> >>Not exactly, don't find any use of this in scsi. The purpose is to make
> >>sure that nvme and dm speak the same language.
> >
> >SCSI doesn't need to do additional work to train a multipath layer
> >because dm-multipath _is_ SCSI's multipathing in Linux.
> 
> Agree.
> 
> >So ensuring SCSI properly classifies its error codes happens as a
> >side-effect of ensuring continued multipath functionality.
> >
> >Hannes introduced all these differentiated error codes in block core
> >because of SCSI.  NVMe is meant to build on the infrastructure that was
> >established.
> 
> Yes, exactly my point. blk_path_error is designed to make nvme and
> dm-multipath speak the same language.

Yes, code was lifted from dm-multipath to block core for that purpose.

> >But AFAICT the patch I provided doesn't compromise proper local retry --
> >as long as we first fix nvme_error_status() to return a retryable
> >BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.
> >
> >Think of blk_path_error() as a more coarse-grained "is this retryable or
> >a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
> >nvme_error_status() _should_ respond with something retryable (I'd
> >prefer BLK_STS_RESOURCE to be honest).
> 
> But blk_path_error semantically mean "is this a pathing error", or at
> least that what its name suggest.

As you can see here:
https://www.redhat.com/archives/dm-devel/2018-January/msg00001.html

Keith's first pass lifted dm-mpath.c:noretry_error to
blk_type.h:blk_path_failure and I think Jens preferred the name
blk_path_error that his how it has stayed.

In any case, blk_path_error is really a distraction for the problems I
now see more clearly.

> >And then nvme_failover_req() is finer-grained; by returning false it now
> >allows short-circuiting failover and reverting back to NVMe's normal
> >controller based error recovery -- which it does for
> >NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().
> >
> >And then the previous nvme_error_status() classification of retryable
> >BLK_STS obviously never gets returned up the IO stack; it gets thrown
> >away.
> 
> I see what you are saying. The issue is that the code becomes
> convoluted (it's a pathing error, oh wait, no its not a pathing error).

Yeah, I don't care for it either.  And I don't think we need to get
bogged down with that nastiness.  Really not a fan of NVMe's code that
lazily routes non-path related error handling through
nvme_failover_req() only to return false back to nvme_complete_rq().

Its too cute, too convoluted.  I concede that it is worthwhile to trap
errors that should be dealt with by NVMe's normal error handling and
punt back.. but when it does so it should be treated as a WARN_ON() or
at least pr_warn() worthy event.  And an earlier negative check in
nvme_failover_req() be updated to latch on that class of error that is
dealt with by NVMe's normal error handling.
 
> >>>Any BLK_STS mapping of NVMe specific error codes would need to not screw
> >>>up by categorizing a retryable error as non-retryable (and vice-versa).
> >>
> >>But it is a special type of retryable. There is nothing that fits the
> >>semantics of the current behavior.
> >
> >I agree.  But that's fine actually.
> >
> >And this issue is the poster-child for why properly supporting a duality
> >of driver-level vs upper level multipathing capabilities is pretty much
> >impossible unless a clean design factors out the different error classes
> >-- and local error retry is handled before punting to higher level
> >failover retry.  Think if NVMe were to adopt a bit more disciplined
> >"local then failover" error handling it all gets easier.
> 
> I don't think punting before is easier, because we do a local retry if:
> - no multipathing sw on top
> - request needs retry (e.g. no DNR, notretry is off etc..)
> - nvme error is not pathing related (nvme_failover_req returned false)
> 
> >This local retry _is_ NVMe specific.  NVMe should just own retrying on
> >the same controller no matter what (I'll hope that such retry has
> >awareness to not retry indefinitely though!).
> 
> it will retry until the retry limit.

Yes.
 
> > And this has nothing to
> >do with multipathing, so the logic to handle it shouldn't be trapped in
> >nvme_failover_req().
> 
> Well given that nvme_failover_req already may not actually failover this
> makes some sense to me (although I did have some resistance to make it
> that way in the first place, but was convinced otherwise).
> 
> >I think NVMe can easily fix this by having an earlier stage of checking,
> >e.g. nvme_local_retry_req(), that shortcircuits ever getting to
> >higher-level multipathing consideration (be it native NVMe or DM
> >multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
> >To be clear: the "default" case of nvme_failover_req() that returns
> >false to fallback to NVMe's "local" normal NVMe error handling -- that
> >can stay.. but a more explicit handling of cases like
> >NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
> >check that happens before nvme_failover_req() in nvme_complete_rq().
> 
> I don't necessarily agree with having a dedicated nvme_local_retry_req().
> a request that isn't failed over, goes to local error handling (retry or
> not). I actually think that just adding the condition to
> nvme_complete_req and having nvme_failover_req reject it would work.
> 
> Keith?

I think that is basically what I'm thinking too.

But looking at NVMe's core.c: I do not think nvme_complete_rq() allows
local error handling _unless_ nvme_req_needs_retry() returns true.

> >>>Anyway, no new BLK_STS is needed at this point.  More discipline with
> >>>how NVMe's error handling is changed is.
> >>
> >>Please read the above.
> >
> >I agree we'd need a new BLK_STS only if NVMe cannot be made to trap
> >NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path
> >failover.
> 
> Not sure that is better, but we can see a patch first to determine.
> 
> >>>If NVMe wants to ensure its
> >>>interface isn't broken regularly it _should_ use blk_path_error() to
> >>>validate future nvme_error_status() changes.  Miscategorizing NVMe
> >>>errors to upper layers is a bug -- not open for debate.
> >>
> >>Again, don't agree is a Miscategorization nor a bug, its just something
> >>that is NVMe specific.
> >
> >Right I understand.
> >
> >Think it safe to assume these types of details are why Christoph wanted
> >to avoid the notion of native NVMe and DM multipathing having
> >compatible error handling.  There was some wisdom with that position
> >(especially with native NVMe goals in mind).  But if things are tweaked
> >slightly then both camps _can_ be made happy.
> >
> >There just needs to be a willingness to work through the details,
> >defensiveness needs to be shed on both sides, so constructive
> >review/consideration of problems can happen.
> 
> Agreed.
> 
> >Think that has already
> >happened here with our exchange.  I'm open to investing effort here if
> >others are up for humoring my attempt to explore fixing the issues in a
> >mutually beneficial way.  What's the worst that can happen?  My simple
> >patches might continue to be ignored? ;)
> 
> I won't ignore it, and I apologize of ignoring the original patch
> posted, I guess it flew under the radar...

No worries.  We all get pulled in many directions and get a lot of
email.

I'll be sending some nvme patch(es) shortly.  To refocus our discussion.

Mike


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-10 14:36                                   ` Mike Snitzer
@ 2020-08-10 17:22                                     ` Mike Snitzer
  2020-08-11  3:32                                       ` Chao Leng
  2020-08-13 14:48                                       ` [RESEND PATCH] " Mike Snitzer
  0 siblings, 2 replies; 50+ messages in thread
From: Mike Snitzer @ 2020-08-10 17:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel,
	Ewan Milne, Chao Leng, Keith Busch, Meneghini, John

On Mon, Aug 10 2020 at 10:36am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Fri, Aug 07 2020 at  7:35pm -0400,
> Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> > 
> > >>Hey Mike,
...
> > >I think NVMe can easily fix this by having an earlier stage of checking,
> > >e.g. nvme_local_retry_req(), that shortcircuits ever getting to
> > >higher-level multipathing consideration (be it native NVMe or DM
> > >multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
> > >To be clear: the "default" case of nvme_failover_req() that returns
> > >false to fallback to NVMe's "local" normal NVMe error handling -- that
> > >can stay.. but a more explicit handling of cases like
> > >NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
> > >check that happens before nvme_failover_req() in nvme_complete_rq().
> > 
> > I don't necessarily agree with having a dedicated nvme_local_retry_req().
> > a request that isn't failed over, goes to local error handling (retry or
> > not). I actually think that just adding the condition to
> > nvme_complete_req and having nvme_failover_req reject it would work.
> > 
> > Keith?
> 
> I think that is basically what I'm thinking too.

From: Mike Snitzer <snitzer@redhat.com>
Subject: nvme: explicitly use normal NVMe error handling when appropriate

Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
handling by changing multipathing's nvme_failover_req() to short-circuit
path failover and then fallback to NVMe's normal error handling (which
takes care of NVME_SC_CMD_INTERRUPTED).

This detour through native NVMe multipathing code is unwelcome because
it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
of any multipathing concerns.

Introduce nvme_status_needs_local_error_handling() to prioritize
non-failover retry, when appropriate, in terms of normal NVMe error
handling.  nvme_status_needs_local_error_handling() will naturely evolve
to include handling of any other errors that normal error handling must
be used for.

nvme_failover_req()'s ability to fallback to normal NVMe error handling
has been preserved because it may be useful for future NVME_SC that
nvme_status_needs_local_error_handling() hasn't yet been trained for.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/nvme/host/core.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 88cff309d8e4..be749b690af7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
 	return true;
 }
 
+static inline bool nvme_status_needs_local_error_handling(u16 status)
+{
+	switch (status & 0x7ff) {
+	case NVME_SC_CMD_INTERRUPTED:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static void nvme_retry_req(struct request *req)
 {
 	struct nvme_ns *ns = req->q->queuedata;
@@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)
 
 void nvme_complete_rq(struct request *req)
 {
-	blk_status_t status = nvme_error_status(nvme_req(req)->status);
+	u16 nvme_status = nvme_req(req)->status;
+	blk_status_t status = nvme_error_status(nvme_status);
 
 	trace_nvme_complete_rq(req);
 
@@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
 		nvme_req(req)->ctrl->comp_seen = true;
 
 	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
+		if (!nvme_status_needs_local_error_handling(nvme_status) &&
+		    (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
 			return;
 
 		if (!blk_queue_dying(req->q)) {
-- 
2.18.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-10 17:22                                     ` [PATCH] nvme: explicitly use normal NVMe error handling when appropriate Mike Snitzer
@ 2020-08-11  3:32                                       ` Chao Leng
  2020-08-11  4:20                                         ` Mike Snitzer
  2020-08-13 14:48                                       ` [RESEND PATCH] " Mike Snitzer
  1 sibling, 1 reply; 50+ messages in thread
From: Chao Leng @ 2020-08-11  3:32 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel,
	Ewan Milne, Keith Busch, Meneghini, John



On 2020/8/11 1:22, Mike Snitzer wrote:
> On Mon, Aug 10 2020 at 10:36am -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Fri, Aug 07 2020 at  7:35pm -0400,
>> Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>>
>>>>> Hey Mike,
> ...
>>>> I think NVMe can easily fix this by having an earlier stage of checking,
>>>> e.g. nvme_local_retry_req(), that shortcircuits ever getting to
>>>> higher-level multipathing consideration (be it native NVMe or DM
>>>> multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
>>>> To be clear: the "default" case of nvme_failover_req() that returns
>>>> false to fallback to NVMe's "local" normal NVMe error handling -- that
>>>> can stay.. but a more explicit handling of cases like
>>>> NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
>>>> check that happens before nvme_failover_req() in nvme_complete_rq().
>>>
>>> I don't necessarily agree with having a dedicated nvme_local_retry_req().
>>> a request that isn't failed over, goes to local error handling (retry or
>>> not). I actually think that just adding the condition to
>>> nvme_complete_req and having nvme_failover_req reject it would work.
>>>
>>> Keith?
>>
>> I think that is basically what I'm thinking too.
> 
> From: Mike Snitzer <snitzer@redhat.com>
> Subject: nvme: explicitly use normal NVMe error handling when appropriate
> 
> Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
> status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
> handling by changing multipathing's nvme_failover_req() to short-circuit
> path failover and then fallback to NVMe's normal error handling (which
> takes care of NVME_SC_CMD_INTERRUPTED).
> 
> This detour through native NVMe multipathing code is unwelcome because
> it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
> of any multipathing concerns.
> 
> Introduce nvme_status_needs_local_error_handling() to prioritize
> non-failover retry, when appropriate, in terms of normal NVMe error
> handling.  nvme_status_needs_local_error_handling() will naturely evolve
> to include handling of any other errors that normal error handling must
> be used for.
> 
> nvme_failover_req()'s ability to fallback to normal NVMe error handling
> has been preserved because it may be useful for future NVME_SC that
> nvme_status_needs_local_error_handling() hasn't yet been trained for.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>   drivers/nvme/host/core.c | 16 ++++++++++++++--
>   1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 88cff309d8e4..be749b690af7 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
>   	return true;
>   }
>   
> +static inline bool nvme_status_needs_local_error_handling(u16 status)
> +{
> +	switch (status & 0x7ff) {
> +	case NVME_SC_CMD_INTERRUPTED:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>   static void nvme_retry_req(struct request *req)
>   {
>   	struct nvme_ns *ns = req->q->queuedata;
> @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)
>   
>   void nvme_complete_rq(struct request *req)
>   {
> -	blk_status_t status = nvme_error_status(nvme_req(req)->status);
> +	u16 nvme_status = nvme_req(req)->status;
> +	blk_status_t status = nvme_error_status(nvme_status);
>   
>   	trace_nvme_complete_rq(req);
>   
> @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
>   		nvme_req(req)->ctrl->comp_seen = true;
>   
>   	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> -		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
> +		if (!nvme_status_needs_local_error_handling(nvme_status) &&
> +		    (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))This looks no affect. if work with nvme multipath, now is already retry local.
If work with dm-multipath, still return error.
>   			return;
>   
>   		if (!blk_queue_dying(req->q)) {
> 

Suggest:
REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol
do not difine the local retry mechanism. SCSI implements a fuzzy local
retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath
software, multipath software retry according error code is expected.
nvme is different with scsi about this. It define local retry mechanism
and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT.

Another, for nvme multipath, if the error code is not a path error,
multipath will not fail over to retry. but maybe blk_queue_dying return
true, IO can not be retry at current path, thus IO will interrupted.
blk_queue_dying and path error both need fail over to retry.

So we can do like this:
---
  drivers/nvme/host/core.c      | 26 +++++++++++++++++++-------
  drivers/nvme/host/multipath.c | 11 +++--------
  drivers/nvme/host/nvme.h      |  5 ++---
  include/linux/nvme.h          |  9 +++++++++
  4 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4ee2330c603e..07471bd37f60 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -243,7 +243,7 @@ static blk_status_t nvme_error_status(u16 status)

  static inline bool nvme_req_needs_retry(struct request *req)
  {
-    if (blk_noretry_request(req))
+    if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER))
          return false;
      if (nvme_req(req)->status & NVME_SC_DNR)
          return false;
@@ -252,6 +252,14 @@ static inline bool nvme_req_needs_retry(struct request *req)
      return true;
  }

+static inline bool nvme_req_path_error(struct request *req)
+{
+    if ((nvme_req(req)->status & NVME_SCT_MASK) == NVME_SCT_PATH ||
+        blk_queue_dying(req->q))
+        return true;
+    return false;
+}
+
  static void nvme_retry_req(struct request *req)
  {
      struct nvme_ns *ns = req->q->queuedata;
@@ -270,7 +278,7 @@ static void nvme_retry_req(struct request *req)

  void nvme_complete_rq(struct request *req)
  {
-    blk_status_t status = nvme_error_status(nvme_req(req)->status);
+    blk_status_t status;

      trace_nvme_complete_rq(req);

@@ -279,16 +287,20 @@ void nvme_complete_rq(struct request *req)
      if (nvme_req(req)->ctrl->kas)
          nvme_req(req)->ctrl->comp_seen = true;

-    if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-        if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
-            return;
-
-        if (!blk_queue_dying(req->q)) {
+    if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
+        nvme_req_needs_retry(req))) {
+        if (nvme_req_path_error(req)) {
+            if (req->cmd_flags & REQ_NVME_MPATH) {
+                nvme_failover_req(req);
+                return;
+            }
+        } else {
              nvme_retry_req(req);
              return;
          }
      }

+    status = nvme_error_status(nvme_req(req)->status);
      nvme_trace_bio_complete(req, status);
      blk_mq_end_request(req, status);
  }
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 66509472fe06..e182fb3bcd0c 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -65,7 +65,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
      }
  }

-bool nvme_failover_req(struct request *req)
+void nvme_failover_req(struct request *req)
  {
      struct nvme_ns *ns = req->q->queuedata;
      u16 status = nvme_req(req)->status;
@@ -90,17 +90,13 @@ bool nvme_failover_req(struct request *req)
              queue_work(nvme_wq, &ns->ctrl->ana_work);
          }
          break;
-    case NVME_SC_HOST_PATH_ERROR:
-    case NVME_SC_HOST_ABORTED_CMD:
+    default:
          /*
-         * Temporary transport disruption in talking to the controller.
+         * Normal error path.
           * Try to send on a new path.
           */
          nvme_mpath_clear_current_path(ns);
          break;
-    default:
-        /* This was a non-ANA error so follow the normal error path. */
-        return false;
      }

      spin_lock_irqsave(&ns->head->requeue_lock, flags);
@@ -109,7 +105,6 @@ bool nvme_failover_req(struct request *req)
      blk_mq_end_request(req, 0);

      kblockd_schedule_work(&ns->head->requeue_work);
-    return true;
  }

  void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 09ffc3246f60..cbb5d4ba6241 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -582,7 +582,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
  void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
  void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
              struct nvme_ctrl *ctrl, int *flags);
-bool nvme_failover_req(struct request *req);
+void nvme_failover_req(struct request *req);
  void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
  int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
  void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
@@ -640,9 +640,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
      sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
  }

-static inline bool nvme_failover_req(struct request *req)
+static inline void nvme_failover_req(struct request *req)
  {
-    return false;
  }
  static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
  {
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 5ce51ab4c50e..8c4a5b4d5b4d 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -1441,6 +1441,15 @@ enum {
      NVME_SC_DNR            = 0x4000,
  };

+#define NVME_SCT_MASK 0x700
+enum {
+    NVME_SCT_GENERIC = 0,
+    NVME_SCT_COMMAND_SPECIFIC = 0x100,
+    NVME_SCT_MEDIA = 0x200,
+    NVME_SCT_PATH = 0x300,
+    NVME_SCT_VENDOR = 0x700
+};
+
  struct nvme_completion {
      /*
       * Used by Admin and Fabrics commands to return data:
-- 
2.16.4



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-11  3:32                                       ` Chao Leng
@ 2020-08-11  4:20                                         ` Mike Snitzer
  2020-08-11  6:17                                           ` Chao Leng
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2020-08-11  4:20 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, Christoph Hellwig,
	dm-devel, Ewan Milne, Keith Busch, Meneghini, John

On Mon, Aug 10 2020 at 11:32pm -0400,
Chao Leng <lengchao@huawei.com> wrote:

> 
> 
> On 2020/8/11 1:22, Mike Snitzer wrote:
> >On Mon, Aug 10 2020 at 10:36am -0400,
> >Mike Snitzer <snitzer@redhat.com> wrote:
> >
> >>On Fri, Aug 07 2020 at  7:35pm -0400,
> >>Sagi Grimberg <sagi@grimberg.me> wrote:
> >>
> >>>
> >>>>>Hey Mike,
> >...
> >>>>I think NVMe can easily fix this by having an earlier stage of checking,
> >>>>e.g. nvme_local_retry_req(), that shortcircuits ever getting to
> >>>>higher-level multipathing consideration (be it native NVMe or DM
> >>>>multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
> >>>>To be clear: the "default" case of nvme_failover_req() that returns
> >>>>false to fallback to NVMe's "local" normal NVMe error handling -- that
> >>>>can stay.. but a more explicit handling of cases like
> >>>>NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
> >>>>check that happens before nvme_failover_req() in nvme_complete_rq().
> >>>
> >>>I don't necessarily agree with having a dedicated nvme_local_retry_req().
> >>>a request that isn't failed over, goes to local error handling (retry or
> >>>not). I actually think that just adding the condition to
> >>>nvme_complete_req and having nvme_failover_req reject it would work.
> >>>
> >>>Keith?
> >>
> >>I think that is basically what I'm thinking too.
> >
> >From: Mike Snitzer <snitzer@redhat.com>
> >Subject: nvme: explicitly use normal NVMe error handling when appropriate
> >
> >Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
> >status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
> >handling by changing multipathing's nvme_failover_req() to short-circuit
> >path failover and then fallback to NVMe's normal error handling (which
> >takes care of NVME_SC_CMD_INTERRUPTED).
> >
> >This detour through native NVMe multipathing code is unwelcome because
> >it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
> >of any multipathing concerns.
> >
> >Introduce nvme_status_needs_local_error_handling() to prioritize
> >non-failover retry, when appropriate, in terms of normal NVMe error
> >handling.  nvme_status_needs_local_error_handling() will naturely evolve
> >to include handling of any other errors that normal error handling must
> >be used for.
> >
> >nvme_failover_req()'s ability to fallback to normal NVMe error handling
> >has been preserved because it may be useful for future NVME_SC that
> >nvme_status_needs_local_error_handling() hasn't yet been trained for.
> >
> >Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >---
> >  drivers/nvme/host/core.c | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >index 88cff309d8e4..be749b690af7 100644
> >--- a/drivers/nvme/host/core.c
> >+++ b/drivers/nvme/host/core.c
> >@@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
> >  	return true;
> >  }
> >+static inline bool nvme_status_needs_local_error_handling(u16 status)
> >+{
> >+	switch (status & 0x7ff) {
> >+	case NVME_SC_CMD_INTERRUPTED:
> >+		return true;
> >+	default:
> >+		return false;
> >+	}
> >+}
> >+
> >  static void nvme_retry_req(struct request *req)
> >  {
> >  	struct nvme_ns *ns = req->q->queuedata;
> >@@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)
> >  void nvme_complete_rq(struct request *req)
> >  {
> >-	blk_status_t status = nvme_error_status(nvme_req(req)->status);
> >+	u16 nvme_status = nvme_req(req)->status;
> >+	blk_status_t status = nvme_error_status(nvme_status);
> >  	trace_nvme_complete_rq(req);
> >@@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
> >  		nvme_req(req)->ctrl->comp_seen = true;
> >  	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> >-		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
> >+		if (!nvme_status_needs_local_error_handling(nvme_status) &&
> >+		    (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
>
> This looks no affect. if work with nvme multipath, now is already retry local.

Not if NVMe is built without multipathing configured.

> If work with dm-multipath, still return error.

Yes, I'm aware.  Use of REQ_FAILFAST_TRANSPORT isn't something that is
needed for NVMe, so why are you proposing hacks in NVMe to deal with it?

> >  			return;
> >  		if (!blk_queue_dying(req->q)) {
> >
> 
> Suggest:
> REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol
> do not difine the local retry mechanism. SCSI implements a fuzzy local
> retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath
> software, multipath software retry according error code is expected.
> nvme is different with scsi about this. It define local retry mechanism
> and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT.

Exactly.  Except by "nvme should not care REQ_FAILFAST_TRANSPORT." your
patch says you mean "nvme shouldn't disallow retry if
REQ_FAILFAST_TRANSPORT is it".  I'm saying: don't try to get such
changes into NVMe.

In general, aspects of your patch may have merit but overall it is doing
too much.

Mike


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-11  4:20                                         ` Mike Snitzer
@ 2020-08-11  6:17                                           ` Chao Leng
  2020-08-11 14:12                                             ` Mike Snitzer
  0 siblings, 1 reply; 50+ messages in thread
From: Chao Leng @ 2020-08-11  6:17 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, Christoph Hellwig,
	dm-devel, Ewan Milne, Keith Busch, Meneghini, John



On 2020/8/11 12:20, Mike Snitzer wrote:
> On Mon, Aug 10 2020 at 11:32pm -0400,
> Chao Leng <lengchao@huawei.com> wrote:
> 
>>
>>
>> On 2020/8/11 1:22, Mike Snitzer wrote:
>>> On Mon, Aug 10 2020 at 10:36am -0400,
>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>>> On Fri, Aug 07 2020 at  7:35pm -0400,
>>>> Sagi Grimberg <sagi@grimberg.me> wrote:
>>>>
>>>>>
>>>>>>> Hey Mike,
>>> ...
>>>>>> I think NVMe can easily fix this by having an earlier stage of checking,
>>>>>> e.g. nvme_local_retry_req(), that shortcircuits ever getting to
>>>>>> higher-level multipathing consideration (be it native NVMe or DM
>>>>>> multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
>>>>>> To be clear: the "default" case of nvme_failover_req() that returns
>>>>>> false to fallback to NVMe's "local" normal NVMe error handling -- that
>>>>>> can stay.. but a more explicit handling of cases like
>>>>>> NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
>>>>>> check that happens before nvme_failover_req() in nvme_complete_rq().
>>>>>
>>>>> I don't necessarily agree with having a dedicated nvme_local_retry_req().
>>>>> a request that isn't failed over, goes to local error handling (retry or
>>>>> not). I actually think that just adding the condition to
>>>>> nvme_complete_req and having nvme_failover_req reject it would work.
>>>>>
>>>>> Keith?
>>>>
>>>> I think that is basically what I'm thinking too.
>>>
>>> From: Mike Snitzer <snitzer@redhat.com>
>>> Subject: nvme: explicitly use normal NVMe error handling when appropriate
>>>
>>> Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
>>> status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
>>> handling by changing multipathing's nvme_failover_req() to short-circuit
>>> path failover and then fallback to NVMe's normal error handling (which
>>> takes care of NVME_SC_CMD_INTERRUPTED).
>>>
>>> This detour through native NVMe multipathing code is unwelcome because
>>> it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
>>> of any multipathing concerns.
>>>
>>> Introduce nvme_status_needs_local_error_handling() to prioritize
>>> non-failover retry, when appropriate, in terms of normal NVMe error
>>> handling.  nvme_status_needs_local_error_handling() will naturely evolve
>>> to include handling of any other errors that normal error handling must
>>> be used for.
>>>
>>> nvme_failover_req()'s ability to fallback to normal NVMe error handling
>>> has been preserved because it may be useful for future NVME_SC that
>>> nvme_status_needs_local_error_handling() hasn't yet been trained for.
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>> ---
>>>   drivers/nvme/host/core.c | 16 ++++++++++++++--
>>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 88cff309d8e4..be749b690af7 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
>>>   	return true;
>>>   }
>>> +static inline bool nvme_status_needs_local_error_handling(u16 status)
>>> +{
>>> +	switch (status & 0x7ff) {
>>> +	case NVME_SC_CMD_INTERRUPTED:
>>> +		return true;
>>> +	default:
>>> +		return false;
>>> +	}
>>> +}
>>> +
>>>   static void nvme_retry_req(struct request *req)
>>>   {
>>>   	struct nvme_ns *ns = req->q->queuedata;
>>> @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)
>>>   void nvme_complete_rq(struct request *req)
>>>   {
>>> -	blk_status_t status = nvme_error_status(nvme_req(req)->status);
>>> +	u16 nvme_status = nvme_req(req)->status;
>>> +	blk_status_t status = nvme_error_status(nvme_status);
>>>   	trace_nvme_complete_rq(req);
>>> @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
>>>   		nvme_req(req)->ctrl->comp_seen = true;
>>>   	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
>>> -		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
>>> +		if (!nvme_status_needs_local_error_handling(nvme_status) &&
>>> +		    (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
>>
>> This looks no affect. if work with nvme multipath, now is already retry local.
> 
> Not if NVMe is built without multipathing configured.
If without nvme multipathing configured, now is also retry local, do not need
!nvme_status_needs_local_error_handling(nvme_status).
> 
>> If work with dm-multipath, still return error.
> 
> Yes, I'm aware.  Use of REQ_FAILFAST_TRANSPORT isn't something that is
> needed for NVMe, so why are you proposing hacks in NVMe to deal with it?
I just describe the possible scenarios:1.nvme multipathing configured.
2.without any multipath.3. with dm-multipath.
> 
>>>   			return;
>>>   		if (!blk_queue_dying(req->q)) {
>>>
>>
>> Suggest:
>> REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol
>> do not difine the local retry mechanism. SCSI implements a fuzzy local
>> retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath
>> software, multipath software retry according error code is expected.
>> nvme is different with scsi about this. It define local retry mechanism
>> and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT.
> 
> Exactly.  Except by "nvme should not care REQ_FAILFAST_TRANSPORT." your
> patch says you mean "nvme shouldn't disallow retry if
> REQ_FAILFAST_TRANSPORT is it".  I'm saying: don't try to get such
> changes into NVMe.
no. the patch just mean: if path error, fail over to retry by multipath
(nvme multipath or dm-multipath). Other need local retry local, retry
after a defined time according to status(CRD) and CRDT. Now nvme multipath
is already do like this, the patch make dm-multipath work like nvme multipath.
> 
> In general, aspects of your patch may have merit but overall it is doing
> too much.
> 
> Mike
> 
> .
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-11  6:17                                           ` Chao Leng
@ 2020-08-11 14:12                                             ` Mike Snitzer
  0 siblings, 0 replies; 50+ messages in thread
From: Mike Snitzer @ 2020-08-11 14:12 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, Christoph Hellwig,
	dm-devel, Ewan Milne, Keith Busch, Meneghini, John

On Tue, Aug 11 2020 at  2:17am -0400,
Chao Leng <lengchao@huawei.com> wrote:

> 
> 
> On 2020/8/11 12:20, Mike Snitzer wrote:
> >On Mon, Aug 10 2020 at 11:32pm -0400,
> >Chao Leng <lengchao@huawei.com> wrote:
> >
> >>
> >>
> >>On 2020/8/11 1:22, Mike Snitzer wrote:
> >>>On Mon, Aug 10 2020 at 10:36am -0400,
> >>>Mike Snitzer <snitzer@redhat.com> wrote:
> >>>
> >>>>On Fri, Aug 07 2020 at  7:35pm -0400,
> >>>>Sagi Grimberg <sagi@grimberg.me> wrote:
> >>>>
> >>>>>
> >>>>>>>Hey Mike,
> >>>...
> >>>>>>I think NVMe can easily fix this by having an earlier stage of checking,
> >>>>>>e.g. nvme_local_retry_req(), that shortcircuits ever getting to
> >>>>>>higher-level multipathing consideration (be it native NVMe or DM
> >>>>>>multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
> >>>>>>To be clear: the "default" case of nvme_failover_req() that returns
> >>>>>>false to fallback to NVMe's "local" normal NVMe error handling -- that
> >>>>>>can stay.. but a more explicit handling of cases like
> >>>>>>NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req()
> >>>>>>check that happens before nvme_failover_req() in nvme_complete_rq().
> >>>>>
> >>>>>I don't necessarily agree with having a dedicated nvme_local_retry_req().
> >>>>>a request that isn't failed over, goes to local error handling (retry or
> >>>>>not). I actually think that just adding the condition to
> >>>>>nvme_complete_req and having nvme_failover_req reject it would work.
> >>>>>
> >>>>>Keith?
> >>>>
> >>>>I think that is basically what I'm thinking too.
> >>>
> >>>From: Mike Snitzer <snitzer@redhat.com>
> >>>Subject: nvme: explicitly use normal NVMe error handling when appropriate
> >>>
> >>>Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
> >>>status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
> >>>handling by changing multipathing's nvme_failover_req() to short-circuit
> >>>path failover and then fallback to NVMe's normal error handling (which
> >>>takes care of NVME_SC_CMD_INTERRUPTED).
> >>>
> >>>This detour through native NVMe multipathing code is unwelcome because
> >>>it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
> >>>of any multipathing concerns.
> >>>
> >>>Introduce nvme_status_needs_local_error_handling() to prioritize
> >>>non-failover retry, when appropriate, in terms of normal NVMe error
> >>>handling.  nvme_status_needs_local_error_handling() will naturely evolve
> >>>to include handling of any other errors that normal error handling must
> >>>be used for.
> >>>
> >>>nvme_failover_req()'s ability to fallback to normal NVMe error handling
> >>>has been preserved because it may be useful for future NVME_SC that
> >>>nvme_status_needs_local_error_handling() hasn't yet been trained for.
> >>>
> >>>Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >>>---
> >>>  drivers/nvme/host/core.c | 16 ++++++++++++++--
> >>>  1 file changed, 14 insertions(+), 2 deletions(-)
> >>>
> >>>diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>>index 88cff309d8e4..be749b690af7 100644
> >>>--- a/drivers/nvme/host/core.c
> >>>+++ b/drivers/nvme/host/core.c
> >>>@@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
> >>>  	return true;
> >>>  }
> >>>+static inline bool nvme_status_needs_local_error_handling(u16 status)
> >>>+{
> >>>+	switch (status & 0x7ff) {
> >>>+	case NVME_SC_CMD_INTERRUPTED:
> >>>+		return true;
> >>>+	default:
> >>>+		return false;
> >>>+	}
> >>>+}
> >>>+
> >>>  static void nvme_retry_req(struct request *req)
> >>>  {
> >>>  	struct nvme_ns *ns = req->q->queuedata;
> >>>@@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)
> >>>  void nvme_complete_rq(struct request *req)
> >>>  {
> >>>-	blk_status_t status = nvme_error_status(nvme_req(req)->status);
> >>>+	u16 nvme_status = nvme_req(req)->status;
> >>>+	blk_status_t status = nvme_error_status(nvme_status);
> >>>  	trace_nvme_complete_rq(req);
> >>>@@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
> >>>  		nvme_req(req)->ctrl->comp_seen = true;
> >>>  	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> >>>-		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
> >>>+		if (!nvme_status_needs_local_error_handling(nvme_status) &&
> >>>+		    (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
> >>
> >>This looks no affect. if work with nvme multipath, now is already retry local.
> >
> >Not if NVMe is built without multipathing configured.
>
> If without nvme multipathing configured, now is also retry local, do not need
> !nvme_status_needs_local_error_handling(nvme_status).

True, REQ_NVME_MPATH won't be set, so it'll fall through.

The real benefit of my patch is that there is a cleaner code flow for
handling errors with normal NVMe retry (without bouncing into failover
code and then falling back to normal retry -- code that your patch does
remove).

SO my change is an obvious yet small improvement that makes the NVMe
core error handling clearer -- yet preserves that fallback from
failover_retry to help future-proof NVMe from new NVME_SC: which John
M. is very clear about needing.

> >>If work with dm-multipath, still return error.
> >
> >Yes, I'm aware.  Use of REQ_FAILFAST_TRANSPORT isn't something that is
> >needed for NVMe, so why are you proposing hacks in NVMe to deal with it?
> I just describe the possible scenarios:1.nvme multipathing configured.
> 2.without any multipath.3. with dm-multipath.

I understand.

> >
> >>>  			return;
> >>>  		if (!blk_queue_dying(req->q)) {
> >>>
> >>
> >>Suggest:
> >>REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol
> >>do not difine the local retry mechanism. SCSI implements a fuzzy local
> >>retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath
> >>software, multipath software retry according error code is expected.
> >>nvme is different with scsi about this. It define local retry mechanism
> >>and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT.
> >
> >Exactly.  Except by "nvme should not care REQ_FAILFAST_TRANSPORT." your
> >patch says you mean "nvme shouldn't disallow retry if
> >REQ_FAILFAST_TRANSPORT is it".  I'm saying: don't try to get such
> >changes into NVMe.
>
> no. the patch just mean: if path error, fail over to retry by multipath
> (nvme multipath or dm-multipath). Other need local retry local, retry
> after a defined time according to status(CRD) and CRDT. Now nvme multipath
> is already do like this, the patch make dm-multipath work like nvme multipath.

I appreciate your enthusiasm for making native NVMe multipathing and
dm-multipath coexist.

But I think you're naive about the willingness to make that a reality.

As such, any change that only benefits dm-multipath is dead on arrival.
Your patch is mixed with various changes like that.  I'd be stunned if
Christoph accepted it.

Thanks,
Mike


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-10 17:22                                     ` [PATCH] nvme: explicitly use normal NVMe error handling when appropriate Mike Snitzer
  2020-08-11  3:32                                       ` Chao Leng
@ 2020-08-13 14:48                                       ` Mike Snitzer
  2020-08-13 15:29                                         ` Meneghini, John
                                                           ` (2 more replies)
  1 sibling, 3 replies; 50+ messages in thread
From: Mike Snitzer @ 2020-08-13 14:48 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Hannes Reinecke,
	Chao Leng, Keith Busch, Meneghini, John, Ewan Milne

Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
handling by changing multipathing's nvme_failover_req() to short-circuit
path failover and then fallback to NVMe's normal error handling (which
takes care of NVME_SC_CMD_INTERRUPTED).

This detour through native NVMe multipathing code is unwelcome because
it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
of any multipathing concerns.

Introduce nvme_status_needs_local_error_handling() to prioritize
non-failover retry, when appropriate, in terms of normal NVMe error
handling.  nvme_status_needs_local_error_handling() will naturely evolve
to include handling of any other errors that normal error handling must
be used for.

nvme_failover_req()'s ability to fallback to normal NVMe error handling
has been preserved because it may be useful for future NVME_SC that
nvme_status_needs_local_error_handling() hasn't been trained for yet.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/nvme/host/core.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 88cff309d8e4..be749b690af7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
 	return true;
 }
 
+static inline bool nvme_status_needs_local_error_handling(u16 status)
+{
+	switch (status & 0x7ff) {
+	case NVME_SC_CMD_INTERRUPTED:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static void nvme_retry_req(struct request *req)
 {
 	struct nvme_ns *ns = req->q->queuedata;
@@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)
 
 void nvme_complete_rq(struct request *req)
 {
-	blk_status_t status = nvme_error_status(nvme_req(req)->status);
+	u16 nvme_status = nvme_req(req)->status;
+	blk_status_t status = nvme_error_status(nvme_status);
 
 	trace_nvme_complete_rq(req);
 
@@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
 		nvme_req(req)->ctrl->comp_seen = true;
 
 	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
+		if (!nvme_status_needs_local_error_handling(nvme_status) &&
+		    (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
 			return;
 
 		if (!blk_queue_dying(req->q)) {
-- 
2.18.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 14:48                                       ` [RESEND PATCH] " Mike Snitzer
@ 2020-08-13 15:29                                         ` Meneghini, John
  2020-08-13 15:43                                           ` Mike Snitzer
  2020-08-13 15:36                                         ` Christoph Hellwig
  2020-08-14  3:23                                         ` Meneghini, John
  2 siblings, 1 reply; 50+ messages in thread
From: Meneghini, John @ 2020-08-13 15:29 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Hannes Reinecke,
	Chao Leng, Keith Busch, Meneghini,  John, Ewan Milne

Mike, 

I don't see your patch at:

https://patchwork.kernel.org/project/linux-block/list/?submitter=1643

Can you please upload this patch there?

Thanks,

/John

On 8/13/20, 10:48 AM, "Mike Snitzer" <snitzer@redhat.com> wrote:

    Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
    status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
    handling by changing multipathing's nvme_failover_req() to short-circuit
    path failover and then fallback to NVMe's normal error handling (which
    takes care of NVME_SC_CMD_INTERRUPTED).

    This detour through native NVMe multipathing code is unwelcome because
    it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
    of any multipathing concerns.

    Introduce nvme_status_needs_local_error_handling() to prioritize
    non-failover retry, when appropriate, in terms of normal NVMe error
    handling.  nvme_status_needs_local_error_handling() will naturely evolve
    to include handling of any other errors that normal error handling must
    be used for.

    nvme_failover_req()'s ability to fallback to normal NVMe error handling
    has been preserved because it may be useful for future NVME_SC that
    nvme_status_needs_local_error_handling() hasn't been trained for yet.

    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    ---
     drivers/nvme/host/core.c | 16 ++++++++++++++--
     1 file changed, 14 insertions(+), 2 deletions(-)

    diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
    index 88cff309d8e4..be749b690af7 100644
    --- a/drivers/nvme/host/core.c
    +++ b/drivers/nvme/host/core.c
    @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
            return true;
     }

    +static inline bool nvme_status_needs_local_error_handling(u16 status)
    +{
    +       switch (status & 0x7ff) {
    +       case NVME_SC_CMD_INTERRUPTED:
    +               return true;
    +       default:
    +               return false;
    +       }
    +}
    +
     static void nvme_retry_req(struct request *req)
     {
            struct nvme_ns *ns = req->q->queuedata;
    @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)

     void nvme_complete_rq(struct request *req)
     {
    -       blk_status_t status = nvme_error_status(nvme_req(req)->status);
    +       u16 nvme_status = nvme_req(req)->status;
    +       blk_status_t status = nvme_error_status(nvme_status);

            trace_nvme_complete_rq(req);

    @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
                    nvme_req(req)->ctrl->comp_seen = true;

            if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
    -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
    +               if (!nvme_status_needs_local_error_handling(nvme_status) &&
    +                   (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
                            return;

                    if (!blk_queue_dying(req->q)) {
    --
    2.18.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 15:29                                         ` Meneghini, John
@ 2020-08-13 15:43                                           ` Mike Snitzer
  2020-08-13 15:59                                             ` Meneghini, John
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2020-08-13 15:43 UTC (permalink / raw)
  To: Meneghini, John
  Cc: Sagi Grimberg, Ewan Milne, Christoph Hellwig, dm-devel,
	linux-nvme, Chao Leng, Keith Busch, Hannes Reinecke

Maybe because I didn't cc linux-block?
Only way that I know to "upload this patch there" is to have cc'd
linux-block.

But the patch is in dm-devel's patchwork here:
https://patchwork.kernel.org/patch/11712563/

Is that sufficient for your needs?

Thanks,
Mike

On Thu, Aug 13 2020 at 11:29am -0400,
Meneghini, John <John.Meneghini@netapp.com> wrote:

> Mike, 
> 
> I don't see your patch at:
> 
> https://patchwork.kernel.org/project/linux-block/list/?submitter=1643
> 
> Can you please upload this patch there?
> 
> Thanks,
> 
> /John
> 
> On 8/13/20, 10:48 AM, "Mike Snitzer" <snitzer@redhat.com> wrote:
> 
>     Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
>     status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
>     handling by changing multipathing's nvme_failover_req() to short-circuit
>     path failover and then fallback to NVMe's normal error handling (which
>     takes care of NVME_SC_CMD_INTERRUPTED).
> 
>     This detour through native NVMe multipathing code is unwelcome because
>     it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
>     of any multipathing concerns.
> 
>     Introduce nvme_status_needs_local_error_handling() to prioritize
>     non-failover retry, when appropriate, in terms of normal NVMe error
>     handling.  nvme_status_needs_local_error_handling() will naturely evolve
>     to include handling of any other errors that normal error handling must
>     be used for.
> 
>     nvme_failover_req()'s ability to fallback to normal NVMe error handling
>     has been preserved because it may be useful for future NVME_SC that
>     nvme_status_needs_local_error_handling() hasn't been trained for yet.
> 
>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>     ---
>      drivers/nvme/host/core.c | 16 ++++++++++++++--
>      1 file changed, 14 insertions(+), 2 deletions(-)
> 
>     diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>     index 88cff309d8e4..be749b690af7 100644
>     --- a/drivers/nvme/host/core.c
>     +++ b/drivers/nvme/host/core.c
>     @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
>             return true;
>      }
> 
>     +static inline bool nvme_status_needs_local_error_handling(u16 status)
>     +{
>     +       switch (status & 0x7ff) {
>     +       case NVME_SC_CMD_INTERRUPTED:
>     +               return true;
>     +       default:
>     +               return false;
>     +       }
>     +}
>     +
>      static void nvme_retry_req(struct request *req)
>      {
>             struct nvme_ns *ns = req->q->queuedata;
>     @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)
> 
>      void nvme_complete_rq(struct request *req)
>      {
>     -       blk_status_t status = nvme_error_status(nvme_req(req)->status);
>     +       u16 nvme_status = nvme_req(req)->status;
>     +       blk_status_t status = nvme_error_status(nvme_status);
> 
>             trace_nvme_complete_rq(req);
> 
>     @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
>                     nvme_req(req)->ctrl->comp_seen = true;
> 
>             if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
>     -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
>     +               if (!nvme_status_needs_local_error_handling(nvme_status) &&
>     +                   (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
>                             return;
> 
>                     if (!blk_queue_dying(req->q)) {
>     --
>     2.18.0
> 
> 


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 15:43                                           ` Mike Snitzer
@ 2020-08-13 15:59                                             ` Meneghini, John
  0 siblings, 0 replies; 50+ messages in thread
From: Meneghini, John @ 2020-08-13 15:59 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Sagi Grimberg, Ewan Milne, Christoph Hellwig, dm-devel,
	linux-nvme, Chao Leng, Keith Busch, Meneghini, John,
	Hannes Reinecke

That works Mike. 

Unfortunately, I have to work with the corporate MS office email handler (which sucks)  and downloading the .patch file works much better for me.

Thanks,

/John

On 8/13/20, 11:43 AM, "Mike Snitzer" <snitzer@redhat.com> wrote:

    Maybe because I didn't cc linux-block?
    Only way that I know to "upload this patch there" is to have cc'd
    linux-block.

    But the patch is in dm-devel's patchwork here:
    https://patchwork.kernel.org/patch/11712563/

    Is that sufficient for your needs?

    Thanks,
    Mike

    On Thu, Aug 13 2020 at 11:29am -0400,
    Meneghini, John <John.Meneghini@netapp.com> wrote:

    > Mike,
    >
    > I don't see your patch at:
    >
    > https://patchwork.kernel.org/project/linux-block/list/?submitter=1643
    >
    > Can you please upload this patch there?
    >
    > Thanks,
    >
    > /John
    >
    > On 8/13/20, 10:48 AM, "Mike Snitzer" <snitzer@redhat.com> wrote:
    >
    >     Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
    >     status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
    >     handling by changing multipathing's nvme_failover_req() to short-circuit
    >     path failover and then fallback to NVMe's normal error handling (which
    >     takes care of NVME_SC_CMD_INTERRUPTED).
    >
    >     This detour through native NVMe multipathing code is unwelcome because
    >     it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
    >     of any multipathing concerns.
    >
    >     Introduce nvme_status_needs_local_error_handling() to prioritize
    >     non-failover retry, when appropriate, in terms of normal NVMe error
    >     handling.  nvme_status_needs_local_error_handling() will naturely evolve
    >     to include handling of any other errors that normal error handling must
    >     be used for.
    >
    >     nvme_failover_req()'s ability to fallback to normal NVMe error handling
    >     has been preserved because it may be useful for future NVME_SC that
    >     nvme_status_needs_local_error_handling() hasn't been trained for yet.
    >
    >     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    >     ---
    >      drivers/nvme/host/core.c | 16 ++++++++++++++--
    >      1 file changed, 14 insertions(+), 2 deletions(-)
    >
    >     diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
    >     index 88cff309d8e4..be749b690af7 100644
    >     --- a/drivers/nvme/host/core.c
    >     +++ b/drivers/nvme/host/core.c
    >     @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
    >             return true;
    >      }
    >
    >     +static inline bool nvme_status_needs_local_error_handling(u16 status)
    >     +{
    >     +       switch (status & 0x7ff) {
    >     +       case NVME_SC_CMD_INTERRUPTED:
    >     +               return true;
    >     +       default:
    >     +               return false;
    >     +       }
    >     +}
    >     +
    >      static void nvme_retry_req(struct request *req)
    >      {
    >             struct nvme_ns *ns = req->q->queuedata;
    >     @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)
    >
    >      void nvme_complete_rq(struct request *req)
    >      {
    >     -       blk_status_t status = nvme_error_status(nvme_req(req)->status);
    >     +       u16 nvme_status = nvme_req(req)->status;
    >     +       blk_status_t status = nvme_error_status(nvme_status);
    >
    >             trace_nvme_complete_rq(req);
    >
    >     @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
    >                     nvme_req(req)->ctrl->comp_seen = true;
    >
    >             if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
    >     -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
    >     +               if (!nvme_status_needs_local_error_handling(nvme_status) &&
    >     +                   (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
    >                             return;
    >
    >                     if (!blk_queue_dying(req->q)) {
    >     --
    >     2.18.0
    >
    >


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 14:48                                       ` [RESEND PATCH] " Mike Snitzer
  2020-08-13 15:29                                         ` Meneghini, John
@ 2020-08-13 15:36                                         ` Christoph Hellwig
  2020-08-13 17:47                                           ` Mike Snitzer
  2020-08-14  3:23                                         ` Meneghini, John
  2 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2020-08-13 15:36 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Sagi Grimberg, Ewan Milne, Christoph Hellwig, dm-devel,
	linux-nvme, Chao Leng, Keith Busch, Meneghini, John,
	Hannes Reinecke

On Thu, Aug 13, 2020 at 10:48:11AM -0400, Mike Snitzer wrote:
> Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
> status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
> handling by changing multipathing's nvme_failover_req() to short-circuit
> path failover and then fallback to NVMe's normal error handling (which
> takes care of NVME_SC_CMD_INTERRUPTED).
> 
> This detour through native NVMe multipathing code is unwelcome because
> it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
> of any multipathing concerns.
> 
> Introduce nvme_status_needs_local_error_handling() to prioritize
> non-failover retry, when appropriate, in terms of normal NVMe error
> handling.  nvme_status_needs_local_error_handling() will naturely evolve
> to include handling of any other errors that normal error handling must
> be used for.
> 
> nvme_failover_req()'s ability to fallback to normal NVMe error handling
> has been preserved because it may be useful for future NVME_SC that
> nvme_status_needs_local_error_handling() hasn't been trained for yet.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

I don't see how this would change anything.  nvme_failover_req simply
retuns false for NVME_SC_CMD_INTERRUPTED, so your change is a no-op.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 15:36                                         ` Christoph Hellwig
@ 2020-08-13 17:47                                           ` Mike Snitzer
  2020-08-13 18:43                                             ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2020-08-13 17:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Ewan Milne, dm-devel, linux-nvme, Chao Leng,
	Keith Busch, Meneghini, John, Hannes Reinecke

On Thu, Aug 13 2020 at 11:36am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Aug 13, 2020 at 10:48:11AM -0400, Mike Snitzer wrote:
> > Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
> > status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
> > handling by changing multipathing's nvme_failover_req() to short-circuit
> > path failover and then fallback to NVMe's normal error handling (which
> > takes care of NVME_SC_CMD_INTERRUPTED).
> > 
> > This detour through native NVMe multipathing code is unwelcome because
> > it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
> > of any multipathing concerns.
> > 
> > Introduce nvme_status_needs_local_error_handling() to prioritize
> > non-failover retry, when appropriate, in terms of normal NVMe error
> > handling.  nvme_status_needs_local_error_handling() will naturely evolve
> > to include handling of any other errors that normal error handling must
> > be used for.
> > 
> > nvme_failover_req()'s ability to fallback to normal NVMe error handling
> > has been preserved because it may be useful for future NVME_SC that
> > nvme_status_needs_local_error_handling() hasn't been trained for yet.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> I don't see how this would change anything.  nvme_failover_req simply
> retuns false for NVME_SC_CMD_INTERRUPTED, so your change is a no-op.

This is just a tweak to improve the high-level fault tree of core NVMe
error handling.  No functional change, but for such basic errors,
avoiding entering nvme_failover_req is meaningful on a code flow level.
Makes code to handle errors that need local retry clearer by being more
structured, less circuitous.

Allows NVMe core's handling of such errors to be more explicit and live
in core.c rather than multipath.c -- so things like ACRE handling can be
made explicitly part of core and not nested under nvme_failover_req's
relatively obscure failsafe that returns false for anything it doesn't
care about.

Mike


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 17:47                                           ` Mike Snitzer
@ 2020-08-13 18:43                                             ` Christoph Hellwig
  2020-08-13 19:03                                               ` Mike Snitzer
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Christoph Hellwig @ 2020-08-13 18:43 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, Christoph Hellwig,
	dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John

On Thu, Aug 13, 2020 at 01:47:04PM -0400, Mike Snitzer wrote:
> This is just a tweak to improve the high-level fault tree of core NVMe
> error handling.  No functional change, but for such basic errors,
> avoiding entering nvme_failover_req is meaningful on a code flow level.
> Makes code to handle errors that need local retry clearer by being more
> structured, less circuitous.
> 
> Allows NVMe core's handling of such errors to be more explicit and live
> in core.c rather than multipath.c -- so things like ACRE handling can be
> made explicitly part of core and not nested under nvme_failover_req's
> relatively obscure failsafe that returns false for anything it doesn't
> care about.

If we're going that way I'd rather do something like the (untested)
patch below that adds a dispostion function with a function that
decides it and then just switches on it:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 88cff309d8e4f0..a740320f0d4ee7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -241,17 +241,6 @@ static blk_status_t nvme_error_status(u16 status)
 	}
 }
 
-static inline bool nvme_req_needs_retry(struct request *req)
-{
-	if (blk_noretry_request(req))
-		return false;
-	if (nvme_req(req)->status & NVME_SC_DNR)
-		return false;
-	if (nvme_req(req)->retries >= nvme_max_retries)
-		return false;
-	return true;
-}
-
 static void nvme_retry_req(struct request *req)
 {
 	struct nvme_ns *ns = req->q->queuedata;
@@ -268,33 +257,75 @@ static void nvme_retry_req(struct request *req)
 	blk_mq_delay_kick_requeue_list(req->q, delay);
 }
 
-void nvme_complete_rq(struct request *req)
+enum nvme_disposition {
+	COMPLETE,
+	RETRY,
+	REDIRECT_ANA,
+	REDIRECT_TMP,
+};
+
+static inline enum nvme_disposition nvme_req_disposition(struct request *req)
+{
+	if (likely(nvme_req(req)->status == 0))
+		return COMPLETE;
+
+	if (blk_noretry_request(req) ||
+	    (nvme_req(req)->status & NVME_SC_DNR) ||
+	    nvme_req(req)->retries >= nvme_max_retries)
+		return COMPLETE;
+
+	if (req->cmd_flags & REQ_NVME_MPATH) {
+		switch (nvme_req(req)->status & 0x7ff) {
+		case NVME_SC_ANA_TRANSITION:
+		case NVME_SC_ANA_INACCESSIBLE:
+		case NVME_SC_ANA_PERSISTENT_LOSS:
+			return REDIRECT_ANA;
+		case NVME_SC_HOST_PATH_ERROR:
+		case NVME_SC_HOST_ABORTED_CMD:
+			return REDIRECT_TMP;
+		}
+	}
+
+	if (blk_queue_dying(req->q))
+		return COMPLETE;
+	return RETRY;
+}
+
+static inline void nvme_complete_req(struct request *req)
 {
 	blk_status_t status = nvme_error_status(nvme_req(req)->status);
 
-	trace_nvme_complete_rq(req);
+	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
+	    req_op(req) == REQ_OP_ZONE_APPEND)
+		req->__sector = nvme_lba_to_sect(req->q->queuedata,
+			le64_to_cpu(nvme_req(req)->result.u64));
+
+	nvme_trace_bio_complete(req, status);
+	blk_mq_end_request(req, status);
+}
 
+void nvme_complete_rq(struct request *req)
+{
+	trace_nvme_complete_rq(req);
 	nvme_cleanup_cmd(req);
 
 	if (nvme_req(req)->ctrl->kas)
 		nvme_req(req)->ctrl->comp_seen = true;
 
-	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
-			return;
-
-		if (!blk_queue_dying(req->q)) {
-			nvme_retry_req(req);
-			return;
-		}
-	} else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-		   req_op(req) == REQ_OP_ZONE_APPEND) {
-		req->__sector = nvme_lba_to_sect(req->q->queuedata,
-			le64_to_cpu(nvme_req(req)->result.u64));
+	switch (nvme_req_disposition(req)) {
+	case COMPLETE:
+		nvme_complete_req(req);
+		return;
+	case RETRY:
+		nvme_retry_req(req);
+		return;
+	case REDIRECT_ANA:
+		nvme_failover_req(req, true);
+		return;
+	case REDIRECT_TMP:
+		nvme_failover_req(req, false);
+		return;
 	}
-
-	nvme_trace_bio_complete(req, status);
-	blk_mq_end_request(req, status);
 }
 EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 3ded54d2c9c6ad..0c22b2c88687a2 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -65,51 +65,32 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
 	}
 }
 
-bool nvme_failover_req(struct request *req)
+void nvme_failover_req(struct request *req, bool is_ana_status)
 {
 	struct nvme_ns *ns = req->q->queuedata;
-	u16 status = nvme_req(req)->status;
 	unsigned long flags;
 
-	switch (status & 0x7ff) {
-	case NVME_SC_ANA_TRANSITION:
-	case NVME_SC_ANA_INACCESSIBLE:
-	case NVME_SC_ANA_PERSISTENT_LOSS:
-		/*
-		 * If we got back an ANA error we know the controller is alive,
-		 * but not ready to serve this namespaces.  The spec suggests
-		 * we should update our general state here, but due to the fact
-		 * that the admin and I/O queues are not serialized that is
-		 * fundamentally racy.  So instead just clear the current path,
-		 * mark the the path as pending and kick of a re-read of the ANA
-		 * log page ASAP.
-		 */
-		nvme_mpath_clear_current_path(ns);
-		if (ns->ctrl->ana_log_buf) {
-			set_bit(NVME_NS_ANA_PENDING, &ns->flags);
-			queue_work(nvme_wq, &ns->ctrl->ana_work);
-		}
-		break;
-	case NVME_SC_HOST_PATH_ERROR:
-	case NVME_SC_HOST_ABORTED_CMD:
-		/*
-		 * Temporary transport disruption in talking to the controller.
-		 * Try to send on a new path.
-		 */
-		nvme_mpath_clear_current_path(ns);
-		break;
-	default:
-		/* This was a non-ANA error so follow the normal error path. */
-		return false;
+	nvme_mpath_clear_current_path(ns);
+
+	/*
+	 * If we got back an ANA error we know the controller is alive, but not
+	 * ready to serve this namespaces.  The spec suggests we should update
+	 * our general state here, but due to the fact that the admin and I/O
+	 * queues are not serialized that is fundamentally racy.  So instead
+	 * just clear the current path, mark the the path as pending and kick
+	 * of a re-read of the ANA log page ASAP.
+	 */
+	if (is_ana_status && ns->ctrl->ana_log_buf) {
+		set_bit(NVME_NS_ANA_PENDING, &ns->flags);
+		queue_work(nvme_wq, &ns->ctrl->ana_work);
 	}
 
 	spin_lock_irqsave(&ns->head->requeue_lock, flags);
 	blk_steal_bios(&ns->head->requeue_list, req);
 	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
-	blk_mq_end_request(req, 0);
 
+	blk_mq_end_request(req, 0);
 	kblockd_schedule_work(&ns->head->requeue_work);
-	return true;
 }
 
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ebb8c3ed388554..aeff1c491ac2ef 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -629,7 +629,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
 void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
 void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
 			struct nvme_ctrl *ctrl, int *flags);
-bool nvme_failover_req(struct request *req);
+void nvme_failover_req(struct request *req, bool is_ana_status);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
@@ -688,9 +688,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
 	sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
 }
 
-static inline bool nvme_failover_req(struct request *req)
+static inline void nvme_failover_req(struct request *req, bool is_ana_status)
 {
-	return false;
 }
 static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 18:43                                             ` Christoph Hellwig
@ 2020-08-13 19:03                                               ` Mike Snitzer
  2020-08-14  4:26                                               ` Meneghini, John
  2020-08-14  6:53                                               ` Sagi Grimberg
  2 siblings, 0 replies; 50+ messages in thread
From: Mike Snitzer @ 2020-08-13 19:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, dm-devel, Ewan Milne,
	Chao Leng, Keith Busch, Meneghini, John

On Thu, Aug 13 2020 at  2:43pm -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Aug 13, 2020 at 01:47:04PM -0400, Mike Snitzer wrote:
> > This is just a tweak to improve the high-level fault tree of core NVMe
> > error handling.  No functional change, but for such basic errors,
> > avoiding entering nvme_failover_req is meaningful on a code flow level.
> > Makes code to handle errors that need local retry clearer by being more
> > structured, less circuitous.
> > 
> > Allows NVMe core's handling of such errors to be more explicit and live
> > in core.c rather than multipath.c -- so things like ACRE handling can be
> > made explicitly part of core and not nested under nvme_failover_req's
> > relatively obscure failsafe that returns false for anything it doesn't
> > care about.
> 
> If we're going that way I'd rather do something like the (untested)
> patch below that adds a dispostion function with a function that
> decides it and then just switches on it:

YES!  That is such a huge improvement (certainly on a code clarity
level).  I haven't reviewed or tested the relative performance or
function of before vs after (will do) but I really like this approach.

Thanks,
Mike


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 18:43                                             ` Christoph Hellwig
  2020-08-13 19:03                                               ` Mike Snitzer
@ 2020-08-14  4:26                                               ` Meneghini, John
  2020-08-14  6:53                                               ` Sagi Grimberg
  2 siblings, 0 replies; 50+ messages in thread
From: Meneghini, John @ 2020-08-14  4:26 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer
  Cc: Sagi Grimberg, linux-nvme, dm-devel, Ewan Milne, Chao Leng,
	Keith Busch, Meneghini, John, Hannes Reinecke

On 8/13/20, 2:44 PM, "Christoph Hellwig" <hch@infradead.org> wrote:

    On Thu, Aug 13, 2020 at 01:47:04PM -0400, Mike Snitzer wrote:
    > This is just a tweak to improve the high-level fault tree of core NVMe
    > error handling.  No functional change, but for such basic errors,
    > avoiding entering nvme_failover_req is meaningful on a code flow level.
    > Makes code to handle errors that need local retry clearer by being more
    > structured, less circuitous.
  
I don't understand how entering nvme_failover_req() is circuitous. 

This code path is only taken if REQ_NVME_MPATH is set which - unless I am mistaken - in
the case that you care about it will not be set.

    > Allows NVMe core's handling of such errors to be more explicit and live
    > in core.c rather than multipath.c -- so things like ACRE handling can be
    > made explicitly part of core and not nested under nvme_failover_req's
    > relatively obscure failsafe that returns false for anything it doesn't
    > care about.

The ACRE handling is already explicitly a part of the core.  I don't understand what
you are after here Mike.  Are you saying that you don't want the ACRE code to run
when REQ_NVME_MPATH is clear?

    If we're going that way I'd rather do something like the (untested)
    patch below that adds a dispostion function with a function that
    decides it and then just switches on it:

Christoph, it looks like you've moved a lot of stuff around here, with no actual
functional change.... but it's really hard for me to tell.  Please be sure to cc me if this
becomes a real patch.

How does your patch solve the problem of making dm-multipath work with command retries?

Mike, do you want the nvme-core driver to retry commands on the same path, with CRD, for the dm-multipath
use case... or are you looking for a different treatment of REQ_FAILFAST_DEV... or what? 

Maybe I'm not seeing it.

/John

    diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
    index 88cff309d8e4f0..a740320f0d4ee7 100644
    --- a/drivers/nvme/host/core.c
    +++ b/drivers/nvme/host/core.c
    @@ -241,17 +241,6 @@ static blk_status_t nvme_error_status(u16 status)
            }
     }

    -static inline bool nvme_req_needs_retry(struct request *req)
    -{
    -       if (blk_noretry_request(req))
    -               return false;
    -       if (nvme_req(req)->status & NVME_SC_DNR)
    -               return false;
    -       if (nvme_req(req)->retries >= nvme_max_retries)
    -               return false;
    -       return true;
    -}
    -
     static void nvme_retry_req(struct request *req)
     {
            struct nvme_ns *ns = req->q->queuedata;
    @@ -268,33 +257,75 @@ static void nvme_retry_req(struct request *req)
            blk_mq_delay_kick_requeue_list(req->q, delay);
     }

    -void nvme_complete_rq(struct request *req)
    +enum nvme_disposition {
    +       COMPLETE,
    +       RETRY,
    +       REDIRECT_ANA,
    +       REDIRECT_TMP,
    +};
    +
    +static inline enum nvme_disposition nvme_req_disposition(struct request *req)
    +{
    +       if (likely(nvme_req(req)->status == 0))
    +               return COMPLETE;
    +
    +       if (blk_noretry_request(req) ||
    +           (nvme_req(req)->status & NVME_SC_DNR) ||
    +           nvme_req(req)->retries >= nvme_max_retries)
    +               return COMPLETE;
    +
    +       if (req->cmd_flags & REQ_NVME_MPATH) {
    +               switch (nvme_req(req)->status & 0x7ff) {
    +               case NVME_SC_ANA_TRANSITION:
    +               case NVME_SC_ANA_INACCESSIBLE:
    +               case NVME_SC_ANA_PERSISTENT_LOSS:
    +                       return REDIRECT_ANA;
    +               case NVME_SC_HOST_PATH_ERROR:
    +               case NVME_SC_HOST_ABORTED_CMD:
    +                       return REDIRECT_TMP;
    +               }
    +       }
    +
    +       if (blk_queue_dying(req->q))
    +               return COMPLETE;
    +       return RETRY;
    +}
    +
    +static inline void nvme_complete_req(struct request *req)
     {
            blk_status_t status = nvme_error_status(nvme_req(req)->status);

    -       trace_nvme_complete_rq(req);
    +       if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
    +           req_op(req) == REQ_OP_ZONE_APPEND)
    +               req->__sector = nvme_lba_to_sect(req->q->queuedata,
    +                       le64_to_cpu(nvme_req(req)->result.u64));
    +
    +       nvme_trace_bio_complete(req, status);
    +       blk_mq_end_request(req, status);
    +}

    +void nvme_complete_rq(struct request *req)
    +{
    +       trace_nvme_complete_rq(req);
            nvme_cleanup_cmd(req);

            if (nvme_req(req)->ctrl->kas)
                    nvme_req(req)->ctrl->comp_seen = true;

    -       if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
    -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
    -                       return;
    -
    -               if (!blk_queue_dying(req->q)) {
    -                       nvme_retry_req(req);
    -                       return;
    -               }
    -       } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
    -                  req_op(req) == REQ_OP_ZONE_APPEND) {
    -               req->__sector = nvme_lba_to_sect(req->q->queuedata,
    -                       le64_to_cpu(nvme_req(req)->result.u64));
    +       switch (nvme_req_disposition(req)) {
    +       case COMPLETE:
    +               nvme_complete_req(req);
    +               return;
    +       case RETRY:
    +               nvme_retry_req(req);
    +               return;
    +       case REDIRECT_ANA:
    +               nvme_failover_req(req, true);
    +               return;
    +       case REDIRECT_TMP:
    +               nvme_failover_req(req, false);
    +               return;
            }
    -
    -       nvme_trace_bio_complete(req, status);
    -       blk_mq_end_request(req, status);
     }
     EXPORT_SYMBOL_GPL(nvme_complete_rq);

    diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
    index 3ded54d2c9c6ad..0c22b2c88687a2 100644
    --- a/drivers/nvme/host/multipath.c
    +++ b/drivers/nvme/host/multipath.c
    @@ -65,51 +65,32 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
            }
     }

    -bool nvme_failover_req(struct request *req)
    +void nvme_failover_req(struct request *req, bool is_ana_status)
     {
            struct nvme_ns *ns = req->q->queuedata;
    -       u16 status = nvme_req(req)->status;
            unsigned long flags;

    -       switch (status & 0x7ff) {
    -       case NVME_SC_ANA_TRANSITION:
    -       case NVME_SC_ANA_INACCESSIBLE:
    -       case NVME_SC_ANA_PERSISTENT_LOSS:
    -               /*
    -                * If we got back an ANA error we know the controller is alive,
    -                * but not ready to serve this namespaces.  The spec suggests
    -                * we should update our general state here, but due to the fact
    -                * that the admin and I/O queues are not serialized that is
    -                * fundamentally racy.  So instead just clear the current path,
    -                * mark the the path as pending and kick of a re-read of the ANA
    -                * log page ASAP.
    -                */
    -               nvme_mpath_clear_current_path(ns);
    -               if (ns->ctrl->ana_log_buf) {
    -                       set_bit(NVME_NS_ANA_PENDING, &ns->flags);
    -                       queue_work(nvme_wq, &ns->ctrl->ana_work);
    -               }
    -               break;
    -       case NVME_SC_HOST_PATH_ERROR:
    -       case NVME_SC_HOST_ABORTED_CMD:
    -               /*
    -                * Temporary transport disruption in talking to the controller.
    -                * Try to send on a new path.
    -                */
    -               nvme_mpath_clear_current_path(ns);
    -               break;
    -       default:
    -               /* This was a non-ANA error so follow the normal error path. */
    -               return false;
    +       nvme_mpath_clear_current_path(ns);
    +
    +       /*
    +        * If we got back an ANA error we know the controller is alive, but not
    +        * ready to serve this namespaces.  The spec suggests we should update
    +        * our general state here, but due to the fact that the admin and I/O
    +        * queues are not serialized that is fundamentally racy.  So instead
    +        * just clear the current path, mark the the path as pending and kick
    +        * of a re-read of the ANA log page ASAP.
    +        */
    +       if (is_ana_status && ns->ctrl->ana_log_buf) {
    +               set_bit(NVME_NS_ANA_PENDING, &ns->flags);
    +               queue_work(nvme_wq, &ns->ctrl->ana_work);
            }

            spin_lock_irqsave(&ns->head->requeue_lock, flags);
            blk_steal_bios(&ns->head->requeue_list, req);
            spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
    -       blk_mq_end_request(req, 0);

    +       blk_mq_end_request(req, 0);
            kblockd_schedule_work(&ns->head->requeue_work);
    -       return true;
     }

     void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
    diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
    index ebb8c3ed388554..aeff1c491ac2ef 100644
    --- a/drivers/nvme/host/nvme.h
    +++ b/drivers/nvme/host/nvme.h
    @@ -629,7 +629,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
     void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
     void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
                            struct nvme_ctrl *ctrl, int *flags);
    -bool nvme_failover_req(struct request *req);
    +void nvme_failover_req(struct request *req, bool is_ana_status);
     void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
     int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
     void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
    @@ -688,9 +688,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
            sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
     }

    -static inline bool nvme_failover_req(struct request *req)
    +static inline void nvme_failover_req(struct request *req, bool is_ana_status)
     {
    -       return false;
     }
     static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
     {

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 18:43                                             ` Christoph Hellwig
  2020-08-13 19:03                                               ` Mike Snitzer
  2020-08-14  4:26                                               ` Meneghini, John
@ 2020-08-14  6:53                                               ` Sagi Grimberg
  2020-08-14  6:55                                                 ` Christoph Hellwig
  2 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2020-08-14  6:53 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer
  Cc: Hannes Reinecke, linux-nvme, dm-devel, Ewan Milne, Chao Leng,
	Keith Busch, Meneghini, John


> +static inline enum nvme_disposition nvme_req_disposition(struct request *req)
> +{
> +	if (likely(nvme_req(req)->status == 0))
> +		return COMPLETE;
> +
> +	if (blk_noretry_request(req) ||
> +	    (nvme_req(req)->status & NVME_SC_DNR) ||
> +	    nvme_req(req)->retries >= nvme_max_retries)
> +		return COMPLETE;
> +
> +	if (req->cmd_flags & REQ_NVME_MPATH) {
> +		switch (nvme_req(req)->status & 0x7ff) {
> +		case NVME_SC_ANA_TRANSITION:
> +		case NVME_SC_ANA_INACCESSIBLE:
> +		case NVME_SC_ANA_PERSISTENT_LOSS:
> +			return REDIRECT_ANA;
> +		case NVME_SC_HOST_PATH_ERROR:
> +		case NVME_SC_HOST_ABORTED_CMD:
> +			return REDIRECT_TMP;
> +		}
> +	}
> +
> +	if (blk_queue_dying(req->q))
> +		return COMPLETE;
> +	return RETRY;
> +}
> +
> +static inline void nvme_complete_req(struct request *req)
>   {
>   	blk_status_t status = nvme_error_status(nvme_req(req)->status);
>   
> -	trace_nvme_complete_rq(req);
> +	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> +	    req_op(req) == REQ_OP_ZONE_APPEND)
> +		req->__sector = nvme_lba_to_sect(req->q->queuedata,
> +			le64_to_cpu(nvme_req(req)->result.u64));
> +
> +	nvme_trace_bio_complete(req, status);
> +	blk_mq_end_request(req, status);
> +}
>   
> +void nvme_complete_rq(struct request *req)
> +{
> +	trace_nvme_complete_rq(req);
>   	nvme_cleanup_cmd(req);
>   
>   	if (nvme_req(req)->ctrl->kas)
>   		nvme_req(req)->ctrl->comp_seen = true;
>   
> -	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> -		if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
> -			return;
> -
> -		if (!blk_queue_dying(req->q)) {
> -			nvme_retry_req(req);
> -			return;
> -		}
> -	} else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> -		   req_op(req) == REQ_OP_ZONE_APPEND) {
> -		req->__sector = nvme_lba_to_sect(req->q->queuedata,
> -			le64_to_cpu(nvme_req(req)->result.u64));
> +	switch (nvme_req_disposition(req)) {
> +	case COMPLETE:
> +		nvme_complete_req(req);

nvme_complete_rq calling nvme_complete_req... Maybe call it
__nvme_complete_rq instead?

> +		return;
> +	case RETRY:
> +		nvme_retry_req(req);
> +		return;
> +	case REDIRECT_ANA:
> +		nvme_failover_req(req, true);
> +		return;
> +	case REDIRECT_TMP:
> +		nvme_failover_req(req, false);
> +		return;
>   	}
> -
> -	nvme_trace_bio_complete(req, status);
> -	blk_mq_end_request(req, status);
>   }
>   EXPORT_SYMBOL_GPL(nvme_complete_rq);
>   
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 3ded54d2c9c6ad..0c22b2c88687a2 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -65,51 +65,32 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
>   	}
>   }
>   
> -bool nvme_failover_req(struct request *req)
> +void nvme_failover_req(struct request *req, bool is_ana_status)
>   {
>   	struct nvme_ns *ns = req->q->queuedata;
> -	u16 status = nvme_req(req)->status;
>   	unsigned long flags;
>   
> -	switch (status & 0x7ff) {
> -	case NVME_SC_ANA_TRANSITION:
> -	case NVME_SC_ANA_INACCESSIBLE:
> -	case NVME_SC_ANA_PERSISTENT_LOSS:
> -		/*
> -		 * If we got back an ANA error we know the controller is alive,
> -		 * but not ready to serve this namespaces.  The spec suggests
> -		 * we should update our general state here, but due to the fact
> -		 * that the admin and I/O queues are not serialized that is
> -		 * fundamentally racy.  So instead just clear the current path,
> -		 * mark the the path as pending and kick of a re-read of the ANA
> -		 * log page ASAP.
> -		 */
> -		nvme_mpath_clear_current_path(ns);
> -		if (ns->ctrl->ana_log_buf) {
> -			set_bit(NVME_NS_ANA_PENDING, &ns->flags);
> -			queue_work(nvme_wq, &ns->ctrl->ana_work);
> -		}
> -		break;
> -	case NVME_SC_HOST_PATH_ERROR:
> -	case NVME_SC_HOST_ABORTED_CMD:
> -		/*
> -		 * Temporary transport disruption in talking to the controller.
> -		 * Try to send on a new path.
> -		 */
> -		nvme_mpath_clear_current_path(ns);
> -		break;
> -	default:
> -		/* This was a non-ANA error so follow the normal error path. */
> -		return false;
> +	nvme_mpath_clear_current_path(ns);
> +
> +	/*
> +	 * If we got back an ANA error we know the controller is alive, but not
> +	 * ready to serve this namespaces.  The spec suggests we should update
> +	 * our general state here, but due to the fact that the admin and I/O
> +	 * queues are not serialized that is fundamentally racy.  So instead
> +	 * just clear the current path, mark the the path as pending and kick
> +	 * of a re-read of the ANA log page ASAP.
> +	 */
> +	if (is_ana_status && ns->ctrl->ana_log_buf) {

Maybe call nvme_req_disposition again locally here to not carry
the is_ana_status. But not a biggy..

Overall this looks good I think.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-14  6:53                                               ` Sagi Grimberg
@ 2020-08-14  6:55                                                 ` Christoph Hellwig
  2020-08-14  7:02                                                   ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2020-08-14  6:55 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Mike Snitzer, Hannes Reinecke, linux-nvme, Christoph Hellwig,
	dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John

On Thu, Aug 13, 2020 at 11:53:14PM -0700, Sagi Grimberg wrote:
> > +	switch (nvme_req_disposition(req)) {
> > +	case COMPLETE:
> > +		nvme_complete_req(req);
> 
> nvme_complete_rq calling nvme_complete_req... Maybe call it
> __nvme_complete_rq instead?

That's what I had first, but it felt so strangely out of place next
to the other nvme_*_req calls..

Maybe nvme_complete_rq needs to be renamed - what about nvme_req_done?

> Maybe call nvme_req_disposition again locally here to not carry
> the is_ana_status. But not a biggy..

That means it would have to become non-static in scope, limiting
inlining possibilities, etc.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-14  6:55                                                 ` Christoph Hellwig
@ 2020-08-14  7:02                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2020-08-14  7:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, linux-nvme, dm-devel, Hannes Reinecke, Chao Leng,
	Keith Busch, Meneghini, John, Ewan Milne


>>> +	switch (nvme_req_disposition(req)) {
>>> +	case COMPLETE:
>>> +		nvme_complete_req(req);
>>
>> nvme_complete_rq calling nvme_complete_req... Maybe call it
>> __nvme_complete_rq instead?
> 
> That's what I had first, but it felt so strangely out of place next
> to the other nvme_*_req calls..
> 
> Maybe nvme_complete_rq needs to be renamed - what about nvme_req_done?

I'd suggest to call nvme_complete_rq nvme_end_rq because it really calls
blk_mq_end_request. That would confuse with nvme_end_request which
calls blk_mq_complete_request... Maybe we need some naming improvements
throughout.

>> Maybe call nvme_req_disposition again locally here to not carry
>> the is_ana_status. But not a biggy..
> 
> That means it would have to become non-static in scope, limiting
> inlining possibilities, etc.

I see.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
  2020-08-13 14:48                                       ` [RESEND PATCH] " Mike Snitzer
  2020-08-13 15:29                                         ` Meneghini, John
  2020-08-13 15:36                                         ` Christoph Hellwig
@ 2020-08-14  3:23                                         ` Meneghini, John
  2 siblings, 0 replies; 50+ messages in thread
From: Meneghini, John @ 2020-08-14  3:23 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: linux-nvme, Christoph Hellwig, dm-devel, Hannes Reinecke,
	Chao Leng, Keith Busch, Meneghini,  John, Ewan Milne

On 8/13/20, 10:48 AM, "Mike Snitzer" <snitzer@redhat.com> wrote:

    Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown
    status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error
    handling by changing multipathing's nvme_failover_req() to short-circuit
    path failover and then fallback to NVMe's normal error handling (which
    takes care of NVME_SC_CMD_INTERRUPTED).

    This detour through native NVMe multipathing code is unwelcome because
    it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent
    of any multipathing concerns.

    Introduce nvme_status_needs_local_error_handling() to prioritize
    non-failover retry, when appropriate, in terms of normal NVMe error
    handling.  nvme_status_needs_local_error_handling() will naturely evolve
    to include handling of any other errors that normal error handling must
    be used for.

How is this any better than blk_path_error()? 

    nvme_failover_req()'s ability to fallback to normal NVMe error handling
    has been preserved because it may be useful for future NVME_SC that
    nvme_status_needs_local_error_handling() hasn't been trained for yet.

    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    ---
     drivers/nvme/host/core.c | 16 ++++++++++++++--
     1 file changed, 14 insertions(+), 2 deletions(-)

    diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
    index 88cff309d8e4..be749b690af7 100644
    --- a/drivers/nvme/host/core.c
    +++ b/drivers/nvme/host/core.c
    @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req)
            return true;
     }

    +static inline bool nvme_status_needs_local_error_handling(u16 status)
    +{
    +       switch (status & 0x7ff) {
    +       case NVME_SC_CMD_INTERRUPTED:
    +               return true;
    +       default:
    +               return false;
    +       }
    +}

I assume that what you mean by nvme_status_needs_local_error_handling is - do you want the nvme core 
driver to handle the command retry.

If this is true, I don't think this function will ever work correctly because,. as discussed, whether or
not the command needs to be retried has nothing to do with the NVMe Status Code Field itself, it
has to so with the DNR bit, the CRD field, and the Status Code Type field - in that order.

    +
     static void nvme_retry_req(struct request *req)
     {
            struct nvme_ns *ns = req->q->queuedata;
    @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req)

     void nvme_complete_rq(struct request *req)
     {
    -       blk_status_t status = nvme_error_status(nvme_req(req)->status);
    +       u16 nvme_status = nvme_req(req)->status;
    +       blk_status_t status = nvme_error_status(nvme_status);

            trace_nvme_complete_rq(req);

    @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req)
                    nvme_req(req)->ctrl->comp_seen = true;

            if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
    -               if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
    +               if (!nvme_status_needs_local_error_handling(nvme_status) &&

This defeats the nvme-multipath logic by inserting  a second evaluation of the NVMe Status Code into the retry logic.

This is basically another version of  blk_path_error().

In fact, in your case REQ_NVME_MPATH is probably not set, so I don't see what difference this would make at all.

/John

    +                   (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
                            return;

                    if (!blk_queue_dying(req->q)) {
    --
    2.18.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-06 19:19                       ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer
  2020-08-06 22:42                         ` Meneghini, John
@ 2020-08-07  0:44                         ` Sagi Grimberg
  2020-08-10 12:43                         ` Christoph Hellwig
  2 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2020-08-07  0:44 UTC (permalink / raw)
  To: Mike Snitzer, Meneghini, John
  Cc: linux-nvme, Christoph Hellwig, Hannes Reinecke, Chao Leng,
	Keith Busch, Ewan Milne

Hey Mike,

> Regardless, reading back on this thread, I think there is at least some
> consensus about reverting commit 35038bffa87 ("nvme: Translate more
> status codes to blk_status_t") ?

We can probably make NVME_SC_CMD_INTERRUPTED an I/O error, its the
closest there is to something that is somewhat close to its meaning,
at least for some interpretation.. Although BLK_STS_TARGET is also
semantically correct from another interpretation.

> And on a related note, building on the thread I started here (but
> haven't heard back from any NVMe maintainers on):
> https://www.redhat.com/archives/dm-devel/2020-July/msg00051.html
> I'd also be happy as a pig in shit if this patch were applied:

I'll reply to the other thread.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-06 19:19                       ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer
  2020-08-06 22:42                         ` Meneghini, John
  2020-08-07  0:44                         ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Sagi Grimberg
@ 2020-08-10 12:43                         ` Christoph Hellwig
  2020-08-10 15:06                           ` Mike Snitzer
  2020-08-11  3:45                           ` [PATCH] " Chao Leng
  2 siblings, 2 replies; 50+ messages in thread
From: Christoph Hellwig @ 2020-08-10 12:43 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, Ewan Milne,
	Chao Leng, Keith Busch, Meneghini, John

Just returning from my vacation, and I'm really surprised about
this discussion.

If you want to fix a device mapper / block layer interaction you do
not change nvme code to use suboptimal interfaces.

One of the reasons for the nvme multipath design was to allow for
tighter integration with protocol features, and making nvme do a detour
through the block layer for no good reason at all does not help with
that at all.

And if you want to support the TP that added the new command interrupted
status code please read through it - the status code itself is just a
small part of it, and none of the changes proposed in these discussions
would lead to a proper implementation.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-10 12:43                         ` Christoph Hellwig
@ 2020-08-10 15:06                           ` Mike Snitzer
  2020-08-11  3:45                           ` [PATCH] " Chao Leng
  1 sibling, 0 replies; 50+ messages in thread
From: Mike Snitzer @ 2020-08-10 15:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Hannes Reinecke, Chao Leng, Keith Busch, Meneghini,
	John, Ewan Milne

On Mon, Aug 10 2020 at  8:43am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> Just returning from my vacation, and I'm really surprised about
> this discussion.

Welcome back!

FYI, I got roped into this stuff again due to reports of regression from
Red Hat and partner QE testing of latest RHEL 8.3 stuff.

I have been away from all things NVMe for a while so apologies for my
misfire that dwelled on nvme_complete_rq()'s blk_path_error() call being
removed.

> If you want to fix a device mapper / block layer interaction you do
> not change nvme code to use suboptimal interfaces.
> 
> One of the reasons for the nvme multipath design was to allow for
> tighter integration with protocol features, and making nvme do a detour
> through the block layer for no good reason at all does not help with
> that at all.

Yes, reinstating blk_path_error() was a distraction.  Still an
outstanding point to discuss but it can be pushed to the
back-back-burner.

> And if you want to support the TP that added the new command interrupted
> status code please read through it - the status code itself is just a
> small part of it, and none of the changes proposed in these discussions
> would lead to a proper implementation.

Think you likely glossed over aspects of my exchange with Sagi.  That's
fine.  I don't fault you in any way for not wanting to labor over this
thread to this  point! ;)

Really I shouldn't _need_ to be the wiser about all this inherently
local NVMe error handling, which happens to be increasingly complex,
that seemingly has nothing to do with multipathing.

NVMe should handle the command interrupted status code, ANA and now
ACRE in a way where it isn't conflated to detour through
nvme_failover_req() et al.

SO that is what I hope to rectify with some simple patches.

Mike

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq()
  2020-08-10 12:43                         ` Christoph Hellwig
  2020-08-10 15:06                           ` Mike Snitzer
@ 2020-08-11  3:45                           ` Chao Leng
  1 sibling, 0 replies; 50+ messages in thread
From: Chao Leng @ 2020-08-11  3:45 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer
  Cc: Keith Busch, Hannes Reinecke, linux-nvme, Ewan Milne, Meneghini, John



On 2020/8/10 20:43, Christoph Hellwig wrote:
> Just returning from my vacation, and I'm really surprised about
> this discussion.
> 
> If you want to fix a device mapper / block layer interaction you do
> not change nvme code to use suboptimal interfaces.
Agree. Nvme define path error and local retry mechanism. It is more
reasonable to use the new mechanism.
> 
> One of the reasons for the nvme multipath design was to allow for
> tighter integration with protocol features, and making nvme do a detour
> through the block layer for no good reason at all does not help with
> that at all.
Nvme should use local retry mechanism. But now we first check the flag:
REQ_FAILFAST_TRANSPORT, if work with dm-multipath, REQ_FAILFAST_TRANSPORT
will be set, this interrupt local retry.
> 
> And if you want to support the TP that added the new command interrupted
> status code please read through it - the status code itself is just a
> small part of it, and none of the changes proposed in these discussions
> would lead to a proper implementation.
> .
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-08-06 15:59                 ` Meneghini, John
  2020-08-06 16:17                   ` Meneghini, John
@ 2020-08-07  0:03                   ` Sagi Grimberg
  2020-08-07  2:28                     ` Chao Leng
  1 sibling, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2020-08-07  0:03 UTC (permalink / raw)
  To: Meneghini, John, Keith Busch, Chao Leng
  Cc: Christoph Hellwig, linux-nvme, Hannes de, Ewan Milne


>>> John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED
>>> to BLK_STS_TARGET? Thank you.
> 
> I think returning to BLK_STS_TARGET for NVME_SC_CMD_INTERRUPTED and NVME_SC_NS_NOT_READY
> was a potentially non-backwards compatible change that Keith made in response to a problem that
> I reported in patch 35038bffa87da.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da
> 
> I don’t think it would be a mistake to back out this change.  We've had a number of problems
> in the distributions with dm-mp and other legacy code getting confused by this change.

 From what I see the point of this change was to have these status codes
to do exactly what they need to do, use the local retry flow.

>>> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED
>>> just means target need retry io. It is not suitable to translate
>>> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to
>>> BLK_STS_IOERR is also not suitable, we should translate
>>> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN.
>>> We can do like this:
>>
>> BLK_STS_AGAIN is a bad choice as we use it for calls that block when
>> the callers asked for non-blocking submission.  I'm really not sure
>> we want to change anything here - the error definition clearly states
>> it is not a failure but a request to retry later.
> 
> Agreed. The historical behavior in nvme-core has been to return BLK_STS_IOERR status for all new
> or unknown NVMe errors.

Which is the correct thing to do.

> .   On 8/6/20, 10:26 AM, "Keith Busch" <kbusch@kernel.org> wrote:
> 
>      > On Thu, Aug 06, 2020 at 01:52:42PM +0800, Chao Leng wrote:
>      > NVME_SC_LBA_RANGE should failfast, because retry can not success.
> 
>      The DNR bit is how NVMe conveys such things.
> 
> I think that NVME_SC_LBA_RANGE has historically returned BLK_STS_TARGET so that
> shouldn't be changed.  I agree that DNR - and now the new ACRE mechanism - is what
> should control the command retry behavior in NVMe.

Completely agree here too.

>      > NVME_SC_NS_NOT_READY may retry success, but the probality is very low.
>      > NVME_SC_CMD_INTERRUPTED need retry, according to protocol define, retry will success.
> 
>      If the upper layers set a request such that "noretry" is true, that's
>      the behavior you're going to get. Sprinkling special exceptions around
>      is not a good idea.
> 
> I agree.

I also 100% agree here.

> I think the problem here is that the current BLK_STS and FAST_FAIL mechanisms
> were designed support legacy protocols like SCSI.  They assume that all retry behavior is
> controlled by other components in the stack.  NVMe is presenting new protocol features
> and semantics which probably can't be effectively supported by those legacy BLK_STS
> and FAST_FAIL mechanisms without passing more information up the stack.

Not sure how generic this new blk status would be.. It would probably
make a lot more sense of there are other consumers for such a status
code.

Maybe we could set it to BLK_STS_TIMEOUT with a big fat comment for why
we are doing this...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah
  2020-08-07  0:03                   ` [PATCH] nvme-core: fix io interrupt when work with dm-multipah Sagi Grimberg
@ 2020-08-07  2:28                     ` Chao Leng
  0 siblings, 0 replies; 50+ messages in thread
From: Chao Leng @ 2020-08-07  2:28 UTC (permalink / raw)
  To: Sagi Grimberg, Meneghini, John, Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Hannes de, Ewan Milne



On 2020/8/7 8:03, Sagi Grimberg wrote:
>> I think the problem here is that the current BLK_STS and FAST_FAIL mechanisms
>> were designed support legacy protocols like SCSI.  They assume that all retry behavior is
>> controlled by other components in the stack.  NVMe is presenting new protocol features
>> and semantics which probably can't be effectively supported by those legacy BLK_STS
>> and FAST_FAIL mechanisms without passing more information up the stack.

> 
> Not sure how generic this new blk status would be.. It would probably
> make a lot more sense of there are other consumers for such a status
> code.
> 
> Maybe we could set it to BLK_STS_TIMEOUT with a big fat comment for why
> we are doing this...
Introduce a new BLK_STS error maybe a good choice, but is not friendly
to forward compatibility.

Now we have 3 choice:BLK_STS_TIMEOUT, BLK_STS_RESOURCE or BLK_STS_IOERR
by default. DM-multipath will fail path and then fail over to retry.
The difference is BLK_STS_RESOURCE will delay 100ms to retry,
BLK_STS_TIMEOUT and BLK_STS_IOERR will retry immediatly.

Now there is no perfect choice, but we have to choose one. I suggest
keep default, translate NVME_SC_CMD_INTERRUPTED to BLK_STS_IOERR.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2020-08-14  7:02 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27  5:58 [PATCH] nvme-core: fix io interrupt when work with dm-multipah Chao Leng
2020-07-28 11:19 ` Christoph Hellwig
2020-07-29  2:54   ` Chao Leng
2020-07-29  5:59     ` Christoph Hellwig
2020-07-30  1:49       ` Chao Leng
2020-08-05  6:40         ` Chao Leng
2020-08-05 15:29           ` Keith Busch
2020-08-06  5:52             ` Chao Leng
2020-08-06 14:26               ` Keith Busch
2020-08-06 15:59                 ` Meneghini, John
2020-08-06 16:17                   ` Meneghini, John
2020-08-06 18:40                     ` Mike Snitzer
2020-08-06 19:19                       ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer
2020-08-06 22:42                         ` Meneghini, John
2020-08-07  0:07                           ` Mike Snitzer
2020-08-07  1:21                             ` Sagi Grimberg
2020-08-07  4:50                               ` Mike Snitzer
2020-08-07 23:35                                 ` Sagi Grimberg
2020-08-08 21:08                                   ` Meneghini, John
2020-08-08 21:11                                     ` Meneghini, John
2020-08-10 14:48                                       ` Mike Snitzer
2020-08-11 12:54                                         ` Meneghini, John
2020-08-10  8:10                                     ` Chao Leng
2020-08-11 12:36                                       ` Meneghini, John
2020-08-12  7:51                                         ` Chao Leng
2020-08-10 14:36                                   ` Mike Snitzer
2020-08-10 17:22                                     ` [PATCH] nvme: explicitly use normal NVMe error handling when appropriate Mike Snitzer
2020-08-11  3:32                                       ` Chao Leng
2020-08-11  4:20                                         ` Mike Snitzer
2020-08-11  6:17                                           ` Chao Leng
2020-08-11 14:12                                             ` Mike Snitzer
2020-08-13 14:48                                       ` [RESEND PATCH] " Mike Snitzer
2020-08-13 15:29                                         ` Meneghini, John
2020-08-13 15:43                                           ` Mike Snitzer
2020-08-13 15:59                                             ` Meneghini, John
2020-08-13 15:36                                         ` Christoph Hellwig
2020-08-13 17:47                                           ` Mike Snitzer
2020-08-13 18:43                                             ` Christoph Hellwig
2020-08-13 19:03                                               ` Mike Snitzer
2020-08-14  4:26                                               ` Meneghini, John
2020-08-14  6:53                                               ` Sagi Grimberg
2020-08-14  6:55                                                 ` Christoph Hellwig
2020-08-14  7:02                                                   ` Sagi Grimberg
2020-08-14  3:23                                         ` Meneghini, John
2020-08-07  0:44                         ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Sagi Grimberg
2020-08-10 12:43                         ` Christoph Hellwig
2020-08-10 15:06                           ` Mike Snitzer
2020-08-11  3:45                           ` [PATCH] " Chao Leng
2020-08-07  0:03                   ` [PATCH] nvme-core: fix io interrupt when work with dm-multipah Sagi Grimberg
2020-08-07  2:28                     ` Chao Leng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).