* [PATCH] nvme-core: fix io interrupt when work with dm-multipah @ 2020-07-27 5:58 Chao Leng 2020-07-28 11:19 ` Christoph Hellwig 0 siblings, 1 reply; 50+ messages in thread From: Chao Leng @ 2020-07-27 5:58 UTC (permalink / raw) To: linux-nvme; +Cc: kbusch, axboe, hch, lengchao, sagi The protocol NVM-Express-1.4 define: Command Interrupted: Command processing was interrupted and the controller is unable to successfully complete the command. The host should retry the command. If this status code is returned, then the controller shall clear the Do Not Retry bit to ‘0’ in the Status field of the CQE (refer to Figure 124). The controller shall not return this status code unless the host has set the Advanced Command Retry Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to section 5.21.1.22). According the protocol define, NVME_SC_CMD_INTERRUPTED need retry. The error code NVME_SC_CMD_INTERRUPTED should not translate to BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET, dm-multipah will return error to application. So if target return error code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED should translate to BLK_STS_IOERR by default, dm-multipath will fail over to other path retry the io. Signed-off-by: Chao Leng <lengchao@huawei.com> --- drivers/nvme/host/core.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index c2c5bc4fb702..1a1ad5e5212c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -211,7 +211,6 @@ static blk_status_t nvme_error_status(u16 status) case NVME_SC_CAP_EXCEEDED: return BLK_STS_NOSPC; case NVME_SC_LBA_RANGE: - case NVME_SC_CMD_INTERRUPTED: case NVME_SC_NS_NOT_READY: return BLK_STS_TARGET; case NVME_SC_BAD_ATTRIBUTES: -- 2.16.4 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-07-27 5:58 [PATCH] nvme-core: fix io interrupt when work with dm-multipah Chao Leng @ 2020-07-28 11:19 ` Christoph Hellwig 2020-07-29 2:54 ` Chao Leng 0 siblings, 1 reply; 50+ messages in thread From: Christoph Hellwig @ 2020-07-28 11:19 UTC (permalink / raw) To: Chao Leng; +Cc: kbusch, axboe, hch, linux-nvme, sagi On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote: > The protocol NVM-Express-1.4 define: > Command Interrupted: Command processing was interrupted and the > controller is unable to successfully complete the command. The host > should retry the command. If this status code is returned, then > the controller shall clear the Do Not Retry bit to ‘0’ in the Status > field of the CQE (refer to Figure 124). The controller shall not return > this status code unless the host has set the Advanced Command Retry > Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to > section 5.21.1.22). > > According the protocol define, NVME_SC_CMD_INTERRUPTED need retry. > The error code NVME_SC_CMD_INTERRUPTED should not translate to > BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET, > dm-multipah will return error to application. So if target return error > code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED > should translate to BLK_STS_IOERR by default, dm-multipath will fail > over to other path retry the io. IOERR still seems wrong, though. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-07-28 11:19 ` Christoph Hellwig @ 2020-07-29 2:54 ` Chao Leng 2020-07-29 5:59 ` Christoph Hellwig 0 siblings, 1 reply; 50+ messages in thread From: Chao Leng @ 2020-07-29 2:54 UTC (permalink / raw) To: Christoph Hellwig; +Cc: kbusch, axboe, sagi, linux-nvme On 2020/7/28 19:19, Christoph Hellwig wrote: > On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote: >> The protocol NVM-Express-1.4 define: >> Command Interrupted: Command processing was interrupted and the >> controller is unable to successfully complete the command. The host >> should retry the command. If this status code is returned, then >> the controller shall clear the Do Not Retry bit to ‘0’ in the Status >> field of the CQE (refer to Figure 124). The controller shall not return >> this status code unless the host has set the Advanced Command Retry >> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to >> section 5.21.1.22). >> >> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry. >> The error code NVME_SC_CMD_INTERRUPTED should not translate to >> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET, >> dm-multipah will return error to application. So if target return error >> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED >> should translate to BLK_STS_IOERR by default, dm-multipath will fail >> over to other path retry the io. > > IOERR still seems wrong, though. > . BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED just means target need retry io. It is not suitable to translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to BLK_STS_IOERR is also not suitable, we should translate NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. We can do like this: --- drivers/nvme/host/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index c2c5bc4fb702..359ce471df1d 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -211,7 +211,6 @@ static blk_status_t nvme_error_status(u16 status) case NVME_SC_CAP_EXCEEDED: return BLK_STS_NOSPC; case NVME_SC_LBA_RANGE: - case NVME_SC_CMD_INTERRUPTED: case NVME_SC_NS_NOT_READY: return BLK_STS_TARGET; case NVME_SC_BAD_ATTRIBUTES: @@ -236,6 +235,8 @@ static blk_status_t nvme_error_status(u16 status) return BLK_STS_NEXUS; case NVME_SC_HOST_PATH_ERROR: return BLK_STS_TRANSPORT; + case NVME_SC_CMD_INTERRUPTED: + return BLK_STS_AGAIN; default: return BLK_STS_IOERR; } -- 2.16.4 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-07-29 2:54 ` Chao Leng @ 2020-07-29 5:59 ` Christoph Hellwig 2020-07-30 1:49 ` Chao Leng 0 siblings, 1 reply; 50+ messages in thread From: Christoph Hellwig @ 2020-07-29 5:59 UTC (permalink / raw) To: Chao Leng; +Cc: kbusch, axboe, Christoph Hellwig, linux-nvme, sagi On Wed, Jul 29, 2020 at 10:54:29AM +0800, Chao Leng wrote: > > > On 2020/7/28 19:19, Christoph Hellwig wrote: >> On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote: >>> The protocol NVM-Express-1.4 define: >>> Command Interrupted: Command processing was interrupted and the >>> controller is unable to successfully complete the command. The host >>> should retry the command. If this status code is returned, then >>> the controller shall clear the Do Not Retry bit to ‘0’ in the Status >>> field of the CQE (refer to Figure 124). The controller shall not return >>> this status code unless the host has set the Advanced Command Retry >>> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to >>> section 5.21.1.22). >>> >>> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry. >>> The error code NVME_SC_CMD_INTERRUPTED should not translate to >>> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET, >>> dm-multipah will return error to application. So if target return error >>> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED >>> should translate to BLK_STS_IOERR by default, dm-multipath will fail >>> over to other path retry the io. >> >> IOERR still seems wrong, though. >> . > > BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED > just means target need retry io. It is not suitable to translate > NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to > BLK_STS_IOERR is also not suitable, we should translate > NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. > We can do like this: BLK_STS_AGAIN is a bad choice as we use it for calls that block when the callers asked for non-blocking submission. I'm really not sure we want to change anything here - the error definition clearly states it is not a failure but a request to retry later. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-07-29 5:59 ` Christoph Hellwig @ 2020-07-30 1:49 ` Chao Leng 2020-08-05 6:40 ` Chao Leng 0 siblings, 1 reply; 50+ messages in thread From: Chao Leng @ 2020-07-30 1:49 UTC (permalink / raw) To: Christoph Hellwig; +Cc: kbusch, axboe, sagi, linux-nvme, John.Meneghini On 2020/7/29 13:59, Christoph Hellwig wrote: > On Wed, Jul 29, 2020 at 10:54:29AM +0800, Chao Leng wrote: >> >> >> On 2020/7/28 19:19, Christoph Hellwig wrote: >>> On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote: >>>> The protocol NVM-Express-1.4 define: >>>> Command Interrupted: Command processing was interrupted and the >>>> controller is unable to successfully complete the command. The host >>>> should retry the command. If this status code is returned, then >>>> the controller shall clear the Do Not Retry bit to ‘0’ in the Status >>>> field of the CQE (refer to Figure 124). The controller shall not return >>>> this status code unless the host has set the Advanced Command Retry >>>> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to >>>> section 5.21.1.22). >>>> >>>> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry. >>>> The error code NVME_SC_CMD_INTERRUPTED should not translate to >>>> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET, >>>> dm-multipah will return error to application. So if target return error >>>> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED >>>> should translate to BLK_STS_IOERR by default, dm-multipath will fail >>>> over to other path retry the io. >>> >>> IOERR still seems wrong, though. >>> . >> >> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED >> just means target need retry io. It is not suitable to translate >> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to >> BLK_STS_IOERR is also not suitable, we should translate >> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. >> We can do like this: > > BLK_STS_AGAIN is a bad choice as we use it for calls that block when > the callers asked for non-blocking submission. I'm really not sure > we want to change anything here - the error definition clearly states > it is not a failure but a request to retry later. > . > BLK_STS_AGAIN is not a good choice, but BLK_STS_TARGET is also not a good choice. I find the patch that translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. commit:35038bffa87da282010b91108cadd13238bb5bbd nvme: Translate more status codes to blk_status_t Decode interrupted command and not ready namespace nvme status codes to BLK_STS_TARGET. These are not generic IO errors and should use a non-path specific error so that it can use the non-failover retry path. Reported-by: John Meneghini <John.Meneghini@netapp.com> In the old version, need translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET, because nvme multipath check the blk_path_error, we expect retry IO in the current path, so have to translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Although converting to BLK_STS_TARGET is not a good choice, there seems to be no better choice for the old version code. But now we do not need translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET, nvme multipath already improve and does not depend on blk_path_error. According to the description of BLK_STS_TARGET: [BLK_STS_TARGET] = { -EREMOTEIO, "critical target" } BLK_STS_TARGET may easily mistaken as a fatal error on the storage. I already check all the error code of BLK_STS_xx, there is no good choice for NVME_SC_CMD_INTERRUPTED now. In the absence of a suitable error code, it is a good choice to retain the default. So I strongly suggest delete translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. John, what do you think? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-07-30 1:49 ` Chao Leng @ 2020-08-05 6:40 ` Chao Leng 2020-08-05 15:29 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Chao Leng @ 2020-08-05 6:40 UTC (permalink / raw) Cc: linux-nvme, John.Meneghini John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET? Thank you. On 2020/7/30 9:49, Chao Leng wrote: > > > On 2020/7/29 13:59, Christoph Hellwig wrote: >> On Wed, Jul 29, 2020 at 10:54:29AM +0800, Chao Leng wrote: >>> >>> >>> On 2020/7/28 19:19, Christoph Hellwig wrote: >>>> On Mon, Jul 27, 2020 at 01:58:18PM +0800, Chao Leng wrote: >>>>> The protocol NVM-Express-1.4 define: >>>>> Command Interrupted: Command processing was interrupted and the >>>>> controller is unable to successfully complete the command. The host >>>>> should retry the command. If this status code is returned, then >>>>> the controller shall clear the Do Not Retry bit to ‘0’ in the Status >>>>> field of the CQE (refer to Figure 124). The controller shall not return >>>>> this status code unless the host has set the Advanced Command Retry >>>>> Enable (ACRE) field to 1h in the Host Behavior Support feature(refer to >>>>> section 5.21.1.22). >>>>> >>>>> According the protocol define, NVME_SC_CMD_INTERRUPTED need retry. >>>>> The error code NVME_SC_CMD_INTERRUPTED should not translate to >>>>> BLK_STS_TARGET, because if the error code translate to BLK_STS_TARGET, >>>>> dm-multipah will return error to application. So if target return error >>>>> code NVME_SC_CMD_INTERRUPTED, io will interrupt. NVME_SC_CMD_INTERRUPTED >>>>> should translate to BLK_STS_IOERR by default, dm-multipath will fail >>>>> over to other path retry the io. >>>> >>>> IOERR still seems wrong, though. >>>> . >>> >>> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED >>> just means target need retry io. It is not suitable to translate >>> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to >>> BLK_STS_IOERR is also not suitable, we should translate >>> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. >>> We can do like this: >> >> BLK_STS_AGAIN is a bad choice as we use it for calls that block when >> the callers asked for non-blocking submission. I'm really not sure >> we want to change anything here - the error definition clearly states >> it is not a failure but a request to retry later. >> . >> > BLK_STS_AGAIN is not a good choice, but BLK_STS_TARGET is also not > a good choice. I find the patch that translate NVME_SC_CMD_INTERRUPTED > to BLK_STS_TARGET. > > commit:35038bffa87da282010b91108cadd13238bb5bbd > nvme: Translate more status codes to blk_status_t > Decode interrupted command and not ready namespace nvme status codes to > BLK_STS_TARGET. These are not generic IO errors and should use a non-path > specific error so that it can use the non-failover retry path. > Reported-by: John Meneghini <John.Meneghini@netapp.com> > > In the old version, need translate NVME_SC_CMD_INTERRUPTED to > BLK_STS_TARGET, because nvme multipath check the blk_path_error, > we expect retry IO in the current path, so have to translate > NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Although converting to > BLK_STS_TARGET is not a good choice, there seems to be no better choice > for the old version code. But now we do not need translate > NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET, nvme multipath already > improve and does not depend on blk_path_error. > > According to the description of BLK_STS_TARGET: > [BLK_STS_TARGET] = { -EREMOTEIO, "critical target" } > BLK_STS_TARGET may easily mistaken as a fatal error on the storage. > I already check all the error code of BLK_STS_xx, there is no good > choice for NVME_SC_CMD_INTERRUPTED now. In the absence of a suitable > error code, it is a good choice to retain the default. So I strongly > suggest delete translate NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. > > John, what do you think? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-08-05 6:40 ` Chao Leng @ 2020-08-05 15:29 ` Keith Busch 2020-08-06 5:52 ` Chao Leng 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2020-08-05 15:29 UTC (permalink / raw) To: Chao Leng; +Cc: linux-nvme, John.Meneghini On Wed, Aug 05, 2020 at 02:40:48PM +0800, Chao Leng wrote: > John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED > to BLK_STS_TARGET? Thank you. Perhaps BLK_STS_TARGET should only failfast for REQ_FAILFAST_DEV rather than REQ_FAILFAST_PATH. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-08-05 15:29 ` Keith Busch @ 2020-08-06 5:52 ` Chao Leng 2020-08-06 14:26 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Chao Leng @ 2020-08-06 5:52 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, John.Meneghini On 2020/8/5 23:29, Keith Busch wrote: > On Wed, Aug 05, 2020 at 02:40:48PM +0800, Chao Leng wrote: >> John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED >> to BLK_STS_TARGET? Thank you. > > Perhaps BLK_STS_TARGET should only failfast for REQ_FAILFAST_DEV rather > than REQ_FAILFAST_PATH. > Now we transate NVME_SC_LBA_RANGE, NVME_SC_CMD_INTERRUPTED, NVME_SC_NS_NOT_READY to BLK_STS_TARGET. NVME_SC_LBA_RANGE should failfast, because retry can not success. NVME_SC_NS_NOT_READY may retry success, but the probality is very low. NVME_SC_CMD_INTERRUPTED need retry, according to protocol define, retry will success. So it is recommended that add the local preferential retry mechanism, like this: --- drivers/nvme/host/core.c | 2 +- drivers/nvme/host/nvme.h | 8 ++++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1a1ad5e5212c..3d39528051ad 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -242,7 +242,7 @@ static blk_status_t nvme_error_status(u16 status) static inline bool nvme_req_needs_retry(struct request *req) { - if (blk_noretry_request(req)) + if (!nvme_req_local_retry(req) && blk_noretry_request(req)) return false; if (nvme_req(req)->status & NVME_SC_DNR) return false; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index c0f4226d3299..17064455f2db 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -710,4 +710,12 @@ void nvme_hwmon_init(struct nvme_ctrl *ctrl); static inline void nvme_hwmon_init(struct nvme_ctrl *ctrl) { } #endif +static inline bool nvme_req_local_retry(struct request *req) +{ + if (nvme_req(req)->status == NVME_SC_CMD_INTERRUPTED) + return true; + return false; +} + + #endif /* _NVME_H */ -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-08-06 5:52 ` Chao Leng @ 2020-08-06 14:26 ` Keith Busch 2020-08-06 15:59 ` Meneghini, John 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2020-08-06 14:26 UTC (permalink / raw) To: Chao Leng; +Cc: linux-nvme, John.Meneghini On Thu, Aug 06, 2020 at 01:52:42PM +0800, Chao Leng wrote: > NVME_SC_LBA_RANGE should failfast, because retry can not success. The DNR bit is how NVMe conveys such things. > NVME_SC_NS_NOT_READY may retry success, but the probality is very low. > NVME_SC_CMD_INTERRUPTED need retry, according to protocol define, retry will success. If the upper layers set a request such that "noretry" is true, that's the behavior you're going to get. Sprinkling special exceptions around is not a good idea. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-08-06 14:26 ` Keith Busch @ 2020-08-06 15:59 ` Meneghini, John 2020-08-06 16:17 ` Meneghini, John 2020-08-07 0:03 ` [PATCH] nvme-core: fix io interrupt when work with dm-multipah Sagi Grimberg 0 siblings, 2 replies; 50+ messages in thread From: Meneghini, John @ 2020-08-06 15:59 UTC (permalink / raw) To: Keith Busch, Chao Leng Cc: Christoph Hellwig, Ewan Milne, Hannes de, linux-nvme, Meneghini, John Thanks for cc'ing me on this thread. The email from this DL usually goes into a folder so I don't watch it too closely. Cc me directly if you need my attention. That way the email shows up at the top of my inbox. >> On 8/5/20, 2:40 AM, "Chao Leng" <lengchao@huawei.com> wrote: >> John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED >> to BLK_STS_TARGET? Thank you. I think returning to BLK_STS_TARGET for NVME_SC_CMD_INTERRUPTED and NVME_SC_NS_NOT_READY was a potentially non-backwards compatible change that Keith made in response to a problem that I reported in patch 35038bffa87da. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da I don’t think it would be a mistake to back out this change. We've had a number of problems in the distributions with dm-mp and other legacy code getting confused by this change. >> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED >> just means target need retry io. It is not suitable to translate >> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to >> BLK_STS_IOERR is also not suitable, we should translate >> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. >> We can do like this: > > BLK_STS_AGAIN is a bad choice as we use it for calls that block when > the callers asked for non-blocking submission. I'm really not sure > we want to change anything here - the error definition clearly states > it is not a failure but a request to retry later. Agreed. The historical behavior in nvme-core has been to return BLK_STS_IOERR status for all new or unknown NVMe errors. . On 8/6/20, 10:26 AM, "Keith Busch" <kbusch@kernel.org> wrote: > On Thu, Aug 06, 2020 at 01:52:42PM +0800, Chao Leng wrote: > NVME_SC_LBA_RANGE should failfast, because retry can not success. The DNR bit is how NVMe conveys such things. I think that NVME_SC_LBA_RANGE has historically returned BLK_STS_TARGET so that shouldn't be changed. I agree that DNR - and now the new ACRE mechanism - is what should control the command retry behavior in NVMe. > NVME_SC_NS_NOT_READY may retry success, but the probality is very low. > NVME_SC_CMD_INTERRUPTED need retry, according to protocol define, retry will success. If the upper layers set a request such that "noretry" is true, that's the behavior you're going to get. Sprinkling special exceptions around is not a good idea. I agree. I think the problem here is that the current BLK_STS and FAST_FAIL mechanisms were designed support legacy protocols like SCSI. They assume that all retry behavior is controlled by other components in the stack. NVMe is presenting new protocol features and semantics which probably can't be effectively supported by those legacy BLK_STS and FAST_FAIL mechanisms without passing more information up the stack. So I think the best approach is to maintain backwards compatibility with the existing BLK_STS codes and semantics or create new errors and mechanisms which will require new software to handle those new error codes all the way up the stack. But don't overload things. John Meneghini ONTAP SAN Target Architect johnm@netapp.com _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-08-06 15:59 ` Meneghini, John @ 2020-08-06 16:17 ` Meneghini, John 2020-08-06 18:40 ` Mike Snitzer 2020-08-07 0:03 ` [PATCH] nvme-core: fix io interrupt when work with dm-multipah Sagi Grimberg 1 sibling, 1 reply; 50+ messages in thread From: Meneghini, John @ 2020-08-06 16:17 UTC (permalink / raw) To: Keith Busch, Chao Leng Cc: msnitzer, linux-nvme, Christoph Hellwig, Hannes de, Ewan Milne, Meneghini, John On 8/6/20, 11:59 AM, "Meneghini, John" <John.Meneghini@netapp.com> wrote: Maybe translate to >> BLK_STS_IOERR is also not suitable, we should translate >> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. I think this depends upon what the error handling is up the stack for BLK_STS_IOERR. What does DM do with BLK_STS_IOERR? > BLK_STS_AGAIN is a bad choice as we use it for calls that block when > the callers asked for non-blocking submission. I'm really not sure > we want to change anything here - the error definition clearly states > it is not a failure but a request to retry later. So it sounds like you may need a new BLK_STS error. However, even if you add a new error, that's not going to be enough to communicate the CRDT or DNR information up the stack. } blk_errors[] = { [BLK_STS_OK] = { 0, "" }, [BLK_STS_NOTSUPP] = { -EOPNOTSUPP, "operation not supported" }, [BLK_STS_TIMEOUT] = { -ETIMEDOUT, "timeout" }, [BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" }, [BLK_STS_TRANSPORT] = { -ENOLINK, "recoverable transport" }, [BLK_STS_TARGET] = { -EREMOTEIO, "critical target" }, [BLK_STS_NEXUS] = { -EBADE, "critical nexus" }, [BLK_STS_MEDIUM] = { -ENODATA, "critical medium" }, [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, [BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" }, [BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" }, /* device mapper special case, should not leak out: */ [BLK_STS_DM_REQUEUE] = { -EREMCHG, "dm internal retry" }, /* everything else not covered above: */ [BLK_STS_IOERR] = { -EIO, "I/O" }, }; _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme-core: fix io interrupt when work with dm-multipah 2020-08-06 16:17 ` Meneghini, John @ 2020-08-06 18:40 ` Mike Snitzer 2020-08-06 19:19 ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer 0 siblings, 1 reply; 50+ messages in thread From: Mike Snitzer @ 2020-08-06 18:40 UTC (permalink / raw) To: Meneghini, John Cc: Ewan Milne, Christoph Hellwig, linux-nvme, Chao Leng, Keith Busch, Hannes de On Thu, Aug 06 2020 at 12:17pm -0400, Meneghini, John <John.Meneghini@netapp.com> wrote: > On 8/6/20, 11:59 AM, "Meneghini, John" <John.Meneghini@netapp.com> wrote: > > Maybe translate to > >> BLK_STS_IOERR is also not suitable, we should translate > >> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. > > I think this depends upon what the error handling is up the stack for BLK_STS_IOERR. > > What does DM do with BLK_STS_IOERR? DM treats it as retryable. See blk_path_error(). > > BLK_STS_AGAIN is a bad choice as we use it for calls that block when > > the callers asked for non-blocking submission. I'm really not sure > > we want to change anything here - the error definition clearly states > > it is not a failure but a request to retry later. > > So it sounds like you may need a new BLK_STS error. However, even if you add > a new error, that's not going to be enough to communicate the CRDT or DNR > information up the stack. > > } blk_errors[] = { > [BLK_STS_OK] = { 0, "" }, > [BLK_STS_NOTSUPP] = { -EOPNOTSUPP, "operation not supported" }, > [BLK_STS_TIMEOUT] = { -ETIMEDOUT, "timeout" }, > [BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" }, > [BLK_STS_TRANSPORT] = { -ENOLINK, "recoverable transport" }, > [BLK_STS_TARGET] = { -EREMOTEIO, "critical target" }, > [BLK_STS_NEXUS] = { -EBADE, "critical nexus" }, > [BLK_STS_MEDIUM] = { -ENODATA, "critical medium" }, > [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, > [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, > [BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" }, > [BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" }, > > /* device mapper special case, should not leak out: */ > [BLK_STS_DM_REQUEUE] = { -EREMCHG, "dm internal retry" }, > > /* everything else not covered above: */ > [BLK_STS_IOERR] = { -EIO, "I/O" }, > }; > We've yet to determine how important it is that the target provided delay information be honored... In any case, NVMe translating NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET is definitely wrong. That conveys the error is not retryable (see blk_path_error()). Shouldn't NVMe translate NVME_SC_CMD_INTERRUPTED to BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE? DM will retry immediately if BLK_STS_RESOURCE is returned. DM will delay a fixed 100ms if BLK_STS_DEV_RESOURCE is used. (Ming said BLK_STS_RESOURCE isn't Linux specific and can be used by drivers) _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-06 18:40 ` Mike Snitzer @ 2020-08-06 19:19 ` Mike Snitzer 2020-08-06 22:42 ` Meneghini, John ` (2 more replies) 0 siblings, 3 replies; 50+ messages in thread From: Mike Snitzer @ 2020-08-06 19:19 UTC (permalink / raw) To: Meneghini, John Cc: Ewan Milne, Christoph Hellwig, linux-nvme, Chao Leng, Keith Busch, Hannes Reinecke On Thu, Aug 06 2020 at 2:40pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Thu, Aug 06 2020 at 12:17pm -0400, > Meneghini, John <John.Meneghini@netapp.com> wrote: > > > On 8/6/20, 11:59 AM, "Meneghini, John" <John.Meneghini@netapp.com> wrote: > > > > Maybe translate to > > >> BLK_STS_IOERR is also not suitable, we should translate > > >> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. > > > > I think this depends upon what the error handling is up the stack for BLK_STS_IOERR. > > > > What does DM do with BLK_STS_IOERR? > > DM treats it as retryable. See blk_path_error(). > > > > BLK_STS_AGAIN is a bad choice as we use it for calls that block when > > > the callers asked for non-blocking submission. I'm really not sure > > > we want to change anything here - the error definition clearly states > > > it is not a failure but a request to retry later. > > > > So it sounds like you may need a new BLK_STS error. However, even if you add > > a new error, that's not going to be enough to communicate the CRDT or DNR > > information up the stack. > > > > } blk_errors[] = { > > [BLK_STS_OK] = { 0, "" }, > > [BLK_STS_NOTSUPP] = { -EOPNOTSUPP, "operation not supported" }, > > [BLK_STS_TIMEOUT] = { -ETIMEDOUT, "timeout" }, > > [BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" }, > > [BLK_STS_TRANSPORT] = { -ENOLINK, "recoverable transport" }, > > [BLK_STS_TARGET] = { -EREMOTEIO, "critical target" }, > > [BLK_STS_NEXUS] = { -EBADE, "critical nexus" }, > > [BLK_STS_MEDIUM] = { -ENODATA, "critical medium" }, > > [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, > > [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, > > [BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" }, > > [BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" }, > > > > /* device mapper special case, should not leak out: */ > > [BLK_STS_DM_REQUEUE] = { -EREMCHG, "dm internal retry" }, > > > > /* everything else not covered above: */ > > [BLK_STS_IOERR] = { -EIO, "I/O" }, > > }; > > > > We've yet to determine how important it is that the target provided > delay information be honored... > > In any case, NVMe translating NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET > is definitely wrong. That conveys the error is not retryable (see > blk_path_error()). > > Shouldn't NVMe translate NVME_SC_CMD_INTERRUPTED to BLK_STS_RESOURCE or > BLK_STS_DEV_RESOURCE? > > DM will retry immediately if BLK_STS_RESOURCE is returned. > DM will delay a fixed 100ms if BLK_STS_DEV_RESOURCE is used. Ngh, I got that inverted.. BLK_STS_RESOURCE will result in the 100ms delayed retry. BLK_STS_DEV_RESOURCE results in immediate retry. But going back to BLK_STS_IOERR by reverting commit 35038bffa87 would work too. > (Ming said BLK_STS_RESOURCE isn't Linux [block core] specific and can > be used by drivers) Regardless, reading back on this thread, I think there is at least some consensus about reverting commit 35038bffa87 ("nvme: Translate more status codes to blk_status_t") ? And on a related note, building on the thread I started here (but haven't heard back from any NVMe maintainers on): https://www.redhat.com/archives/dm-devel/2020-July/msg00051.html I'd also be happy as a pig in shit if this patch were applied: From: Mike Snitzer <snitzer@redhat.com> Date: Thu, 2 Jul 2020 01:43:27 -0400 Subject: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown status") removed NVMe's use blk_path_error() -- presummably because nvme_failover_req() was modified to return whether a command should be retried or not. By not using blk_path_error() there is serious potential for regression for how upper layers (e.g. DM multipath) respond to NVMe's error conditions. This has played out now due to commit 35038bffa87 ("nvme: Translate more status codes to blk_status_t"). Had NVMe continued to use blk_path_error() it too would not have retried an NVMe command that got NVME_SC_CMD_INTERRUPTED. Fix this potential for NVMe error handling regression, possibly outside NVMe, by restoring NVMe's use of blk_path_error(). Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status") Cc: stable@vger.kerneel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/nvme/host/core.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 6585d57112ad..072f629da4d8 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req) nvme_req(req)->ctrl->comp_seen = true; if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; + if (blk_path_error(status)) { + if (req->cmd_flags & REQ_NVME_MPATH) { + if (nvme_failover_req(req)) + return; + /* fallthru to normal error handling */ + } + } if (!blk_queue_dying(req->q)) { nvme_retry_req(req); -- 2.18.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-06 19:19 ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer @ 2020-08-06 22:42 ` Meneghini, John 2020-08-07 0:07 ` Mike Snitzer 2020-08-07 0:44 ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Sagi Grimberg 2020-08-10 12:43 ` Christoph Hellwig 2 siblings, 1 reply; 50+ messages in thread From: Meneghini, John @ 2020-08-06 22:42 UTC (permalink / raw) To: Mike Snitzer Cc: Ewan Milne, Christoph Hellwig, linux-nvme, Chao Leng, Keith Busch, Meneghini, John, Hannes Reinecke On 8/6/20, 3:20 PM, "Mike Snitzer" <snitzer@redhat.com> wrote: From: Mike Snitzer <snitzer@redhat.com> Date: Thu, 2 Jul 2020 01:43:27 -0400 Subject: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown status") removed NVMe's use blk_path_error() -- presummably because nvme_failover_req() was modified to return whether a command should be retried or not. Yes, that was a major part of this patch. nvme_failover_req() was completely reworked and fixed because it was badly broken. By not using blk_path_error() there is serious potential for regression for how upper layers (e.g. DM multipath) respond to NVMe's error conditions. This has played out now due to commit 35038bffa87 ("nvme: Translate more status codes to blk_status_t"). Had NVMe continued to use blk_path_error() it too would not have retried an NVMe command that got NVME_SC_CMD_INTERRUPTED. I'm not sure others would consider it broken. The idea was to get the blk_path_error() logic out of the core nvme driver completion routine. Fix this potential for NVMe error handling regression, possibly outside NVMe, by restoring NVMe's use of blk_path_error(). The point is: blk_path_error() has nothing to do with NVMe errors. This is dm-multipath logic stuck in the middle of the NVMe error handling code. C symbol: blk_path_error File Function Line 0 dm-mpath.c multipath_end_io 1606 if (error && blk_path_error(error)) { 1 dm-mpath.c multipath_end_io_bio 1646 if (!*error || !blk_path_error(*error)) 2 blk_types.h blk_path_error 118 static inline bool blk_path_error(blk_status_t error) Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status") Cc: stable@vger.kerneel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/nvme/host/core.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 6585d57112ad..072f629da4d8 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req) nvme_req(req)->ctrl->comp_seen = true; if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; + if (blk_path_error(status)) { + if (req->cmd_flags & REQ_NVME_MPATH) { + if (nvme_failover_req(req)) + return; + /* fallthru to normal error handling */ + } + } This would basically undo the patch Hannes, Christoph, and I put together in commit 764e9332098c0. This patch greatly simplified and improved the whole nvme_complete_rq() code path, and I don't support undoing that change. If you want blk_path_error() to handle any new NVMe error status differently, I would suggest creating a new BLK_STS code to translate that NVMe status and add it to the one place in nvme_complete_rq() designed to do the NVMe status to BLK_STS translation: nvme_error_status(). Then you can change your application specific error handling code to handle the new BLK_STS code appropriately, further up the stack. Don't stick your application specific error handling logic into the middle of the nvme driver's nvme_complete_rq routine. /John if (!blk_queue_dying(req->q)) { nvme_retry_req(req); -- 2.18.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-06 22:42 ` Meneghini, John @ 2020-08-07 0:07 ` Mike Snitzer 2020-08-07 1:21 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Mike Snitzer @ 2020-08-07 0:07 UTC (permalink / raw) To: Meneghini, John Cc: Ewan Milne, Christoph Hellwig, dm-devel, linux-nvme, Chao Leng, Keith Busch, Hannes Reinecke On Thu, Aug 06 2020 at 6:42pm -0400, Meneghini, John <John.Meneghini@netapp.com> wrote: > On 8/6/20, 3:20 PM, "Mike Snitzer" <snitzer@redhat.com> wrote: > > From: Mike Snitzer <snitzer@redhat.com> > Date: Thu, 2 Jul 2020 01:43:27 -0400 > Subject: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() > > Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown > status") removed NVMe's use blk_path_error() -- presummably because > nvme_failover_req() was modified to return whether a command should be > retried or not. > > Yes, that was a major part of this patch. nvme_failover_req() was completely > reworked and fixed because it was badly broken. Sure, and I didn't change any of that. Still all in place. > By not using blk_path_error() there is serious potential for > regression for how upper layers (e.g. DM multipath) respond to NVMe's > error conditions. This has played out now due to commit 35038bffa87 > ("nvme: Translate more status codes to blk_status_t"). Had NVMe > continued to use blk_path_error() it too would not have retried an > NVMe command that got NVME_SC_CMD_INTERRUPTED. > > I'm not sure others would consider it broken. The idea was to get the blk_path_error() > logic out of the core nvme driver completion routine. > > Fix this potential for NVMe error handling regression, possibly > outside NVMe, by restoring NVMe's use of blk_path_error(). > > The point is: blk_path_error() has nothing to do with NVMe errors. > This is dm-multipath logic stuck in the middle of the NVMe error > handling code. No, it is a means to have multiple subsystems (to this point both SCSI and NVMe) doing the correct thing when translating subsystem specific error codes to BLK_STS codes. If you, or others you name drop below, understood the point we wouldn't be having this conversation. You'd accept the point of blk_path_error() to be valid and required codification of what constitutes a retryable path error for the Linux block layer. Any BLK_STS mapping of NVMe specific error codes would need to not screw up by categorizing a retryable error as non-retryable (and vice-versa). Again, assuming proper testing, commit 35038bffa87 wouldn't have made it upstream if NVMe still used blk_path_error(). > C symbol: blk_path_error > > File Function Line > 0 dm-mpath.c multipath_end_io 1606 if (error && blk_path_error(error)) { > 1 dm-mpath.c multipath_end_io_bio 1646 if (!*error || !blk_path_error(*error)) > 2 blk_types.h blk_path_error 118 static inline bool blk_path_error(blk_status_t error) Yes, your commit 764e9332098c0 needlessly removed NVMe's use of blk_path_error(). If you're saying it wasn't needless please explain why. > Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status") > Cc: stable@vger.kerneel.org > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > drivers/nvme/host/core.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 6585d57112ad..072f629da4d8 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req) > nvme_req(req)->ctrl->comp_seen = true; > > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > - return; > + if (blk_path_error(status)) { > + if (req->cmd_flags & REQ_NVME_MPATH) { > + if (nvme_failover_req(req)) > + return; > + /* fallthru to normal error handling */ > + } > + } > > This would basically undo the patch Hannes, Christoph, and I put together in > commit 764e9332098c0. This patch greatly simplified and improved the > whole nvme_complete_rq() code path, and I don't support undoing that change. Please elaborate on how I've undone anything? The only thing I may have done is forced NVMe to take more care about properly translating its NVME_SC to BLK_STS in nvme_error_status(). Which is a good thing. > If you want blk_path_error() to handle any new NVMe error status differently, I would suggest creating a new > BLK_STS code to translate that NVMe status and add it to the one place in nvme_complete_rq() designed to > do the NVMe status to BLK_STS translation: nvme_error_status(). Then you can change your application specific > error handling code to handle the new BLK_STS code appropriately, further up the stack. Don't stick your > application specific error handling logic into the middle of the nvme driver's nvme_complete_rq routine. Removing NVMe as a primary user of blk_path_error() was a pretty serious mistake. One you clearly aren't willing to own. Yet the associated breakage still exists as well as the potential for further regressions. This really isn't some vantage point holy war. Please don't continue to make this something it isn't. Would be cool if you could see what a disservice this hostility is causing. Otherwise you'll continue to taint and/or avoid fixing NVMe with misplaced anti-DM-multipath tribalism. Maybe (re?)read these commit headers: e96fef2c3fa3 nvme: Add more command status translation 908e45643d64 nvme/multipath: Consult blk_status_t for failover 9111e5686c8c block: Provide blk_status_t decoding for path errors e1f425e770d2 nvme/multipath: Use blk_path_error a1275677f8cd dm mpath: Use blk_path_error with: git log e96fef2c3fa3^..a1275677f8cd Anyway, no new BLK_STS is needed at this point. More discipline with how NVMe's error handling is changed is. If NVMe wants to ensure its interface isn't broken regularly it _should_ use blk_path_error() to validate future nvme_error_status() changes. Miscategorizing NVMe errors to upper layers is a bug -- not open for debate. Nor should be using block core infrastructure we put in place to help stabilize how Linux block drivers convey retryable errors. Mike _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-07 0:07 ` Mike Snitzer @ 2020-08-07 1:21 ` Sagi Grimberg 2020-08-07 4:50 ` Mike Snitzer 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2020-08-07 1:21 UTC (permalink / raw) To: Mike Snitzer, Meneghini, John Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Hannes Reinecke Hey Mike, >> The point is: blk_path_error() has nothing to do with NVMe errors. >> This is dm-multipath logic stuck in the middle of the NVMe error >> handling code. > > No, it is a means to have multiple subsystems (to this point both SCSI > and NVMe) doing the correct thing when translating subsystem specific > error codes to BLK_STS codes. Not exactly, don't find any use of this in scsi. The purpose is to make sure that nvme and dm speak the same language. > If you, or others you name drop below, understood the point we wouldn't > be having this conversation. You'd accept the point of blk_path_error() > to be valid and required codification of what constitutes a retryable > path error for the Linux block layer. This incident is a case where the specific nvme status was designed to retry on the same path respecting the controller retry delay. And because nvme used blk_path_error at the time it caused us to use a non-retryable status to get around that. Granted, no one had dm-multipath in mind. So in a sense, there is consensus on changing patch 35038bffa87da _because_ nvme no longer uses blk_path_error. Otherwise it would break. > Any BLK_STS mapping of NVMe specific error codes would need to not screw > up by categorizing a retryable error as non-retryable (and vice-versa). But it is a special type of retryable. There is nothing that fits the semantics of the current behavior. > Again, assuming proper testing, commit 35038bffa87 wouldn't have made it > upstream if NVMe still used blk_path_error(). Agree. > Yes, your commit 764e9332098c0 needlessly removed NVMe's use of > blk_path_error(). If you're saying it wasn't needless please explain > why. > >> Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status") >> Cc: stable@vger.kerneel.org >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> >> --- >> drivers/nvme/host/core.c | 9 +++++++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index 6585d57112ad..072f629da4d8 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req) >> nvme_req(req)->ctrl->comp_seen = true; >> >> if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { >> - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) >> - return; >> + if (blk_path_error(status)) { >> + if (req->cmd_flags & REQ_NVME_MPATH) { >> + if (nvme_failover_req(req)) >> + return; >> + /* fallthru to normal error handling */ >> + } >> + } >> >> This would basically undo the patch Hannes, Christoph, and I put together in >> commit 764e9332098c0. This patch greatly simplified and improved the >> whole nvme_complete_rq() code path, and I don't support undoing that change. > > Please elaborate on how I've undone anything? I think you're right, this hasn't undone the patch from John, but it breaks NVME_SC_CMD_INTERRUPTED error handling behavior. > The only thing I may have done is forced NVMe to take more care about > properly translating its NVME_SC to BLK_STS in nvme_error_status(). > Which is a good thing. I don't think there is an issue here of mistakenly converting an nvme status code to a wrong block status code. This conversion is there because there is no block status that give us the semantics we need which is apparently specific to nvme. I personally don't mind restoring blk_path_error to nvme, I don't particularly feel strong about it either. But for sure blk_path_error needs to first provide the semantics needed for NVME_SC_CMD_INTERRUPTED. ... > Anyway, no new BLK_STS is needed at this point. More discipline with > how NVMe's error handling is changed is. Please read the above. > If NVMe wants to ensure its > interface isn't broken regularly it _should_ use blk_path_error() to > validate future nvme_error_status() changes. Miscategorizing NVMe > errors to upper layers is a bug -- not open for debate. Again, don't agree is a Miscategorization nor a bug, its just something that is NVMe specific. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-07 1:21 ` Sagi Grimberg @ 2020-08-07 4:50 ` Mike Snitzer 2020-08-07 23:35 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Mike Snitzer @ 2020-08-07 4:50 UTC (permalink / raw) To: Sagi Grimberg Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John On Thu, Aug 06 2020 at 9:21pm -0400, Sagi Grimberg <sagi@grimberg.me> wrote: > Hey Mike, > > >>The point is: blk_path_error() has nothing to do with NVMe errors. > >>This is dm-multipath logic stuck in the middle of the NVMe error > >>handling code. > > > >No, it is a means to have multiple subsystems (to this point both SCSI > >and NVMe) doing the correct thing when translating subsystem specific > >error codes to BLK_STS codes. > > Not exactly, don't find any use of this in scsi. The purpose is to make > sure that nvme and dm speak the same language. SCSI doesn't need to do additional work to train a multipath layer because dm-multipath _is_ SCSI's multipathing in Linux. So ensuring SCSI properly classifies its error codes happens as a side-effect of ensuring continued multipath functionality. Hannes introduced all these differentiated error codes in block core because of SCSI. NVMe is meant to build on the infrastructure that was established. > >If you, or others you name drop below, understood the point we wouldn't > >be having this conversation. You'd accept the point of blk_path_error() > >to be valid and required codification of what constitutes a retryable > >path error for the Linux block layer. > > This incident is a case where the specific nvme status was designed > to retry on the same path respecting the controller retry delay. > And because nvme used blk_path_error at the time it caused us to use a > non-retryable status to get around that. Granted, no one had > dm-multipath in mind. > > So in a sense, there is consensus on changing patch 35038bffa87da > _because_ nvme no longer uses blk_path_error. Otherwise it would break. "break" meaning it would do failover instead of the more optimal local retry to the same controller. I see. Wish the header for commit 35038bffa87da touched on this even a little bit ;) But AFAICT the patch I provided doesn't compromise proper local retry -- as long as we first fix nvme_error_status() to return a retryable BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. Think of blk_path_error() as a more coarse-grained "is this retryable or a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, nvme_error_status() _should_ respond with something retryable (I'd prefer BLK_STS_RESOURCE to be honest). And then nvme_failover_req() is finer-grained; by returning false it now allows short-circuiting failover and reverting back to NVMe's normal controller based error recovery -- which it does for NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). And then the previous nvme_error_status() classification of retryable BLK_STS obviously never gets returned up the IO stack; it gets thrown away. > >Any BLK_STS mapping of NVMe specific error codes would need to not screw > >up by categorizing a retryable error as non-retryable (and vice-versa). > > But it is a special type of retryable. There is nothing that fits the > semantics of the current behavior. I agree. But that's fine actually. And this issue is the poster-child for why properly supporting a duality of driver-level vs upper level multipathing capabilities is pretty much impossible unless a clean design factors out the different error classes -- and local error retry is handled before punting to higher level failover retry. Think if NVMe were to adopt a bit more disciplined "local then failover" error handling it all gets easier. This local retry _is_ NVMe specific. NVMe should just own retrying on the same controller no matter what (I'll hope that such retry has awareness to not retry indefinitely though!). And this has nothing to do with multipathing, so the logic to handle it shouldn't be trapped in nvme_failover_req(). I think NVMe can easily fix this by having an earlier stage of checking, e.g. nvme_local_retry_req(), that shortcircuits ever getting to higher-level multipathing consideration (be it native NVMe or DM multipathing) for cases like NVME_SC_CMD_INTERRUPTED. To be clear: the "default" case of nvme_failover_req() that returns false to fallback to NVMe's "local" normal NVMe error handling -- that can stay.. but a more explicit handling of cases like NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() check that happens before nvme_failover_req() in nvme_complete_rq(). This is where a patch will speak a thousand words... maybe ;) > >Again, assuming proper testing, commit 35038bffa87 wouldn't have made it > >upstream if NVMe still used blk_path_error(). > > Agree. > > >Yes, your commit 764e9332098c0 needlessly removed NVMe's use of > >blk_path_error(). If you're saying it wasn't needless please explain > >why. > > > >> Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status") > >> Cc: stable@vger.kerneel.org > >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> > >> --- > >> drivers/nvme/host/core.c | 9 +++++++-- > >> 1 file changed, 7 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > >> index 6585d57112ad..072f629da4d8 100644 > >> --- a/drivers/nvme/host/core.c > >> +++ b/drivers/nvme/host/core.c > >> @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req) > >> nvme_req(req)->ctrl->comp_seen = true; > >> > >> if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > >> - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > >> - return; > >> + if (blk_path_error(status)) { > >> + if (req->cmd_flags & REQ_NVME_MPATH) { > >> + if (nvme_failover_req(req)) > >> + return; > >> + /* fallthru to normal error handling */ > >> + } > >> + } > >> > >>This would basically undo the patch Hannes, Christoph, and I put together in > >>commit 764e9332098c0. This patch greatly simplified and improved the > >>whole nvme_complete_rq() code path, and I don't support undoing that change. > > > >Please elaborate on how I've undone anything? > > I think you're right, this hasn't undone the patch from John, but it > breaks NVME_SC_CMD_INTERRUPTED error handling behavior. Yes, again we'd need to first fix nvme_error_status() to return a retryable BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > >The only thing I may have done is forced NVMe to take more care about > >properly translating its NVME_SC to BLK_STS in nvme_error_status(). > >Which is a good thing. > > I don't think there is an issue here of mistakenly converting an nvme > status code to a wrong block status code. This conversion is there > because there is no block status that give us the semantics we need > which is apparently specific to nvme. > > I personally don't mind restoring blk_path_error to nvme, I don't > particularly feel strong about it either. But for sure blk_path_error > needs to first provide the semantics needed for NVME_SC_CMD_INTERRUPTED. Hopefully it doesn't. As I discuss above with my "I think NVMe can easily fix". NVMe should trap this class of "local" retryable error _before_ going on to check if higher-level failover is needed. If it does that: problem solved for both NVMe and DM multipathing. (could be missing something though, I'll try to craft a patch later today to shake it out). > ... > > >Anyway, no new BLK_STS is needed at this point. More discipline with > >how NVMe's error handling is changed is. > > Please read the above. I agree we'd need a new BLK_STS only if NVMe cannot be made to trap NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path failover. > >If NVMe wants to ensure its > >interface isn't broken regularly it _should_ use blk_path_error() to > >validate future nvme_error_status() changes. Miscategorizing NVMe > >errors to upper layers is a bug -- not open for debate. > > Again, don't agree is a Miscategorization nor a bug, its just something > that is NVMe specific. Right I understand. Think it safe to assume these types of details are why Christoph wanted to avoid the notion of native NVMe and DM multipathing having compatible error handling. There was some wisdom with that position (especially with native NVMe goals in mind). But if things are tweaked slightly then both camps _can_ be made happy. There just needs to be a willingness to work through the details, defensiveness needs to be shed on both sides, so constructive review/consideration of problems can happen. Think that has already happened here with our exchange. I'm open to investing effort here if others are up for humoring my attempt to explore fixing the issues in a mutually beneficial way. What's the worst that can happen? My simple patches might continue to be ignored? ;) Mike _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-07 4:50 ` Mike Snitzer @ 2020-08-07 23:35 ` Sagi Grimberg 2020-08-08 21:08 ` Meneghini, John 2020-08-10 14:36 ` Mike Snitzer 0 siblings, 2 replies; 50+ messages in thread From: Sagi Grimberg @ 2020-08-07 23:35 UTC (permalink / raw) To: Mike Snitzer Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John >> Hey Mike, >> >>>> The point is: blk_path_error() has nothing to do with NVMe errors. >>>> This is dm-multipath logic stuck in the middle of the NVMe error >>>> handling code. >>> >>> No, it is a means to have multiple subsystems (to this point both SCSI >>> and NVMe) doing the correct thing when translating subsystem specific >>> error codes to BLK_STS codes. >> >> Not exactly, don't find any use of this in scsi. The purpose is to make >> sure that nvme and dm speak the same language. > > SCSI doesn't need to do additional work to train a multipath layer > because dm-multipath _is_ SCSI's multipathing in Linux. Agree. > So ensuring SCSI properly classifies its error codes happens as a > side-effect of ensuring continued multipath functionality. > > Hannes introduced all these differentiated error codes in block core > because of SCSI. NVMe is meant to build on the infrastructure that was > established. Yes, exactly my point. blk_path_error is designed to make nvme and dm-multipath speak the same language. >>> If you, or others you name drop below, understood the point we wouldn't >>> be having this conversation. You'd accept the point of blk_path_error() >>> to be valid and required codification of what constitutes a retryable >>> path error for the Linux block layer. >> >> This incident is a case where the specific nvme status was designed >> to retry on the same path respecting the controller retry delay. >> And because nvme used blk_path_error at the time it caused us to use a >> non-retryable status to get around that. Granted, no one had >> dm-multipath in mind. >> >> So in a sense, there is consensus on changing patch 35038bffa87da >> _because_ nvme no longer uses blk_path_error. Otherwise it would break. > > "break" meaning it would do failover instead of the more optimal local > retry to the same controller. > > I see. Wish the header for commit 35038bffa87da touched on this even a > little bit ;) I think it did, but maybe didn't put too much emphasis on it. > But AFAICT the patch I provided doesn't compromise proper local retry -- > as long as we first fix nvme_error_status() to return a retryable > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > > Think of blk_path_error() as a more coarse-grained "is this retryable or > a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, > nvme_error_status() _should_ respond with something retryable (I'd > prefer BLK_STS_RESOURCE to be honest). But blk_path_error semantically mean "is this a pathing error", or at least that what its name suggest. > And then nvme_failover_req() is finer-grained; by returning false it now > allows short-circuiting failover and reverting back to NVMe's normal > controller based error recovery -- which it does for > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). > > And then the previous nvme_error_status() classification of retryable > BLK_STS obviously never gets returned up the IO stack; it gets thrown > away. I see what you are saying. The issue is that the code becomes convoluted (it's a pathing error, oh wait, no its not a pathing error). >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw >>> up by categorizing a retryable error as non-retryable (and vice-versa). >> >> But it is a special type of retryable. There is nothing that fits the >> semantics of the current behavior. > > I agree. But that's fine actually. > > And this issue is the poster-child for why properly supporting a duality > of driver-level vs upper level multipathing capabilities is pretty much > impossible unless a clean design factors out the different error classes > -- and local error retry is handled before punting to higher level > failover retry. Think if NVMe were to adopt a bit more disciplined > "local then failover" error handling it all gets easier. I don't think punting before is easier, because we do a local retry if: - no multipathing sw on top - request needs retry (e.g. no DNR, notretry is off etc..) - nvme error is not pathing related (nvme_failover_req returned false) > This local retry _is_ NVMe specific. NVMe should just own retrying on > the same controller no matter what (I'll hope that such retry has > awareness to not retry indefinitely though!). it will retry until the retry limit. > And this has nothing to > do with multipathing, so the logic to handle it shouldn't be trapped in > nvme_failover_req(). Well given that nvme_failover_req already may not actually failover this makes some sense to me (although I did have some resistance to make it that way in the first place, but was convinced otherwise). > I think NVMe can easily fix this by having an earlier stage of checking, > e.g. nvme_local_retry_req(), that shortcircuits ever getting to > higher-level multipathing consideration (be it native NVMe or DM > multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > To be clear: the "default" case of nvme_failover_req() that returns > false to fallback to NVMe's "local" normal NVMe error handling -- that > can stay.. but a more explicit handling of cases like > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > check that happens before nvme_failover_req() in nvme_complete_rq(). I don't necessarily agree with having a dedicated nvme_local_retry_req(). a request that isn't failed over, goes to local error handling (retry or not). I actually think that just adding the condition to nvme_complete_req and having nvme_failover_req reject it would work. Keith? >>> Anyway, no new BLK_STS is needed at this point. More discipline with >>> how NVMe's error handling is changed is. >> >> Please read the above. > > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path > failover. Not sure that is better, but we can see a patch first to determine. >>> If NVMe wants to ensure its >>> interface isn't broken regularly it _should_ use blk_path_error() to >>> validate future nvme_error_status() changes. Miscategorizing NVMe >>> errors to upper layers is a bug -- not open for debate. >> >> Again, don't agree is a Miscategorization nor a bug, its just something >> that is NVMe specific. > > Right I understand. > > Think it safe to assume these types of details are why Christoph wanted > to avoid the notion of native NVMe and DM multipathing having > compatible error handling. There was some wisdom with that position > (especially with native NVMe goals in mind). But if things are tweaked > slightly then both camps _can_ be made happy. > > There just needs to be a willingness to work through the details, > defensiveness needs to be shed on both sides, so constructive > review/consideration of problems can happen. Agreed. > Think that has already > happened here with our exchange. I'm open to investing effort here if > others are up for humoring my attempt to explore fixing the issues in a > mutually beneficial way. What's the worst that can happen? My simple > patches might continue to be ignored? ;) I won't ignore it, and I apologize of ignoring the original patch posted, I guess it flew under the radar... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-07 23:35 ` Sagi Grimberg @ 2020-08-08 21:08 ` Meneghini, John 2020-08-08 21:11 ` Meneghini, John 2020-08-10 8:10 ` Chao Leng 2020-08-10 14:36 ` Mike Snitzer 1 sibling, 2 replies; 50+ messages in thread From: Meneghini, John @ 2020-08-08 21:08 UTC (permalink / raw) To: Sagi Grimberg, Mike Snitzer Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John, Hannes Reinecke I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does. The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry. You can find a copy of this TP here: http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields. To make use of these delay fields a new CRD field was added to the CQE Status Field. This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls. CRDTs can range from 100 milliseconds to 6559 seconds. Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit. This allows the host to enable or disable the feature. With this background there are a couple of misconceptions in this thread which I'd like to address. The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status. Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^ So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do. There is a much larger change in error semantics going on with ACRE than just this single error. The second is: All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec. However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status. This is how NVMe error semantics work and this is a part of what was behind Keith's patch . https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do. I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out. However, ultimately, I think nvme-core should return an error status like BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are. Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq() https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120 I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do. Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks. As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq(). All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic. At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability. This is something which has plagued SCSI implementations for years, and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this. There are plans to introduce more Host Behavior Support features in the future. In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this. We don't want to turn NVMe into SCSI. I know I don't want to. /John ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need). On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote: >> Hey Mike, >> >>>> The point is: blk_path_error() has nothing to do with NVMe errors. >>>> This is dm-multipath logic stuck in the middle of the NVMe error >>>> handling code. >>> >>> No, it is a means to have multiple subsystems (to this point both SCSI >>> and NVMe) doing the correct thing when translating subsystem specific >>> error codes to BLK_STS codes. >> >> Not exactly, don't find any use of this in scsi. The purpose is to make >> sure that nvme and dm speak the same language. > > SCSI doesn't need to do additional work to train a multipath layer > because dm-multipath _is_ SCSI's multipathing in Linux. Agree. > So ensuring SCSI properly classifies its error codes happens as a > side-effect of ensuring continued multipath functionality. > > Hannes introduced all these differentiated error codes in block core > because of SCSI. NVMe is meant to build on the infrastructure that was > established. Yes, exactly my point. blk_path_error is designed to make nvme and dm-multipath speak the same language. >>> If you, or others you name drop below, understood the point we wouldn't >>> be having this conversation. You'd accept the point of blk_path_error() >>> to be valid and required codification of what constitutes a retryable >>> path error for the Linux block layer. >> >> This incident is a case where the specific nvme status was designed >> to retry on the same path respecting the controller retry delay. >> And because nvme used blk_path_error at the time it caused us to use a >> non-retryable status to get around that. Granted, no one had >> dm-multipath in mind. >> >> So in a sense, there is consensus on changing patch 35038bffa87da >> _because_ nvme no longer uses blk_path_error. Otherwise it would break. > > "break" meaning it would do failover instead of the more optimal local > retry to the same controller. > > I see. Wish the header for commit 35038bffa87da touched on this even a > little bit ;) I think it did, but maybe didn't put too much emphasis on it. > But AFAICT the patch I provided doesn't compromise proper local retry -- > as long as we first fix nvme_error_status() to return a retryable > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > > Think of blk_path_error() as a more coarse-grained "is this retryable or > a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, > nvme_error_status() _should_ respond with something retryable (I'd > prefer BLK_STS_RESOURCE to be honest). But blk_path_error semantically mean "is this a pathing error", or at least that what its name suggest. > And then nvme_failover_req() is finer-grained; by returning false it now > allows short-circuiting failover and reverting back to NVMe's normal > controller based error recovery -- which it does for > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). > > And then the previous nvme_error_status() classification of retryable > BLK_STS obviously never gets returned up the IO stack; it gets thrown > away. I see what you are saying. The issue is that the code becomes convoluted (it's a pathing error, oh wait, no its not a pathing error). >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw >>> up by categorizing a retryable error as non-retryable (and vice-versa). >> >> But it is a special type of retryable. There is nothing that fits the >> semantics of the current behavior. > > I agree. But that's fine actually. > > And this issue is the poster-child for why properly supporting a duality > of driver-level vs upper level multipathing capabilities is pretty much > impossible unless a clean design factors out the different error classes > -- and local error retry is handled before punting to higher level > failover retry. Think if NVMe were to adopt a bit more disciplined > "local then failover" error handling it all gets easier. I don't think punting before is easier, because we do a local retry if: - no multipathing sw on top - request needs retry (e.g. no DNR, notretry is off etc..) - nvme error is not pathing related (nvme_failover_req returned false) > This local retry _is_ NVMe specific. NVMe should just own retrying on > the same controller no matter what (I'll hope that such retry has > awareness to not retry indefinitely though!). it will retry until the retry limit. > And this has nothing to > do with multipathing, so the logic to handle it shouldn't be trapped in > nvme_failover_req(). Well given that nvme_failover_req already may not actually failover this makes some sense to me (although I did have some resistance to make it that way in the first place, but was convinced otherwise). > I think NVMe can easily fix this by having an earlier stage of checking, > e.g. nvme_local_retry_req(), that shortcircuits ever getting to > higher-level multipathing consideration (be it native NVMe or DM > multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > To be clear: the "default" case of nvme_failover_req() that returns > false to fallback to NVMe's "local" normal NVMe error handling -- that > can stay.. but a more explicit handling of cases like > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > check that happens before nvme_failover_req() in nvme_complete_rq(). I don't necessarily agree with having a dedicated nvme_local_retry_req(). a request that isn't failed over, goes to local error handling (retry or not). I actually think that just adding the condition to nvme_complete_req and having nvme_failover_req reject it would work. Keith? >>> Anyway, no new BLK_STS is needed at this point. More discipline with >>> how NVMe's error handling is changed is. >> >> Please read the above. > > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path > failover. Not sure that is better, but we can see a patch first to determine. >>> If NVMe wants to ensure its >>> interface isn't broken regularly it _should_ use blk_path_error() to >>> validate future nvme_error_status() changes. Miscategorizing NVMe >>> errors to upper layers is a bug -- not open for debate. >> >> Again, don't agree is a Miscategorization nor a bug, its just something >> that is NVMe specific. > > Right I understand. > > Think it safe to assume these types of details are why Christoph wanted > to avoid the notion of native NVMe and DM multipathing having > compatible error handling. There was some wisdom with that position > (especially with native NVMe goals in mind). But if things are tweaked > slightly then both camps _can_ be made happy. > > There just needs to be a willingness to work through the details, > defensiveness needs to be shed on both sides, so constructive > review/consideration of problems can happen. Agreed. > Think that has already > happened here with our exchange. I'm open to investing effort here if > others are up for humoring my attempt to explore fixing the issues in a > mutually beneficial way. What's the worst that can happen? My simple > patches might continue to be ignored? ;) I won't ignore it, and I apologize of ignoring the original patch posted, I guess it flew under the radar... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-08 21:08 ` Meneghini, John @ 2020-08-08 21:11 ` Meneghini, John 2020-08-10 14:48 ` Mike Snitzer 2020-08-10 8:10 ` Chao Leng 1 sibling, 1 reply; 50+ messages in thread From: Meneghini, John @ 2020-08-08 21:11 UTC (permalink / raw) To: Sagi Grimberg, Mike Snitzer Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John, Hannes Reinecke Having said all of this, I think one thing which could be done to help support NVMe with DMP is to approximate any new NVMe error semantics by adding new BLK_STS codes. Reusing or overloading existing BLK_STS may not work well because everyone else already has a claim on those semantics. I would suggest adding new BLK_STS to the nvme_error_status() routine. I think this is the one place where NVMe status to BLK_STS translation belongs, and changes can be made there without impacting any of the NVMe error handling logic in nvme-core - now that we've removed blk_path_error() from nvme_complete_rq(). Another choice would be to simply disable ACRE. This won't solve the problem of changing NVMe error semantics with dm-multipath but it can certainly simplify the problem. Maybe a patch that disables ACRE when REQ_NVME_MPATH is clear. Or you can always just turn the feature off with the nvme-cli. I'll be sure to keep an eye on the Host Behavior Support feature at NVMexpress.org. We created this new nvme feature control mechanism to help with problems like this and I'll be sure to watch out for new protocol features which could break compatibility with dm-multipath in the future. /John On 8/8/20, 5:08 PM, "Meneghini, John" <John.Meneghini@netapp.com> wrote: I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does. The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry. You can find a copy of this TP here: http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields. To make use of these delay fields a new CRD field was added to the CQE Status Field. This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls. CRDTs can range from 100 milliseconds to 6559 seconds. Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit. This allows the host to enable or disable the feature. With this background there are a couple of misconceptions in this thread which I'd like to address. The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status. Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^ So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do. There is a much larger change in error semantics going on with ACRE than just this single error. The second is: All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec. However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status. This is how NVMe error semantics work and this is a part of what was behind Keith's patch . https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do. I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out. However, ultimately, I think nvme-core should return an error status like BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are. Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq() https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120 I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do. Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks. As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq(). All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic. At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability. This is something which has plagued SCSI implementations for years, and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this. There are plans to introduce more Host Behavior Support features in the future. In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this. We don't want to turn NVMe into SCSI. I know I don't want to. /John ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need). On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote: >> Hey Mike, >> >>>> The point is: blk_path_error() has nothing to do with NVMe errors. >>>> This is dm-multipath logic stuck in the middle of the NVMe error >>>> handling code. >>> >>> No, it is a means to have multiple subsystems (to this point both SCSI >>> and NVMe) doing the correct thing when translating subsystem specific >>> error codes to BLK_STS codes. >> >> Not exactly, don't find any use of this in scsi. The purpose is to make >> sure that nvme and dm speak the same language. > > SCSI doesn't need to do additional work to train a multipath layer > because dm-multipath _is_ SCSI's multipathing in Linux. Agree. > So ensuring SCSI properly classifies its error codes happens as a > side-effect of ensuring continued multipath functionality. > > Hannes introduced all these differentiated error codes in block core > because of SCSI. NVMe is meant to build on the infrastructure that was > established. Yes, exactly my point. blk_path_error is designed to make nvme and dm-multipath speak the same language. >>> If you, or others you name drop below, understood the point we wouldn't >>> be having this conversation. You'd accept the point of blk_path_error() >>> to be valid and required codification of what constitutes a retryable >>> path error for the Linux block layer. >> >> This incident is a case where the specific nvme status was designed >> to retry on the same path respecting the controller retry delay. >> And because nvme used blk_path_error at the time it caused us to use a >> non-retryable status to get around that. Granted, no one had >> dm-multipath in mind. >> >> So in a sense, there is consensus on changing patch 35038bffa87da >> _because_ nvme no longer uses blk_path_error. Otherwise it would break. > > "break" meaning it would do failover instead of the more optimal local > retry to the same controller. > > I see. Wish the header for commit 35038bffa87da touched on this even a > little bit ;) I think it did, but maybe didn't put too much emphasis on it. > But AFAICT the patch I provided doesn't compromise proper local retry -- > as long as we first fix nvme_error_status() to return a retryable > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > > Think of blk_path_error() as a more coarse-grained "is this retryable or > a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, > nvme_error_status() _should_ respond with something retryable (I'd > prefer BLK_STS_RESOURCE to be honest). But blk_path_error semantically mean "is this a pathing error", or at least that what its name suggest. > And then nvme_failover_req() is finer-grained; by returning false it now > allows short-circuiting failover and reverting back to NVMe's normal > controller based error recovery -- which it does for > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). > > And then the previous nvme_error_status() classification of retryable > BLK_STS obviously never gets returned up the IO stack; it gets thrown > away. I see what you are saying. The issue is that the code becomes convoluted (it's a pathing error, oh wait, no its not a pathing error). >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw >>> up by categorizing a retryable error as non-retryable (and vice-versa). >> >> But it is a special type of retryable. There is nothing that fits the >> semantics of the current behavior. > > I agree. But that's fine actually. > > And this issue is the poster-child for why properly supporting a duality > of driver-level vs upper level multipathing capabilities is pretty much > impossible unless a clean design factors out the different error classes > -- and local error retry is handled before punting to higher level > failover retry. Think if NVMe were to adopt a bit more disciplined > "local then failover" error handling it all gets easier. I don't think punting before is easier, because we do a local retry if: - no multipathing sw on top - request needs retry (e.g. no DNR, notretry is off etc..) - nvme error is not pathing related (nvme_failover_req returned false) > This local retry _is_ NVMe specific. NVMe should just own retrying on > the same controller no matter what (I'll hope that such retry has > awareness to not retry indefinitely though!). it will retry until the retry limit. > And this has nothing to > do with multipathing, so the logic to handle it shouldn't be trapped in > nvme_failover_req(). Well given that nvme_failover_req already may not actually failover this makes some sense to me (although I did have some resistance to make it that way in the first place, but was convinced otherwise). > I think NVMe can easily fix this by having an earlier stage of checking, > e.g. nvme_local_retry_req(), that shortcircuits ever getting to > higher-level multipathing consideration (be it native NVMe or DM > multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > To be clear: the "default" case of nvme_failover_req() that returns > false to fallback to NVMe's "local" normal NVMe error handling -- that > can stay.. but a more explicit handling of cases like > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > check that happens before nvme_failover_req() in nvme_complete_rq(). I don't necessarily agree with having a dedicated nvme_local_retry_req(). a request that isn't failed over, goes to local error handling (retry or not). I actually think that just adding the condition to nvme_complete_req and having nvme_failover_req reject it would work. Keith? >>> Anyway, no new BLK_STS is needed at this point. More discipline with >>> how NVMe's error handling is changed is. >> >> Please read the above. > > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path > failover. Not sure that is better, but we can see a patch first to determine. >>> If NVMe wants to ensure its >>> interface isn't broken regularly it _should_ use blk_path_error() to >>> validate future nvme_error_status() changes. Miscategorizing NVMe >>> errors to upper layers is a bug -- not open for debate. >> >> Again, don't agree is a Miscategorization nor a bug, its just something >> that is NVMe specific. > > Right I understand. > > Think it safe to assume these types of details are why Christoph wanted > to avoid the notion of native NVMe and DM multipathing having > compatible error handling. There was some wisdom with that position > (especially with native NVMe goals in mind). But if things are tweaked > slightly then both camps _can_ be made happy. > > There just needs to be a willingness to work through the details, > defensiveness needs to be shed on both sides, so constructive > review/consideration of problems can happen. Agreed. > Think that has already > happened here with our exchange. I'm open to investing effort here if > others are up for humoring my attempt to explore fixing the issues in a > mutually beneficial way. What's the worst that can happen? My simple > patches might continue to be ignored? ;) I won't ignore it, and I apologize of ignoring the original patch posted, I guess it flew under the radar... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-08 21:11 ` Meneghini, John @ 2020-08-10 14:48 ` Mike Snitzer 2020-08-11 12:54 ` Meneghini, John 0 siblings, 1 reply; 50+ messages in thread From: Mike Snitzer @ 2020-08-10 14:48 UTC (permalink / raw) To: Meneghini, John Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Hannes Reinecke Hi John, I appreciate your insight about ACRE. And that NVME_SC_CMD_INTERRUPTED is really just a preview for the kind of NVMe error handling advances to come. But, because I started harping about blk_path_error() I got you hyper-focused on it and that led to all your thinking about me wanting to somehow normalize both SCSI and NVMe. Couldn't be further from my thinking. I've always wanted Linux's NVMe driver to do whatever it needed to handle the evolution of NVMe. I just really need it to do it in a way that: respects NVMe's spec for ANA, ACRE and the like in way that doesn't make NVMe error handling broken if native NVMe multipath isn't used. The good news is I still don't see a need to model NVMe error handling with BLK_STS et al. But AFAICT the NVMe error handling needs some fixing/cleanup. To be continued, with patches so we can collectively have a more productive discussion. Thanks, Mike On Sat, Aug 08 2020 at 5:11pm -0400, Meneghini, John <John.Meneghini@netapp.com> wrote: > Having said all of this, I think one thing which could be done to help > support NVMe with DMP is to approximate any new NVMe error semantics > by adding new BLK_STS codes. Reusing or overloading existing BLK_STS > may not work well because everyone else already has a claim on those > semantics. I would suggest adding new BLK_STS to the > nvme_error_status() routine. I think this is the one place where NVMe > status to BLK_STS translation belongs, and changes can be made there > without impacting any of the NVMe error handling logic in nvme-core - > now that we've removed blk_path_error() from nvme_complete_rq(). > > Another choice would be to simply disable ACRE. This won't solve the > problem of changing NVMe error semantics with dm-multipath but it can > certainly simplify the problem. Maybe a patch that disables ACRE when > REQ_NVME_MPATH is clear. Or you can always just turn the feature off > with the nvme-cli. I'll be sure to keep an eye on the Host Behavior > Support feature at NVMexpress.org. We created this new nvme feature > control mechanism to help with problems like this and I'll be sure to > watch out for new protocol features which could break compatibility > with dm-multipath in the future. > > /John > > On 8/8/20, 5:08 PM, "Meneghini, John" <John.Meneghini@netapp.com> wrote: > > I'd like to up level this whole conversation for a minute by > talking about exactly what ACRE does. > > The genesis of the changes discussed in this thread is NVMe > TP-4033 - Enhanced Command Retry. You can find a copy of this TP > here: > > http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip > > This technical proposal added a command retry delay feature which > is programmed by the controller. The controller advertises a set of 3 > different timing delays though the Identify Controller data structure > CRDT{1-2} fields. To make use of these delay fields a new CRD field > was added to the CQE Status Field. This allows the NVMe controller to > specify exactly how long a command retry should be delayed, with 3 > possible timers that it chooses and controls. CRDTs can range from > 100 milliseconds to 6559 seconds. Because this capability can have > such a radical effect on backwards compatibility a new NVMe Feature > Identifier was added (Host Behavior Support - Feature ID 16h) with an > Advanced Command Retry Enable (ACRE) bit. This allows the host to > enable or disable the feature. > > With this background there are a couple of misconceptions in this > thread which I'd like to address. > > The first is: ACRE has nothing to do with the > NVME_SC_CMD_INTERRUPTED status. Yes, this new error status was added > as a part of TP-4033 but the CRD field of the CQE status can be set by > the controller with *any* NVMe error status. As long as the DNR bit is > not set the Command Retry Delay can come into effect. This is how the > spec is written and this is exactly how it has been implemented in the > core nvme_complete_rq() function (after change 507fe46ac91276120). For > example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of > 2 seconds.^ So CDRT needs to be supported with all error status if > the host is going to enable ACRE, and I think it's a big mistake to > get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The > NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a > general purpose "busy" status, something that was missing from NVMe, > and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other > specific NVMe error status, is the wrong thing to do. There is a much > larger change in error semantics going on with ACRE than just this > single error. > > The second is: All NVMe error status that do not have a Status > Code Type of 3h (Path Related Status) are subsystem scoped. This is a > topic that has gone through some debate on the linux-nvme mailing list > and at NVMexpress.org; and there have been some ECNs to the spec to > address this. There may be some exceptions to this rule because there > are always implementations out there that may not follow, and there > are bugs in the spec. However, this is the intention of the NVMe spec > and it matters. This means that, in a multi-pathing environment, > retrying any command on a different path will not provide a different > result. Retries should all occur on the same controller - unless it is > a path related status. This is how NVMe error semantics work and this > is a part of what was behind Keith's patch . > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da > > Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on > another path is simply not the right thing to do, and returning > BLK_STS_TARGET after all command retries, with CRDT, have been > exhausted communicates the right thing to the upper layer. From the > perspective of nvme-multipathing Keith's patch was exactly the correct > thing to do. I understand that this may have caused a backwards > compatibly problem with dm-multipath, and that's the only reason why > I've talked about backing it out. However, ultimately, I think > nvme-core should return an error status like BLK_STS_TARGET that > says, semantically - the IO has failed, no retry will work - because > this is what the NVMe error semantics are. > > Taken together both of these facts about the NVMe protocol > semantics are what's behind my patch which removed blk_path_error() > from nvme_complete_rq() > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120 > > I understand that there is a goal to try and avoid having > different failure/recovery handling semantically in response to > different error status between nvme-multipath and dm-multipath, but > NVMe error semantics are truly different from SCSI error semantics, > and they are changing. The Linux host needs to enable and support > those changes unhampered by the past. With this goal in mind, removing > the blk_path_error() code from nvme-core was the right thing to do. > Hannes and I struggled with the patch to try and make it work with > blk_path_error() for weeks. As pointed out in the thread below, > blk_path_error() is the SCSI multipathing logic and we can't use it in > nvme_complete_rq(). All it does is import all of the legacy problems > of dm-multipath, and of SCSI, into the nvme completion/multipath > logic. > > At NVMexpress.org we consciously added the ACRE feature because > the SCSI protocol had no such capability. This is something which has > plagued SCSI implementations for years, and all kinds of tricks have > been played, in both the SCSI host stack and in SCSI target stack, to > deal with the problem. The goal of NVMe is to create a better block > storage protocol and ACRE is just one example of many places where the > industry is trying to do this. There are plans to introduce more Host > Behavior Support features in the future. > > In the end, we are consciously changing NVMe, both in the spec and > in its implementation, to make it different from SCSI. I think this is > what's at the bottom of the changes discussed in this thread, and this > is why so many people are so passionate about this. We don't want to > turn NVMe into SCSI. I know I don't want to. > > /John > > ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need). > > On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote: > > >> Hey Mike, > >> > >>>> The point is: blk_path_error() has nothing to do with NVMe errors. > >>>> This is dm-multipath logic stuck in the middle of the NVMe error > >>>> handling code. > >>> > >>> No, it is a means to have multiple subsystems (to this point both SCSI > >>> and NVMe) doing the correct thing when translating subsystem specific > >>> error codes to BLK_STS codes. > >> > >> Not exactly, don't find any use of this in scsi. The purpose is to make > >> sure that nvme and dm speak the same language. > > > > SCSI doesn't need to do additional work to train a multipath layer > > because dm-multipath _is_ SCSI's multipathing in Linux. > > Agree. > > > So ensuring SCSI properly classifies its error codes happens as a > > side-effect of ensuring continued multipath functionality. > > > > Hannes introduced all these differentiated error codes in block core > > because of SCSI. NVMe is meant to build on the infrastructure that was > > established. > > Yes, exactly my point. blk_path_error is designed to make nvme and > dm-multipath speak the same language. > > >>> If you, or others you name drop below, understood the point we wouldn't > >>> be having this conversation. You'd accept the point of blk_path_error() > >>> to be valid and required codification of what constitutes a retryable > >>> path error for the Linux block layer. > >> > >> This incident is a case where the specific nvme status was designed > >> to retry on the same path respecting the controller retry delay. > >> And because nvme used blk_path_error at the time it caused us to use a > >> non-retryable status to get around that. Granted, no one had > >> dm-multipath in mind. > >> > >> So in a sense, there is consensus on changing patch 35038bffa87da > >> _because_ nvme no longer uses blk_path_error. Otherwise it would break. > > > > "break" meaning it would do failover instead of the more optimal local > > retry to the same controller. > > > > I see. Wish the header for commit 35038bffa87da touched on this even a > > little bit ;) > > I think it did, but maybe didn't put too much emphasis on it. > > > But AFAICT the patch I provided doesn't compromise proper local retry -- > > as long as we first fix nvme_error_status() to return a retryable > > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > > > > Think of blk_path_error() as a more coarse-grained "is this retryable or > > a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, > > nvme_error_status() _should_ respond with something retryable (I'd > > prefer BLK_STS_RESOURCE to be honest). > > But blk_path_error semantically mean "is this a pathing error", or at > least that what its name suggest. > > > And then nvme_failover_req() is finer-grained; by returning false it now > > allows short-circuiting failover and reverting back to NVMe's normal > > controller based error recovery -- which it does for > > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). > > > > And then the previous nvme_error_status() classification of retryable > > BLK_STS obviously never gets returned up the IO stack; it gets thrown > > away. > > I see what you are saying. The issue is that the code becomes > convoluted (it's a pathing error, oh wait, no its not a pathing error). > > >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw > >>> up by categorizing a retryable error as non-retryable (and vice-versa). > >> > >> But it is a special type of retryable. There is nothing that fits the > >> semantics of the current behavior. > > > > I agree. But that's fine actually. > > > > And this issue is the poster-child for why properly supporting a duality > > of driver-level vs upper level multipathing capabilities is pretty much > > impossible unless a clean design factors out the different error classes > > -- and local error retry is handled before punting to higher level > > failover retry. Think if NVMe were to adopt a bit more disciplined > > "local then failover" error handling it all gets easier. > > I don't think punting before is easier, because we do a local retry if: > - no multipathing sw on top > - request needs retry (e.g. no DNR, notretry is off etc..) > - nvme error is not pathing related (nvme_failover_req returned false) > > > This local retry _is_ NVMe specific. NVMe should just own retrying on > > the same controller no matter what (I'll hope that such retry has > > awareness to not retry indefinitely though!). > > it will retry until the retry limit. > > > And this has nothing to > > do with multipathing, so the logic to handle it shouldn't be trapped in > > nvme_failover_req(). > > Well given that nvme_failover_req already may not actually failover this > makes some sense to me (although I did have some resistance to make it > that way in the first place, but was convinced otherwise). > > > I think NVMe can easily fix this by having an earlier stage of checking, > > e.g. nvme_local_retry_req(), that shortcircuits ever getting to > > higher-level multipathing consideration (be it native NVMe or DM > > multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > > To be clear: the "default" case of nvme_failover_req() that returns > > false to fallback to NVMe's "local" normal NVMe error handling -- that > > can stay.. but a more explicit handling of cases like > > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > > check that happens before nvme_failover_req() in nvme_complete_rq(). > > I don't necessarily agree with having a dedicated nvme_local_retry_req(). > a request that isn't failed over, goes to local error handling (retry or > not). I actually think that just adding the condition to > nvme_complete_req and having nvme_failover_req reject it would work. > > Keith? > > >>> Anyway, no new BLK_STS is needed at this point. More discipline with > >>> how NVMe's error handling is changed is. > >> > >> Please read the above. > > > > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap > > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path > > failover. > > Not sure that is better, but we can see a patch first to determine. > > >>> If NVMe wants to ensure its > >>> interface isn't broken regularly it _should_ use blk_path_error() to > >>> validate future nvme_error_status() changes. Miscategorizing NVMe > >>> errors to upper layers is a bug -- not open for debate. > >> > >> Again, don't agree is a Miscategorization nor a bug, its just something > >> that is NVMe specific. > > > > Right I understand. > > > > Think it safe to assume these types of details are why Christoph wanted > > to avoid the notion of native NVMe and DM multipathing having > > compatible error handling. There was some wisdom with that position > > (especially with native NVMe goals in mind). But if things are tweaked > > slightly then both camps _can_ be made happy. > > > > There just needs to be a willingness to work through the details, > > defensiveness needs to be shed on both sides, so constructive > > review/consideration of problems can happen. > > Agreed. > > > Think that has already > > happened here with our exchange. I'm open to investing effort here if > > others are up for humoring my attempt to explore fixing the issues in a > > mutually beneficial way. What's the worst that can happen? My simple > > patches might continue to be ignored? ;) > > I won't ignore it, and I apologize of ignoring the original patch > posted, I guess it flew under the radar... > > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-10 14:48 ` Mike Snitzer @ 2020-08-11 12:54 ` Meneghini, John 0 siblings, 0 replies; 50+ messages in thread From: Meneghini, John @ 2020-08-11 12:54 UTC (permalink / raw) To: Mike Snitzer Cc: Sagi Grimberg, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John, Hannes Reinecke On 8/10/20, 10:49 AM, "Mike Snitzer" <snitzer@redhat.com> wrote: I appreciate your insight about ACRE. And that NVME_SC_CMD_INTERRUPTED is really just a preview for the kind of NVMe error handling advances to come. But, because I started harping about blk_path_error() I got you hyper-focused on it and that led to all your thinking about me wanting to somehow normalize both SCSI and NVMe. Couldn't be further from my thinking. I've always wanted Linux's NVMe driver to do whatever it needed to handle the evolution of NVMe. I just really need it to do it in a way that: respects NVMe's spec for ANA, ACRE and the like in way that doesn't make NVMe error handling broken if native NVMe multipath isn't used. Agreed Mike. I just wanted be sure we are understanding the problem clearly. The good news is I still don't see a need to model NVMe error handling with BLK_STS et al. Yes, I agree some alternative to modeling things around BLK_STS would be good... perhaps even required. I'll admit my suggested solutions don't help much. But AFAICT the NVMe error handling needs some fixing/cleanup. To be continued, with patches so we can collectively have a more productive discussion. If someone can come up with a solution that allows nvme-core error handling to work freely with and without native nvme-multipathing, I'm all for it. But we also need a solution that any solution is future proof because more changes are coming. /John Thanks, Mike On Sat, Aug 08 2020 at 5:11pm -0400, Meneghini, John <John.Meneghini@netapp.com> wrote: > Having said all of this, I think one thing which could be done to help > support NVMe with DMP is to approximate any new NVMe error semantics > by adding new BLK_STS codes. Reusing or overloading existing BLK_STS > may not work well because everyone else already has a claim on those > semantics. I would suggest adding new BLK_STS to the > nvme_error_status() routine. I think this is the one place where NVMe > status to BLK_STS translation belongs, and changes can be made there > without impacting any of the NVMe error handling logic in nvme-core - > now that we've removed blk_path_error() from nvme_complete_rq(). > > Another choice would be to simply disable ACRE. This won't solve the > problem of changing NVMe error semantics with dm-multipath but it can > certainly simplify the problem. Maybe a patch that disables ACRE when > REQ_NVME_MPATH is clear. Or you can always just turn the feature off > with the nvme-cli. I'll be sure to keep an eye on the Host Behavior > Support feature at NVMexpress.org. We created this new nvme feature > control mechanism to help with problems like this and I'll be sure to > watch out for new protocol features which could break compatibility > with dm-multipath in the future. > > /John > > On 8/8/20, 5:08 PM, "Meneghini, John" <John.Meneghini@netapp.com> wrote: > > I'd like to up level this whole conversation for a minute by > talking about exactly what ACRE does. > > The genesis of the changes discussed in this thread is NVMe > TP-4033 - Enhanced Command Retry. You can find a copy of this TP > here: > > http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip > > This technical proposal added a command retry delay feature which > is programmed by the controller. The controller advertises a set of 3 > different timing delays though the Identify Controller data structure > CRDT{1-2} fields. To make use of these delay fields a new CRD field > was added to the CQE Status Field. This allows the NVMe controller to > specify exactly how long a command retry should be delayed, with 3 > possible timers that it chooses and controls. CRDTs can range from > 100 milliseconds to 6559 seconds. Because this capability can have > such a radical effect on backwards compatibility a new NVMe Feature > Identifier was added (Host Behavior Support - Feature ID 16h) with an > Advanced Command Retry Enable (ACRE) bit. This allows the host to > enable or disable the feature. > > With this background there are a couple of misconceptions in this > thread which I'd like to address. > > The first is: ACRE has nothing to do with the > NVME_SC_CMD_INTERRUPTED status. Yes, this new error status was added > as a part of TP-4033 but the CRD field of the CQE status can be set by > the controller with *any* NVMe error status. As long as the DNR bit is > not set the Command Retry Delay can come into effect. This is how the > spec is written and this is exactly how it has been implemented in the > core nvme_complete_rq() function (after change 507fe46ac91276120). For > example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of > 2 seconds.^ So CDRT needs to be supported with all error status if > the host is going to enable ACRE, and I think it's a big mistake to > get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The > NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a > general purpose "busy" status, something that was missing from NVMe, > and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other > specific NVMe error status, is the wrong thing to do. There is a much > larger change in error semantics going on with ACRE than just this > single error. > > The second is: All NVMe error status that do not have a Status > Code Type of 3h (Path Related Status) are subsystem scoped. This is a > topic that has gone through some debate on the linux-nvme mailing list > and at NVMexpress.org; and there have been some ECNs to the spec to > address this. There may be some exceptions to this rule because there > are always implementations out there that may not follow, and there > are bugs in the spec. However, this is the intention of the NVMe spec > and it matters. This means that, in a multi-pathing environment, > retrying any command on a different path will not provide a different > result. Retries should all occur on the same controller - unless it is > a path related status. This is how NVMe error semantics work and this > is a part of what was behind Keith's patch . > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da > > Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on > another path is simply not the right thing to do, and returning > BLK_STS_TARGET after all command retries, with CRDT, have been > exhausted communicates the right thing to the upper layer. From the > perspective of nvme-multipathing Keith's patch was exactly the correct > thing to do. I understand that this may have caused a backwards > compatibly problem with dm-multipath, and that's the only reason why > I've talked about backing it out. However, ultimately, I think > nvme-core should return an error status like BLK_STS_TARGET that > says, semantically - the IO has failed, no retry will work - because > this is what the NVMe error semantics are. > > Taken together both of these facts about the NVMe protocol > semantics are what's behind my patch which removed blk_path_error() > from nvme_complete_rq() > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120 > > I understand that there is a goal to try and avoid having > different failure/recovery handling semantically in response to > different error status between nvme-multipath and dm-multipath, but > NVMe error semantics are truly different from SCSI error semantics, > and they are changing. The Linux host needs to enable and support > those changes unhampered by the past. With this goal in mind, removing > the blk_path_error() code from nvme-core was the right thing to do. > Hannes and I struggled with the patch to try and make it work with > blk_path_error() for weeks. As pointed out in the thread below, > blk_path_error() is the SCSI multipathing logic and we can't use it in > nvme_complete_rq(). All it does is import all of the legacy problems > of dm-multipath, and of SCSI, into the nvme completion/multipath > logic. > > At NVMexpress.org we consciously added the ACRE feature because > the SCSI protocol had no such capability. This is something which has > plagued SCSI implementations for years, and all kinds of tricks have > been played, in both the SCSI host stack and in SCSI target stack, to > deal with the problem. The goal of NVMe is to create a better block > storage protocol and ACRE is just one example of many places where the > industry is trying to do this. There are plans to introduce more Host > Behavior Support features in the future. > > In the end, we are consciously changing NVMe, both in the spec and > in its implementation, to make it different from SCSI. I think this is > what's at the bottom of the changes discussed in this thread, and this > is why so many people are so passionate about this. We don't want to > turn NVMe into SCSI. I know I don't want to. > > /John > > ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need). > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-08 21:08 ` Meneghini, John 2020-08-08 21:11 ` Meneghini, John @ 2020-08-10 8:10 ` Chao Leng 2020-08-11 12:36 ` Meneghini, John 1 sibling, 1 reply; 50+ messages in thread From: Chao Leng @ 2020-08-10 8:10 UTC (permalink / raw) To: Meneghini, John, Sagi Grimberg, Mike Snitzer Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch, Hannes Reinecke I can not agree with you more. The root cause of the conflict is REQ_FAILFAST_TRANSPORT. REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol do not difine the local retry mechanism. SCSI implements a fuzzy local retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath software, multipath software retry according error code is expected. nvme is different with scsi about this. It define local retry mechanism and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT. Another, for nvme multipath, if the error code is not a path error, multipath will not fail over to retry. but maybe blk_queue_dying return true, IO can not be retry at current path, thus IO will interrupted. blk_queue_dying and path error both need fail over to retry. So we can do like this: --- drivers/nvme/host/core.c | 26 +++++++++++++++++++------- drivers/nvme/host/multipath.c | 11 +++-------- drivers/nvme/host/nvme.h | 5 ++--- include/linux/nvme.h | 9 +++++++++ 4 files changed, 33 insertions(+), 18 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 4ee2330c603e..07471bd37f60 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -243,7 +243,7 @@ static blk_status_t nvme_error_status(u16 status) static inline bool nvme_req_needs_retry(struct request *req) { - if (blk_noretry_request(req)) + if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER)) return false; if (nvme_req(req)->status & NVME_SC_DNR) return false; @@ -252,6 +252,14 @@ static inline bool nvme_req_needs_retry(struct request *req) return true; } +static inline bool nvme_req_path_error(struct request *req) +{ + if ((nvme_req(req)->status & NVME_SCT_MASK) == NVME_SCT_PATH || + blk_queue_dying(req->q)) + return true; + return false; +} + static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -270,7 +278,7 @@ static void nvme_retry_req(struct request *req) void nvme_complete_rq(struct request *req) { - blk_status_t status = nvme_error_status(nvme_req(req)->status); + blk_status_t status; trace_nvme_complete_rq(req); @@ -279,16 +287,20 @@ void nvme_complete_rq(struct request *req) if (nvme_req(req)->ctrl->kas) nvme_req(req)->ctrl->comp_seen = true; - if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; - - if (!blk_queue_dying(req->q)) { + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && + nvme_req_needs_retry(req))) { + if (nvme_req_path_error(req)) { + if (req->cmd_flags & REQ_NVME_MPATH) { + nvme_failover_req(req); + return; + } + } else { nvme_retry_req(req); return; } } + status = nvme_error_status(nvme_req(req)->status); nvme_trace_bio_complete(req, status); blk_mq_end_request(req, status); } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 66509472fe06..e182fb3bcd0c 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -65,7 +65,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, } } -bool nvme_failover_req(struct request *req) +void nvme_failover_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; u16 status = nvme_req(req)->status; @@ -90,17 +90,13 @@ bool nvme_failover_req(struct request *req) queue_work(nvme_wq, &ns->ctrl->ana_work); } break; - case NVME_SC_HOST_PATH_ERROR: - case NVME_SC_HOST_ABORTED_CMD: + default: /* - * Temporary transport disruption in talking to the controller. + * Normal error path. * Try to send on a new path. */ nvme_mpath_clear_current_path(ns); break; - default: - /* This was a non-ANA error so follow the normal error path. */ - return false; } spin_lock_irqsave(&ns->head->requeue_lock, flags); @@ -109,7 +105,6 @@ bool nvme_failover_req(struct request *req) blk_mq_end_request(req, 0); kblockd_schedule_work(&ns->head->requeue_work); - return true; } void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 09ffc3246f60..cbb5d4ba6241 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -582,7 +582,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, struct nvme_ctrl *ctrl, int *flags); -bool nvme_failover_req(struct request *req); +void nvme_failover_req(struct request *req); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); @@ -640,9 +640,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance); } -static inline bool nvme_failover_req(struct request *req) +static inline void nvme_failover_req(struct request *req) { - return false; } static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 5ce51ab4c50e..8c4a5b4d5b4d 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -1441,6 +1441,15 @@ enum { NVME_SC_DNR = 0x4000, }; +#define NVME_SCT_MASK 0x700 +enum { + NVME_SCT_GENERIC = 0, + NVME_SCT_COMMAND_SPECIFIC = 0x100, + NVME_SCT_MEDIA = 0x200, + NVME_SCT_PATH = 0x300, + NVME_SCT_VENDOR = 0x700 +}; + struct nvme_completion { /* * Used by Admin and Fabrics commands to return data: -- 2.16.4 On 2020/8/9 5:08, Meneghini, John wrote: > I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does. > > The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry. You can find a copy of this TP here: > > http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip > > This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields. To make use of these delay fields a new CRD field was added to the CQE Status Field. This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls. CRDTs can range from 100 milliseconds to 6559 seconds. Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit. This allows the host to enable or disable the feature. > > With this background there are a couple of misconceptions in this thread which I'd like to address. > > The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status. Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function > (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^ So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do. There is a much larger change in error semantics going on with ACRE than just this single error. > > The second is: All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec. However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status. This is how NVMe error semantics work and this is a part of what was behind Keith's patch . > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da > > Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do. I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out. However, ultimately, I think nvme-core should return an error status like BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are. > > Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq() > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120 > > I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do. Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks. As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq(). All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic. > > At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability. This is something which has plagued SCSI implementations for years, and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this. There are plans to introduce more Host Behavior Support features in the future. > > In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this. We don't want to turn NVMe into SCSI. I know I don't want to. > > /John > > ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need). > > On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote: > > >> Hey Mike, > >> > >>>> The point is: blk_path_error() has nothing to do with NVMe errors. > >>>> This is dm-multipath logic stuck in the middle of the NVMe error > >>>> handling code. > >>> > >>> No, it is a means to have multiple subsystems (to this point both SCSI > >>> and NVMe) doing the correct thing when translating subsystem specific > >>> error codes to BLK_STS codes. > >> > >> Not exactly, don't find any use of this in scsi. The purpose is to make > >> sure that nvme and dm speak the same language. > > > > SCSI doesn't need to do additional work to train a multipath layer > > because dm-multipath _is_ SCSI's multipathing in Linux. > > Agree. > > > So ensuring SCSI properly classifies its error codes happens as a > > side-effect of ensuring continued multipath functionality. > > > > Hannes introduced all these differentiated error codes in block core > > because of SCSI. NVMe is meant to build on the infrastructure that was > > established. > > Yes, exactly my point. blk_path_error is designed to make nvme and > dm-multipath speak the same language. > > >>> If you, or others you name drop below, understood the point we wouldn't > >>> be having this conversation. You'd accept the point of blk_path_error() > >>> to be valid and required codification of what constitutes a retryable > >>> path error for the Linux block layer. > >> > >> This incident is a case where the specific nvme status was designed > >> to retry on the same path respecting the controller retry delay. > >> And because nvme used blk_path_error at the time it caused us to use a > >> non-retryable status to get around that. Granted, no one had > >> dm-multipath in mind. > >> > >> So in a sense, there is consensus on changing patch 35038bffa87da > >> _because_ nvme no longer uses blk_path_error. Otherwise it would break. > > > > "break" meaning it would do failover instead of the more optimal local > > retry to the same controller. > > > > I see. Wish the header for commit 35038bffa87da touched on this even a > > little bit ;) > > I think it did, but maybe didn't put too much emphasis on it. > > > But AFAICT the patch I provided doesn't compromise proper local retry -- > > as long as we first fix nvme_error_status() to return a retryable > > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > > > > Think of blk_path_error() as a more coarse-grained "is this retryable or > > a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, > > nvme_error_status() _should_ respond with something retryable (I'd > > prefer BLK_STS_RESOURCE to be honest). > > But blk_path_error semantically mean "is this a pathing error", or at > least that what its name suggest. > > > And then nvme_failover_req() is finer-grained; by returning false it now > > allows short-circuiting failover and reverting back to NVMe's normal > > controller based error recovery -- which it does for > > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). > > > > And then the previous nvme_error_status() classification of retryable > > BLK_STS obviously never gets returned up the IO stack; it gets thrown > > away. > > I see what you are saying. The issue is that the code becomes > convoluted (it's a pathing error, oh wait, no its not a pathing error). > > >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw > >>> up by categorizing a retryable error as non-retryable (and vice-versa). > >> > >> But it is a special type of retryable. There is nothing that fits the > >> semantics of the current behavior. > > > > I agree. But that's fine actually. > > > > And this issue is the poster-child for why properly supporting a duality > > of driver-level vs upper level multipathing capabilities is pretty much > > impossible unless a clean design factors out the different error classes > > -- and local error retry is handled before punting to higher level > > failover retry. Think if NVMe were to adopt a bit more disciplined > > "local then failover" error handling it all gets easier. > > I don't think punting before is easier, because we do a local retry if: > - no multipathing sw on top > - request needs retry (e.g. no DNR, notretry is off etc..) > - nvme error is not pathing related (nvme_failover_req returned false) > > > This local retry _is_ NVMe specific. NVMe should just own retrying on > > the same controller no matter what (I'll hope that such retry has > > awareness to not retry indefinitely though!). > > it will retry until the retry limit. > > > And this has nothing to > > do with multipathing, so the logic to handle it shouldn't be trapped in > > nvme_failover_req(). > > Well given that nvme_failover_req already may not actually failover this > makes some sense to me (although I did have some resistance to make it > that way in the first place, but was convinced otherwise). > > > I think NVMe can easily fix this by having an earlier stage of checking, > > e.g. nvme_local_retry_req(), that shortcircuits ever getting to > > higher-level multipathing consideration (be it native NVMe or DM > > multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > > To be clear: the "default" case of nvme_failover_req() that returns > > false to fallback to NVMe's "local" normal NVMe error handling -- that > > can stay.. but a more explicit handling of cases like > > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > > check that happens before nvme_failover_req() in nvme_complete_rq(). > > I don't necessarily agree with having a dedicated nvme_local_retry_req(). > a request that isn't failed over, goes to local error handling (retry or > not). I actually think that just adding the condition to > nvme_complete_req and having nvme_failover_req reject it would work. > > Keith? > > >>> Anyway, no new BLK_STS is needed at this point. More discipline with > >>> how NVMe's error handling is changed is. > >> > >> Please read the above. > > > > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap > > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path > > failover. > > Not sure that is better, but we can see a patch first to determine. > > >>> If NVMe wants to ensure its > >>> interface isn't broken regularly it _should_ use blk_path_error() to > >>> validate future nvme_error_status() changes. Miscategorizing NVMe > >>> errors to upper layers is a bug -- not open for debate. > >> > >> Again, don't agree is a Miscategorization nor a bug, its just something > >> that is NVMe specific. > > > > Right I understand. > > > > Think it safe to assume these types of details are why Christoph wanted > > to avoid the notion of native NVMe and DM multipathing having > > compatible error handling. There was some wisdom with that position > > (especially with native NVMe goals in mind). But if things are tweaked > > slightly then both camps _can_ be made happy. > > > > There just needs to be a willingness to work through the details, > > defensiveness needs to be shed on both sides, so constructive > > review/consideration of problems can happen. > > Agreed. > > > Think that has already > > happened here with our exchange. I'm open to investing effort here if > > others are up for humoring my attempt to explore fixing the issues in a > > mutually beneficial way. What's the worst that can happen? My simple > > patches might continue to be ignored? ;) > > I won't ignore it, and I apologize of ignoring the original patch > posted, I guess it flew under the radar... > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-10 8:10 ` Chao Leng @ 2020-08-11 12:36 ` Meneghini, John 2020-08-12 7:51 ` Chao Leng 0 siblings, 1 reply; 50+ messages in thread From: Meneghini, John @ 2020-08-11 12:36 UTC (permalink / raw) To: Chao Leng, Sagi Grimberg, Mike Snitzer Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch, Meneghini, John, Hannes Reinecke Chao, I don't see this patch on patchwork. https://patchwork.kernel.org/project/linux-block/list/?submitter=173677 Can you upload this there. I'd like to apply this patch to my sandbox and take a closer look. Thanks, /John On 8/10/20, 4:10 AM, "Chao Leng" <lengchao@huawei.com> wrote: NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. I can not agree with you more. The root cause of the conflict is REQ_FAILFAST_TRANSPORT. REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol do not difine the local retry mechanism. SCSI implements a fuzzy local retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath software, multipath software retry according error code is expected. nvme is different with scsi about this. It define local retry mechanism and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT. Another, for nvme multipath, if the error code is not a path error, multipath will not fail over to retry. but maybe blk_queue_dying return true, IO can not be retry at current path, thus IO will interrupted. blk_queue_dying and path error both need fail over to retry. So we can do like this: --- drivers/nvme/host/core.c | 26 +++++++++++++++++++------- drivers/nvme/host/multipath.c | 11 +++-------- drivers/nvme/host/nvme.h | 5 ++--- include/linux/nvme.h | 9 +++++++++ 4 files changed, 33 insertions(+), 18 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 4ee2330c603e..07471bd37f60 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -243,7 +243,7 @@ static blk_status_t nvme_error_status(u16 status) static inline bool nvme_req_needs_retry(struct request *req) { - if (blk_noretry_request(req)) + if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER)) return false; if (nvme_req(req)->status & NVME_SC_DNR) return false; @@ -252,6 +252,14 @@ static inline bool nvme_req_needs_retry(struct request *req) return true; } +static inline bool nvme_req_path_error(struct request *req) +{ + if ((nvme_req(req)->status & NVME_SCT_MASK) == NVME_SCT_PATH || + blk_queue_dying(req->q)) + return true; + return false; +} + static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -270,7 +278,7 @@ static void nvme_retry_req(struct request *req) void nvme_complete_rq(struct request *req) { - blk_status_t status = nvme_error_status(nvme_req(req)->status); + blk_status_t status; trace_nvme_complete_rq(req); @@ -279,16 +287,20 @@ void nvme_complete_rq(struct request *req) if (nvme_req(req)->ctrl->kas) nvme_req(req)->ctrl->comp_seen = true; - if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; - - if (!blk_queue_dying(req->q)) { + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && + nvme_req_needs_retry(req))) { + if (nvme_req_path_error(req)) { + if (req->cmd_flags & REQ_NVME_MPATH) { + nvme_failover_req(req); + return; + } + } else { nvme_retry_req(req); return; } } + status = nvme_error_status(nvme_req(req)->status); nvme_trace_bio_complete(req, status); blk_mq_end_request(req, status); } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 66509472fe06..e182fb3bcd0c 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -65,7 +65,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, } } -bool nvme_failover_req(struct request *req) +void nvme_failover_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; u16 status = nvme_req(req)->status; @@ -90,17 +90,13 @@ bool nvme_failover_req(struct request *req) queue_work(nvme_wq, &ns->ctrl->ana_work); } break; - case NVME_SC_HOST_PATH_ERROR: - case NVME_SC_HOST_ABORTED_CMD: + default: /* - * Temporary transport disruption in talking to the controller. + * Normal error path. * Try to send on a new path. */ nvme_mpath_clear_current_path(ns); break; - default: - /* This was a non-ANA error so follow the normal error path. */ - return false; } spin_lock_irqsave(&ns->head->requeue_lock, flags); @@ -109,7 +105,6 @@ bool nvme_failover_req(struct request *req) blk_mq_end_request(req, 0); kblockd_schedule_work(&ns->head->requeue_work); - return true; } void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 09ffc3246f60..cbb5d4ba6241 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -582,7 +582,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, struct nvme_ctrl *ctrl, int *flags); -bool nvme_failover_req(struct request *req); +void nvme_failover_req(struct request *req); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); @@ -640,9 +640,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance); } -static inline bool nvme_failover_req(struct request *req) +static inline void nvme_failover_req(struct request *req) { - return false; } static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 5ce51ab4c50e..8c4a5b4d5b4d 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -1441,6 +1441,15 @@ enum { NVME_SC_DNR = 0x4000, }; +#define NVME_SCT_MASK 0x700 +enum { + NVME_SCT_GENERIC = 0, + NVME_SCT_COMMAND_SPECIFIC = 0x100, + NVME_SCT_MEDIA = 0x200, + NVME_SCT_PATH = 0x300, + NVME_SCT_VENDOR = 0x700 +}; + struct nvme_completion { /* * Used by Admin and Fabrics commands to return data: -- 2.16.4 On 2020/8/9 5:08, Meneghini, John wrote: > I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does. > > The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry. You can find a copy of this TP here: > > http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip > > This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields. To make use of these delay fields a new CRD field was added to the CQE Status Field. This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls. CRDTs can range from 100 milliseconds to 6559 seconds. Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit. This allows the host to enable or disable the feature. > > With this background there are a couple of misconceptions in this thread which I'd like to address. > > The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status. Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function > (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^ So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do. There is a much larger change in error semantics going on with ACRE than just this single error. > > The second is: All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec. However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status. This is how NVMe error semantics work and this is a part of what was behind Keith's patch . > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da > > Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do. I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out. However, ultimately, I think nvme-core should return an error status like BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are. > > Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq() > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120 > > I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do. Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks. As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq(). All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic. > > At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability. This is something which has plagued SCSI implementations for years, and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this. There are plans to introduce more Host Behavior Support features in the future. > > In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this. We don't want to turn NVMe into SCSI. I know I don't want to. > > /John > > ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need). > > On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote: > > >> Hey Mike, > >> > >>>> The point is: blk_path_error() has nothing to do with NVMe errors. > >>>> This is dm-multipath logic stuck in the middle of the NVMe error > >>>> handling code. > >>> > >>> No, it is a means to have multiple subsystems (to this point both SCSI > >>> and NVMe) doing the correct thing when translating subsystem specific > >>> error codes to BLK_STS codes. > >> > >> Not exactly, don't find any use of this in scsi. The purpose is to make > >> sure that nvme and dm speak the same language. > > > > SCSI doesn't need to do additional work to train a multipath layer > > because dm-multipath _is_ SCSI's multipathing in Linux. > > Agree. > > > So ensuring SCSI properly classifies its error codes happens as a > > side-effect of ensuring continued multipath functionality. > > > > Hannes introduced all these differentiated error codes in block core > > because of SCSI. NVMe is meant to build on the infrastructure that was > > established. > > Yes, exactly my point. blk_path_error is designed to make nvme and > dm-multipath speak the same language. > > >>> If you, or others you name drop below, understood the point we wouldn't > >>> be having this conversation. You'd accept the point of blk_path_error() > >>> to be valid and required codification of what constitutes a retryable > >>> path error for the Linux block layer. > >> > >> This incident is a case where the specific nvme status was designed > >> to retry on the same path respecting the controller retry delay. > >> And because nvme used blk_path_error at the time it caused us to use a > >> non-retryable status to get around that. Granted, no one had > >> dm-multipath in mind. > >> > >> So in a sense, there is consensus on changing patch 35038bffa87da > >> _because_ nvme no longer uses blk_path_error. Otherwise it would break. > > > > "break" meaning it would do failover instead of the more optimal local > > retry to the same controller. > > > > I see. Wish the header for commit 35038bffa87da touched on this even a > > little bit ;) > > I think it did, but maybe didn't put too much emphasis on it. > > > But AFAICT the patch I provided doesn't compromise proper local retry -- > > as long as we first fix nvme_error_status() to return a retryable > > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > > > > Think of blk_path_error() as a more coarse-grained "is this retryable or > > a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, > > nvme_error_status() _should_ respond with something retryable (I'd > > prefer BLK_STS_RESOURCE to be honest). > > But blk_path_error semantically mean "is this a pathing error", or at > least that what its name suggest. > > > And then nvme_failover_req() is finer-grained; by returning false it now > > allows short-circuiting failover and reverting back to NVMe's normal > > controller based error recovery -- which it does for > > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). > > > > And then the previous nvme_error_status() classification of retryable > > BLK_STS obviously never gets returned up the IO stack; it gets thrown > > away. > > I see what you are saying. The issue is that the code becomes > convoluted (it's a pathing error, oh wait, no its not a pathing error). > > >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw > >>> up by categorizing a retryable error as non-retryable (and vice-versa). > >> > >> But it is a special type of retryable. There is nothing that fits the > >> semantics of the current behavior. > > > > I agree. But that's fine actually. > > > > And this issue is the poster-child for why properly supporting a duality > > of driver-level vs upper level multipathing capabilities is pretty much > > impossible unless a clean design factors out the different error classes > > -- and local error retry is handled before punting to higher level > > failover retry. Think if NVMe were to adopt a bit more disciplined > > "local then failover" error handling it all gets easier. > > I don't think punting before is easier, because we do a local retry if: > - no multipathing sw on top > - request needs retry (e.g. no DNR, notretry is off etc..) > - nvme error is not pathing related (nvme_failover_req returned false) > > > This local retry _is_ NVMe specific. NVMe should just own retrying on > > the same controller no matter what (I'll hope that such retry has > > awareness to not retry indefinitely though!). > > it will retry until the retry limit. > > > And this has nothing to > > do with multipathing, so the logic to handle it shouldn't be trapped in > > nvme_failover_req(). > > Well given that nvme_failover_req already may not actually failover this > makes some sense to me (although I did have some resistance to make it > that way in the first place, but was convinced otherwise). > > > I think NVMe can easily fix this by having an earlier stage of checking, > > e.g. nvme_local_retry_req(), that shortcircuits ever getting to > > higher-level multipathing consideration (be it native NVMe or DM > > multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > > To be clear: the "default" case of nvme_failover_req() that returns > > false to fallback to NVMe's "local" normal NVMe error handling -- that > > can stay.. but a more explicit handling of cases like > > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > > check that happens before nvme_failover_req() in nvme_complete_rq(). > > I don't necessarily agree with having a dedicated nvme_local_retry_req(). > a request that isn't failed over, goes to local error handling (retry or > not). I actually think that just adding the condition to > nvme_complete_req and having nvme_failover_req reject it would work. > > Keith? > > >>> Anyway, no new BLK_STS is needed at this point. More discipline with > >>> how NVMe's error handling is changed is. > >> > >> Please read the above. > > > > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap > > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path > > failover. > > Not sure that is better, but we can see a patch first to determine. > > >>> If NVMe wants to ensure its > >>> interface isn't broken regularly it _should_ use blk_path_error() to > >>> validate future nvme_error_status() changes. Miscategorizing NVMe > >>> errors to upper layers is a bug -- not open for debate. > >> > >> Again, don't agree is a Miscategorization nor a bug, its just something > >> that is NVMe specific. > > > > Right I understand. > > > > Think it safe to assume these types of details are why Christoph wanted > > to avoid the notion of native NVMe and DM multipathing having > > compatible error handling. There was some wisdom with that position > > (especially with native NVMe goals in mind). But if things are tweaked > > slightly then both camps _can_ be made happy. > > > > There just needs to be a willingness to work through the details, > > defensiveness needs to be shed on both sides, so constructive > > review/consideration of problems can happen. > > Agreed. > > > Think that has already > > happened here with our exchange. I'm open to investing effort here if > > others are up for humoring my attempt to explore fixing the issues in a > > mutually beneficial way. What's the worst that can happen? My simple > > patches might continue to be ignored? ;) > > I won't ignore it, and I apologize of ignoring the original patch > posted, I guess it flew under the radar... > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-11 12:36 ` Meneghini, John @ 2020-08-12 7:51 ` Chao Leng 0 siblings, 0 replies; 50+ messages in thread From: Chao Leng @ 2020-08-12 7:51 UTC (permalink / raw) To: Meneghini, John, Sagi Grimberg, Mike Snitzer Cc: linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch, Hannes Reinecke On 2020/8/11 20:36, Meneghini, John wrote: > Chao, > > I don't see this patch on patchwork. > > https://patchwork.kernel.org/project/linux-block/list/?submitter=173677 > > Can you upload this there. I'd like to apply this patch to my sandbox and take a closer look. Ok, I will send the patch later. > > Thanks, > > /John > > On 8/10/20, 4:10 AM, "Chao Leng" <lengchao@huawei.com> wrote: > > NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. > > > > > I can not agree with you more. > The root cause of the conflict is REQ_FAILFAST_TRANSPORT. > REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol > do not difine the local retry mechanism. SCSI implements a fuzzy local > retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath > software, multipath software retry according error code is expected. > nvme is different with scsi about this. It define local retry mechanism > and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT. > > Another, for nvme multipath, if the error code is not a path error, > multipath will not fail over to retry. but maybe blk_queue_dying return > true, IO can not be retry at current path, thus IO will interrupted. > blk_queue_dying and path error both need fail over to retry. > > So we can do like this: > --- > drivers/nvme/host/core.c | 26 +++++++++++++++++++------- > drivers/nvme/host/multipath.c | 11 +++-------- > drivers/nvme/host/nvme.h | 5 ++--- > include/linux/nvme.h | 9 +++++++++ > 4 files changed, 33 insertions(+), 18 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 4ee2330c603e..07471bd37f60 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -243,7 +243,7 @@ static blk_status_t nvme_error_status(u16 status) > > static inline bool nvme_req_needs_retry(struct request *req) > { > - if (blk_noretry_request(req)) > + if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER)) > return false; > if (nvme_req(req)->status & NVME_SC_DNR) > return false; > @@ -252,6 +252,14 @@ static inline bool nvme_req_needs_retry(struct request *req) > return true; > } > > +static inline bool nvme_req_path_error(struct request *req) > +{ > + if ((nvme_req(req)->status & NVME_SCT_MASK) == NVME_SCT_PATH || > + blk_queue_dying(req->q)) > + return true; > + return false; > +} > + > static void nvme_retry_req(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > @@ -270,7 +278,7 @@ static void nvme_retry_req(struct request *req) > > void nvme_complete_rq(struct request *req) > { > - blk_status_t status = nvme_error_status(nvme_req(req)->status); > + blk_status_t status; > > trace_nvme_complete_rq(req); > > @@ -279,16 +287,20 @@ void nvme_complete_rq(struct request *req) > if (nvme_req(req)->ctrl->kas) > nvme_req(req)->ctrl->comp_seen = true; > > - if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > - return; > - > - if (!blk_queue_dying(req->q)) { > + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && > + nvme_req_needs_retry(req))) { > + if (nvme_req_path_error(req)) { > + if (req->cmd_flags & REQ_NVME_MPATH) { > + nvme_failover_req(req); > + return; > + } > + } else { > nvme_retry_req(req); > return; > } > } > > + status = nvme_error_status(nvme_req(req)->status); > nvme_trace_bio_complete(req, status); > blk_mq_end_request(req, status); > } > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 66509472fe06..e182fb3bcd0c 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -65,7 +65,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > } > } > > -bool nvme_failover_req(struct request *req) > +void nvme_failover_req(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > u16 status = nvme_req(req)->status; > @@ -90,17 +90,13 @@ bool nvme_failover_req(struct request *req) > queue_work(nvme_wq, &ns->ctrl->ana_work); > } > break; > - case NVME_SC_HOST_PATH_ERROR: > - case NVME_SC_HOST_ABORTED_CMD: > + default: > /* > - * Temporary transport disruption in talking to the controller. > + * Normal error path. > * Try to send on a new path. > */ > nvme_mpath_clear_current_path(ns); > break; > - default: > - /* This was a non-ANA error so follow the normal error path. */ > - return false; > } > > spin_lock_irqsave(&ns->head->requeue_lock, flags); > @@ -109,7 +105,6 @@ bool nvme_failover_req(struct request *req) > blk_mq_end_request(req, 0); > > kblockd_schedule_work(&ns->head->requeue_work); > - return true; > } > > void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 09ffc3246f60..cbb5d4ba6241 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -582,7 +582,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); > void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); > void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > struct nvme_ctrl *ctrl, int *flags); > -bool nvme_failover_req(struct request *req); > +void nvme_failover_req(struct request *req); > void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); > int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); > void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); > @@ -640,9 +640,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance); > } > > -static inline bool nvme_failover_req(struct request *req) > +static inline void nvme_failover_req(struct request *req) > { > - return false; > } > static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) > { > diff --git a/include/linux/nvme.h b/include/linux/nvme.h > index 5ce51ab4c50e..8c4a5b4d5b4d 100644 > --- a/include/linux/nvme.h > +++ b/include/linux/nvme.h > @@ -1441,6 +1441,15 @@ enum { > NVME_SC_DNR = 0x4000, > }; > > +#define NVME_SCT_MASK 0x700 > +enum { > + NVME_SCT_GENERIC = 0, > + NVME_SCT_COMMAND_SPECIFIC = 0x100, > + NVME_SCT_MEDIA = 0x200, > + NVME_SCT_PATH = 0x300, > + NVME_SCT_VENDOR = 0x700 > +}; > + > struct nvme_completion { > /* > * Used by Admin and Fabrics commands to return data: > -- > 2.16.4 > > > On 2020/8/9 5:08, Meneghini, John wrote: > > I'd like to up level this whole conversation for a minute by talking about exactly what ACRE does. > > > > The genesis of the changes discussed in this thread is NVMe TP-4033 - Enhanced Command Retry. You can find a copy of this TP here: > > > > http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip > > > > This technical proposal added a command retry delay feature which is programmed by the controller. The controller advertises a set of 3 different timing delays though the Identify Controller data structure CRDT{1-2} fields. To make use of these delay fields a new CRD field was added to the CQE Status Field. This allows the NVMe controller to specify exactly how long a command retry should be delayed, with 3 possible timers that it chooses and controls. CRDTs can range from 100 milliseconds to 6559 seconds. Because this capability can have such a radical effect on backwards compatibility a new NVMe Feature Identifier was added (Host Behavior Support - Feature ID 16h) with an Advanced Command Retry Enable (ACRE) bit. This allows the host to enable or disable the feature. > > > > With this background there are a couple of misconceptions in this thread which I'd like to address. > > > > The first is: ACRE has nothing to do with the NVME_SC_CMD_INTERRUPTED status. Yes, this new error status was added as a part of TP-4033 but the CRD field of the CQE status can be set by the controller with *any* NVMe error status. As long as the DNR bit is not set the Command Retry Delay can come into effect. This is how the spec is written and this is exactly how it has been implemented in the core nvme_complete_rq() function > > (after change 507fe46ac91276120). For example, the controller can return NVME_SC_NS_NOT_READY with a CRDT of 2 seconds.^ So CDRT needs to be supported with all error status if the host is going to enable ACRE, and I think it's a big mistake to get hung up over the NVME_SC_CMD_INTERRUPTED status translation. The NVME_SC_CMD_INTERRUPTED status was only added to the spec to provide a general purpose "busy" status, something that was missing from NVMe, and focusing your solution on NVME_SC_CMD_INTERRUPTED, or any other specific NVMe error status, is the wrong thing to do. There is a much larger change in error semantics going on with ACRE than just this single error. > > > > The second is: All NVMe error status that do not have a Status Code Type of 3h (Path Related Status) are subsystem scoped. This is a topic that has gone through some debate on the linux-nvme mailing list and at NVMexpress.org; and there have been some ECNs to the spec to address this. There may be some exceptions to this rule because there are always implementations out there that may not follow, and there are bugs in the spec. However, this is the intention of the NVMe spec and it matters. This means that, in a multi-pathing environment, retrying any command on a different path will not provide a different result. Retries should all occur on the same controller - unless it is a path related status. This is how NVMe error semantics work and this is a part of what was behind Keith's patch . > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da > > > > Retrying NVME_SC_NS_NOT_READY or NVME_SC_CMD_INTERRUPTED on another path is simply not the right thing to do, and returning BLK_STS_TARGET after all command retries, with CRDT, have been exhausted communicates the right thing to the upper layer. From the perspective of nvme-multipathing Keith's patch was exactly the correct thing to do. I understand that this may have caused a backwards compatibly problem with dm-multipath, and that's the only reason why I've talked about backing it out. However, ultimately, I think nvme-core should return an error status like BLK_STS_TARGET that says, semantically - the IO has failed, no retry will work - because this is what the NVMe error semantics are. > > > > Taken together both of these facts about the NVMe protocol semantics are what's behind my patch which removed blk_path_error() from nvme_complete_rq() > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=764e9332098c0e60251386a507fe46ac91276120 > > > > I understand that there is a goal to try and avoid having different failure/recovery handling semantically in response to different error status between nvme-multipath and dm-multipath, but NVMe error semantics are truly different from SCSI error semantics, and they are changing. The Linux host needs to enable and support those changes unhampered by the past. With this goal in mind, removing the blk_path_error() code from nvme-core was the right thing to do. Hannes and I struggled with the patch to try and make it work with blk_path_error() for weeks. As pointed out in the thread below, blk_path_error() is the SCSI multipathing logic and we can't use it in nvme_complete_rq(). All it does is import all of the legacy problems of dm-multipath, and of SCSI, into the nvme completion/multipath logic. > > > > At NVMexpress.org we consciously added the ACRE feature because the SCSI protocol had no such capability. This is something which has plagued SCSI implementations for years, and all kinds of tricks have been played, in both the SCSI host stack and in SCSI target stack, to deal with the problem. The goal of NVMe is to create a better block storage protocol and ACRE is just one example of many places where the industry is trying to do this. There are plans to introduce more Host Behavior Support features in the future. > > > > In the end, we are consciously changing NVMe, both in the spec and in its implementation, to make it different from SCSI. I think this is what's at the bottom of the changes discussed in this thread, and this is why so many people are so passionate about this. We don't want to turn NVMe into SCSI. I know I don't want to. > > > > /John > > > > ^Note: (maybe a 2 second delay sounds unreasonable for a flash storage device but there implementors that that want to use NVMe with spinning disks... so the CDRT gives them exactly what they need). > > > > On 8/7/20, 7:35 PM, "Sagi Grimberg" <sagi@grimberg.me> wrote: > > > > >> Hey Mike, > > >> > > >>>> The point is: blk_path_error() has nothing to do with NVMe errors. > > >>>> This is dm-multipath logic stuck in the middle of the NVMe error > > >>>> handling code. > > >>> > > >>> No, it is a means to have multiple subsystems (to this point both SCSI > > >>> and NVMe) doing the correct thing when translating subsystem specific > > >>> error codes to BLK_STS codes. > > >> > > >> Not exactly, don't find any use of this in scsi. The purpose is to make > > >> sure that nvme and dm speak the same language. > > > > > > SCSI doesn't need to do additional work to train a multipath layer > > > because dm-multipath _is_ SCSI's multipathing in Linux. > > > > Agree. > > > > > So ensuring SCSI properly classifies its error codes happens as a > > > side-effect of ensuring continued multipath functionality. > > > > > > Hannes introduced all these differentiated error codes in block core > > > because of SCSI. NVMe is meant to build on the infrastructure that was > > > established. > > > > Yes, exactly my point. blk_path_error is designed to make nvme and > > dm-multipath speak the same language. > > > > >>> If you, or others you name drop below, understood the point we wouldn't > > >>> be having this conversation. You'd accept the point of blk_path_error() > > >>> to be valid and required codification of what constitutes a retryable > > >>> path error for the Linux block layer. > > >> > > >> This incident is a case where the specific nvme status was designed > > >> to retry on the same path respecting the controller retry delay. > > >> And because nvme used blk_path_error at the time it caused us to use a > > >> non-retryable status to get around that. Granted, no one had > > >> dm-multipath in mind. > > >> > > >> So in a sense, there is consensus on changing patch 35038bffa87da > > >> _because_ nvme no longer uses blk_path_error. Otherwise it would break. > > > > > > "break" meaning it would do failover instead of the more optimal local > > > retry to the same controller. > > > > > > I see. Wish the header for commit 35038bffa87da touched on this even a > > > little bit ;) > > > > I think it did, but maybe didn't put too much emphasis on it. > > > > > But AFAICT the patch I provided doesn't compromise proper local retry -- > > > as long as we first fix nvme_error_status() to return a retryable > > > BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > > > > > > Think of blk_path_error() as a more coarse-grained "is this retryable or > > > a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, > > > nvme_error_status() _should_ respond with something retryable (I'd > > > prefer BLK_STS_RESOURCE to be honest). > > > > But blk_path_error semantically mean "is this a pathing error", or at > > least that what its name suggest. > > > > > And then nvme_failover_req() is finer-grained; by returning false it now > > > allows short-circuiting failover and reverting back to NVMe's normal > > > controller based error recovery -- which it does for > > > NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). > > > > > > And then the previous nvme_error_status() classification of retryable > > > BLK_STS obviously never gets returned up the IO stack; it gets thrown > > > away. > > > > I see what you are saying. The issue is that the code becomes > > convoluted (it's a pathing error, oh wait, no its not a pathing error). > > > > >>> Any BLK_STS mapping of NVMe specific error codes would need to not screw > > >>> up by categorizing a retryable error as non-retryable (and vice-versa). > > >> > > >> But it is a special type of retryable. There is nothing that fits the > > >> semantics of the current behavior. > > > > > > I agree. But that's fine actually. > > > > > > And this issue is the poster-child for why properly supporting a duality > > > of driver-level vs upper level multipathing capabilities is pretty much > > > impossible unless a clean design factors out the different error classes > > > -- and local error retry is handled before punting to higher level > > > failover retry. Think if NVMe were to adopt a bit more disciplined > > > "local then failover" error handling it all gets easier. > > > > I don't think punting before is easier, because we do a local retry if: > > - no multipathing sw on top > > - request needs retry (e.g. no DNR, notretry is off etc..) > > - nvme error is not pathing related (nvme_failover_req returned false) > > > > > This local retry _is_ NVMe specific. NVMe should just own retrying on > > > the same controller no matter what (I'll hope that such retry has > > > awareness to not retry indefinitely though!). > > > > it will retry until the retry limit. > > > > > And this has nothing to > > > do with multipathing, so the logic to handle it shouldn't be trapped in > > > nvme_failover_req(). > > > > Well given that nvme_failover_req already may not actually failover this > > makes some sense to me (although I did have some resistance to make it > > that way in the first place, but was convinced otherwise). > > > > > I think NVMe can easily fix this by having an earlier stage of checking, > > > e.g. nvme_local_retry_req(), that shortcircuits ever getting to > > > higher-level multipathing consideration (be it native NVMe or DM > > > multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > > > To be clear: the "default" case of nvme_failover_req() that returns > > > false to fallback to NVMe's "local" normal NVMe error handling -- that > > > can stay.. but a more explicit handling of cases like > > > NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > > > check that happens before nvme_failover_req() in nvme_complete_rq(). > > > > I don't necessarily agree with having a dedicated nvme_local_retry_req(). > > a request that isn't failed over, goes to local error handling (retry or > > not). I actually think that just adding the condition to > > nvme_complete_req and having nvme_failover_req reject it would work. > > > > Keith? > > > > >>> Anyway, no new BLK_STS is needed at this point. More discipline with > > >>> how NVMe's error handling is changed is. > > >> > > >> Please read the above. > > > > > > I agree we'd need a new BLK_STS only if NVMe cannot be made to trap > > > NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path > > > failover. > > > > Not sure that is better, but we can see a patch first to determine. > > > > >>> If NVMe wants to ensure its > > >>> interface isn't broken regularly it _should_ use blk_path_error() to > > >>> validate future nvme_error_status() changes. Miscategorizing NVMe > > >>> errors to upper layers is a bug -- not open for debate. > > >> > > >> Again, don't agree is a Miscategorization nor a bug, its just something > > >> that is NVMe specific. > > > > > > Right I understand. > > > > > > Think it safe to assume these types of details are why Christoph wanted > > > to avoid the notion of native NVMe and DM multipathing having > > > compatible error handling. There was some wisdom with that position > > > (especially with native NVMe goals in mind). But if things are tweaked > > > slightly then both camps _can_ be made happy. > > > > > > There just needs to be a willingness to work through the details, > > > defensiveness needs to be shed on both sides, so constructive > > > review/consideration of problems can happen. > > > > Agreed. > > > > > Think that has already > > > happened here with our exchange. I'm open to investing effort here if > > > others are up for humoring my attempt to explore fixing the issues in a > > > mutually beneficial way. What's the worst that can happen? My simple > > > patches might continue to be ignored? ;) > > > > I won't ignore it, and I apologize of ignoring the original patch > > posted, I guess it flew under the radar... > > > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-07 23:35 ` Sagi Grimberg 2020-08-08 21:08 ` Meneghini, John @ 2020-08-10 14:36 ` Mike Snitzer 2020-08-10 17:22 ` [PATCH] nvme: explicitly use normal NVMe error handling when appropriate Mike Snitzer 1 sibling, 1 reply; 50+ messages in thread From: Mike Snitzer @ 2020-08-10 14:36 UTC (permalink / raw) To: Sagi Grimberg Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John On Fri, Aug 07 2020 at 7:35pm -0400, Sagi Grimberg <sagi@grimberg.me> wrote: > > >>Hey Mike, > >> > >>>>The point is: blk_path_error() has nothing to do with NVMe errors. > >>>>This is dm-multipath logic stuck in the middle of the NVMe error > >>>>handling code. > >>> > >>>No, it is a means to have multiple subsystems (to this point both SCSI > >>>and NVMe) doing the correct thing when translating subsystem specific > >>>error codes to BLK_STS codes. > >> > >>Not exactly, don't find any use of this in scsi. The purpose is to make > >>sure that nvme and dm speak the same language. > > > >SCSI doesn't need to do additional work to train a multipath layer > >because dm-multipath _is_ SCSI's multipathing in Linux. > > Agree. > > >So ensuring SCSI properly classifies its error codes happens as a > >side-effect of ensuring continued multipath functionality. > > > >Hannes introduced all these differentiated error codes in block core > >because of SCSI. NVMe is meant to build on the infrastructure that was > >established. > > Yes, exactly my point. blk_path_error is designed to make nvme and > dm-multipath speak the same language. Yes, code was lifted from dm-multipath to block core for that purpose. > >But AFAICT the patch I provided doesn't compromise proper local retry -- > >as long as we first fix nvme_error_status() to return a retryable > >BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. > > > >Think of blk_path_error() as a more coarse-grained "is this retryable or > >a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, > >nvme_error_status() _should_ respond with something retryable (I'd > >prefer BLK_STS_RESOURCE to be honest). > > But blk_path_error semantically mean "is this a pathing error", or at > least that what its name suggest. As you can see here: https://www.redhat.com/archives/dm-devel/2018-January/msg00001.html Keith's first pass lifted dm-mpath.c:noretry_error to blk_type.h:blk_path_failure and I think Jens preferred the name blk_path_error that his how it has stayed. In any case, blk_path_error is really a distraction for the problems I now see more clearly. > >And then nvme_failover_req() is finer-grained; by returning false it now > >allows short-circuiting failover and reverting back to NVMe's normal > >controller based error recovery -- which it does for > >NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). > > > >And then the previous nvme_error_status() classification of retryable > >BLK_STS obviously never gets returned up the IO stack; it gets thrown > >away. > > I see what you are saying. The issue is that the code becomes > convoluted (it's a pathing error, oh wait, no its not a pathing error). Yeah, I don't care for it either. And I don't think we need to get bogged down with that nastiness. Really not a fan of NVMe's code that lazily routes non-path related error handling through nvme_failover_req() only to return false back to nvme_complete_rq(). Its too cute, too convoluted. I concede that it is worthwhile to trap errors that should be dealt with by NVMe's normal error handling and punt back.. but when it does so it should be treated as a WARN_ON() or at least pr_warn() worthy event. And an earlier negative check in nvme_failover_req() be updated to latch on that class of error that is dealt with by NVMe's normal error handling. > >>>Any BLK_STS mapping of NVMe specific error codes would need to not screw > >>>up by categorizing a retryable error as non-retryable (and vice-versa). > >> > >>But it is a special type of retryable. There is nothing that fits the > >>semantics of the current behavior. > > > >I agree. But that's fine actually. > > > >And this issue is the poster-child for why properly supporting a duality > >of driver-level vs upper level multipathing capabilities is pretty much > >impossible unless a clean design factors out the different error classes > >-- and local error retry is handled before punting to higher level > >failover retry. Think if NVMe were to adopt a bit more disciplined > >"local then failover" error handling it all gets easier. > > I don't think punting before is easier, because we do a local retry if: > - no multipathing sw on top > - request needs retry (e.g. no DNR, notretry is off etc..) > - nvme error is not pathing related (nvme_failover_req returned false) > > >This local retry _is_ NVMe specific. NVMe should just own retrying on > >the same controller no matter what (I'll hope that such retry has > >awareness to not retry indefinitely though!). > > it will retry until the retry limit. Yes. > > And this has nothing to > >do with multipathing, so the logic to handle it shouldn't be trapped in > >nvme_failover_req(). > > Well given that nvme_failover_req already may not actually failover this > makes some sense to me (although I did have some resistance to make it > that way in the first place, but was convinced otherwise). > > >I think NVMe can easily fix this by having an earlier stage of checking, > >e.g. nvme_local_retry_req(), that shortcircuits ever getting to > >higher-level multipathing consideration (be it native NVMe or DM > >multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > >To be clear: the "default" case of nvme_failover_req() that returns > >false to fallback to NVMe's "local" normal NVMe error handling -- that > >can stay.. but a more explicit handling of cases like > >NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > >check that happens before nvme_failover_req() in nvme_complete_rq(). > > I don't necessarily agree with having a dedicated nvme_local_retry_req(). > a request that isn't failed over, goes to local error handling (retry or > not). I actually think that just adding the condition to > nvme_complete_req and having nvme_failover_req reject it would work. > > Keith? I think that is basically what I'm thinking too. But looking at NVMe's core.c: I do not think nvme_complete_rq() allows local error handling _unless_ nvme_req_needs_retry() returns true. > >>>Anyway, no new BLK_STS is needed at this point. More discipline with > >>>how NVMe's error handling is changed is. > >> > >>Please read the above. > > > >I agree we'd need a new BLK_STS only if NVMe cannot be made to trap > >NVME_SC_CMD_INTERRUPTED for local retry _before_ considering path > >failover. > > Not sure that is better, but we can see a patch first to determine. > > >>>If NVMe wants to ensure its > >>>interface isn't broken regularly it _should_ use blk_path_error() to > >>>validate future nvme_error_status() changes. Miscategorizing NVMe > >>>errors to upper layers is a bug -- not open for debate. > >> > >>Again, don't agree is a Miscategorization nor a bug, its just something > >>that is NVMe specific. > > > >Right I understand. > > > >Think it safe to assume these types of details are why Christoph wanted > >to avoid the notion of native NVMe and DM multipathing having > >compatible error handling. There was some wisdom with that position > >(especially with native NVMe goals in mind). But if things are tweaked > >slightly then both camps _can_ be made happy. > > > >There just needs to be a willingness to work through the details, > >defensiveness needs to be shed on both sides, so constructive > >review/consideration of problems can happen. > > Agreed. > > >Think that has already > >happened here with our exchange. I'm open to investing effort here if > >others are up for humoring my attempt to explore fixing the issues in a > >mutually beneficial way. What's the worst that can happen? My simple > >patches might continue to be ignored? ;) > > I won't ignore it, and I apologize of ignoring the original patch > posted, I guess it flew under the radar... No worries. We all get pulled in many directions and get a lot of email. I'll be sending some nvme patch(es) shortly. To refocus our discussion. Mike _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-10 14:36 ` Mike Snitzer @ 2020-08-10 17:22 ` Mike Snitzer 2020-08-11 3:32 ` Chao Leng 2020-08-13 14:48 ` [RESEND PATCH] " Mike Snitzer 0 siblings, 2 replies; 50+ messages in thread From: Mike Snitzer @ 2020-08-10 17:22 UTC (permalink / raw) To: Sagi Grimberg Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John On Mon, Aug 10 2020 at 10:36am -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Fri, Aug 07 2020 at 7:35pm -0400, > Sagi Grimberg <sagi@grimberg.me> wrote: > > > > > >>Hey Mike, ... > > >I think NVMe can easily fix this by having an earlier stage of checking, > > >e.g. nvme_local_retry_req(), that shortcircuits ever getting to > > >higher-level multipathing consideration (be it native NVMe or DM > > >multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > > >To be clear: the "default" case of nvme_failover_req() that returns > > >false to fallback to NVMe's "local" normal NVMe error handling -- that > > >can stay.. but a more explicit handling of cases like > > >NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > > >check that happens before nvme_failover_req() in nvme_complete_rq(). > > > > I don't necessarily agree with having a dedicated nvme_local_retry_req(). > > a request that isn't failed over, goes to local error handling (retry or > > not). I actually think that just adding the condition to > > nvme_complete_req and having nvme_failover_req reject it would work. > > > > Keith? > > I think that is basically what I'm thinking too. From: Mike Snitzer <snitzer@redhat.com> Subject: nvme: explicitly use normal NVMe error handling when appropriate Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error handling by changing multipathing's nvme_failover_req() to short-circuit path failover and then fallback to NVMe's normal error handling (which takes care of NVME_SC_CMD_INTERRUPTED). This detour through native NVMe multipathing code is unwelcome because it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent of any multipathing concerns. Introduce nvme_status_needs_local_error_handling() to prioritize non-failover retry, when appropriate, in terms of normal NVMe error handling. nvme_status_needs_local_error_handling() will naturely evolve to include handling of any other errors that normal error handling must be used for. nvme_failover_req()'s ability to fallback to normal NVMe error handling has been preserved because it may be useful for future NVME_SC that nvme_status_needs_local_error_handling() hasn't yet been trained for. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/nvme/host/core.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 88cff309d8e4..be749b690af7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) return true; } +static inline bool nvme_status_needs_local_error_handling(u16 status) +{ + switch (status & 0x7ff) { + case NVME_SC_CMD_INTERRUPTED: + return true; + default: + return false; + } +} + static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) void nvme_complete_rq(struct request *req) { - blk_status_t status = nvme_error_status(nvme_req(req)->status); + u16 nvme_status = nvme_req(req)->status; + blk_status_t status = nvme_error_status(nvme_status); trace_nvme_complete_rq(req); @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) nvme_req(req)->ctrl->comp_seen = true; if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) + if (!nvme_status_needs_local_error_handling(nvme_status) && + (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) return; if (!blk_queue_dying(req->q)) { -- 2.18.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-10 17:22 ` [PATCH] nvme: explicitly use normal NVMe error handling when appropriate Mike Snitzer @ 2020-08-11 3:32 ` Chao Leng 2020-08-11 4:20 ` Mike Snitzer 2020-08-13 14:48 ` [RESEND PATCH] " Mike Snitzer 1 sibling, 1 reply; 50+ messages in thread From: Chao Leng @ 2020-08-11 3:32 UTC (permalink / raw) To: Mike Snitzer, Sagi Grimberg Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch, Meneghini, John On 2020/8/11 1:22, Mike Snitzer wrote: > On Mon, Aug 10 2020 at 10:36am -0400, > Mike Snitzer <snitzer@redhat.com> wrote: > >> On Fri, Aug 07 2020 at 7:35pm -0400, >> Sagi Grimberg <sagi@grimberg.me> wrote: >> >>> >>>>> Hey Mike, > ... >>>> I think NVMe can easily fix this by having an earlier stage of checking, >>>> e.g. nvme_local_retry_req(), that shortcircuits ever getting to >>>> higher-level multipathing consideration (be it native NVMe or DM >>>> multipathing) for cases like NVME_SC_CMD_INTERRUPTED. >>>> To be clear: the "default" case of nvme_failover_req() that returns >>>> false to fallback to NVMe's "local" normal NVMe error handling -- that >>>> can stay.. but a more explicit handling of cases like >>>> NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() >>>> check that happens before nvme_failover_req() in nvme_complete_rq(). >>> >>> I don't necessarily agree with having a dedicated nvme_local_retry_req(). >>> a request that isn't failed over, goes to local error handling (retry or >>> not). I actually think that just adding the condition to >>> nvme_complete_req and having nvme_failover_req reject it would work. >>> >>> Keith? >> >> I think that is basically what I'm thinking too. > > From: Mike Snitzer <snitzer@redhat.com> > Subject: nvme: explicitly use normal NVMe error handling when appropriate > > Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown > status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error > handling by changing multipathing's nvme_failover_req() to short-circuit > path failover and then fallback to NVMe's normal error handling (which > takes care of NVME_SC_CMD_INTERRUPTED). > > This detour through native NVMe multipathing code is unwelcome because > it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent > of any multipathing concerns. > > Introduce nvme_status_needs_local_error_handling() to prioritize > non-failover retry, when appropriate, in terms of normal NVMe error > handling. nvme_status_needs_local_error_handling() will naturely evolve > to include handling of any other errors that normal error handling must > be used for. > > nvme_failover_req()'s ability to fallback to normal NVMe error handling > has been preserved because it may be useful for future NVME_SC that > nvme_status_needs_local_error_handling() hasn't yet been trained for. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > drivers/nvme/host/core.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 88cff309d8e4..be749b690af7 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) > return true; > } > > +static inline bool nvme_status_needs_local_error_handling(u16 status) > +{ > + switch (status & 0x7ff) { > + case NVME_SC_CMD_INTERRUPTED: > + return true; > + default: > + return false; > + } > +} > + > static void nvme_retry_req(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) > > void nvme_complete_rq(struct request *req) > { > - blk_status_t status = nvme_error_status(nvme_req(req)->status); > + u16 nvme_status = nvme_req(req)->status; > + blk_status_t status = nvme_error_status(nvme_status); > > trace_nvme_complete_rq(req); > > @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) > nvme_req(req)->ctrl->comp_seen = true; > > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > + if (!nvme_status_needs_local_error_handling(nvme_status) && > + (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))This looks no affect. if work with nvme multipath, now is already retry local. If work with dm-multipath, still return error. > return; > > if (!blk_queue_dying(req->q)) { > Suggest: REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol do not difine the local retry mechanism. SCSI implements a fuzzy local retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath software, multipath software retry according error code is expected. nvme is different with scsi about this. It define local retry mechanism and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT. Another, for nvme multipath, if the error code is not a path error, multipath will not fail over to retry. but maybe blk_queue_dying return true, IO can not be retry at current path, thus IO will interrupted. blk_queue_dying and path error both need fail over to retry. So we can do like this: --- drivers/nvme/host/core.c | 26 +++++++++++++++++++------- drivers/nvme/host/multipath.c | 11 +++-------- drivers/nvme/host/nvme.h | 5 ++--- include/linux/nvme.h | 9 +++++++++ 4 files changed, 33 insertions(+), 18 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 4ee2330c603e..07471bd37f60 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -243,7 +243,7 @@ static blk_status_t nvme_error_status(u16 status) static inline bool nvme_req_needs_retry(struct request *req) { - if (blk_noretry_request(req)) + if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER)) return false; if (nvme_req(req)->status & NVME_SC_DNR) return false; @@ -252,6 +252,14 @@ static inline bool nvme_req_needs_retry(struct request *req) return true; } +static inline bool nvme_req_path_error(struct request *req) +{ + if ((nvme_req(req)->status & NVME_SCT_MASK) == NVME_SCT_PATH || + blk_queue_dying(req->q)) + return true; + return false; +} + static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -270,7 +278,7 @@ static void nvme_retry_req(struct request *req) void nvme_complete_rq(struct request *req) { - blk_status_t status = nvme_error_status(nvme_req(req)->status); + blk_status_t status; trace_nvme_complete_rq(req); @@ -279,16 +287,20 @@ void nvme_complete_rq(struct request *req) if (nvme_req(req)->ctrl->kas) nvme_req(req)->ctrl->comp_seen = true; - if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; - - if (!blk_queue_dying(req->q)) { + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && + nvme_req_needs_retry(req))) { + if (nvme_req_path_error(req)) { + if (req->cmd_flags & REQ_NVME_MPATH) { + nvme_failover_req(req); + return; + } + } else { nvme_retry_req(req); return; } } + status = nvme_error_status(nvme_req(req)->status); nvme_trace_bio_complete(req, status); blk_mq_end_request(req, status); } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 66509472fe06..e182fb3bcd0c 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -65,7 +65,7 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, } } -bool nvme_failover_req(struct request *req) +void nvme_failover_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; u16 status = nvme_req(req)->status; @@ -90,17 +90,13 @@ bool nvme_failover_req(struct request *req) queue_work(nvme_wq, &ns->ctrl->ana_work); } break; - case NVME_SC_HOST_PATH_ERROR: - case NVME_SC_HOST_ABORTED_CMD: + default: /* - * Temporary transport disruption in talking to the controller. + * Normal error path. * Try to send on a new path. */ nvme_mpath_clear_current_path(ns); break; - default: - /* This was a non-ANA error so follow the normal error path. */ - return false; } spin_lock_irqsave(&ns->head->requeue_lock, flags); @@ -109,7 +105,6 @@ bool nvme_failover_req(struct request *req) blk_mq_end_request(req, 0); kblockd_schedule_work(&ns->head->requeue_work); - return true; } void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 09ffc3246f60..cbb5d4ba6241 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -582,7 +582,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, struct nvme_ctrl *ctrl, int *flags); -bool nvme_failover_req(struct request *req); +void nvme_failover_req(struct request *req); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); @@ -640,9 +640,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance); } -static inline bool nvme_failover_req(struct request *req) +static inline void nvme_failover_req(struct request *req) { - return false; } static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 5ce51ab4c50e..8c4a5b4d5b4d 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -1441,6 +1441,15 @@ enum { NVME_SC_DNR = 0x4000, }; +#define NVME_SCT_MASK 0x700 +enum { + NVME_SCT_GENERIC = 0, + NVME_SCT_COMMAND_SPECIFIC = 0x100, + NVME_SCT_MEDIA = 0x200, + NVME_SCT_PATH = 0x300, + NVME_SCT_VENDOR = 0x700 +}; + struct nvme_completion { /* * Used by Admin and Fabrics commands to return data: -- 2.16.4 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme: explicitly use normal NVMe error handling when appropriate 2020-08-11 3:32 ` Chao Leng @ 2020-08-11 4:20 ` Mike Snitzer 2020-08-11 6:17 ` Chao Leng 0 siblings, 1 reply; 50+ messages in thread From: Mike Snitzer @ 2020-08-11 4:20 UTC (permalink / raw) To: Chao Leng Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch, Meneghini, John On Mon, Aug 10 2020 at 11:32pm -0400, Chao Leng <lengchao@huawei.com> wrote: > > > On 2020/8/11 1:22, Mike Snitzer wrote: > >On Mon, Aug 10 2020 at 10:36am -0400, > >Mike Snitzer <snitzer@redhat.com> wrote: > > > >>On Fri, Aug 07 2020 at 7:35pm -0400, > >>Sagi Grimberg <sagi@grimberg.me> wrote: > >> > >>> > >>>>>Hey Mike, > >... > >>>>I think NVMe can easily fix this by having an earlier stage of checking, > >>>>e.g. nvme_local_retry_req(), that shortcircuits ever getting to > >>>>higher-level multipathing consideration (be it native NVMe or DM > >>>>multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > >>>>To be clear: the "default" case of nvme_failover_req() that returns > >>>>false to fallback to NVMe's "local" normal NVMe error handling -- that > >>>>can stay.. but a more explicit handling of cases like > >>>>NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > >>>>check that happens before nvme_failover_req() in nvme_complete_rq(). > >>> > >>>I don't necessarily agree with having a dedicated nvme_local_retry_req(). > >>>a request that isn't failed over, goes to local error handling (retry or > >>>not). I actually think that just adding the condition to > >>>nvme_complete_req and having nvme_failover_req reject it would work. > >>> > >>>Keith? > >> > >>I think that is basically what I'm thinking too. > > > >From: Mike Snitzer <snitzer@redhat.com> > >Subject: nvme: explicitly use normal NVMe error handling when appropriate > > > >Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown > >status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error > >handling by changing multipathing's nvme_failover_req() to short-circuit > >path failover and then fallback to NVMe's normal error handling (which > >takes care of NVME_SC_CMD_INTERRUPTED). > > > >This detour through native NVMe multipathing code is unwelcome because > >it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent > >of any multipathing concerns. > > > >Introduce nvme_status_needs_local_error_handling() to prioritize > >non-failover retry, when appropriate, in terms of normal NVMe error > >handling. nvme_status_needs_local_error_handling() will naturely evolve > >to include handling of any other errors that normal error handling must > >be used for. > > > >nvme_failover_req()'s ability to fallback to normal NVMe error handling > >has been preserved because it may be useful for future NVME_SC that > >nvme_status_needs_local_error_handling() hasn't yet been trained for. > > > >Signed-off-by: Mike Snitzer <snitzer@redhat.com> > >--- > > drivers/nvme/host/core.c | 16 ++++++++++++++-- > > 1 file changed, 14 insertions(+), 2 deletions(-) > > > >diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > >index 88cff309d8e4..be749b690af7 100644 > >--- a/drivers/nvme/host/core.c > >+++ b/drivers/nvme/host/core.c > >@@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) > > return true; > > } > >+static inline bool nvme_status_needs_local_error_handling(u16 status) > >+{ > >+ switch (status & 0x7ff) { > >+ case NVME_SC_CMD_INTERRUPTED: > >+ return true; > >+ default: > >+ return false; > >+ } > >+} > >+ > > static void nvme_retry_req(struct request *req) > > { > > struct nvme_ns *ns = req->q->queuedata; > >@@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) > > void nvme_complete_rq(struct request *req) > > { > >- blk_status_t status = nvme_error_status(nvme_req(req)->status); > >+ u16 nvme_status = nvme_req(req)->status; > >+ blk_status_t status = nvme_error_status(nvme_status); > > trace_nvme_complete_rq(req); > >@@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) > > nvme_req(req)->ctrl->comp_seen = true; > > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > >- if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > >+ if (!nvme_status_needs_local_error_handling(nvme_status) && > >+ (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > > This looks no affect. if work with nvme multipath, now is already retry local. Not if NVMe is built without multipathing configured. > If work with dm-multipath, still return error. Yes, I'm aware. Use of REQ_FAILFAST_TRANSPORT isn't something that is needed for NVMe, so why are you proposing hacks in NVMe to deal with it? > > return; > > if (!blk_queue_dying(req->q)) { > > > > Suggest: > REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol > do not difine the local retry mechanism. SCSI implements a fuzzy local > retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath > software, multipath software retry according error code is expected. > nvme is different with scsi about this. It define local retry mechanism > and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT. Exactly. Except by "nvme should not care REQ_FAILFAST_TRANSPORT." your patch says you mean "nvme shouldn't disallow retry if REQ_FAILFAST_TRANSPORT is it". I'm saying: don't try to get such changes into NVMe. In general, aspects of your patch may have merit but overall it is doing too much. Mike _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: explicitly use normal NVMe error handling when appropriate 2020-08-11 4:20 ` Mike Snitzer @ 2020-08-11 6:17 ` Chao Leng 2020-08-11 14:12 ` Mike Snitzer 0 siblings, 1 reply; 50+ messages in thread From: Chao Leng @ 2020-08-11 6:17 UTC (permalink / raw) To: Mike Snitzer Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch, Meneghini, John On 2020/8/11 12:20, Mike Snitzer wrote: > On Mon, Aug 10 2020 at 11:32pm -0400, > Chao Leng <lengchao@huawei.com> wrote: > >> >> >> On 2020/8/11 1:22, Mike Snitzer wrote: >>> On Mon, Aug 10 2020 at 10:36am -0400, >>> Mike Snitzer <snitzer@redhat.com> wrote: >>> >>>> On Fri, Aug 07 2020 at 7:35pm -0400, >>>> Sagi Grimberg <sagi@grimberg.me> wrote: >>>> >>>>> >>>>>>> Hey Mike, >>> ... >>>>>> I think NVMe can easily fix this by having an earlier stage of checking, >>>>>> e.g. nvme_local_retry_req(), that shortcircuits ever getting to >>>>>> higher-level multipathing consideration (be it native NVMe or DM >>>>>> multipathing) for cases like NVME_SC_CMD_INTERRUPTED. >>>>>> To be clear: the "default" case of nvme_failover_req() that returns >>>>>> false to fallback to NVMe's "local" normal NVMe error handling -- that >>>>>> can stay.. but a more explicit handling of cases like >>>>>> NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() >>>>>> check that happens before nvme_failover_req() in nvme_complete_rq(). >>>>> >>>>> I don't necessarily agree with having a dedicated nvme_local_retry_req(). >>>>> a request that isn't failed over, goes to local error handling (retry or >>>>> not). I actually think that just adding the condition to >>>>> nvme_complete_req and having nvme_failover_req reject it would work. >>>>> >>>>> Keith? >>>> >>>> I think that is basically what I'm thinking too. >>> >>> From: Mike Snitzer <snitzer@redhat.com> >>> Subject: nvme: explicitly use normal NVMe error handling when appropriate >>> >>> Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown >>> status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error >>> handling by changing multipathing's nvme_failover_req() to short-circuit >>> path failover and then fallback to NVMe's normal error handling (which >>> takes care of NVME_SC_CMD_INTERRUPTED). >>> >>> This detour through native NVMe multipathing code is unwelcome because >>> it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent >>> of any multipathing concerns. >>> >>> Introduce nvme_status_needs_local_error_handling() to prioritize >>> non-failover retry, when appropriate, in terms of normal NVMe error >>> handling. nvme_status_needs_local_error_handling() will naturely evolve >>> to include handling of any other errors that normal error handling must >>> be used for. >>> >>> nvme_failover_req()'s ability to fallback to normal NVMe error handling >>> has been preserved because it may be useful for future NVME_SC that >>> nvme_status_needs_local_error_handling() hasn't yet been trained for. >>> >>> Signed-off-by: Mike Snitzer <snitzer@redhat.com> >>> --- >>> drivers/nvme/host/core.c | 16 ++++++++++++++-- >>> 1 file changed, 14 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index 88cff309d8e4..be749b690af7 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) >>> return true; >>> } >>> +static inline bool nvme_status_needs_local_error_handling(u16 status) >>> +{ >>> + switch (status & 0x7ff) { >>> + case NVME_SC_CMD_INTERRUPTED: >>> + return true; >>> + default: >>> + return false; >>> + } >>> +} >>> + >>> static void nvme_retry_req(struct request *req) >>> { >>> struct nvme_ns *ns = req->q->queuedata; >>> @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) >>> void nvme_complete_rq(struct request *req) >>> { >>> - blk_status_t status = nvme_error_status(nvme_req(req)->status); >>> + u16 nvme_status = nvme_req(req)->status; >>> + blk_status_t status = nvme_error_status(nvme_status); >>> trace_nvme_complete_rq(req); >>> @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) >>> nvme_req(req)->ctrl->comp_seen = true; >>> if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { >>> - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) >>> + if (!nvme_status_needs_local_error_handling(nvme_status) && >>> + (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) >> >> This looks no affect. if work with nvme multipath, now is already retry local. > > Not if NVMe is built without multipathing configured. If without nvme multipathing configured, now is also retry local, do not need !nvme_status_needs_local_error_handling(nvme_status). > >> If work with dm-multipath, still return error. > > Yes, I'm aware. Use of REQ_FAILFAST_TRANSPORT isn't something that is > needed for NVMe, so why are you proposing hacks in NVMe to deal with it? I just describe the possible scenarios:1.nvme multipathing configured. 2.without any multipath.3. with dm-multipath. > >>> return; >>> if (!blk_queue_dying(req->q)) { >>> >> >> Suggest: >> REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol >> do not difine the local retry mechanism. SCSI implements a fuzzy local >> retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath >> software, multipath software retry according error code is expected. >> nvme is different with scsi about this. It define local retry mechanism >> and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT. > > Exactly. Except by "nvme should not care REQ_FAILFAST_TRANSPORT." your > patch says you mean "nvme shouldn't disallow retry if > REQ_FAILFAST_TRANSPORT is it". I'm saying: don't try to get such > changes into NVMe. no. the patch just mean: if path error, fail over to retry by multipath (nvme multipath or dm-multipath). Other need local retry local, retry after a defined time according to status(CRD) and CRDT. Now nvme multipath is already do like this, the patch make dm-multipath work like nvme multipath. > > In general, aspects of your patch may have merit but overall it is doing > too much. > > Mike > > . > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: explicitly use normal NVMe error handling when appropriate 2020-08-11 6:17 ` Chao Leng @ 2020-08-11 14:12 ` Mike Snitzer 0 siblings, 0 replies; 50+ messages in thread From: Mike Snitzer @ 2020-08-11 14:12 UTC (permalink / raw) To: Chao Leng Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Keith Busch, Meneghini, John On Tue, Aug 11 2020 at 2:17am -0400, Chao Leng <lengchao@huawei.com> wrote: > > > On 2020/8/11 12:20, Mike Snitzer wrote: > >On Mon, Aug 10 2020 at 11:32pm -0400, > >Chao Leng <lengchao@huawei.com> wrote: > > > >> > >> > >>On 2020/8/11 1:22, Mike Snitzer wrote: > >>>On Mon, Aug 10 2020 at 10:36am -0400, > >>>Mike Snitzer <snitzer@redhat.com> wrote: > >>> > >>>>On Fri, Aug 07 2020 at 7:35pm -0400, > >>>>Sagi Grimberg <sagi@grimberg.me> wrote: > >>>> > >>>>> > >>>>>>>Hey Mike, > >>>... > >>>>>>I think NVMe can easily fix this by having an earlier stage of checking, > >>>>>>e.g. nvme_local_retry_req(), that shortcircuits ever getting to > >>>>>>higher-level multipathing consideration (be it native NVMe or DM > >>>>>>multipathing) for cases like NVME_SC_CMD_INTERRUPTED. > >>>>>>To be clear: the "default" case of nvme_failover_req() that returns > >>>>>>false to fallback to NVMe's "local" normal NVMe error handling -- that > >>>>>>can stay.. but a more explicit handling of cases like > >>>>>>NVME_SC_CMD_INTERRUPTED should be added to a nvme_local_retry_req() > >>>>>>check that happens before nvme_failover_req() in nvme_complete_rq(). > >>>>> > >>>>>I don't necessarily agree with having a dedicated nvme_local_retry_req(). > >>>>>a request that isn't failed over, goes to local error handling (retry or > >>>>>not). I actually think that just adding the condition to > >>>>>nvme_complete_req and having nvme_failover_req reject it would work. > >>>>> > >>>>>Keith? > >>>> > >>>>I think that is basically what I'm thinking too. > >>> > >>>From: Mike Snitzer <snitzer@redhat.com> > >>>Subject: nvme: explicitly use normal NVMe error handling when appropriate > >>> > >>>Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown > >>>status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error > >>>handling by changing multipathing's nvme_failover_req() to short-circuit > >>>path failover and then fallback to NVMe's normal error handling (which > >>>takes care of NVME_SC_CMD_INTERRUPTED). > >>> > >>>This detour through native NVMe multipathing code is unwelcome because > >>>it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent > >>>of any multipathing concerns. > >>> > >>>Introduce nvme_status_needs_local_error_handling() to prioritize > >>>non-failover retry, when appropriate, in terms of normal NVMe error > >>>handling. nvme_status_needs_local_error_handling() will naturely evolve > >>>to include handling of any other errors that normal error handling must > >>>be used for. > >>> > >>>nvme_failover_req()'s ability to fallback to normal NVMe error handling > >>>has been preserved because it may be useful for future NVME_SC that > >>>nvme_status_needs_local_error_handling() hasn't yet been trained for. > >>> > >>>Signed-off-by: Mike Snitzer <snitzer@redhat.com> > >>>--- > >>> drivers/nvme/host/core.c | 16 ++++++++++++++-- > >>> 1 file changed, 14 insertions(+), 2 deletions(-) > >>> > >>>diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > >>>index 88cff309d8e4..be749b690af7 100644 > >>>--- a/drivers/nvme/host/core.c > >>>+++ b/drivers/nvme/host/core.c > >>>@@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) > >>> return true; > >>> } > >>>+static inline bool nvme_status_needs_local_error_handling(u16 status) > >>>+{ > >>>+ switch (status & 0x7ff) { > >>>+ case NVME_SC_CMD_INTERRUPTED: > >>>+ return true; > >>>+ default: > >>>+ return false; > >>>+ } > >>>+} > >>>+ > >>> static void nvme_retry_req(struct request *req) > >>> { > >>> struct nvme_ns *ns = req->q->queuedata; > >>>@@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) > >>> void nvme_complete_rq(struct request *req) > >>> { > >>>- blk_status_t status = nvme_error_status(nvme_req(req)->status); > >>>+ u16 nvme_status = nvme_req(req)->status; > >>>+ blk_status_t status = nvme_error_status(nvme_status); > >>> trace_nvme_complete_rq(req); > >>>@@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) > >>> nvme_req(req)->ctrl->comp_seen = true; > >>> if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > >>>- if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > >>>+ if (!nvme_status_needs_local_error_handling(nvme_status) && > >>>+ (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > >> > >>This looks no affect. if work with nvme multipath, now is already retry local. > > > >Not if NVMe is built without multipathing configured. > > If without nvme multipathing configured, now is also retry local, do not need > !nvme_status_needs_local_error_handling(nvme_status). True, REQ_NVME_MPATH won't be set, so it'll fall through. The real benefit of my patch is that there is a cleaner code flow for handling errors with normal NVMe retry (without bouncing into failover code and then falling back to normal retry -- code that your patch does remove). SO my change is an obvious yet small improvement that makes the NVMe core error handling clearer -- yet preserves that fallback from failover_retry to help future-proof NVMe from new NVME_SC: which John M. is very clear about needing. > >>If work with dm-multipath, still return error. > > > >Yes, I'm aware. Use of REQ_FAILFAST_TRANSPORT isn't something that is > >needed for NVMe, so why are you proposing hacks in NVMe to deal with it? > I just describe the possible scenarios:1.nvme multipathing configured. > 2.without any multipath.3. with dm-multipath. I understand. > > > >>> return; > >>> if (!blk_queue_dying(req->q)) { > >>> > >> > >>Suggest: > >>REQ_FAILFAST_TRANSPORT may be designed for scsi, because scsi protocol > >>do not difine the local retry mechanism. SCSI implements a fuzzy local > >>retry mechanism, so need the REQ_FAILFAST_TRANSPORT for multipath > >>software, multipath software retry according error code is expected. > >>nvme is different with scsi about this. It define local retry mechanism > >>and path error code, so nvme should not care REQ_FAILFAST_TRANSPORT. > > > >Exactly. Except by "nvme should not care REQ_FAILFAST_TRANSPORT." your > >patch says you mean "nvme shouldn't disallow retry if > >REQ_FAILFAST_TRANSPORT is it". I'm saying: don't try to get such > >changes into NVMe. > > no. the patch just mean: if path error, fail over to retry by multipath > (nvme multipath or dm-multipath). Other need local retry local, retry > after a defined time according to status(CRD) and CRDT. Now nvme multipath > is already do like this, the patch make dm-multipath work like nvme multipath. I appreciate your enthusiasm for making native NVMe multipathing and dm-multipath coexist. But I think you're naive about the willingness to make that a reality. As such, any change that only benefits dm-multipath is dead on arrival. Your patch is mixed with various changes like that. I'd be stunned if Christoph accepted it. Thanks, Mike _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-10 17:22 ` [PATCH] nvme: explicitly use normal NVMe error handling when appropriate Mike Snitzer 2020-08-11 3:32 ` Chao Leng @ 2020-08-13 14:48 ` Mike Snitzer 2020-08-13 15:29 ` Meneghini, John ` (2 more replies) 1 sibling, 3 replies; 50+ messages in thread From: Mike Snitzer @ 2020-08-13 14:48 UTC (permalink / raw) To: Sagi Grimberg Cc: linux-nvme, Christoph Hellwig, dm-devel, Hannes Reinecke, Chao Leng, Keith Busch, Meneghini, John, Ewan Milne Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error handling by changing multipathing's nvme_failover_req() to short-circuit path failover and then fallback to NVMe's normal error handling (which takes care of NVME_SC_CMD_INTERRUPTED). This detour through native NVMe multipathing code is unwelcome because it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent of any multipathing concerns. Introduce nvme_status_needs_local_error_handling() to prioritize non-failover retry, when appropriate, in terms of normal NVMe error handling. nvme_status_needs_local_error_handling() will naturely evolve to include handling of any other errors that normal error handling must be used for. nvme_failover_req()'s ability to fallback to normal NVMe error handling has been preserved because it may be useful for future NVME_SC that nvme_status_needs_local_error_handling() hasn't been trained for yet. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/nvme/host/core.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 88cff309d8e4..be749b690af7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) return true; } +static inline bool nvme_status_needs_local_error_handling(u16 status) +{ + switch (status & 0x7ff) { + case NVME_SC_CMD_INTERRUPTED: + return true; + default: + return false; + } +} + static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) void nvme_complete_rq(struct request *req) { - blk_status_t status = nvme_error_status(nvme_req(req)->status); + u16 nvme_status = nvme_req(req)->status; + blk_status_t status = nvme_error_status(nvme_status); trace_nvme_complete_rq(req); @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) nvme_req(req)->ctrl->comp_seen = true; if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) + if (!nvme_status_needs_local_error_handling(nvme_status) && + (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) return; if (!blk_queue_dying(req->q)) { -- 2.18.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 14:48 ` [RESEND PATCH] " Mike Snitzer @ 2020-08-13 15:29 ` Meneghini, John 2020-08-13 15:43 ` Mike Snitzer 2020-08-13 15:36 ` Christoph Hellwig 2020-08-14 3:23 ` Meneghini, John 2 siblings, 1 reply; 50+ messages in thread From: Meneghini, John @ 2020-08-13 15:29 UTC (permalink / raw) To: Mike Snitzer, Sagi Grimberg Cc: linux-nvme, Christoph Hellwig, dm-devel, Hannes Reinecke, Chao Leng, Keith Busch, Meneghini, John, Ewan Milne Mike, I don't see your patch at: https://patchwork.kernel.org/project/linux-block/list/?submitter=1643 Can you please upload this patch there? Thanks, /John On 8/13/20, 10:48 AM, "Mike Snitzer" <snitzer@redhat.com> wrote: Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error handling by changing multipathing's nvme_failover_req() to short-circuit path failover and then fallback to NVMe's normal error handling (which takes care of NVME_SC_CMD_INTERRUPTED). This detour through native NVMe multipathing code is unwelcome because it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent of any multipathing concerns. Introduce nvme_status_needs_local_error_handling() to prioritize non-failover retry, when appropriate, in terms of normal NVMe error handling. nvme_status_needs_local_error_handling() will naturely evolve to include handling of any other errors that normal error handling must be used for. nvme_failover_req()'s ability to fallback to normal NVMe error handling has been preserved because it may be useful for future NVME_SC that nvme_status_needs_local_error_handling() hasn't been trained for yet. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/nvme/host/core.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 88cff309d8e4..be749b690af7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) return true; } +static inline bool nvme_status_needs_local_error_handling(u16 status) +{ + switch (status & 0x7ff) { + case NVME_SC_CMD_INTERRUPTED: + return true; + default: + return false; + } +} + static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) void nvme_complete_rq(struct request *req) { - blk_status_t status = nvme_error_status(nvme_req(req)->status); + u16 nvme_status = nvme_req(req)->status; + blk_status_t status = nvme_error_status(nvme_status); trace_nvme_complete_rq(req); @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) nvme_req(req)->ctrl->comp_seen = true; if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) + if (!nvme_status_needs_local_error_handling(nvme_status) && + (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) return; if (!blk_queue_dying(req->q)) { -- 2.18.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 15:29 ` Meneghini, John @ 2020-08-13 15:43 ` Mike Snitzer 2020-08-13 15:59 ` Meneghini, John 0 siblings, 1 reply; 50+ messages in thread From: Mike Snitzer @ 2020-08-13 15:43 UTC (permalink / raw) To: Meneghini, John Cc: Sagi Grimberg, Ewan Milne, Christoph Hellwig, dm-devel, linux-nvme, Chao Leng, Keith Busch, Hannes Reinecke Maybe because I didn't cc linux-block? Only way that I know to "upload this patch there" is to have cc'd linux-block. But the patch is in dm-devel's patchwork here: https://patchwork.kernel.org/patch/11712563/ Is that sufficient for your needs? Thanks, Mike On Thu, Aug 13 2020 at 11:29am -0400, Meneghini, John <John.Meneghini@netapp.com> wrote: > Mike, > > I don't see your patch at: > > https://patchwork.kernel.org/project/linux-block/list/?submitter=1643 > > Can you please upload this patch there? > > Thanks, > > /John > > On 8/13/20, 10:48 AM, "Mike Snitzer" <snitzer@redhat.com> wrote: > > Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown > status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error > handling by changing multipathing's nvme_failover_req() to short-circuit > path failover and then fallback to NVMe's normal error handling (which > takes care of NVME_SC_CMD_INTERRUPTED). > > This detour through native NVMe multipathing code is unwelcome because > it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent > of any multipathing concerns. > > Introduce nvme_status_needs_local_error_handling() to prioritize > non-failover retry, when appropriate, in terms of normal NVMe error > handling. nvme_status_needs_local_error_handling() will naturely evolve > to include handling of any other errors that normal error handling must > be used for. > > nvme_failover_req()'s ability to fallback to normal NVMe error handling > has been preserved because it may be useful for future NVME_SC that > nvme_status_needs_local_error_handling() hasn't been trained for yet. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > drivers/nvme/host/core.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 88cff309d8e4..be749b690af7 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) > return true; > } > > +static inline bool nvme_status_needs_local_error_handling(u16 status) > +{ > + switch (status & 0x7ff) { > + case NVME_SC_CMD_INTERRUPTED: > + return true; > + default: > + return false; > + } > +} > + > static void nvme_retry_req(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) > > void nvme_complete_rq(struct request *req) > { > - blk_status_t status = nvme_error_status(nvme_req(req)->status); > + u16 nvme_status = nvme_req(req)->status; > + blk_status_t status = nvme_error_status(nvme_status); > > trace_nvme_complete_rq(req); > > @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) > nvme_req(req)->ctrl->comp_seen = true; > > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > + if (!nvme_status_needs_local_error_handling(nvme_status) && > + (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > return; > > if (!blk_queue_dying(req->q)) { > -- > 2.18.0 > > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 15:43 ` Mike Snitzer @ 2020-08-13 15:59 ` Meneghini, John 0 siblings, 0 replies; 50+ messages in thread From: Meneghini, John @ 2020-08-13 15:59 UTC (permalink / raw) To: Mike Snitzer Cc: Sagi Grimberg, Ewan Milne, Christoph Hellwig, dm-devel, linux-nvme, Chao Leng, Keith Busch, Meneghini, John, Hannes Reinecke That works Mike. Unfortunately, I have to work with the corporate MS office email handler (which sucks) and downloading the .patch file works much better for me. Thanks, /John On 8/13/20, 11:43 AM, "Mike Snitzer" <snitzer@redhat.com> wrote: Maybe because I didn't cc linux-block? Only way that I know to "upload this patch there" is to have cc'd linux-block. But the patch is in dm-devel's patchwork here: https://patchwork.kernel.org/patch/11712563/ Is that sufficient for your needs? Thanks, Mike On Thu, Aug 13 2020 at 11:29am -0400, Meneghini, John <John.Meneghini@netapp.com> wrote: > Mike, > > I don't see your patch at: > > https://patchwork.kernel.org/project/linux-block/list/?submitter=1643 > > Can you please upload this patch there? > > Thanks, > > /John > > On 8/13/20, 10:48 AM, "Mike Snitzer" <snitzer@redhat.com> wrote: > > Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown > status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error > handling by changing multipathing's nvme_failover_req() to short-circuit > path failover and then fallback to NVMe's normal error handling (which > takes care of NVME_SC_CMD_INTERRUPTED). > > This detour through native NVMe multipathing code is unwelcome because > it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent > of any multipathing concerns. > > Introduce nvme_status_needs_local_error_handling() to prioritize > non-failover retry, when appropriate, in terms of normal NVMe error > handling. nvme_status_needs_local_error_handling() will naturely evolve > to include handling of any other errors that normal error handling must > be used for. > > nvme_failover_req()'s ability to fallback to normal NVMe error handling > has been preserved because it may be useful for future NVME_SC that > nvme_status_needs_local_error_handling() hasn't been trained for yet. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > drivers/nvme/host/core.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 88cff309d8e4..be749b690af7 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) > return true; > } > > +static inline bool nvme_status_needs_local_error_handling(u16 status) > +{ > + switch (status & 0x7ff) { > + case NVME_SC_CMD_INTERRUPTED: > + return true; > + default: > + return false; > + } > +} > + > static void nvme_retry_req(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) > > void nvme_complete_rq(struct request *req) > { > - blk_status_t status = nvme_error_status(nvme_req(req)->status); > + u16 nvme_status = nvme_req(req)->status; > + blk_status_t status = nvme_error_status(nvme_status); > > trace_nvme_complete_rq(req); > > @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) > nvme_req(req)->ctrl->comp_seen = true; > > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > + if (!nvme_status_needs_local_error_handling(nvme_status) && > + (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > return; > > if (!blk_queue_dying(req->q)) { > -- > 2.18.0 > > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 14:48 ` [RESEND PATCH] " Mike Snitzer 2020-08-13 15:29 ` Meneghini, John @ 2020-08-13 15:36 ` Christoph Hellwig 2020-08-13 17:47 ` Mike Snitzer 2020-08-14 3:23 ` Meneghini, John 2 siblings, 1 reply; 50+ messages in thread From: Christoph Hellwig @ 2020-08-13 15:36 UTC (permalink / raw) To: Mike Snitzer Cc: Sagi Grimberg, Ewan Milne, Christoph Hellwig, dm-devel, linux-nvme, Chao Leng, Keith Busch, Meneghini, John, Hannes Reinecke On Thu, Aug 13, 2020 at 10:48:11AM -0400, Mike Snitzer wrote: > Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown > status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error > handling by changing multipathing's nvme_failover_req() to short-circuit > path failover and then fallback to NVMe's normal error handling (which > takes care of NVME_SC_CMD_INTERRUPTED). > > This detour through native NVMe multipathing code is unwelcome because > it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent > of any multipathing concerns. > > Introduce nvme_status_needs_local_error_handling() to prioritize > non-failover retry, when appropriate, in terms of normal NVMe error > handling. nvme_status_needs_local_error_handling() will naturely evolve > to include handling of any other errors that normal error handling must > be used for. > > nvme_failover_req()'s ability to fallback to normal NVMe error handling > has been preserved because it may be useful for future NVME_SC that > nvme_status_needs_local_error_handling() hasn't been trained for yet. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> I don't see how this would change anything. nvme_failover_req simply retuns false for NVME_SC_CMD_INTERRUPTED, so your change is a no-op. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 15:36 ` Christoph Hellwig @ 2020-08-13 17:47 ` Mike Snitzer 2020-08-13 18:43 ` Christoph Hellwig 0 siblings, 1 reply; 50+ messages in thread From: Mike Snitzer @ 2020-08-13 17:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Sagi Grimberg, Ewan Milne, dm-devel, linux-nvme, Chao Leng, Keith Busch, Meneghini, John, Hannes Reinecke On Thu, Aug 13 2020 at 11:36am -0400, Christoph Hellwig <hch@infradead.org> wrote: > On Thu, Aug 13, 2020 at 10:48:11AM -0400, Mike Snitzer wrote: > > Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown > > status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error > > handling by changing multipathing's nvme_failover_req() to short-circuit > > path failover and then fallback to NVMe's normal error handling (which > > takes care of NVME_SC_CMD_INTERRUPTED). > > > > This detour through native NVMe multipathing code is unwelcome because > > it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent > > of any multipathing concerns. > > > > Introduce nvme_status_needs_local_error_handling() to prioritize > > non-failover retry, when appropriate, in terms of normal NVMe error > > handling. nvme_status_needs_local_error_handling() will naturely evolve > > to include handling of any other errors that normal error handling must > > be used for. > > > > nvme_failover_req()'s ability to fallback to normal NVMe error handling > > has been preserved because it may be useful for future NVME_SC that > > nvme_status_needs_local_error_handling() hasn't been trained for yet. > > > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > I don't see how this would change anything. nvme_failover_req simply > retuns false for NVME_SC_CMD_INTERRUPTED, so your change is a no-op. This is just a tweak to improve the high-level fault tree of core NVMe error handling. No functional change, but for such basic errors, avoiding entering nvme_failover_req is meaningful on a code flow level. Makes code to handle errors that need local retry clearer by being more structured, less circuitous. Allows NVMe core's handling of such errors to be more explicit and live in core.c rather than multipath.c -- so things like ACRE handling can be made explicitly part of core and not nested under nvme_failover_req's relatively obscure failsafe that returns false for anything it doesn't care about. Mike _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 17:47 ` Mike Snitzer @ 2020-08-13 18:43 ` Christoph Hellwig 2020-08-13 19:03 ` Mike Snitzer ` (2 more replies) 0 siblings, 3 replies; 50+ messages in thread From: Christoph Hellwig @ 2020-08-13 18:43 UTC (permalink / raw) To: Mike Snitzer Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John On Thu, Aug 13, 2020 at 01:47:04PM -0400, Mike Snitzer wrote: > This is just a tweak to improve the high-level fault tree of core NVMe > error handling. No functional change, but for such basic errors, > avoiding entering nvme_failover_req is meaningful on a code flow level. > Makes code to handle errors that need local retry clearer by being more > structured, less circuitous. > > Allows NVMe core's handling of such errors to be more explicit and live > in core.c rather than multipath.c -- so things like ACRE handling can be > made explicitly part of core and not nested under nvme_failover_req's > relatively obscure failsafe that returns false for anything it doesn't > care about. If we're going that way I'd rather do something like the (untested) patch below that adds a dispostion function with a function that decides it and then just switches on it: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 88cff309d8e4f0..a740320f0d4ee7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -241,17 +241,6 @@ static blk_status_t nvme_error_status(u16 status) } } -static inline bool nvme_req_needs_retry(struct request *req) -{ - if (blk_noretry_request(req)) - return false; - if (nvme_req(req)->status & NVME_SC_DNR) - return false; - if (nvme_req(req)->retries >= nvme_max_retries) - return false; - return true; -} - static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -268,33 +257,75 @@ static void nvme_retry_req(struct request *req) blk_mq_delay_kick_requeue_list(req->q, delay); } -void nvme_complete_rq(struct request *req) +enum nvme_disposition { + COMPLETE, + RETRY, + REDIRECT_ANA, + REDIRECT_TMP, +}; + +static inline enum nvme_disposition nvme_req_disposition(struct request *req) +{ + if (likely(nvme_req(req)->status == 0)) + return COMPLETE; + + if (blk_noretry_request(req) || + (nvme_req(req)->status & NVME_SC_DNR) || + nvme_req(req)->retries >= nvme_max_retries) + return COMPLETE; + + if (req->cmd_flags & REQ_NVME_MPATH) { + switch (nvme_req(req)->status & 0x7ff) { + case NVME_SC_ANA_TRANSITION: + case NVME_SC_ANA_INACCESSIBLE: + case NVME_SC_ANA_PERSISTENT_LOSS: + return REDIRECT_ANA; + case NVME_SC_HOST_PATH_ERROR: + case NVME_SC_HOST_ABORTED_CMD: + return REDIRECT_TMP; + } + } + + if (blk_queue_dying(req->q)) + return COMPLETE; + return RETRY; +} + +static inline void nvme_complete_req(struct request *req) { blk_status_t status = nvme_error_status(nvme_req(req)->status); - trace_nvme_complete_rq(req); + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && + req_op(req) == REQ_OP_ZONE_APPEND) + req->__sector = nvme_lba_to_sect(req->q->queuedata, + le64_to_cpu(nvme_req(req)->result.u64)); + + nvme_trace_bio_complete(req, status); + blk_mq_end_request(req, status); +} +void nvme_complete_rq(struct request *req) +{ + trace_nvme_complete_rq(req); nvme_cleanup_cmd(req); if (nvme_req(req)->ctrl->kas) nvme_req(req)->ctrl->comp_seen = true; - if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; - - if (!blk_queue_dying(req->q)) { - nvme_retry_req(req); - return; - } - } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && - req_op(req) == REQ_OP_ZONE_APPEND) { - req->__sector = nvme_lba_to_sect(req->q->queuedata, - le64_to_cpu(nvme_req(req)->result.u64)); + switch (nvme_req_disposition(req)) { + case COMPLETE: + nvme_complete_req(req); + return; + case RETRY: + nvme_retry_req(req); + return; + case REDIRECT_ANA: + nvme_failover_req(req, true); + return; + case REDIRECT_TMP: + nvme_failover_req(req, false); + return; } - - nvme_trace_bio_complete(req, status); - blk_mq_end_request(req, status); } EXPORT_SYMBOL_GPL(nvme_complete_rq); diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 3ded54d2c9c6ad..0c22b2c88687a2 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -65,51 +65,32 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, } } -bool nvme_failover_req(struct request *req) +void nvme_failover_req(struct request *req, bool is_ana_status) { struct nvme_ns *ns = req->q->queuedata; - u16 status = nvme_req(req)->status; unsigned long flags; - switch (status & 0x7ff) { - case NVME_SC_ANA_TRANSITION: - case NVME_SC_ANA_INACCESSIBLE: - case NVME_SC_ANA_PERSISTENT_LOSS: - /* - * If we got back an ANA error we know the controller is alive, - * but not ready to serve this namespaces. The spec suggests - * we should update our general state here, but due to the fact - * that the admin and I/O queues are not serialized that is - * fundamentally racy. So instead just clear the current path, - * mark the the path as pending and kick of a re-read of the ANA - * log page ASAP. - */ - nvme_mpath_clear_current_path(ns); - if (ns->ctrl->ana_log_buf) { - set_bit(NVME_NS_ANA_PENDING, &ns->flags); - queue_work(nvme_wq, &ns->ctrl->ana_work); - } - break; - case NVME_SC_HOST_PATH_ERROR: - case NVME_SC_HOST_ABORTED_CMD: - /* - * Temporary transport disruption in talking to the controller. - * Try to send on a new path. - */ - nvme_mpath_clear_current_path(ns); - break; - default: - /* This was a non-ANA error so follow the normal error path. */ - return false; + nvme_mpath_clear_current_path(ns); + + /* + * If we got back an ANA error we know the controller is alive, but not + * ready to serve this namespaces. The spec suggests we should update + * our general state here, but due to the fact that the admin and I/O + * queues are not serialized that is fundamentally racy. So instead + * just clear the current path, mark the the path as pending and kick + * of a re-read of the ANA log page ASAP. + */ + if (is_ana_status && ns->ctrl->ana_log_buf) { + set_bit(NVME_NS_ANA_PENDING, &ns->flags); + queue_work(nvme_wq, &ns->ctrl->ana_work); } spin_lock_irqsave(&ns->head->requeue_lock, flags); blk_steal_bios(&ns->head->requeue_list, req); spin_unlock_irqrestore(&ns->head->requeue_lock, flags); - blk_mq_end_request(req, 0); + blk_mq_end_request(req, 0); kblockd_schedule_work(&ns->head->requeue_work); - return true; } void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ebb8c3ed388554..aeff1c491ac2ef 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -629,7 +629,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, struct nvme_ctrl *ctrl, int *flags); -bool nvme_failover_req(struct request *req); +void nvme_failover_req(struct request *req, bool is_ana_status); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); @@ -688,9 +688,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance); } -static inline bool nvme_failover_req(struct request *req) +static inline void nvme_failover_req(struct request *req, bool is_ana_status) { - return false; } static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 18:43 ` Christoph Hellwig @ 2020-08-13 19:03 ` Mike Snitzer 2020-08-14 4:26 ` Meneghini, John 2020-08-14 6:53 ` Sagi Grimberg 2 siblings, 0 replies; 50+ messages in thread From: Mike Snitzer @ 2020-08-13 19:03 UTC (permalink / raw) To: Christoph Hellwig Cc: Sagi Grimberg, Hannes Reinecke, linux-nvme, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John On Thu, Aug 13 2020 at 2:43pm -0400, Christoph Hellwig <hch@infradead.org> wrote: > On Thu, Aug 13, 2020 at 01:47:04PM -0400, Mike Snitzer wrote: > > This is just a tweak to improve the high-level fault tree of core NVMe > > error handling. No functional change, but for such basic errors, > > avoiding entering nvme_failover_req is meaningful on a code flow level. > > Makes code to handle errors that need local retry clearer by being more > > structured, less circuitous. > > > > Allows NVMe core's handling of such errors to be more explicit and live > > in core.c rather than multipath.c -- so things like ACRE handling can be > > made explicitly part of core and not nested under nvme_failover_req's > > relatively obscure failsafe that returns false for anything it doesn't > > care about. > > If we're going that way I'd rather do something like the (untested) > patch below that adds a dispostion function with a function that > decides it and then just switches on it: YES! That is such a huge improvement (certainly on a code clarity level). I haven't reviewed or tested the relative performance or function of before vs after (will do) but I really like this approach. Thanks, Mike _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 18:43 ` Christoph Hellwig 2020-08-13 19:03 ` Mike Snitzer @ 2020-08-14 4:26 ` Meneghini, John 2020-08-14 6:53 ` Sagi Grimberg 2 siblings, 0 replies; 50+ messages in thread From: Meneghini, John @ 2020-08-14 4:26 UTC (permalink / raw) To: Christoph Hellwig, Mike Snitzer Cc: Sagi Grimberg, linux-nvme, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John, Hannes Reinecke On 8/13/20, 2:44 PM, "Christoph Hellwig" <hch@infradead.org> wrote: On Thu, Aug 13, 2020 at 01:47:04PM -0400, Mike Snitzer wrote: > This is just a tweak to improve the high-level fault tree of core NVMe > error handling. No functional change, but for such basic errors, > avoiding entering nvme_failover_req is meaningful on a code flow level. > Makes code to handle errors that need local retry clearer by being more > structured, less circuitous. I don't understand how entering nvme_failover_req() is circuitous. This code path is only taken if REQ_NVME_MPATH is set which - unless I am mistaken - in the case that you care about it will not be set. > Allows NVMe core's handling of such errors to be more explicit and live > in core.c rather than multipath.c -- so things like ACRE handling can be > made explicitly part of core and not nested under nvme_failover_req's > relatively obscure failsafe that returns false for anything it doesn't > care about. The ACRE handling is already explicitly a part of the core. I don't understand what you are after here Mike. Are you saying that you don't want the ACRE code to run when REQ_NVME_MPATH is clear? If we're going that way I'd rather do something like the (untested) patch below that adds a dispostion function with a function that decides it and then just switches on it: Christoph, it looks like you've moved a lot of stuff around here, with no actual functional change.... but it's really hard for me to tell. Please be sure to cc me if this becomes a real patch. How does your patch solve the problem of making dm-multipath work with command retries? Mike, do you want the nvme-core driver to retry commands on the same path, with CRD, for the dm-multipath use case... or are you looking for a different treatment of REQ_FAILFAST_DEV... or what? Maybe I'm not seeing it. /John diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 88cff309d8e4f0..a740320f0d4ee7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -241,17 +241,6 @@ static blk_status_t nvme_error_status(u16 status) } } -static inline bool nvme_req_needs_retry(struct request *req) -{ - if (blk_noretry_request(req)) - return false; - if (nvme_req(req)->status & NVME_SC_DNR) - return false; - if (nvme_req(req)->retries >= nvme_max_retries) - return false; - return true; -} - static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -268,33 +257,75 @@ static void nvme_retry_req(struct request *req) blk_mq_delay_kick_requeue_list(req->q, delay); } -void nvme_complete_rq(struct request *req) +enum nvme_disposition { + COMPLETE, + RETRY, + REDIRECT_ANA, + REDIRECT_TMP, +}; + +static inline enum nvme_disposition nvme_req_disposition(struct request *req) +{ + if (likely(nvme_req(req)->status == 0)) + return COMPLETE; + + if (blk_noretry_request(req) || + (nvme_req(req)->status & NVME_SC_DNR) || + nvme_req(req)->retries >= nvme_max_retries) + return COMPLETE; + + if (req->cmd_flags & REQ_NVME_MPATH) { + switch (nvme_req(req)->status & 0x7ff) { + case NVME_SC_ANA_TRANSITION: + case NVME_SC_ANA_INACCESSIBLE: + case NVME_SC_ANA_PERSISTENT_LOSS: + return REDIRECT_ANA; + case NVME_SC_HOST_PATH_ERROR: + case NVME_SC_HOST_ABORTED_CMD: + return REDIRECT_TMP; + } + } + + if (blk_queue_dying(req->q)) + return COMPLETE; + return RETRY; +} + +static inline void nvme_complete_req(struct request *req) { blk_status_t status = nvme_error_status(nvme_req(req)->status); - trace_nvme_complete_rq(req); + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && + req_op(req) == REQ_OP_ZONE_APPEND) + req->__sector = nvme_lba_to_sect(req->q->queuedata, + le64_to_cpu(nvme_req(req)->result.u64)); + + nvme_trace_bio_complete(req, status); + blk_mq_end_request(req, status); +} +void nvme_complete_rq(struct request *req) +{ + trace_nvme_complete_rq(req); nvme_cleanup_cmd(req); if (nvme_req(req)->ctrl->kas) nvme_req(req)->ctrl->comp_seen = true; - if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; - - if (!blk_queue_dying(req->q)) { - nvme_retry_req(req); - return; - } - } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && - req_op(req) == REQ_OP_ZONE_APPEND) { - req->__sector = nvme_lba_to_sect(req->q->queuedata, - le64_to_cpu(nvme_req(req)->result.u64)); + switch (nvme_req_disposition(req)) { + case COMPLETE: + nvme_complete_req(req); + return; + case RETRY: + nvme_retry_req(req); + return; + case REDIRECT_ANA: + nvme_failover_req(req, true); + return; + case REDIRECT_TMP: + nvme_failover_req(req, false); + return; } - - nvme_trace_bio_complete(req, status); - blk_mq_end_request(req, status); } EXPORT_SYMBOL_GPL(nvme_complete_rq); diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 3ded54d2c9c6ad..0c22b2c88687a2 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -65,51 +65,32 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, } } -bool nvme_failover_req(struct request *req) +void nvme_failover_req(struct request *req, bool is_ana_status) { struct nvme_ns *ns = req->q->queuedata; - u16 status = nvme_req(req)->status; unsigned long flags; - switch (status & 0x7ff) { - case NVME_SC_ANA_TRANSITION: - case NVME_SC_ANA_INACCESSIBLE: - case NVME_SC_ANA_PERSISTENT_LOSS: - /* - * If we got back an ANA error we know the controller is alive, - * but not ready to serve this namespaces. The spec suggests - * we should update our general state here, but due to the fact - * that the admin and I/O queues are not serialized that is - * fundamentally racy. So instead just clear the current path, - * mark the the path as pending and kick of a re-read of the ANA - * log page ASAP. - */ - nvme_mpath_clear_current_path(ns); - if (ns->ctrl->ana_log_buf) { - set_bit(NVME_NS_ANA_PENDING, &ns->flags); - queue_work(nvme_wq, &ns->ctrl->ana_work); - } - break; - case NVME_SC_HOST_PATH_ERROR: - case NVME_SC_HOST_ABORTED_CMD: - /* - * Temporary transport disruption in talking to the controller. - * Try to send on a new path. - */ - nvme_mpath_clear_current_path(ns); - break; - default: - /* This was a non-ANA error so follow the normal error path. */ - return false; + nvme_mpath_clear_current_path(ns); + + /* + * If we got back an ANA error we know the controller is alive, but not + * ready to serve this namespaces. The spec suggests we should update + * our general state here, but due to the fact that the admin and I/O + * queues are not serialized that is fundamentally racy. So instead + * just clear the current path, mark the the path as pending and kick + * of a re-read of the ANA log page ASAP. + */ + if (is_ana_status && ns->ctrl->ana_log_buf) { + set_bit(NVME_NS_ANA_PENDING, &ns->flags); + queue_work(nvme_wq, &ns->ctrl->ana_work); } spin_lock_irqsave(&ns->head->requeue_lock, flags); blk_steal_bios(&ns->head->requeue_list, req); spin_unlock_irqrestore(&ns->head->requeue_lock, flags); - blk_mq_end_request(req, 0); + blk_mq_end_request(req, 0); kblockd_schedule_work(&ns->head->requeue_work); - return true; } void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ebb8c3ed388554..aeff1c491ac2ef 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -629,7 +629,7 @@ void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, struct nvme_ctrl *ctrl, int *flags); -bool nvme_failover_req(struct request *req); +void nvme_failover_req(struct request *req, bool is_ana_status); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); @@ -688,9 +688,8 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance); } -static inline bool nvme_failover_req(struct request *req) +static inline void nvme_failover_req(struct request *req, bool is_ana_status) { - return false; } static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 18:43 ` Christoph Hellwig 2020-08-13 19:03 ` Mike Snitzer 2020-08-14 4:26 ` Meneghini, John @ 2020-08-14 6:53 ` Sagi Grimberg 2020-08-14 6:55 ` Christoph Hellwig 2 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2020-08-14 6:53 UTC (permalink / raw) To: Christoph Hellwig, Mike Snitzer Cc: Hannes Reinecke, linux-nvme, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John > +static inline enum nvme_disposition nvme_req_disposition(struct request *req) > +{ > + if (likely(nvme_req(req)->status == 0)) > + return COMPLETE; > + > + if (blk_noretry_request(req) || > + (nvme_req(req)->status & NVME_SC_DNR) || > + nvme_req(req)->retries >= nvme_max_retries) > + return COMPLETE; > + > + if (req->cmd_flags & REQ_NVME_MPATH) { > + switch (nvme_req(req)->status & 0x7ff) { > + case NVME_SC_ANA_TRANSITION: > + case NVME_SC_ANA_INACCESSIBLE: > + case NVME_SC_ANA_PERSISTENT_LOSS: > + return REDIRECT_ANA; > + case NVME_SC_HOST_PATH_ERROR: > + case NVME_SC_HOST_ABORTED_CMD: > + return REDIRECT_TMP; > + } > + } > + > + if (blk_queue_dying(req->q)) > + return COMPLETE; > + return RETRY; > +} > + > +static inline void nvme_complete_req(struct request *req) > { > blk_status_t status = nvme_error_status(nvme_req(req)->status); > > - trace_nvme_complete_rq(req); > + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && > + req_op(req) == REQ_OP_ZONE_APPEND) > + req->__sector = nvme_lba_to_sect(req->q->queuedata, > + le64_to_cpu(nvme_req(req)->result.u64)); > + > + nvme_trace_bio_complete(req, status); > + blk_mq_end_request(req, status); > +} > > +void nvme_complete_rq(struct request *req) > +{ > + trace_nvme_complete_rq(req); > nvme_cleanup_cmd(req); > > if (nvme_req(req)->ctrl->kas) > nvme_req(req)->ctrl->comp_seen = true; > > - if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) > - return; > - > - if (!blk_queue_dying(req->q)) { > - nvme_retry_req(req); > - return; > - } > - } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && > - req_op(req) == REQ_OP_ZONE_APPEND) { > - req->__sector = nvme_lba_to_sect(req->q->queuedata, > - le64_to_cpu(nvme_req(req)->result.u64)); > + switch (nvme_req_disposition(req)) { > + case COMPLETE: > + nvme_complete_req(req); nvme_complete_rq calling nvme_complete_req... Maybe call it __nvme_complete_rq instead? > + return; > + case RETRY: > + nvme_retry_req(req); > + return; > + case REDIRECT_ANA: > + nvme_failover_req(req, true); > + return; > + case REDIRECT_TMP: > + nvme_failover_req(req, false); > + return; > } > - > - nvme_trace_bio_complete(req, status); > - blk_mq_end_request(req, status); > } > EXPORT_SYMBOL_GPL(nvme_complete_rq); > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 3ded54d2c9c6ad..0c22b2c88687a2 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -65,51 +65,32 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > } > } > > -bool nvme_failover_req(struct request *req) > +void nvme_failover_req(struct request *req, bool is_ana_status) > { > struct nvme_ns *ns = req->q->queuedata; > - u16 status = nvme_req(req)->status; > unsigned long flags; > > - switch (status & 0x7ff) { > - case NVME_SC_ANA_TRANSITION: > - case NVME_SC_ANA_INACCESSIBLE: > - case NVME_SC_ANA_PERSISTENT_LOSS: > - /* > - * If we got back an ANA error we know the controller is alive, > - * but not ready to serve this namespaces. The spec suggests > - * we should update our general state here, but due to the fact > - * that the admin and I/O queues are not serialized that is > - * fundamentally racy. So instead just clear the current path, > - * mark the the path as pending and kick of a re-read of the ANA > - * log page ASAP. > - */ > - nvme_mpath_clear_current_path(ns); > - if (ns->ctrl->ana_log_buf) { > - set_bit(NVME_NS_ANA_PENDING, &ns->flags); > - queue_work(nvme_wq, &ns->ctrl->ana_work); > - } > - break; > - case NVME_SC_HOST_PATH_ERROR: > - case NVME_SC_HOST_ABORTED_CMD: > - /* > - * Temporary transport disruption in talking to the controller. > - * Try to send on a new path. > - */ > - nvme_mpath_clear_current_path(ns); > - break; > - default: > - /* This was a non-ANA error so follow the normal error path. */ > - return false; > + nvme_mpath_clear_current_path(ns); > + > + /* > + * If we got back an ANA error we know the controller is alive, but not > + * ready to serve this namespaces. The spec suggests we should update > + * our general state here, but due to the fact that the admin and I/O > + * queues are not serialized that is fundamentally racy. So instead > + * just clear the current path, mark the the path as pending and kick > + * of a re-read of the ANA log page ASAP. > + */ > + if (is_ana_status && ns->ctrl->ana_log_buf) { Maybe call nvme_req_disposition again locally here to not carry the is_ana_status. But not a biggy.. Overall this looks good I think. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-14 6:53 ` Sagi Grimberg @ 2020-08-14 6:55 ` Christoph Hellwig 2020-08-14 7:02 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Christoph Hellwig @ 2020-08-14 6:55 UTC (permalink / raw) To: Sagi Grimberg Cc: Mike Snitzer, Hannes Reinecke, linux-nvme, Christoph Hellwig, dm-devel, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John On Thu, Aug 13, 2020 at 11:53:14PM -0700, Sagi Grimberg wrote: > > + switch (nvme_req_disposition(req)) { > > + case COMPLETE: > > + nvme_complete_req(req); > > nvme_complete_rq calling nvme_complete_req... Maybe call it > __nvme_complete_rq instead? That's what I had first, but it felt so strangely out of place next to the other nvme_*_req calls.. Maybe nvme_complete_rq needs to be renamed - what about nvme_req_done? > Maybe call nvme_req_disposition again locally here to not carry > the is_ana_status. But not a biggy.. That means it would have to become non-static in scope, limiting inlining possibilities, etc. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-14 6:55 ` Christoph Hellwig @ 2020-08-14 7:02 ` Sagi Grimberg 0 siblings, 0 replies; 50+ messages in thread From: Sagi Grimberg @ 2020-08-14 7:02 UTC (permalink / raw) To: Christoph Hellwig Cc: Mike Snitzer, linux-nvme, dm-devel, Hannes Reinecke, Chao Leng, Keith Busch, Meneghini, John, Ewan Milne >>> + switch (nvme_req_disposition(req)) { >>> + case COMPLETE: >>> + nvme_complete_req(req); >> >> nvme_complete_rq calling nvme_complete_req... Maybe call it >> __nvme_complete_rq instead? > > That's what I had first, but it felt so strangely out of place next > to the other nvme_*_req calls.. > > Maybe nvme_complete_rq needs to be renamed - what about nvme_req_done? I'd suggest to call nvme_complete_rq nvme_end_rq because it really calls blk_mq_end_request. That would confuse with nvme_end_request which calls blk_mq_complete_request... Maybe we need some naming improvements throughout. >> Maybe call nvme_req_disposition again locally here to not carry >> the is_ana_status. But not a biggy.. > > That means it would have to become non-static in scope, limiting > inlining possibilities, etc. I see. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate 2020-08-13 14:48 ` [RESEND PATCH] " Mike Snitzer 2020-08-13 15:29 ` Meneghini, John 2020-08-13 15:36 ` Christoph Hellwig @ 2020-08-14 3:23 ` Meneghini, John 2 siblings, 0 replies; 50+ messages in thread From: Meneghini, John @ 2020-08-14 3:23 UTC (permalink / raw) To: Mike Snitzer, Sagi Grimberg Cc: linux-nvme, Christoph Hellwig, dm-devel, Hannes Reinecke, Chao Leng, Keith Busch, Meneghini, John, Ewan Milne On 8/13/20, 10:48 AM, "Mike Snitzer" <snitzer@redhat.com> wrote: Commit 764e9332098c0 ("nvme-multipath: do not reset on unknown status"), among other things, fixed NVME_SC_CMD_INTERRUPTED error handling by changing multipathing's nvme_failover_req() to short-circuit path failover and then fallback to NVMe's normal error handling (which takes care of NVME_SC_CMD_INTERRUPTED). This detour through native NVMe multipathing code is unwelcome because it prevents NVMe core from handling NVME_SC_CMD_INTERRUPTED independent of any multipathing concerns. Introduce nvme_status_needs_local_error_handling() to prioritize non-failover retry, when appropriate, in terms of normal NVMe error handling. nvme_status_needs_local_error_handling() will naturely evolve to include handling of any other errors that normal error handling must be used for. How is this any better than blk_path_error()? nvme_failover_req()'s ability to fallback to normal NVMe error handling has been preserved because it may be useful for future NVME_SC that nvme_status_needs_local_error_handling() hasn't been trained for yet. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/nvme/host/core.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 88cff309d8e4..be749b690af7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -252,6 +252,16 @@ static inline bool nvme_req_needs_retry(struct request *req) return true; } +static inline bool nvme_status_needs_local_error_handling(u16 status) +{ + switch (status & 0x7ff) { + case NVME_SC_CMD_INTERRUPTED: + return true; + default: + return false; + } +} I assume that what you mean by nvme_status_needs_local_error_handling is - do you want the nvme core driver to handle the command retry. If this is true, I don't think this function will ever work correctly because,. as discussed, whether or not the command needs to be retried has nothing to do with the NVMe Status Code Field itself, it has to so with the DNR bit, the CRD field, and the Status Code Type field - in that order. + static void nvme_retry_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -270,7 +280,8 @@ static void nvme_retry_req(struct request *req) void nvme_complete_rq(struct request *req) { - blk_status_t status = nvme_error_status(nvme_req(req)->status); + u16 nvme_status = nvme_req(req)->status; + blk_status_t status = nvme_error_status(nvme_status); trace_nvme_complete_rq(req); @@ -280,7 +291,8 @@ void nvme_complete_rq(struct request *req) nvme_req(req)->ctrl->comp_seen = true; if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) + if (!nvme_status_needs_local_error_handling(nvme_status) && This defeats the nvme-multipath logic by inserting a second evaluation of the NVMe Status Code into the retry logic. This is basically another version of blk_path_error(). In fact, in your case REQ_NVME_MPATH is probably not set, so I don't see what difference this would make at all. /John + (req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) return; if (!blk_queue_dying(req->q)) { -- 2.18.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-06 19:19 ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer 2020-08-06 22:42 ` Meneghini, John @ 2020-08-07 0:44 ` Sagi Grimberg 2020-08-10 12:43 ` Christoph Hellwig 2 siblings, 0 replies; 50+ messages in thread From: Sagi Grimberg @ 2020-08-07 0:44 UTC (permalink / raw) To: Mike Snitzer, Meneghini, John Cc: linux-nvme, Christoph Hellwig, Hannes Reinecke, Chao Leng, Keith Busch, Ewan Milne Hey Mike, > Regardless, reading back on this thread, I think there is at least some > consensus about reverting commit 35038bffa87 ("nvme: Translate more > status codes to blk_status_t") ? We can probably make NVME_SC_CMD_INTERRUPTED an I/O error, its the closest there is to something that is somewhat close to its meaning, at least for some interpretation.. Although BLK_STS_TARGET is also semantically correct from another interpretation. > And on a related note, building on the thread I started here (but > haven't heard back from any NVMe maintainers on): > https://www.redhat.com/archives/dm-devel/2020-July/msg00051.html > I'd also be happy as a pig in shit if this patch were applied: I'll reply to the other thread. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-06 19:19 ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer 2020-08-06 22:42 ` Meneghini, John 2020-08-07 0:44 ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Sagi Grimberg @ 2020-08-10 12:43 ` Christoph Hellwig 2020-08-10 15:06 ` Mike Snitzer 2020-08-11 3:45 ` [PATCH] " Chao Leng 2 siblings, 2 replies; 50+ messages in thread From: Christoph Hellwig @ 2020-08-10 12:43 UTC (permalink / raw) To: Mike Snitzer Cc: Hannes Reinecke, linux-nvme, Christoph Hellwig, Ewan Milne, Chao Leng, Keith Busch, Meneghini, John Just returning from my vacation, and I'm really surprised about this discussion. If you want to fix a device mapper / block layer interaction you do not change nvme code to use suboptimal interfaces. One of the reasons for the nvme multipath design was to allow for tighter integration with protocol features, and making nvme do a detour through the block layer for no good reason at all does not help with that at all. And if you want to support the TP that added the new command interrupted status code please read through it - the status code itself is just a small part of it, and none of the changes proposed in these discussions would lead to a proper implementation. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-10 12:43 ` Christoph Hellwig @ 2020-08-10 15:06 ` Mike Snitzer 2020-08-11 3:45 ` [PATCH] " Chao Leng 1 sibling, 0 replies; 50+ messages in thread From: Mike Snitzer @ 2020-08-10 15:06 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-nvme, Hannes Reinecke, Chao Leng, Keith Busch, Meneghini, John, Ewan Milne On Mon, Aug 10 2020 at 8:43am -0400, Christoph Hellwig <hch@infradead.org> wrote: > Just returning from my vacation, and I'm really surprised about > this discussion. Welcome back! FYI, I got roped into this stuff again due to reports of regression from Red Hat and partner QE testing of latest RHEL 8.3 stuff. I have been away from all things NVMe for a while so apologies for my misfire that dwelled on nvme_complete_rq()'s blk_path_error() call being removed. > If you want to fix a device mapper / block layer interaction you do > not change nvme code to use suboptimal interfaces. > > One of the reasons for the nvme multipath design was to allow for > tighter integration with protocol features, and making nvme do a detour > through the block layer for no good reason at all does not help with > that at all. Yes, reinstating blk_path_error() was a distraction. Still an outstanding point to discuss but it can be pushed to the back-back-burner. > And if you want to support the TP that added the new command interrupted > status code please read through it - the status code itself is just a > small part of it, and none of the changes proposed in these discussions > would lead to a proper implementation. Think you likely glossed over aspects of my exchange with Sagi. That's fine. I don't fault you in any way for not wanting to labor over this thread to this point! ;) Really I shouldn't _need_ to be the wiser about all this inherently local NVMe error handling, which happens to be increasingly complex, that seemingly has nothing to do with multipathing. NVMe should handle the command interrupted status code, ANA and now ACRE in a way where it isn't conflated to detour through nvme_failover_req() et al. SO that is what I hope to rectify with some simple patches. Mike _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() 2020-08-10 12:43 ` Christoph Hellwig 2020-08-10 15:06 ` Mike Snitzer @ 2020-08-11 3:45 ` Chao Leng 1 sibling, 0 replies; 50+ messages in thread From: Chao Leng @ 2020-08-11 3:45 UTC (permalink / raw) To: Christoph Hellwig, Mike Snitzer Cc: Keith Busch, Hannes Reinecke, linux-nvme, Ewan Milne, Meneghini, John On 2020/8/10 20:43, Christoph Hellwig wrote: > Just returning from my vacation, and I'm really surprised about > this discussion. > > If you want to fix a device mapper / block layer interaction you do > not change nvme code to use suboptimal interfaces. Agree. Nvme define path error and local retry mechanism. It is more reasonable to use the new mechanism. > > One of the reasons for the nvme multipath design was to allow for > tighter integration with protocol features, and making nvme do a detour > through the block layer for no good reason at all does not help with > that at all. Nvme should use local retry mechanism. But now we first check the flag: REQ_FAILFAST_TRANSPORT, if work with dm-multipath, REQ_FAILFAST_TRANSPORT will be set, this interrupt local retry. > > And if you want to support the TP that added the new command interrupted > status code please read through it - the status code itself is just a > small part of it, and none of the changes proposed in these discussions > would lead to a proper implementation. > . > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-08-06 15:59 ` Meneghini, John 2020-08-06 16:17 ` Meneghini, John @ 2020-08-07 0:03 ` Sagi Grimberg 2020-08-07 2:28 ` Chao Leng 1 sibling, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2020-08-07 0:03 UTC (permalink / raw) To: Meneghini, John, Keith Busch, Chao Leng Cc: Christoph Hellwig, linux-nvme, Hannes de, Ewan Milne >>> John, Wat do you think about if delete translate NVME_SC_CMD_INTERRUPTED >>> to BLK_STS_TARGET? Thank you. > > I think returning to BLK_STS_TARGET for NVME_SC_CMD_INTERRUPTED and NVME_SC_NS_NOT_READY > was a potentially non-backwards compatible change that Keith made in response to a problem that > I reported in patch 35038bffa87da. > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=35038bffa87da > > I don’t think it would be a mistake to back out this change. We've had a number of problems > in the distributions with dm-mp and other legacy code getting confused by this change. From what I see the point of this change was to have these status codes to do exactly what they need to do, use the local retry flow. >>> BLK_STS_TARGET means target has critical error. NVME_SC_CMD_INTERRUPTED >>> just means target need retry io. It is not suitable to translate >>> NVME_SC_CMD_INTERRUPTED to BLK_STS_TARGET. Maybe translate to >>> BLK_STS_IOERR is also not suitable, we should translate >>> NVME_SC_CMD_INTERRUPTED to BLK_STS_AGAIN. >>> We can do like this: >> >> BLK_STS_AGAIN is a bad choice as we use it for calls that block when >> the callers asked for non-blocking submission. I'm really not sure >> we want to change anything here - the error definition clearly states >> it is not a failure but a request to retry later. > > Agreed. The historical behavior in nvme-core has been to return BLK_STS_IOERR status for all new > or unknown NVMe errors. Which is the correct thing to do. > . On 8/6/20, 10:26 AM, "Keith Busch" <kbusch@kernel.org> wrote: > > > On Thu, Aug 06, 2020 at 01:52:42PM +0800, Chao Leng wrote: > > NVME_SC_LBA_RANGE should failfast, because retry can not success. > > The DNR bit is how NVMe conveys such things. > > I think that NVME_SC_LBA_RANGE has historically returned BLK_STS_TARGET so that > shouldn't be changed. I agree that DNR - and now the new ACRE mechanism - is what > should control the command retry behavior in NVMe. Completely agree here too. > > NVME_SC_NS_NOT_READY may retry success, but the probality is very low. > > NVME_SC_CMD_INTERRUPTED need retry, according to protocol define, retry will success. > > If the upper layers set a request such that "noretry" is true, that's > the behavior you're going to get. Sprinkling special exceptions around > is not a good idea. > > I agree. I also 100% agree here. > I think the problem here is that the current BLK_STS and FAST_FAIL mechanisms > were designed support legacy protocols like SCSI. They assume that all retry behavior is > controlled by other components in the stack. NVMe is presenting new protocol features > and semantics which probably can't be effectively supported by those legacy BLK_STS > and FAST_FAIL mechanisms without passing more information up the stack. Not sure how generic this new blk status would be.. It would probably make a lot more sense of there are other consumers for such a status code. Maybe we could set it to BLK_STS_TIMEOUT with a big fat comment for why we are doing this... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] nvme-core: fix io interrupt when work with dm-multipah 2020-08-07 0:03 ` [PATCH] nvme-core: fix io interrupt when work with dm-multipah Sagi Grimberg @ 2020-08-07 2:28 ` Chao Leng 0 siblings, 0 replies; 50+ messages in thread From: Chao Leng @ 2020-08-07 2:28 UTC (permalink / raw) To: Sagi Grimberg, Meneghini, John, Keith Busch Cc: Christoph Hellwig, linux-nvme, Hannes de, Ewan Milne On 2020/8/7 8:03, Sagi Grimberg wrote: >> I think the problem here is that the current BLK_STS and FAST_FAIL mechanisms >> were designed support legacy protocols like SCSI. They assume that all retry behavior is >> controlled by other components in the stack. NVMe is presenting new protocol features >> and semantics which probably can't be effectively supported by those legacy BLK_STS >> and FAST_FAIL mechanisms without passing more information up the stack. > > Not sure how generic this new blk status would be.. It would probably > make a lot more sense of there are other consumers for such a status > code. > > Maybe we could set it to BLK_STS_TIMEOUT with a big fat comment for why > we are doing this... Introduce a new BLK_STS error maybe a good choice, but is not friendly to forward compatibility. Now we have 3 choice:BLK_STS_TIMEOUT, BLK_STS_RESOURCE or BLK_STS_IOERR by default. DM-multipath will fail path and then fail over to retry. The difference is BLK_STS_RESOURCE will delay 100ms to retry, BLK_STS_TIMEOUT and BLK_STS_IOERR will retry immediatly. Now there is no perfect choice, but we have to choose one. I suggest keep default, translate NVME_SC_CMD_INTERRUPTED to BLK_STS_IOERR. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2020-08-14 7:02 UTC | newest] Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-27 5:58 [PATCH] nvme-core: fix io interrupt when work with dm-multipah Chao Leng 2020-07-28 11:19 ` Christoph Hellwig 2020-07-29 2:54 ` Chao Leng 2020-07-29 5:59 ` Christoph Hellwig 2020-07-30 1:49 ` Chao Leng 2020-08-05 6:40 ` Chao Leng 2020-08-05 15:29 ` Keith Busch 2020-08-06 5:52 ` Chao Leng 2020-08-06 14:26 ` Keith Busch 2020-08-06 15:59 ` Meneghini, John 2020-08-06 16:17 ` Meneghini, John 2020-08-06 18:40 ` Mike Snitzer 2020-08-06 19:19 ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Mike Snitzer 2020-08-06 22:42 ` Meneghini, John 2020-08-07 0:07 ` Mike Snitzer 2020-08-07 1:21 ` Sagi Grimberg 2020-08-07 4:50 ` Mike Snitzer 2020-08-07 23:35 ` Sagi Grimberg 2020-08-08 21:08 ` Meneghini, John 2020-08-08 21:11 ` Meneghini, John 2020-08-10 14:48 ` Mike Snitzer 2020-08-11 12:54 ` Meneghini, John 2020-08-10 8:10 ` Chao Leng 2020-08-11 12:36 ` Meneghini, John 2020-08-12 7:51 ` Chao Leng 2020-08-10 14:36 ` Mike Snitzer 2020-08-10 17:22 ` [PATCH] nvme: explicitly use normal NVMe error handling when appropriate Mike Snitzer 2020-08-11 3:32 ` Chao Leng 2020-08-11 4:20 ` Mike Snitzer 2020-08-11 6:17 ` Chao Leng 2020-08-11 14:12 ` Mike Snitzer 2020-08-13 14:48 ` [RESEND PATCH] " Mike Snitzer 2020-08-13 15:29 ` Meneghini, John 2020-08-13 15:43 ` Mike Snitzer 2020-08-13 15:59 ` Meneghini, John 2020-08-13 15:36 ` Christoph Hellwig 2020-08-13 17:47 ` Mike Snitzer 2020-08-13 18:43 ` Christoph Hellwig 2020-08-13 19:03 ` Mike Snitzer 2020-08-14 4:26 ` Meneghini, John 2020-08-14 6:53 ` Sagi Grimberg 2020-08-14 6:55 ` Christoph Hellwig 2020-08-14 7:02 ` Sagi Grimberg 2020-08-14 3:23 ` Meneghini, John 2020-08-07 0:44 ` [PATCH] nvme: restore use of blk_path_error() in nvme_complete_rq() Sagi Grimberg 2020-08-10 12:43 ` Christoph Hellwig 2020-08-10 15:06 ` Mike Snitzer 2020-08-11 3:45 ` [PATCH] " Chao Leng 2020-08-07 0:03 ` [PATCH] nvme-core: fix io interrupt when work with dm-multipah Sagi Grimberg 2020-08-07 2:28 ` Chao Leng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).