Re: [PATCH v2 4/6] nvme-rdma: avoid IO error and repeated request completion

From: Chao Leng <lengchao@huawei.com>
To: Sagi Grimberg <sagi@grimberg.me>, <linux-nvme@lists.infradead.org>
Cc: <kbusch@kernel.org>, <axboe@fb.com>, <hch@lst.de>,
	<linux-block@vger.kernel.org>, <axboe@kernel.dk>
Subject: Re: [PATCH v2 4/6] nvme-rdma: avoid IO error and repeated request completion
Date: Mon, 18 Jan 2021 11:22:16 +0800	[thread overview]
Message-ID: <0b5c8e31-8dc2-994a-1710-1b1be07549c9@huawei.com> (raw)
In-Reply-To: <4ff22d33-12fa-1f70-3606-54821f314c45@grimberg.me>


On 2021/1/16 9:18, Sagi Grimberg wrote:
> 
>>>>>> When a request is queued failed, blk_status_t is directly returned
>>>>>> to the blk-mq. If blk_status_t is not BLK_STS_RESOURCE,
>>>>>> BLK_STS_DEV_RESOURCE, BLK_STS_ZONE_RESOURCE, blk-mq call
>>>>>> blk_mq_end_request to complete the request with BLK_STS_IOERR.
>>>>>> In two scenarios, the request should be retried and may succeed.
>>>>>> First, if work with nvme multipath, the request may be retried
>>>>>> successfully in another path, because the error is probably related to
>>>>>> the path. Second, if work without multipath software, the request may
>>>>>> be retried successfully after error recovery.
>>>>>> If the request is complete with BLK_STS_IOERR in blk_mq_dispatch_rq_list.
>>>>>> The state of request may be changed to MQ_RQ_IN_FLIGHT. If free the
>>>>>> request asynchronously such as in nvme_submit_user_cmd, in extreme
>>>>>> scenario the request will be repeated freed in tear down.
>>>>>> If a non-resource error occurs in queue_rq, should directly call
>>>>>> nvme_complete_rq to complete request and set the state of request to
>>>>>> MQ_RQ_COMPLETE. nvme_complete_rq will decide to retry, fail over or end
>>>>>> the request.
>>>>>>
>>>>>> Signed-off-by: Chao Leng <lengchao@huawei.com>
>>>>>> ---
>>>>>>   drivers/nvme/host/rdma.c | 2 +-
>>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>>>> index df9f6f4549f1..4a89bf44ecdc 100644
>>>>>> --- a/drivers/nvme/host/rdma.c
>>>>>> +++ b/drivers/nvme/host/rdma.c
>>>>>> @@ -2093,7 +2093,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>>>>   unmap_qe:
>>>>>>       ib_dma_unmap_single(dev, req->sqe.dma, sizeof(struct nvme_command),
>>>>>>                   DMA_TO_DEVICE);
>>>>>> -    return ret;
>>>>>> +    return nvme_try_complete_failed_req(rq, ret);
>>>>>
>>>>> I don't understand this. There are errors that may not be related to
>>>>> anything that is pathing related (sw bug, memory leak, mapping error,
>>>>> etc, etc) why should we return this one-shot error?
>>>> Although fail over retry is not required, if we return the error to
>>>> blk-mq, a low probability crash may happen. because blk-mq do not set
>>>> the state of request to MQ_RQ_COMPLETE before complete the request,
>>>> the request may be freed asynchronously such as in nvme_submit_user_cmd.
>>>> If race with error recovery, request double completion may happens.
>>>
>>> Then fix that, don't work around it.
>> I'm not trying to work around it. The purpose of this is to solve
>> the problem of nvme native multipathing at the same time.
> 
> Please explain how this is an nvme-multipath issue?
> 
>>>
>>>>
>>>> So we can not return the error to blk-mq if the blk_status_t is not
>>>> BLK_STS_RESOURCE, BLK_STS_DEV_RESOURCE, BLK_STS_ZONE_RESOURCE.
>>>
>>> This is not something we should be handling in nvme. block drivers
>>> should be able to fail queue_rq, and this all should live in the
>>> block layer.
>> Of course, it is also an idea to repair the block drivers directly.
>> However, block layer is unaware of nvme native multipathing,
> 
> Nor it should be
> 
>> will cause the request return error which should be avoided.
> 
> Not sure I understand..
> requests should failover for path related errors,
> what queue_rq errors are expected to be failed over from your
> perspective?
Although fail over for only path related errors is the best choice, it's
almost impossible to achieve.
The probability of non-path-related errors is very low. Although these
errors do not require fail over retry, the cost of fail over retry
is complete the request with error delay a bit long time(retry several
times). It's not the best choice, but I think it's acceptable, because
HBA driver does not have path-related error codes but only general error
codes. It is difficult to identify whether the general error codes are
path-related.
> 
>> The scenario: use two HBAs for nvme native multipath, and then one HBA
>> fault,
> 
> What is the specific error the driver sees?
The path related error code is closely related to HBA driver
implementation. In general it is EIO. I don't think it's a good idea to
assume what general error code the driver returns in the event of a path
error.
> 
>> the blk_status_t of queue_rq is BLK_STS_IOERR, blk-mq will call
>> blk_mq_end_request to complete the request which bypass name native
>> multipath. We expect the request fail over to normal HBA, but the request
>> is directly completed with BLK_STS_IOERR.
>> The two scenarios can be fixed by directly completing the request in queue_rq.
> Well, certainly this one-shot always return 0 and complete the command
> with HOST_PATH error is not a good approach IMO
So what's the better option? Just complete the request with host path
error for non-ENOMEM and EAGAIN returned by the HBA driver?