Re: [PATCH v2 4/6] nvme-rdma: avoid IO error and repeated request completion

From: Sagi Grimberg <sagi@grimberg.me>
To: Chao Leng <lengchao@huawei.com>, linux-nvme@lists.infradead.org
Cc: kbusch@kernel.org, axboe@fb.com, hch@lst.de,
	linux-block@vger.kernel.org, axboe@kernel.dk
Subject: Re: [PATCH v2 4/6] nvme-rdma: avoid IO error and repeated request completion
Date: Fri, 15 Jan 2021 17:18:37 -0800	[thread overview]
Message-ID: <4ff22d33-12fa-1f70-3606-54821f314c45@grimberg.me> (raw)
In-Reply-To: <695b6839-5333-c342-2189-d7aaeba797a7@huawei.com>

>>>>> When a request is queued failed, blk_status_t is directly returned
>>>>> to the blk-mq. If blk_status_t is not BLK_STS_RESOURCE,
>>>>> BLK_STS_DEV_RESOURCE, BLK_STS_ZONE_RESOURCE, blk-mq call
>>>>> blk_mq_end_request to complete the request with BLK_STS_IOERR.
>>>>> In two scenarios, the request should be retried and may succeed.
>>>>> First, if work with nvme multipath, the request may be retried
>>>>> successfully in another path, because the error is probably related to
>>>>> the path. Second, if work without multipath software, the request may
>>>>> be retried successfully after error recovery.
>>>>> If the request is complete with BLK_STS_IOERR in 
>>>>> blk_mq_dispatch_rq_list.
>>>>> The state of request may be changed to MQ_RQ_IN_FLIGHT. If free the
>>>>> request asynchronously such as in nvme_submit_user_cmd, in extreme
>>>>> scenario the request will be repeated freed in tear down.
>>>>> If a non-resource error occurs in queue_rq, should directly call
>>>>> nvme_complete_rq to complete request and set the state of request to
>>>>> MQ_RQ_COMPLETE. nvme_complete_rq will decide to retry, fail over or 
>>>>> end
>>>>> the request.
>>>>>
>>>>> Signed-off-by: Chao Leng <lengchao@huawei.com>
>>>>> ---
>>>>>   drivers/nvme/host/rdma.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>>> index df9f6f4549f1..4a89bf44ecdc 100644
>>>>> --- a/drivers/nvme/host/rdma.c
>>>>> +++ b/drivers/nvme/host/rdma.c
>>>>> @@ -2093,7 +2093,7 @@ static blk_status_t nvme_rdma_queue_rq(struct 
>>>>> blk_mq_hw_ctx *hctx,
>>>>>   unmap_qe:
>>>>>       ib_dma_unmap_single(dev, req->sqe.dma, sizeof(struct 
>>>>> nvme_command),
>>>>>                   DMA_TO_DEVICE);
>>>>> -    return ret;
>>>>> +    return nvme_try_complete_failed_req(rq, ret);
>>>>
>>>> I don't understand this. There are errors that may not be related to
>>>> anything that is pathing related (sw bug, memory leak, mapping error,
>>>> etc, etc) why should we return this one-shot error?
>>> Although fail over retry is not required, if we return the error to
>>> blk-mq, a low probability crash may happen. because blk-mq do not set
>>> the state of request to MQ_RQ_COMPLETE before complete the request,
>>> the request may be freed asynchronously such as in nvme_submit_user_cmd.
>>> If race with error recovery, request double completion may happens.
>>
>> Then fix that, don't work around it.
> I'm not trying to work around it. The purpose of this is to solve
> the problem of nvme native multipathing at the same time.

Please explain how this is an nvme-multipath issue?

>>
>>>
>>> So we can not return the error to blk-mq if the blk_status_t is not
>>> BLK_STS_RESOURCE, BLK_STS_DEV_RESOURCE, BLK_STS_ZONE_RESOURCE.
>>
>> This is not something we should be handling in nvme. block drivers
>> should be able to fail queue_rq, and this all should live in the
>> block layer.
> Of course, it is also an idea to repair the block drivers directly.
> However, block layer is unaware of nvme native multipathing,

Nor it should be

> will cause the request return error which should be avoided.

Not sure I understand..
requests should failover for path related errors,
what queue_rq errors are expected to be failed over from your
perspective?

> The scenario: use two HBAs for nvme native multipath, and then one HBA
> fault,

What is the specific error the driver sees?

> the blk_status_t of queue_rq is BLK_STS_IOERR, blk-mq will call
> blk_mq_end_request to complete the request which bypass name native
> multipath. We expect the request fail over to normal HBA, but the request
> is directly completed with BLK_STS_IOERR.
> The two scenarios can be fixed by directly completing the request in 
> queue_rq.
Well, certainly this one-shot always return 0 and complete the command
with HOST_PATH error is not a good approach IMO