Re: [PATCH] nvme-multipath: do not reset controller on unknown status

From: Hannes Reinecke <hare@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>,
	linux-nvme@lists.infradead.org,
	John Managhini <john.meneghini@netapp.com>
Subject: Re: [PATCH] nvme-multipath: do not reset controller on unknown status
Date: Thu, 13 Feb 2020 08:02:20 +0100	[thread overview]
Message-ID: <3345c55f-3a42-315b-1d62-20f9aaab296e@suse.de> (raw)
In-Reply-To: <6d4d18e3-c3a1-7d93-5abf-1ae46e18ca8c@grimberg.me>

On 2/12/20 8:33 PM, Sagi Grimberg wrote:
> 
> 
> On 2/12/20 9:53 AM, Christoph Hellwig wrote:
>> On Wed, Feb 12, 2020 at 02:41:40PM +0100, Hannes Reinecke wrote:
>>> We're seeing occasional controller resets during straight I/O,
>>> but only when multipath is active.
>>> The problem here is the nvme-multipath will reset the controller
>>> on every unknown status, which really is an odd behaviour, seeing
>>> that the host already received a perfectly good status; it's just
>>> that it's not smart enough to understand it.
>>> And resetting wouldn't help at all; the error status will continue
>>> to be received.
>>> So we should rather pass up any unknown error to the generic
>>> routines and let them deal with this situation.
>>
>> What unknown status do you see?
> 
> My question exactly, I don't understand what is the good status that
> the host is not smart enough to understand?
> 
> I have a vague recollection that John sent some questions in that area
> in the past...
> 
>>
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> Cc: John Managhini <john.meneghini@netapp.com>
>>> ---
>>>   drivers/nvme/host/core.c      |  4 ++--
>>>   drivers/nvme/host/multipath.c | 18 ++++++++++--------
>>>   drivers/nvme/host/nvme.h      |  2 +-
>>>   3 files changed, 13 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 5dc32b72e7fa..edb081781ae7 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -293,8 +293,8 @@ void nvme_complete_rq(struct request *req)
>>>       if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
>>>           if ((req->cmd_flags & REQ_NVME_MPATH) &&
>>>               blk_path_error(status)) {
>>> -            nvme_failover_req(req);
>>> -            return;
>>> +            if (nvme_failover_req(req))
>>> +                return;
>>
>> This reads weird, why not:
>>
>>         if ((req->cmd_flags & REQ_NVME_MPATH) &&
>>             blk_path_error(status) && nvme_failover_req(req))
>>             return;
>>
>> ?
>>
>> But see below, with your updated checks (assuming this makes sense
>> as we need more explanation) we don't even need the blk_path_error
>> call.
> 
> I think it reads better that we call nvme_failover_req only for actual
> path errors.
> 
>>
>>> -void nvme_failover_req(struct request *req)
>>> +bool nvme_failover_req(struct request *req)
>>>   {
>>>       struct nvme_ns *ns = req->q->queuedata;
>>>       u16 status = nvme_req(req)->status;
>>>       unsigned long flags;
>>> +    bool handled = false;
>>>         spin_lock_irqsave(&ns->head->requeue_lock, flags);
>>>       blk_steal_bios(&ns->head->requeue_list, req);
>>>       spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
>>> -    blk_mq_end_request(req, 0);
>>
>> You can't just steal the bios without completing the request.
>>
>>>         switch (status & 0x7ff) {
>>>       case NVME_SC_ANA_TRANSITION:
>>> @@ -88,11 +88,13 @@ void nvme_failover_req(struct request *req)
>>>            * mark the the path as pending and kick of a re-read of
>>> the ANA
>>>            * log page ASAP.
>>>            */
>>> +        blk_mq_end_request(req, 0);
>>>           nvme_mpath_clear_current_path(ns);
>>>           if (ns->ctrl->ana_log_buf) {
>>>               set_bit(NVME_NS_ANA_PENDING, &ns->flags);
>>>               queue_work(nvme_wq, &ns->ctrl->ana_work);
>>>           }
>>> +        handled = true;
>>>           break;
>>>       case NVME_SC_HOST_PATH_ERROR:
>>>       case NVME_SC_HOST_ABORTED_CMD:
>>> @@ -100,18 +102,18 @@ void nvme_failover_req(struct request *req)
>>>            * Temporary transport disruption in talking to the
>>> controller.
>>>            * Try to send on a new path.
>>>            */
>>> +        blk_mq_end_request(req, 0);
>>>           nvme_mpath_clear_current_path(ns);
>>> +        handled = true;
>>>           break;
>>>       default:
>>> -        /*
>>> -         * Reset the controller for any non-ANA error as we don't know
>>> -         * what caused the error.
>>> -         */
>>> -        nvme_reset_ctrl(ns->ctrl);
>>> +        /* Delegate to common error handling */
>>>           break;
> 
> What will happen in the common case? right now it will just retry
> it on the same path, is that the desired behavior?
> 
> I guess we need to understand the phenomenon better.
> 
> Right now the code says, we don't know what went wrong here, so we
> are going to reset the controller because it acts weird, which can
> make some sense, given that the host doesn't understand what is going
> on...
> 
But this is precisely the case I'm arguing against here.
One of the lessons learned from SCSI is that reset only makes sense if
the system misbehaves and resetting it would make this error go away.

Receiving a status code which we don't know about does _not_ fall into
this category; the very fact that we have a status code proves that the
system does _not_ misbehave.

So what exactly will be resolved by resetting?
There actually is a fair chance that we'll be getting the very same
error again...

Consequently I think that resetting the system here is wrong, and we
should just handle it as a normal but unknown error.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme