Re: [PATCH] nvme-fabrics: allow to queue requests for live queues

From: Sagi Grimberg <sagi@grimberg.me>
To: James Smart <james.smart@broadcom.com>,
	Chao Leng <lengchao@huawei.com>,
	linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
Date: Tue, 28 Jul 2020 16:39:26 -0700	[thread overview]
Message-ID: <8c96aa2a-89e3-b8e5-8506-f489aa9a69f1@grimberg.me> (raw)
In-Reply-To: <520f2317-533d-b546-dd35-80ae5f707aa7@broadcom.com>

>>>> Can you please give an example? NVME_REQ_USERCMD should not be any
>>>> different from any other type of I/O.
>>> ...
>>
>> The problem I have with this code is that checking for NVME_REQ_USERCMD
>> is a big hammer that you land on what can be normal I/O
>> (e.g. nvme read/write/write-zeros), and we must not rely on this
>> indication to prevent what you are describing.
>>
>> It maybe (just maybe) OK to check NVME_REQ_USERCMD only for the admin
>> queue, but we are really circling over the fact that we cannot reliably
>> send admin connect before to go off first, because we have to unquiesce
>> it before we issue the admin connect, and there might be a stray command
>> going into execution before we submit the admin connect.
> 
> agree.
> 
> Didn't you suggest something separate to isolate the connect 
> commands/init commands ?   Can we use a different tag set for 
> "initialization" commands?  We can use different queue_rq routines with 
> different check ready checks.  I don't like the overhead, but sure makes 
> the logic more straightforward.

I think I've added this now that I'm looking back:
e7832cb48a65 ("nvme: make fabrics command run on a separate request queue")

But it still doesn't solve the user commands you are referring to.

>>>>> As for the blk_rq_is_passthrough check - I guess I can see it being 
>>>>> based on the queue state, and the check looks ok  (we should never 
>>>>> see !blk_rq_is_passthrough on the admin q).
>>>>> But...
>>>>> - I don't know why it was that important to change it. On the 
>>>>> connecting path, all you're doing is letting io start flowing 
>>>>> before all the queues have been created.  Did you really need to 
>>>>> start that much sooner ?
>>>>
>>>> The issue is that controller in RESETTING state will have requests that
>>>> are being issued, and if we don't let it pass through, it will hang
>>>> around forever being requeued preventing queue freeze to complete.
>>>
> 
> Isn't this the real problem:  we require the request to complete in 
> order to get off the request queue in order to freeze. But to complete - 
> it has to finish with some status. This is ok if running multipath, as 
> mpath queues and retries - but what mpath is not running ? The issuer of 
> the io will get the error ?

We already fail noretry errors when we cancel (or abort) anything that
was submitted before, so there is nothing new here. non-mpath
does not set FAILFAST so it gets requeued, user commands do, so they
are failed, this is not different than before.

> Should we always be putting even non-mp devices under mp, thus mp 
> catches the errors rather than the blk queue, and can have a short delay 
> while the controller reconnects ?  At least normal io can be handled 
> this way and we'd have to figure out the ioctl path.   But it should 
> still allow the core io, such as scanning, to get the error and fail 
> out, to be picked up after the reconnect.

non-mpath still gets requeued because it doesn't set any sort of
FAILFAST flag.
or

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme