Re: [PATCH v2] nvme-fabrics: allow to queue requests for live queues

From: James Smart <james.smart@broadcom.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH v2] nvme-fabrics: allow to queue requests for live queues
Date: Wed, 29 Jul 2020 17:18:43 -0700	[thread overview]
Message-ID: <2ab8786b-211f-6e7c-980b-b4ea01cacd47@broadcom.com> (raw)
In-Reply-To: <45901cfb-6dc7-d021-7151-bcbab1390139@grimberg.me>

On 7/29/2020 3:09 PM, Sagi Grimberg wrote:
>
>> So what was the problem ?  It has to be limited in scope to either 
>> the start of the Resetting state while q->live had yet to transition 
>> to !live. But I don't see any freezing in this path.
>
> When the queue is frozen, the q_usage_counter is killed, and waited
> until it reaches to 0, and that will not happen if all the requests
> that got the q_usage_counter reference are completed. So we cannot
> get into a state where requests are requeued due to BLK_STS_RESOURCE,
> as they still have the reference that need to be dropped.
>
>> Looking at the original problem that lock ups are:
>> 1) the sysfs write to do a reset - which is stalled waiting on a 
>> flush of reset_work
>> 2) reset_work is stalled on a call to blk_mq_update_nr_hw_queues() 
>> which then does a blk_mq_freeze_queue_wait.
>
> Right, and it blocks because we have commands that are requeued
> indefinitely due to the fact that we refused to accept them while
> we should have.

I understand what you are saying, but.

Invariably, the queues are unquiesced to keep connect going through (all 
queues) and more (admin queue), and as new io will come in that time, we 
have no recourse but to bounce the io as the q isn't live. And it has to 
be a resource status so it is requeued as we can't fail the io as the 
failure, in a non-mpath scenario, will go all the way back to the caller 
which is bad news.  So we're in a catch-22.

That was why I suggested perhaps we needed to make mpath always be 
involved, even for a single-path device (non-mpath) so that it can catch 
and queue at it's level (as if all paths are gone), for a short time 
period, rather than letting the error go back to the issuer.

>
>> The 2nd one is odd in my mind as the the update_nr_hw_queues is 
>> something done while connecting and I wouldn't think the reset flush 
>> must wait for a reconnect as well.
>
> We cannot transition to LIVE and restart traffic before we update
> the mappings to the new number of queues (controller suddenly allowed
> less queues, or we got cpu hotplug or whatever).

agree

>
>   Although not the fix, I'd recommend
>> to not call nvme_tcp_setup_ctrl() from reset_work and instead have it 
>> schedule nvme_tcp_reconnect_ctrl_work().
>
> What is the difference?

it didn't matter for the issue we're really trying to solve. Just 
something that looked odd.

>
> As I see it what the driver needs to do in reset is:
> 1. start freeze - prevent from new requests from taking the 
> q_usage_counter
> 2. quiesce - prevent new commands from entering queue_rq
> 3. stop and drain all transport queues
> 4. cancel all inflight commands (tagset_iter or other)
> 5. start transport queues again - allow only connect to pass
augmented for admin-q to allow other controller init commands as well.

> 6. unquiesce the queues - to allow any requests that are stuck
>    between (1) and (2) (blocked on quiesced but already have
>    the q_usage_counter) to run and complete.
>
> --> at this point anyone waiting for queue freeze should now be
>     able to complete (e.g. update_nr_hw_queues).
>
> 7. wait for started freeze (paired to balance the freeze_counter)
> 8. unfreeze the queue - paired with (1)

ok. But you never stated where in these steps the transport queue is no 
longer live.  I think you're saying (6) will process the io, not abort 
it like the "drain" step, so it can run to completion on the link.  
That's ok for a user-initiated reset, but if the reset is due to a 
link-side error such as the FC case, the queues aren't live and the only 
way to complete the io is to fail it with some error. And for non-mpath, 
that error may surface back to the issuer.

>
>> As tcp did change state to Connecting before making the 
>> update_nr_hw_queues call, yet we didn't change the behavior of 
>> rescheduling if Connecting and !q->live, what did the patch actually 
>> do to help ?
>
> As I said, we shouldn't have I/O commands that are requeued forever
> preventing the freeze from completing. if we are CONNECTING, then
> the queues are not started, so we shouldn't see any I/O commands
> coming yet.

well - ok, and we're back to why they are started. Let's say we're not 
worred about connect ordering - another reason for the continued running 
is so that during an extended reconnect period that may take a couple of 
retries, we can have a fast_io_fail timeout to start failing those 
rescheduled normal ios. even then - fast_io_fail should have been with 
mpath ios so they already should have been fine. Any fast_io_fail would 
still run the risk of returning an error to the issuer.

>
>   The patch only changed the queuing behavior for Connecting
>> and q->live.  So whether or not we hit it is dependent on the timing 
>> of when a new io is received vs the call to hr_hw_queues.   The patch 
>> didn't help this.
>
> There are some supporting patches I sent for tcp and rdma (to operate
> in the order I explained above).
>
> I'll resend them as a set soon.

ok - but unless I'm missing something we have some holes.

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme