Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
From: Sagi Grimberg <sagi@grimberg.me>
To: James Smart <james.smart@broadcom.com>,
	linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH v2] nvme-fabrics: allow to queue requests for live queues
Date: Wed, 29 Jul 2020 15:09:30 -0700
Message-ID: <45901cfb-6dc7-d021-7151-bcbab1390139@grimberg.me> (raw)
In-Reply-To: <dcfb5abb-b740-2043-a6c2-97dca47f75e4@broadcom.com>


>> Right now we are failing requests based on the controller
>> state (which is checked inline in nvmf_check_ready) however
>> we should definitely accept requests if the queue is live.
>>
>> When entering controller reset, we transition the controller
>> into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
>> non-mpath requests (have blk_noretry_request set).
>>
>> This is also the case for NVME_REQ_USER for the wrong reason.
>> There shouldn't be any reason for us to reject this I/O in a
>> controller reset. We do want to prevent passthru commands on
>> the admin queue because we need the controller to fully initialize
>> first before we let user passthru admin commands to be issued.
>>
>> In a non-mpath setup, this means that the requests will simply
>> be requeued over and over forever not allowing the q_usage_counter
>> to drop its final reference, causing controller reset to hang
>> if running concurrently with heavy I/O.
> 
> I've been looking at this trying to understand the real issues.
> 
> First issue: even if the check_ready checks pass, there's nothing that 
> says the transport won't return BLK_STS_RESOURCE for it's own reasons. 
> Maybe Tcp/rdma don't, but FC does today for cases of lost connectivity - 
> and a loss of connectivity will cause a Resetting state transition.

That's fine, as I said, the controller state should only advise us to
look further (i.e. queue_live). Not blindly fail these commands.

> I agree with
> +    if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
> change as until we do have acceptance of a way to serialize/prioritize 
> Connect and initialization commands. We need to do this.
> 
> Which then leaves:
> The patch isn't changing any behavior when ! q->live - which occurs in 
> the latter steps of Resetting as well as most of the Connecting state. 
> So any I/O received while !q->live is still getting queued/retried.

Correct.

> So 
> what was the problem ?  It has to be limited in scope to either the 
> start of the Resetting state while q->live had yet to transition to 
> !live. But I don't see any freezing in this path.

When the queue is frozen, the q_usage_counter is killed, and waited
until it reaches to 0, and that will not happen if all the requests
that got the q_usage_counter reference are completed. So we cannot
get into a state where requests are requeued due to BLK_STS_RESOURCE,
as they still have the reference that need to be dropped.

> Looking at the original problem that lock ups are:
> 1) the sysfs write to do a reset - which is stalled waiting on a flush 
> of reset_work
> 2) reset_work is stalled on a call to blk_mq_update_nr_hw_queues() which 
> then does a blk_mq_freeze_queue_wait.

Right, and it blocks because we have commands that are requeued
indefinitely due to the fact that we refused to accept them while
we should have.

> The 2nd one is odd in my mind as the the update_nr_hw_queues is 
> something done while connecting and I wouldn't think the reset flush 
> must wait for a reconnect as well.

We cannot transition to LIVE and restart traffic before we update
the mappings to the new number of queues (controller suddenly allowed
less queues, or we got cpu hotplug or whatever).

   Although not the fix, I'd recommend
> to not call nvme_tcp_setup_ctrl() from reset_work and instead have it 
> schedule nvme_tcp_reconnect_ctrl_work().

What is the difference? we cannot resubmit traffic, hence we cannot
complete it, hence the freeze cannot complete.

As I see it what the driver needs to do in reset is:
1. start freeze - prevent from new requests from taking the q_usage_counter
2. quiesce - prevent new commands from entering queue_rq
3. stop and drain all transport queues
4. cancel all inflight commands (tagset_iter or other)
5. start transport queues again - allow only connect to pass
6. unquiesce the queues - to allow any requests that are stuck
    between (1) and (2) (blocked on quiesced but already have
    the q_usage_counter) to run and complete.

--> at this point anyone waiting for queue freeze should now be
     able to complete (e.g. update_nr_hw_queues).

7. wait for started freeze (paired to balance the freeze_counter)
8. unfreeze the queue - paired with (1)

> As tcp did change state to Connecting before making the 
> update_nr_hw_queues call, yet we didn't change the behavior of 
> rescheduling if Connecting and !q->live, what did the patch actually do 
> to help ?

As I said, we shouldn't have I/O commands that are requeued forever
preventing the freeze from completing. if we are CONNECTING, then
the queues are not started, so we shouldn't see any I/O commands
coming yet.

   The patch only changed the queuing behavior for Connecting
> and q->live.  So whether or not we hit it is dependent on the timing of 
> when a new io is received vs the call to hr_hw_queues.   The patch 
> didn't help this.

There are some supporting patches I sent for tcp and rdma (to operate
in the order I explained above).

I'll resend them as a set soon.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply index

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-29  7:58 Sagi Grimberg
2020-07-29 19:34 ` James Smart
2020-07-29 22:09   ` Sagi Grimberg [this message]
2020-07-30  0:18     ` James Smart
2020-07-30  1:23       ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45901cfb-6dc7-d021-7151-bcbab1390139@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=hch@lst.de \
    --cc=james.smart@broadcom.com \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git