Linux-NVME Archive on
 help / color / Atom feed
From: Sagi Grimberg <>
To: James Smart <>,, Christoph Hellwig <>,
	Keith Busch <>
Subject: Re: [PATCH v2] nvme-fabrics: allow to queue requests for live queues
Date: Wed, 29 Jul 2020 18:23:15 -0700
Message-ID: <> (raw)
In-Reply-To: <>

>>> So what was the problem ?  It has to be limited in scope to either 
>>> the start of the Resetting state while q->live had yet to transition 
>>> to !live. But I don't see any freezing in this path.
>> When the queue is frozen, the q_usage_counter is killed, and waited
>> until it reaches to 0, and that will not happen if all the requests
>> that got the q_usage_counter reference are completed. So we cannot
>> get into a state where requests are requeued due to BLK_STS_RESOURCE,
>> as they still have the reference that need to be dropped.
>>> Looking at the original problem that lock ups are:
>>> 1) the sysfs write to do a reset - which is stalled waiting on a 
>>> flush of reset_work
>>> 2) reset_work is stalled on a call to blk_mq_update_nr_hw_queues() 
>>> which then does a blk_mq_freeze_queue_wait.
>> Right, and it blocks because we have commands that are requeued
>> indefinitely due to the fact that we refused to accept them while
>> we should have.
> I understand what you are saying, but.
> Invariably, the queues are unquiesced to keep connect going through (all 
> queues)

No, the connect has its own connect_q, which is not quiesced, so
the connect passes thru and I/O commands are kept quiesced.

> and more (admin queue),

For the admin queue we have the temporary w.a which still needs to
be fixed.

> and as new io will come in that time, we 
> have no recourse but to bounce the io as the q isn't live. And it has to 
> be a resource status so it is requeued as we can't fail the io as the 
> failure, in a non-mpath scenario, will go all the way back to the caller 
> which is bad news.  So we're in a catch-22.

See above, for I/O the issue is solved, for admin we need a similar

> That was why I suggested perhaps we needed to make mpath always be 
> involved, even for a single-path device (non-mpath) so that it can catch 
> and queue at it's level (as if all paths are gone), for a short time 
> period, rather than letting the error go back to the issuer.

This needs to work also for non-mpath devices. I don't think that we
should force mpath.

>>> The 2nd one is odd in my mind as the the update_nr_hw_queues is 
>>> something done while connecting and I wouldn't think the reset flush 
>>> must wait for a reconnect as well.
>> We cannot transition to LIVE and restart traffic before we update
>> the mappings to the new number of queues (controller suddenly allowed
>> less queues, or we got cpu hotplug or whatever).
> agree
>>   Although not the fix, I'd recommend
>>> to not call nvme_tcp_setup_ctrl() from reset_work and instead have it 
>>> schedule nvme_tcp_reconnect_ctrl_work().
>> What is the difference?
> it didn't matter for the issue we're really trying to solve. Just 
> something that looked odd.

I don't necessarily think so, this update_nr_hw_queues still needs
to happen before the we allow new I/O in, regardless of the context.

>> As I see it what the driver needs to do in reset is:
>> 1. start freeze - prevent from new requests from taking the 
>> q_usage_counter
>> 2. quiesce - prevent new commands from entering queue_rq
>> 3. stop and drain all transport queues
>> 4. cancel all inflight commands (tagset_iter or other)
>> 5. start transport queues again - allow only connect to pass
> augmented for admin-q to allow other controller init commands as well.


>> 6. unquiesce the queues - to allow any requests that are stuck
>>    between (1) and (2) (blocked on quiesced but already have
>>    the q_usage_counter) to run and complete.
>> --> at this point anyone waiting for queue freeze should now be
>>     able to complete (e.g. update_nr_hw_queues).
>> 7. wait for started freeze (paired to balance the freeze_counter)
>> 8. unfreeze the queue - paired with (1)
> ok. But you never stated where in these steps the transport queue is no 
> longer live.

The transport queues are not alive after the queues were frozen and

>  I think you're saying (6) will process the io, not abort 
> it like the "drain" step, so it can run to completion on the link. 
> That's ok for a user-initiated reset, but if the reset is due to a 
> link-side error such as the FC case, the queues aren't live and the only 
> way to complete the io is to fail it with some error. And for non-mpath, 
> that error may surface back to the issuer.

error recovery due to link-side errors also exist in tcp and rdma. step
(6) happens only after we successfully recreated the I/O. for the link
side errors, queues are unquiesced immediately.

But I think we do have an issue here. If the link side failed exactly
when we connected, we are allowed commands to pass through (step 6)
, sent to the controller that is now unresponsive (in the middle of a
reset/reconnect), we don't have any way to fail them because even if
we do fail them in timeout, the core will retry and will be freeze
will still not complete.

Perhaps Keith can help,

How does that work for pci?
Assume that you are in the middle of a reset, with I/O inflight, you
got the controller established and them all of a sudden, it stopped
responding. I see that the timeout handler kicks in and you delete
the controller (forcing requests to fail). That is not a good
option for fabrics, because we want to stick around for when
the controller comes back online (sometime in the future).

>>> As tcp did change state to Connecting before making the 
>>> update_nr_hw_queues call, yet we didn't change the behavior of 
>>> rescheduling if Connecting and !q->live, what did the patch actually 
>>> do to help ?
>> As I said, we shouldn't have I/O commands that are requeued forever
>> preventing the freeze from completing. if we are CONNECTING, then
>> the queues are not started, so we shouldn't see any I/O commands
>> coming yet.
> well - ok, and we're back to why they are started. Let's say we're not 
> worred about connect ordering

connect is ordered for I/O, we should fix it for admin as well.

  - another reason for the continued running
> is so that during an extended reconnect period that may take a couple of 
> retries, we can have a fast_io_fail timeout to start failing those 
> rescheduled normal ios. even then - fast_io_fail should have been with 
> mpath ios so they already should have been fine. Any fast_io_fail would 
> still run the risk of returning an error to the issuer.

Not sure I follow your point...

Linux-nvme mailing list

      reply index

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-29  7:58 Sagi Grimberg
2020-07-29 19:34 ` James Smart
2020-07-29 22:09   ` Sagi Grimberg
2020-07-30  0:18     ` James Smart
2020-07-30  1:23       ` Sagi Grimberg [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-NVME Archive on

Archives are clonable:
	git clone --mirror linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ \
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone