Re: [PATCH v5 0/3] Handle update hardware queues and queue freeze more carefully

From: Daniel Wagner <dwagner@suse.de>
To: linux-nvme@lists.infradead.org
Cc: linux-kernel@vger.kernel.org,
	James Smart <james.smart@broadcom.com>,
	Keith Busch <kbusch@kernel.org>, Ming Lei <ming.lei@redhat.com>,
	Sagi Grimberg <sagi@grimberg.me>, Hannes Reinecke <hare@suse.de>,
	Wen Xiong <wenxiong@us.ibm.com>,
	Himanshu Madhani <himanshu.madhani@oracle.com>
Subject: Re: [PATCH v5 0/3] Handle update hardware queues and queue freeze more carefully
Date: Fri, 20 Aug 2021 10:48:32 +0200	[thread overview]
Message-ID: <20210820084832.nlsbiztn26fv3b73@carbon.lan> (raw)
In-Reply-To: <20210818120530.130501-1-dwagner@suse.de>

On Wed, Aug 18, 2021 at 02:05:27PM +0200, Daniel Wagner wrote:
> I've dropped all non FC patches as they were bogus. I've retested this
> version with all combinations and all looks good now. Also I gave
> nvme-tcp a spin and again all is good.

I forgot to mention I also dropped the first three patches from v4.
Which seems to break her testing again.

Wendy reported all her tests pass with Ming's V7 of 'blk-mq: fix
blk_mq_alloc_request_hctx' and this series *only* if 'nvme-fc: Update
hardware queues before using them' from previous version is also used.

After starring at it once more, I think I finally understood the
problem. So when we do

        ret = nvme_fc_create_hw_io_queues(ctrl, ctrl->ctrl.sqsize + 1);
        if (ret)
                goto out_free_io_queues;

        ret = nvme_fc_connect_io_queues(ctrl, ctrl->ctrl.sqsize + 1);
        if (ret)
                goto out_delete_hw_queues;

and the number of queues has changed, the connect call will fail:

 nvme2: NVME-FC{2}: create association : host wwpn 0x100000109b5a4dfa rport wwpn 0x50050768101935e5: NQN "nqn.1986-03.com.ibm:nvme:2145.0000020420006CEA"
 nvme2: Connect command failed, error wo/DNR bit: -16389

and we stop the current reconnect attempt and reschedule a new
reconnect attempt:

 nvme2: NVME-FC{2}: reset: Reconnect attempt failed (-5)
 nvme2: NVME-FC{2}: Reconnect attempt in 2 seconds

Then we try to do the same thing again which fails, thus we never
make progress.

So clearly we need to update number of queues at one point. What would
be the right thing to do here? As I understood we need to be careful
with frozen requests. Can we abort them (is this even possible in this
state?) and requeue them before we update the queue numbers?

Daniel