All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Wagner <dwagner@suse.de>
To: linux-nvme@lists.infradead.org
Cc: linux-kernel@vger.kernel.org,
	James Smart <james.smart@broadcom.com>,
	Keith Busch <kbusch@kernel.org>, Ming Lei <ming.lei@redhat.com>,
	Sagi Grimberg <sagi@grimberg.me>, Hannes Reinecke <hare@suse.de>,
	Wen Xiong <wenxiong@us.ibm.com>,
	Himanshu Madhani <himanshu.madhani@oracle.com>
Subject: Re: [PATCH v5 0/3] Handle update hardware queues and queue freeze more carefully
Date: Fri, 20 Aug 2021 10:48:32 +0200	[thread overview]
Message-ID: <20210820084832.nlsbiztn26fv3b73@carbon.lan> (raw)
In-Reply-To: <20210818120530.130501-1-dwagner@suse.de>

On Wed, Aug 18, 2021 at 02:05:27PM +0200, Daniel Wagner wrote:
> I've dropped all non FC patches as they were bogus. I've retested this
> version with all combinations and all looks good now. Also I gave
> nvme-tcp a spin and again all is good.

I forgot to mention I also dropped the first three patches from v4.
Which seems to break her testing again.

Wendy reported all her tests pass with Ming's V7 of 'blk-mq: fix
blk_mq_alloc_request_hctx' and this series *only* if 'nvme-fc: Update
hardware queues before using them' from previous version is also used.

After starring at it once more, I think I finally understood the
problem. So when we do

        ret = nvme_fc_create_hw_io_queues(ctrl, ctrl->ctrl.sqsize + 1);
        if (ret)
                goto out_free_io_queues;

        ret = nvme_fc_connect_io_queues(ctrl, ctrl->ctrl.sqsize + 1);
        if (ret)
                goto out_delete_hw_queues;

and the number of queues has changed, the connect call will fail:

 nvme2: NVME-FC{2}: create association : host wwpn 0x100000109b5a4dfa rport wwpn 0x50050768101935e5: NQN "nqn.1986-03.com.ibm:nvme:2145.0000020420006CEA"
 nvme2: Connect command failed, error wo/DNR bit: -16389

and we stop the current reconnect attempt and reschedule a new
reconnect attempt:

 nvme2: NVME-FC{2}: reset: Reconnect attempt failed (-5)
 nvme2: NVME-FC{2}: Reconnect attempt in 2 seconds

Then we try to do the same thing again which fails, thus we never
make progress.

So clearly we need to update number of queues at one point. What would
be the right thing to do here? As I understood we need to be careful
with frozen requests. Can we abort them (is this even possible in this
state?) and requeue them before we update the queue numbers?

Daniel

WARNING: multiple messages have this Message-ID (diff)
From: Daniel Wagner <dwagner@suse.de>
To: linux-nvme@lists.infradead.org
Cc: linux-kernel@vger.kernel.org,
	James Smart <james.smart@broadcom.com>,
	Keith Busch <kbusch@kernel.org>, Ming Lei <ming.lei@redhat.com>,
	Sagi Grimberg <sagi@grimberg.me>, Hannes Reinecke <hare@suse.de>,
	Wen Xiong <wenxiong@us.ibm.com>,
	Himanshu Madhani <himanshu.madhani@oracle.com>
Subject: Re: [PATCH v5 0/3] Handle update hardware queues and queue freeze more carefully
Date: Fri, 20 Aug 2021 10:48:32 +0200	[thread overview]
Message-ID: <20210820084832.nlsbiztn26fv3b73@carbon.lan> (raw)
In-Reply-To: <20210818120530.130501-1-dwagner@suse.de>

On Wed, Aug 18, 2021 at 02:05:27PM +0200, Daniel Wagner wrote:
> I've dropped all non FC patches as they were bogus. I've retested this
> version with all combinations and all looks good now. Also I gave
> nvme-tcp a spin and again all is good.

I forgot to mention I also dropped the first three patches from v4.
Which seems to break her testing again.

Wendy reported all her tests pass with Ming's V7 of 'blk-mq: fix
blk_mq_alloc_request_hctx' and this series *only* if 'nvme-fc: Update
hardware queues before using them' from previous version is also used.

After starring at it once more, I think I finally understood the
problem. So when we do

        ret = nvme_fc_create_hw_io_queues(ctrl, ctrl->ctrl.sqsize + 1);
        if (ret)
                goto out_free_io_queues;

        ret = nvme_fc_connect_io_queues(ctrl, ctrl->ctrl.sqsize + 1);
        if (ret)
                goto out_delete_hw_queues;

and the number of queues has changed, the connect call will fail:

 nvme2: NVME-FC{2}: create association : host wwpn 0x100000109b5a4dfa rport wwpn 0x50050768101935e5: NQN "nqn.1986-03.com.ibm:nvme:2145.0000020420006CEA"
 nvme2: Connect command failed, error wo/DNR bit: -16389

and we stop the current reconnect attempt and reschedule a new
reconnect attempt:

 nvme2: NVME-FC{2}: reset: Reconnect attempt failed (-5)
 nvme2: NVME-FC{2}: Reconnect attempt in 2 seconds

Then we try to do the same thing again which fails, thus we never
make progress.

So clearly we need to update number of queues at one point. What would
be the right thing to do here? As I understood we need to be careful
with frozen requests. Can we abort them (is this even possible in this
state?) and requeue them before we update the queue numbers?

Daniel

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  parent reply	other threads:[~2021-08-20  8:48 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-18 12:05 [PATCH v5 0/3] Handle update hardware queues and queue freeze more carefully Daniel Wagner
2021-08-18 12:05 ` Daniel Wagner
2021-08-18 12:05 ` [PATCH v5 1/3] nvme-fc: Wait with a timeout for queue to freeze Daniel Wagner
2021-08-18 12:05   ` Daniel Wagner
2021-08-18 12:05 ` [PATCH v5 2/3] nvme-fc: avoid race between time out and tear down Daniel Wagner
2021-08-18 12:05   ` Daniel Wagner
2021-08-18 12:05 ` [PATCH v5 3/3] nvme-fc: fix controller reset hang during traffic Daniel Wagner
2021-08-18 12:05   ` Daniel Wagner
2021-08-18 12:21   ` Hannes Reinecke
2021-08-18 12:21     ` Hannes Reinecke
2021-08-20  8:48 ` Daniel Wagner [this message]
2021-08-20  8:48   ` [PATCH v5 0/3] Handle update hardware queues and queue freeze more carefully Daniel Wagner
2021-08-20 11:55   ` Daniel Wagner
2021-08-20 11:55     ` Daniel Wagner
2021-08-20 15:27     ` James Smart
2021-08-20 15:27       ` James Smart
2021-08-23  8:01 ` Christoph Hellwig
2021-08-23  8:01   ` Christoph Hellwig
2021-08-23  9:14   ` Daniel Wagner
2021-08-23  9:14     ` Daniel Wagner
2021-08-23  9:19     ` Christoph Hellwig
2021-08-23  9:19       ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210820084832.nlsbiztn26fv3b73@carbon.lan \
    --to=dwagner@suse.de \
    --cc=hare@suse.de \
    --cc=himanshu.madhani@oracle.com \
    --cc=james.smart@broadcom.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    --cc=sagi@grimberg.me \
    --cc=wenxiong@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.