[PATCH 0/3] improve nvme quiesce time for large amount of namespaces

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
@ 2022-07-29  7:39 Chao Leng
  2022-07-29  7:39 ` [PATCH 1/3] blk-mq: delete unnecessary comments Chao Leng
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Chao Leng @ 2022-07-29  7:39 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: hch, sagi, kbusch, axboe, lengchao

This set improves the quiesce time when using a large set of
namespaces, which also improves I/O failover time in a multipath
environment.

Chao Leng (3):
  blk-mq: delete unnecessary comments
  nvme: improve the quiesce time for non blocking transports
  nvme: improve the quiesce time for blocking transports

 block/blk-mq.c           |  4 ----
 drivers/nvme/host/core.c | 60 +++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 52 insertions(+), 12 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] blk-mq: delete unnecessary comments
  2022-07-29  7:39 [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Chao Leng
@ 2022-07-29  7:39 ` Chao Leng
  2022-07-29  7:39 ` [PATCH 2/3] nvme: improve the quiesce time for non blocking transports Chao Leng
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Chao Leng @ 2022-07-29  7:39 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: hch, sagi, kbusch, axboe, lengchao

To improve the quiesce time when using a large set of namespaces, nvme
need using blk_mq_quiesce_queue_nowait directly.
Thus the comments for blk_mq_quiesce_queue_nowait is unnecessary,
so delete the comments.

Signed-off-by: Chao Leng <lengchao@huawei.com>
---
 block/blk-mq.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 92aae03103b7..6b525640009c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -238,10 +238,6 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
-/*
- * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
- * mpt3sas driver such that this function can be removed.
- */
 void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 {
 	unsigned long flags;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvme: improve the quiesce time for non blocking transports
  2022-07-29  7:39 [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Chao Leng
  2022-07-29  7:39 ` [PATCH 1/3] blk-mq: delete unnecessary comments Chao Leng
@ 2022-07-29  7:39 ` Chao Leng
  2022-07-29  7:39 ` [PATCH 3/3] nvme: improve the quiesce time for " Chao Leng
  2022-07-29 14:26 ` [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Christoph Hellwig
  3 siblings, 0 replies; 15+ messages in thread
From: Chao Leng @ 2022-07-29  7:39 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: hch, sagi, kbusch, axboe, lengchao

When work with nvme multipathing, if one path failed, the requests will
be canceled and retry on another path. Before cancel the requests,
nvme_stop_queues quiesce queues for all namespaces, now quiesce one by
one, if there is a large set of namespaces, it will take long time.
Because every synchronize_rcu may need more than 10 milliseconds,
the total waiting time will be long.
Quiesce all queues in parallel, and it is safe to synchronize_rcu once.
The quiesce time and I/O fail over time can be reduced to one
synchronize_rcu time.

Signed-off-by: Chao Leng <lengchao@huawei.com>
---
 drivers/nvme/host/core.c | 34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index eabffbc708cd..fcfa27e1078a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4924,6 +4924,30 @@ static void nvme_stop_ns_queue(struct nvme_ns *ns)
 		blk_mq_wait_quiesce_done(ns->queue);
 }
 
+static void nvme_stop_ns_queue_nosync(struct nvme_ns *ns)
+{
+	if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags))
+		blk_mq_quiesce_queue_nowait(ns->queue);
+}
+
+static void nvme_stop_blocking_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		nvme_stop_ns_queue(ns);
+}
+
+static void nvme_stop_nonblocking_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		nvme_stop_ns_queue_nosync(ns);
+
+	synchronize_rcu();
+}
+
 /*
  * Prepare a queue for teardown.
  *
@@ -5017,11 +5041,13 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
-
 	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		nvme_stop_ns_queue(ns);
+
+	if (ctrl->tagset->flags & BLK_MQ_F_BLOCKING)
+		nvme_stop_blocking_queues(ctrl);
+	else
+		nvme_stop_nonblocking_queues(ctrl);
+
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] nvme: improve the quiesce time for blocking transports
  2022-07-29  7:39 [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Chao Leng
  2022-07-29  7:39 ` [PATCH 1/3] blk-mq: delete unnecessary comments Chao Leng
  2022-07-29  7:39 ` [PATCH 2/3] nvme: improve the quiesce time for non blocking transports Chao Leng
@ 2022-07-29  7:39 ` Chao Leng
  2022-07-29 14:26 ` [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Christoph Hellwig
  3 siblings, 0 replies; 15+ messages in thread
From: Chao Leng @ 2022-07-29  7:39 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: hch, sagi, kbusch, axboe, lengchao

When work with nvme multipathing, if one path failed, the requests will
be canceled and retry on another path. Before cancel the requests,
nvme_stop_queues quiesce queues for all namespaces, now quiesce one by
one, if there is a large set of namespaces, it will take long time.
Because every synchronize_srcu may need more than 10 milliseconds,
the total waiting time will be long.
Quiesce all queues in parallel and start syncing for srcus of all
queues, and then wait for all srcus to complete synchronization.
The quiesce time and I/O fail over time can be largely reduced.

Signed-off-by: Chao Leng <lengchao@huawei.com>
---
 drivers/nvme/host/core.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fcfa27e1078a..937d5802d41b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -21,6 +21,7 @@
 #include <linux/nvme_ioctl.h>
 #include <linux/pm_qos.h>
 #include <asm/unaligned.h>
+#include <linux/rcupdate_wait.h>
 
 #include "nvme.h"
 #include "fabrics.h"
@@ -4916,14 +4917,6 @@ static void nvme_start_ns_queue(struct nvme_ns *ns)
 		blk_mq_unquiesce_queue(ns->queue);
 }
 
-static void nvme_stop_ns_queue(struct nvme_ns *ns)
-{
-	if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags))
-		blk_mq_quiesce_queue(ns->queue);
-	else
-		blk_mq_wait_quiesce_done(ns->queue);
-}
-
 static void nvme_stop_ns_queue_nosync(struct nvme_ns *ns)
 {
 	if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags))
@@ -4933,9 +4926,34 @@ static void nvme_stop_ns_queue_nosync(struct nvme_ns *ns)
 static void nvme_stop_blocking_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	int i, count;
+	struct rcu_synchronize *rcu;
 
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		nvme_stop_ns_queue(ns);
+	count = 0;
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		nvme_stop_ns_queue_nosync(ns);
+		count++;
+	}
+
+	rcu = kvmalloc(count * sizeof(*rcu), GFP_KERNEL);
+	if (rcu) {
+		i = 0;
+		list_for_each_entry(ns, &ctrl->namespaces, list) {
+			init_rcu_head(&rcu[i].head);
+			init_completion(&rcu[i].completion);
+			call_srcu(ns->queue->srcu, &rcu[i].head, wakeme_after_rcu);
+			i++;
+		}
+
+		for (i = 0; i < count; i++) {
+			wait_for_completion(&rcu[i].completion);
+			destroy_rcu_head(&rcu[i].head);
+		}
+		kvfree(rcu);
+	} else {
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_mq_wait_quiesce_done(ns->queue);
+	}
 }
 
 static void nvme_stop_nonblocking_queues(struct nvme_ctrl *ctrl)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-07-29  7:39 [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Chao Leng
                   ` (2 preceding siblings ...)
  2022-07-29  7:39 ` [PATCH 3/3] nvme: improve the quiesce time for " Chao Leng
@ 2022-07-29 14:26 ` Christoph Hellwig
  2022-07-30  0:39   ` Chao Leng
  3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2022-07-29 14:26 UTC (permalink / raw)
  To: Chao Leng; +Cc: linux-nvme, linux-block, hch, sagi, kbusch, axboe

Why can't we have a per-tagset quiesce flag and just wait for the
one?  That also really nicely supports the problem with changes in
the namespace list during that time.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-07-29 14:26 ` [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Christoph Hellwig
@ 2022-07-30  0:39   ` Chao Leng
  2022-07-31 10:23     ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Chao Leng @ 2022-07-30  0:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, linux-block, sagi, kbusch, axboe



On 2022/7/29 22:26, Christoph Hellwig wrote:
> Why can't we have a per-tagset quiesce flag and just wait for the
> one?  That also really nicely supports the problem with changes in
> the namespace list during that time.
Because If quiesce queues based on tagset, it is difficult to
distinguish non-IO queues. The I/O queues process is different
from other queues such as fabrics_q, admin_q, etc, which may cause
confusion in the code logic.
> 
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-07-30  0:39   ` Chao Leng
@ 2022-07-31 10:23     ` Sagi Grimberg
  2022-08-01  1:45       ` Chao Leng
  2022-08-02 13:38       ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Sagi Grimberg @ 2022-07-31 10:23 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig; +Cc: linux-nvme, linux-block, kbusch, axboe


>> Why can't we have a per-tagset quiesce flag and just wait for the
>> one?  That also really nicely supports the problem with changes in
>> the namespace list during that time.
> Because If quiesce queues based on tagset, it is difficult to
> distinguish non-IO queues. The I/O queues process is different
> from other queues such as fabrics_q, admin_q, etc, which may cause
> confusion in the code logic.

It is primarily the connect_q where we issue io queue connect...
We should not quiesce the connect_q in nvme_stop_queues() as that
relates to only namespaces queues.

In the last attempt to do a tagset flag, we ended up having to do
something like:
--
void nvme_stop_queues(struct nvme_ctrl *ctrl)
{
	blk_mq_quiesce_tagset(ctrl->tagset);
	if (ctrl->connect_q)
		blk_mq_unquiesce_queue(ctrl->connect_q);
}
EXPORT_SYMBOL_GPL(nvme_stop_queues);
--

But maybe we can avoid that, and because we allocate
the connect_q ourselves, and fully know that it should
not be apart of the tagset quiesce, perhaps we can introduce
a new interface like:
--
static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
{
	ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
	if (IS_ERR(ctrl->connect_q))
		return PTR_ERR(ctrl->connect_q);
	return 0;
}
--

And then blk_mq_quiesce_tagset can simply look into a per request-queue
self_quiesce flag and skip as needed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-07-31 10:23     ` Sagi Grimberg
@ 2022-08-01  1:45       ` Chao Leng
  2022-08-02 13:38       ` Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Chao Leng @ 2022-08-01  1:45 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, linux-block, kbusch, axboe



On 2022/7/31 18:23, Sagi Grimberg wrote:
> 
>>> Why can't we have a per-tagset quiesce flag and just wait for the
>>> one?  That also really nicely supports the problem with changes in
>>> the namespace list during that time.
>> Because If quiesce queues based on tagset, it is difficult to
>> distinguish non-IO queues. The I/O queues process is different
>> from other queues such as fabrics_q, admin_q, etc, which may cause
>> confusion in the code logic.
> 
> It is primarily the connect_q where we issue io queue connect...
> We should not quiesce the connect_q in nvme_stop_queues() as that
> relates to only namespaces queues.
Although we can do special processing for connect_q, fabrics_q, admin_q,
but this results in redundant semantics being implemented in
nvme_xxx_teardown_io_queues, these actions are confused for
nvme_xxx_teardown_admin_queue. It doesn't look clear.
Therefor, I think quiesceing queues based on namespaces is a better option.
In addition, I do not see the benefit of quiesceing queues based on tagset.
> 
> In the last attempt to do a tagset flag, we ended up having to do
> something like:
> -- 
> void nvme_stop_queues(struct nvme_ctrl *ctrl)
> {
>      blk_mq_quiesce_tagset(ctrl->tagset);
>      if (ctrl->connect_q)
>          blk_mq_unquiesce_queue(ctrl->connect_q);
> }
> EXPORT_SYMBOL_GPL(nvme_stop_queues);
> -- 
> 
> But maybe we can avoid that, and because we allocate
> the connect_q ourselves, and fully know that it should
> not be apart of the tagset quiesce, perhaps we can introduce
> a new interface like:
> -- 
> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
> {
>      ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>      if (IS_ERR(ctrl->connect_q))
>          return PTR_ERR(ctrl->connect_q);
>      return 0;
> }
> -- 
> 
> And then blk_mq_quiesce_tagset can simply look into a per request-queue
> self_quiesce flag and skip as needed.
> .

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-07-31 10:23     ` Sagi Grimberg
  2022-08-01  1:45       ` Chao Leng
@ 2022-08-02 13:38       ` Christoph Hellwig
  2022-10-10  8:46         ` Chao Leng
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2022-08-02 13:38 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chao Leng, Christoph Hellwig, linux-nvme, linux-block, kbusch, axboe

On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
> But maybe we can avoid that, and because we allocate
> the connect_q ourselves, and fully know that it should
> not be apart of the tagset quiesce, perhaps we can introduce
> a new interface like:
> --
> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
> {
> 	ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
> 	if (IS_ERR(ctrl->connect_q))
> 		return PTR_ERR(ctrl->connect_q);
> 	return 0;
> }
> --
>
> And then blk_mq_quiesce_tagset can simply look into a per request-queue
> self_quiesce flag and skip as needed.

I'd just make that a queue flag set after allocation to keep the
interface simple, but otherwise this seems like the right thing
to do.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-08-02 13:38       ` Christoph Hellwig
@ 2022-10-10  8:46         ` Chao Leng
  2022-10-12  6:37           ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Chao Leng @ 2022-10-10  8:46 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg; +Cc: linux-nvme, linux-block, kbusch, axboe



On 2022/8/2 21:38, Christoph Hellwig wrote:
> On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
>> But maybe we can avoid that, and because we allocate
>> the connect_q ourselves, and fully know that it should
>> not be apart of the tagset quiesce, perhaps we can introduce
>> a new interface like:
>> --
>> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
>> {
>> 	ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>> 	if (IS_ERR(ctrl->connect_q))
>> 		return PTR_ERR(ctrl->connect_q);
>> 	return 0;
>> }
>> --
>>
>> And then blk_mq_quiesce_tagset can simply look into a per request-queue
>> self_quiesce flag and skip as needed.
> 
> I'd just make that a queue flag set after allocation to keep the
> interface simple, but otherwise this seems like the right thing
> to do.
Now the code used NVME_NS_STOPPED to avoid unpaired stop/start.
If we use blk_mq_quiesce_tagset, It will cause the above mechanism to fail.
I review the code, only pci can not ensure secure stop/start pairing.
So there is a choice, We only use blk_mq_quiesce_tagset on fabrics, not PCI.
Do you think that's acceptable?
If that's acceptable, I will try to send a patch set.
> 
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-10-10  8:46         ` Chao Leng
@ 2022-10-12  6:37           ` Sagi Grimberg
  2022-10-12  8:43             ` Chao Leng
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2022-10-12  6:37 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig; +Cc: linux-nvme, linux-block, kbusch, axboe


>> On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
>>> But maybe we can avoid that, and because we allocate
>>> the connect_q ourselves, and fully know that it should
>>> not be apart of the tagset quiesce, perhaps we can introduce
>>> a new interface like:
>>> -- 
>>> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
>>> {
>>>     ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>>>     if (IS_ERR(ctrl->connect_q))
>>>         return PTR_ERR(ctrl->connect_q);
>>>     return 0;
>>> }
>>> -- 
>>>
>>> And then blk_mq_quiesce_tagset can simply look into a per request-queue
>>> self_quiesce flag and skip as needed.
>>
>> I'd just make that a queue flag set after allocation to keep the
>> interface simple, but otherwise this seems like the right thing
>> to do.
> Now the code used NVME_NS_STOPPED to avoid unpaired stop/start.
> If we use blk_mq_quiesce_tagset, It will cause the above mechanism to fail.
> I review the code, only pci can not ensure secure stop/start pairing.
> So there is a choice, We only use blk_mq_quiesce_tagset on fabrics, not 
> PCI.
> Do you think that's acceptable?
> If that's acceptable, I will try to send a patch set.

I don't think that this is acceptable. But I don't understand how
NVME_NS_STOPPED would change anything in the behavior of tagset-wide
quiesce?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-10-12  6:37           ` Sagi Grimberg
@ 2022-10-12  8:43             ` Chao Leng
  2022-10-12 11:13               ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Chao Leng @ 2022-10-12  8:43 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Ming Lei
  Cc: linux-nvme, linux-block, kbusch, axboe

Add Ming Lei.

On 2022/10/12 14:37, Sagi Grimberg wrote:
> 
>>> On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
>>>> But maybe we can avoid that, and because we allocate
>>>> the connect_q ourselves, and fully know that it should
>>>> not be apart of the tagset quiesce, perhaps we can introduce
>>>> a new interface like:
>>>> -- 
>>>> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
>>>> {
>>>>     ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>>>>     if (IS_ERR(ctrl->connect_q))
>>>>         return PTR_ERR(ctrl->connect_q);
>>>>     return 0;
>>>> }
>>>> -- 
>>>>
>>>> And then blk_mq_quiesce_tagset can simply look into a per request-queue
>>>> self_quiesce flag and skip as needed.
>>>
>>> I'd just make that a queue flag set after allocation to keep the
>>> interface simple, but otherwise this seems like the right thing
>>> to do.
>> Now the code used NVME_NS_STOPPED to avoid unpaired stop/start.
>> If we use blk_mq_quiesce_tagset, It will cause the above mechanism to fail.
>> I review the code, only pci can not ensure secure stop/start pairing.
>> So there is a choice, We only use blk_mq_quiesce_tagset on fabrics, not PCI.
>> Do you think that's acceptable?
>> If that's acceptable, I will try to send a patch set.
> 
> I don't think that this is acceptable. But I don't understand how
> NVME_NS_STOPPED would change anything in the behavior of tagset-wide
> quiesce?
If use blk_mq_quiesce_tagset, it will quiesce all queues of all ns,
but can not set NVME_NS_STOPPED of all ns. The mechanism of NVME_NS_STOPPED
will be invalidated.
NVMe-pci has very complicated quiesce/unquiesce use pattern, quiesce/unquiesce
may be called unpaired.
It will cause some backward. There may be some bugs in this scenario:
A thread: quiesce the queue
B thread: quiesce the queue
A thread end, and does not unquiesce the queue.
B thread: unquiesce the queue, and do something which need the queue must be unquiesed.

Of course, I don't think it is a good choice to guarantee paired access through NVME_NS_STOPPED,
there exist unexpected unquiesce and start queue too early.
But now that the code has done so, the backward should be unacceptable.
such as this scenario:
A thread: quiesce the queue
B thread: want to quiesce the queue but do nothing because NVME_NS_STOPPED is already set.
A thread: unquiesce the queue
Now the queue is unquiesced too early for B thread.
B thread: do something which need the queue must be quiesced.

Introduce NVME_NS_STOPPED link:
https://lore.kernel.org/all/20211014081710.1871747-5-ming.lei@redhat.com/
> 
> .

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-10-12  8:43             ` Chao Leng
@ 2022-10-12 11:13               ` Sagi Grimberg
  2022-10-13  1:37                 ` Chao Leng
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2022-10-12 11:13 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Ming Lei
  Cc: linux-nvme, linux-block, kbusch, axboe



On 10/12/22 11:43, Chao Leng wrote:
> Add Ming Lei.
> 
> On 2022/10/12 14:37, Sagi Grimberg wrote:
>>
>>>> On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
>>>>> But maybe we can avoid that, and because we allocate
>>>>> the connect_q ourselves, and fully know that it should
>>>>> not be apart of the tagset quiesce, perhaps we can introduce
>>>>> a new interface like:
>>>>> -- 
>>>>> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
>>>>> {
>>>>>     ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>>>>>     if (IS_ERR(ctrl->connect_q))
>>>>>         return PTR_ERR(ctrl->connect_q);
>>>>>     return 0;
>>>>> }
>>>>> -- 
>>>>>
>>>>> And then blk_mq_quiesce_tagset can simply look into a per 
>>>>> request-queue
>>>>> self_quiesce flag and skip as needed.
>>>>
>>>> I'd just make that a queue flag set after allocation to keep the
>>>> interface simple, but otherwise this seems like the right thing
>>>> to do.
>>> Now the code used NVME_NS_STOPPED to avoid unpaired stop/start.
>>> If we use blk_mq_quiesce_tagset, It will cause the above mechanism to 
>>> fail.
>>> I review the code, only pci can not ensure secure stop/start pairing.
>>> So there is a choice, We only use blk_mq_quiesce_tagset on fabrics, 
>>> not PCI.
>>> Do you think that's acceptable?
>>> If that's acceptable, I will try to send a patch set.
>>
>> I don't think that this is acceptable. But I don't understand how
>> NVME_NS_STOPPED would change anything in the behavior of tagset-wide
>> quiesce?
> If use blk_mq_quiesce_tagset, it will quiesce all queues of all ns,
> but can not set NVME_NS_STOPPED of all ns. The mechanism of NVME_NS_STOPPED
> will be invalidated.
> NVMe-pci has very complicated quiesce/unquiesce use pattern, 
> quiesce/unquiesce
> may be called unpaired.
> It will cause some backward. There may be some bugs in this scenario:
> A thread: quiesce the queue
> B thread: quiesce the queue
> A thread end, and does not unquiesce the queue.
> B thread: unquiesce the queue, and do something which need the queue 
> must be unquiesed.
> 
> Of course, I don't think it is a good choice to guarantee paired access 
> through NVME_NS_STOPPED,
> there exist unexpected unquiesce and start queue too early.
> But now that the code has done so, the backward should be unacceptable.
> such as this scenario:
> A thread: quiesce the queue
> B thread: want to quiesce the queue but do nothing because 
> NVME_NS_STOPPED is already set.
> A thread: unquiesce the queue
> Now the queue is unquiesced too early for B thread.
> B thread: do something which need the queue must be quiesced.
> 
> Introduce NVME_NS_STOPPED link:
> https://lore.kernel.org/all/20211014081710.1871747-5-ming.lei@redhat.com/

I think we can just change a ns flag to a controller flag ala:
NVME_CTRL_STOPPED, and then do:

void nvme_stop_queues(struct nvme_ctrl *ctrl)
{
	if (!test_and_set_bit(NVME_CTRL_STOPPED, &ns->flags))
		blk_mq_quiesce_tagset(ctrl->tagset);
}
EXPORT_SYMBOL_GPL(nvme_stop_queues);

void nvme_start_queues(struct nvme_ctrl *ctrl)
{
	if (test_and_clear_bit(NVME_CTRL_STOPPED, &ns->flags))
		blk_mq_unquiesce_tagset(ctrl->tagset);
}
EXPORT_SYMBOL_GPL(nvme_start_queues);

Won't that achieve the same result?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-10-12 11:13               ` Sagi Grimberg
@ 2022-10-13  1:37                 ` Chao Leng
  2022-10-13  2:06                   ` Chao Leng
  0 siblings, 1 reply; 15+ messages in thread
From: Chao Leng @ 2022-10-13  1:37 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Ming Lei
  Cc: linux-nvme, linux-block, kbusch, axboe



On 2022/10/12 19:13, Sagi Grimberg wrote:
> 
> 
> On 10/12/22 11:43, Chao Leng wrote:
>> Add Ming Lei.
>>
>> On 2022/10/12 14:37, Sagi Grimberg wrote:
>>>
>>>>> On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
>>>>>> But maybe we can avoid that, and because we allocate
>>>>>> the connect_q ourselves, and fully know that it should
>>>>>> not be apart of the tagset quiesce, perhaps we can introduce
>>>>>> a new interface like:
>>>>>> -- 
>>>>>> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
>>>>>> {
>>>>>>     ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>>>>>>     if (IS_ERR(ctrl->connect_q))
>>>>>>         return PTR_ERR(ctrl->connect_q);
>>>>>>     return 0;
>>>>>> }
>>>>>> -- 
>>>>>>
>>>>>> And then blk_mq_quiesce_tagset can simply look into a per request-queue
>>>>>> self_quiesce flag and skip as needed.
>>>>>
>>>>> I'd just make that a queue flag set after allocation to keep the
>>>>> interface simple, but otherwise this seems like the right thing
>>>>> to do.
>>>> Now the code used NVME_NS_STOPPED to avoid unpaired stop/start.
>>>> If we use blk_mq_quiesce_tagset, It will cause the above mechanism to fail.
>>>> I review the code, only pci can not ensure secure stop/start pairing.
>>>> So there is a choice, We only use blk_mq_quiesce_tagset on fabrics, not PCI.
>>>> Do you think that's acceptable?
>>>> If that's acceptable, I will try to send a patch set.
>>>
>>> I don't think that this is acceptable. But I don't understand how
>>> NVME_NS_STOPPED would change anything in the behavior of tagset-wide
>>> quiesce?
>> If use blk_mq_quiesce_tagset, it will quiesce all queues of all ns,
>> but can not set NVME_NS_STOPPED of all ns. The mechanism of NVME_NS_STOPPED
>> will be invalidated.
>> NVMe-pci has very complicated quiesce/unquiesce use pattern, quiesce/unquiesce
>> may be called unpaired.
>> It will cause some backward. There may be some bugs in this scenario:
>> A thread: quiesce the queue
>> B thread: quiesce the queue
>> A thread end, and does not unquiesce the queue.
>> B thread: unquiesce the queue, and do something which need the queue must be unquiesed.
>>
>> Of course, I don't think it is a good choice to guarantee paired access through NVME_NS_STOPPED,
>> there exist unexpected unquiesce and start queue too early.
>> But now that the code has done so, the backward should be unacceptable.
>> such as this scenario:
>> A thread: quiesce the queue
>> B thread: want to quiesce the queue but do nothing because NVME_NS_STOPPED is already set.
>> A thread: unquiesce the queue
>> Now the queue is unquiesced too early for B thread.
>> B thread: do something which need the queue must be quiesced.
>>
>> Introduce NVME_NS_STOPPED link:
>> https://lore.kernel.org/all/20211014081710.1871747-5-ming.lei@redhat.com/
> 
> I think we can just change a ns flag to a controller flag ala:
> NVME_CTRL_STOPPED, and then do:
> 
> void nvme_stop_queues(struct nvme_ctrl *ctrl)
> {
>      if (!test_and_set_bit(NVME_CTRL_STOPPED, &ns->flags))
>          blk_mq_quiesce_tagset(ctrl->tagset);
> }
> EXPORT_SYMBOL_GPL(nvme_stop_queues);
> 
> void nvme_start_queues(struct nvme_ctrl *ctrl)
> {
>      if (test_and_clear_bit(NVME_CTRL_STOPPED, &ns->flags))
>          blk_mq_unquiesce_tagset(ctrl->tagset);
> }
> EXPORT_SYMBOL_GPL(nvme_start_queues);
> 
> Won't that achieve the same result?
No, nvme_set_queue_dying call nvme_start_ns_queue for one ns.
ctrl->flags can not cover this scenario.
This still results in unpaired quiesce/unquiesce.

Maybe it is necessary to clean up the confused quiesce/unquiesce of nvme-pci,
Thus blk_mq_quiesce_tagset can be easy to use for nvme-pci.
Before that, we could only optimize batched quiesce/unquiesce based on ns,
Although I also think using blk_mq_quiesce_tagset is a better choice.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] improve nvme quiesce time for large amount of namespaces
  2022-10-13  1:37                 ` Chao Leng
@ 2022-10-13  2:06                   ` Chao Leng
  0 siblings, 0 replies; 15+ messages in thread
From: Chao Leng @ 2022-10-13  2:06 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Ming Lei
  Cc: linux-nvme, linux-block, kbusch, axboe



On 2022/10/13 9:37, Chao Leng wrote:
> 
> 
> On 2022/10/12 19:13, Sagi Grimberg wrote:
>>
>>
>> On 10/12/22 11:43, Chao Leng wrote:
>>> Add Ming Lei.
>>>
>>> On 2022/10/12 14:37, Sagi Grimberg wrote:
>>>>
>>>>>> On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
>>>>>>> But maybe we can avoid that, and because we allocate
>>>>>>> the connect_q ourselves, and fully know that it should
>>>>>>> not be apart of the tagset quiesce, perhaps we can introduce
>>>>>>> a new interface like:
>>>>>>> -- 
>>>>>>> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
>>>>>>> {
>>>>>>>     ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>>>>>>>     if (IS_ERR(ctrl->connect_q))
>>>>>>>         return PTR_ERR(ctrl->connect_q);
>>>>>>>     return 0;
>>>>>>> }
>>>>>>> -- 
>>>>>>>
>>>>>>> And then blk_mq_quiesce_tagset can simply look into a per request-queue
>>>>>>> self_quiesce flag and skip as needed.
>>>>>>
>>>>>> I'd just make that a queue flag set after allocation to keep the
>>>>>> interface simple, but otherwise this seems like the right thing
>>>>>> to do.
>>>>> Now the code used NVME_NS_STOPPED to avoid unpaired stop/start.
>>>>> If we use blk_mq_quiesce_tagset, It will cause the above mechanism to fail.
>>>>> I review the code, only pci can not ensure secure stop/start pairing.
>>>>> So there is a choice, We only use blk_mq_quiesce_tagset on fabrics, not PCI.
>>>>> Do you think that's acceptable?
>>>>> If that's acceptable, I will try to send a patch set.
>>>>
>>>> I don't think that this is acceptable. But I don't understand how
>>>> NVME_NS_STOPPED would change anything in the behavior of tagset-wide
>>>> quiesce?
>>> If use blk_mq_quiesce_tagset, it will quiesce all queues of all ns,
>>> but can not set NVME_NS_STOPPED of all ns. The mechanism of NVME_NS_STOPPED
>>> will be invalidated.
>>> NVMe-pci has very complicated quiesce/unquiesce use pattern, quiesce/unquiesce
>>> may be called unpaired.
>>> It will cause some backward. There may be some bugs in this scenario:
>>> A thread: quiesce the queue
>>> B thread: quiesce the queue
>>> A thread end, and does not unquiesce the queue.
>>> B thread: unquiesce the queue, and do something which need the queue must be unquiesed.
>>>
>>> Of course, I don't think it is a good choice to guarantee paired access through NVME_NS_STOPPED,
>>> there exist unexpected unquiesce and start queue too early.
>>> But now that the code has done so, the backward should be unacceptable.
>>> such as this scenario:
>>> A thread: quiesce the queue
>>> B thread: want to quiesce the queue but do nothing because NVME_NS_STOPPED is already set.
>>> A thread: unquiesce the queue
>>> Now the queue is unquiesced too early for B thread.
>>> B thread: do something which need the queue must be quiesced.
>>>
>>> Introduce NVME_NS_STOPPED link:
>>> https://lore.kernel.org/all/20211014081710.1871747-5-ming.lei@redhat.com/
>>
>> I think we can just change a ns flag to a controller flag ala:
>> NVME_CTRL_STOPPED, and then do:
>>
>> void nvme_stop_queues(struct nvme_ctrl *ctrl)
>> {
>>      if (!test_and_set_bit(NVME_CTRL_STOPPED, &ns->flags))
>>          blk_mq_quiesce_tagset(ctrl->tagset);
>> }
>> EXPORT_SYMBOL_GPL(nvme_stop_queues);
>>
>> void nvme_start_queues(struct nvme_ctrl *ctrl)
>> {
>>      if (test_and_clear_bit(NVME_CTRL_STOPPED, &ns->flags))
>>          blk_mq_unquiesce_tagset(ctrl->tagset);
>> }
>> EXPORT_SYMBOL_GPL(nvme_start_queues);
>>
>> Won't that achieve the same result?
> No, nvme_set_queue_dying call nvme_start_ns_queue for one ns.
> ctrl->flags can not cover this scenario.
> This still results in unpaired quiesce/unquiesce.
> 
> Maybe it is necessary to clean up the confused quiesce/unquiesce of nvme-pci,
> Thus blk_mq_quiesce_tagset can be easy to use for nvme-pci.
> Before that, we could only optimize batched quiesce/unquiesce based on ns,
> Although I also think using blk_mq_quiesce_tagset is a better choice.
We can do some something special for nvme_set_queue_dying.
Thus can achieve the same effect with NVME_NS_STOPPED.
This should look acceptable.

What do you think about this?

Show the code:
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5100,13 +5100,14 @@ static void nvme_stop_ns_queue(struct nvme_ns *ns)
   * holding bd_butex.  This will end buffered writers dirtying pages that can't
   * be synced.
   */
-static void nvme_set_queue_dying(struct nvme_ns *ns)
+static void nvme_set_queue_dying(struct nvme_ns *ns, bool start_queue)
  {
         if (test_and_set_bit(NVME_NS_DEAD, &ns->flags))
                 return;

         blk_mark_disk_dead(ns->disk);
-       nvme_start_ns_queue(ns);
+       if (start_queue)
+               blk_mq_unquiesce_queue(ns->queue);

         set_capacity_and_notify(ns->disk, 0);
  }
@@ -5121,15 +5122,18 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
  void nvme_kill_queues(struct nvme_ctrl *ctrl)
  {
         struct nvme_ns *ns;
+       bool start_queue = false;

         down_read(&ctrl->namespaces_rwsem);

         /* Forcibly unquiesce queues to avoid blocking dispatch */
         if (ctrl->admin_q && !blk_queue_dying(ctrl->admin_q))
                 nvme_start_admin_queue(ctrl);
+        if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
+               start_queue = true;

         list_for_each_entry(ns, &ctrl->namespaces, list)
-               nvme_set_queue_dying(ns);
+               nvme_set_queue_dying(ns, start_queue);

         up_read(&ctrl->namespaces_rwsem);
  }

> 
> .

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-10-13  2:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-29  7:39 [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Chao Leng
2022-07-29  7:39 ` [PATCH 1/3] blk-mq: delete unnecessary comments Chao Leng
2022-07-29  7:39 ` [PATCH 2/3] nvme: improve the quiesce time for non blocking transports Chao Leng
2022-07-29  7:39 ` [PATCH 3/3] nvme: improve the quiesce time for " Chao Leng
2022-07-29 14:26 ` [PATCH 0/3] improve nvme quiesce time for large amount of namespaces Christoph Hellwig
2022-07-30  0:39   ` Chao Leng
2022-07-31 10:23     ` Sagi Grimberg
2022-08-01  1:45       ` Chao Leng
2022-08-02 13:38       ` Christoph Hellwig
2022-10-10  8:46         ` Chao Leng
2022-10-12  6:37           ` Sagi Grimberg
2022-10-12  8:43             ` Chao Leng
2022-10-12 11:13               ` Sagi Grimberg
2022-10-13  1:37                 ` Chao Leng
2022-10-13  2:06                   ` Chao Leng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.