linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nvme-core: reduce io pause time when fail over
@ 2020-07-22  9:36 Chao Leng
  2020-07-22 16:00 ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Chao Leng @ 2020-07-22  9:36 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, axboe, hch, lengchao, sagi

We test nvme over roce fail over with multipath when 1000 namespaces
configured, io pause more than 10 seconds. The reason: nvme_stop_queues
will quiesce all queues for each namespace when io timeout cause path
error. Quiesce queue wait all ongoing dispatches finished through
synchronize_rcu, need more than 10 milliseconds for each wait,
thus io pause more than 10 seconds.

To reduce io pause time, nvme_stop_queues use
blk_mq_quiesce_queue_nowait to quiesce the queue, nvme_stop_queues wait
all ongoing dispatches completed after all queues has been quiesced.

Signed-off-by: Chao Leng <lengchao@huawei.com>
---
 drivers/nvme/host/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index add040168e67..877dd75d8a8c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4323,8 +4323,9 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
 
 	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_quiesce_queue(ns->queue);
+		blk_mq_quiesce_queue_nowait(ns->queue);
 	up_read(&ctrl->namespaces_rwsem);
+	synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
-- 
2.16.4


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-core: reduce io pause time when fail over
  2020-07-22  9:36 [PATCH] nvme-core: reduce io pause time when fail over Chao Leng
@ 2020-07-22 16:00 ` Keith Busch
  2020-07-22 16:13   ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2020-07-22 16:00 UTC (permalink / raw)
  To: Chao Leng; +Cc: axboe, hch, linux-nvme, sagi

On Wed, Jul 22, 2020 at 05:36:33PM +0800, Chao Leng wrote:
> To reduce io pause time, nvme_stop_queues use
> blk_mq_quiesce_queue_nowait to quiesce the queue, nvme_stop_queues wait
> all ongoing dispatches completed after all queues has been quiesced.

The comment above blk_mq_quiesce_queue_nowait() wants to remove this
function. I'm not sure we should be introducing more users if that's the
case.

Using synchronize_rcu() at the end may be looking to much into blk-mq's
internal quiesce implementation. It happens to be the right thing to do
for non-blocking hctx, so this patch assumes that will be true for any
nvme request_queue.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-core: reduce io pause time when fail over
  2020-07-22 16:00 ` Keith Busch
@ 2020-07-22 16:13   ` Sagi Grimberg
  2020-07-23  8:06     ` Chao Leng
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2020-07-22 16:13 UTC (permalink / raw)
  To: Keith Busch, Chao Leng; +Cc: axboe, hch, linux-nvme


>> To reduce io pause time, nvme_stop_queues use
>> blk_mq_quiesce_queue_nowait to quiesce the queue, nvme_stop_queues wait
>> all ongoing dispatches completed after all queues has been quiesced.
> 
> The comment above blk_mq_quiesce_queue_nowait() wants to remove this
> function. I'm not sure we should be introducing more users if that's the
> case.
> 
> Using synchronize_rcu() at the end may be looking to much into blk-mq's
> internal quiesce implementation. It happens to be the right thing to do
> for non-blocking hctx, so this patch assumes that will be true for any
> nvme request_queue.

nvme-tcp became a block hctx since we optimize to do network sends from
inside queue_rq. So this should either split into two functions or
nvme-core needs to look into BLK_MQ_F_BLOCKING and do the right thing.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-core: reduce io pause time when fail over
  2020-07-22 16:13   ` Sagi Grimberg
@ 2020-07-23  8:06     ` Chao Leng
  2020-07-23 16:30       ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Chao Leng @ 2020-07-23  8:06 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch; +Cc: axboe, hch, linux-nvme


On 2020/7/23 0:13, Sagi Grimberg wrote:
>
>>> To reduce io pause time, nvme_stop_queues use
>>> blk_mq_quiesce_queue_nowait to quiesce the queue, nvme_stop_queues wait
>>> all ongoing dispatches completed after all queues has been quiesced.
>>
>> The comment above blk_mq_quiesce_queue_nowait() wants to remove this
>> function. I'm not sure we should be introducing more users if that's the
>> case.
>>
>> Using synchronize_rcu() at the end may be looking to much into blk-mq's
>> internal quiesce implementation. It happens to be the right thing to do
>> for non-blocking hctx, so this patch assumes that will be true for any
>> nvme request_queue.
>
> nvme-tcp became a block hctx since we optimize to do network sends from
> inside queue_rq. So this should either split into two functions or
> nvme-core needs to look into BLK_MQ_F_BLOCKING and do the right thing.

Another solution:introduce blk_mq_quiesce_queue_async,
blk_mq_quiesce_queue_async do not wait all ongoing dispatches completed,
if the hctx is not block hctx. The caller such as nvme_stop_queues
wait all ongoing dispatches completed after all queues has been quiesced.

---
  block/blk-mq.c           | 28 ++++++++++++++++++++++++++++
  drivers/nvme/host/core.c |  5 ++++-
  include/linux/blk-mq.h   |  1 +
  3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e0d173beaa3..0053ff42bb47 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -235,6 +235,34 @@ void blk_mq_quiesce_queue(struct request_queue *q)
  }
  EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);

+/*
+ * blk_mq_quiesce_queue_async() - do not wait for non block hctx
+ * @q: request queue.
+ *
+ * Note: this function do not wait all ongoing dispatches completed
+ * if the hctx is not block hctx. Should be used to reduce wait time
+ * when quiesce batch queues at the same time. The caller should
+ * synchronous wait until all ongoing dispatches have finished
+ * after all queues has been done this.
+ */
+bool blk_mq_quiesce_queue_async(struct request_queue *q)
+{
+    struct blk_mq_hw_ctx *hctx;
+    unsigned int i;
+    bool rcu = false;
+
+    blk_mq_quiesce_queue_nowait(q);
+
+    queue_for_each_hw_ctx(q, hctx, i) {
+        if (hctx->flags & BLK_MQ_F_BLOCKING)
+            synchronize_srcu(hctx->srcu);
+        else
+            rcu = true;
+    }
+    return rcu;
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_async);
+
  /*
   * blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
   * @q: request queue.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index add040168e67..a1915f0276eb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4320,11 +4320,14 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
  void nvme_stop_queues(struct nvme_ctrl *ctrl)
  {
      struct nvme_ns *ns;
+    bool rcu = false;

      down_read(&ctrl->namespaces_rwsem);
      list_for_each_entry(ns, &ctrl->namespaces, list)
-        blk_mq_quiesce_queue(ns->queue);
+        rcu = rcu || blk_mq_quiesce_queue_async(ns->queue);
      up_read(&ctrl->namespaces_rwsem);
+    if (rcu)
+        synchronize_rcu();
  }
  EXPORT_SYMBOL_GPL(nvme_stop_queues);

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d6fcae17da5a..6ad83bfd17b2 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -515,6 +515,7 @@ void blk_mq_start_hw_queues(struct request_queue *q);
  void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
  void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
  void blk_mq_quiesce_queue(struct request_queue *q);
+bool blk_mq_quiesce_queue_async(struct request_queue *q);
  void blk_mq_unquiesce_queue(struct request_queue *q);
  void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
  void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-- 



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-core: reduce io pause time when fail over
  2020-07-23  8:06     ` Chao Leng
@ 2020-07-23 16:30       ` Sagi Grimberg
  2020-07-23 16:42         ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2020-07-23 16:30 UTC (permalink / raw)
  To: Chao Leng, Keith Busch; +Cc: axboe, hch, linux-nvme


>>> Using synchronize_rcu() at the end may be looking to much into blk-mq's
>>> internal quiesce implementation. It happens to be the right thing to do
>>> for non-blocking hctx, so this patch assumes that will be true for any
>>> nvme request_queue.
>>
>> nvme-tcp became a block hctx since we optimize to do network sends from
>> inside queue_rq. So this should either split into two functions or
>> nvme-core needs to look into BLK_MQ_F_BLOCKING and do the right thing.
> 
> Another solution:introduce blk_mq_quiesce_queue_async,
> blk_mq_quiesce_queue_async do not wait all ongoing dispatches completed,
> if the hctx is not block hctx. The caller such as nvme_stop_queues
> wait all ongoing dispatches completed after all queues has been quiesced.

But then you maintain the problem for blocking hctx, so its not a good
solution (and also its not a good practice to abuse the return code like
you did IMO).

What we need is something like the following untested patches:
[1]:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index abcf590f6238..c63bf34866eb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -209,23 +209,12 @@ void blk_mq_quiesce_queue_nowait(struct 
request_queue *q)
  }
  EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);

-/**
- * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
- * @q: request queue.
- *
- * Note: this function does not prevent that the struct request end_io()
- * callback function is invoked. Once this function is returned, we make
- * sure no dispatch can happen until the queue is unquiesced via
- * blk_mq_unquiesce_queue().
- */
-void blk_mq_quiesce_queue(struct request_queue *q)
+void blk_mq_quiesce_queue_sync(struct request_queue *q)
  {
         struct blk_mq_hw_ctx *hctx;
         unsigned int i;
         bool rcu = false;

-       blk_mq_quiesce_queue_nowait(q);
-
         queue_for_each_hw_ctx(q, hctx, i) {
                 if (hctx->flags & BLK_MQ_F_BLOCKING)
                         synchronize_srcu(hctx->srcu);
@@ -235,6 +224,22 @@ void blk_mq_quiesce_queue(struct request_queue *q)
         if (rcu)
                 synchronize_rcu();
  }
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_sync);
+
+/**
+ * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
+ * @q: request queue.
+ *
+ * Note: this function does not prevent that the struct request end_io()
+ * callback function is invoked. Once this function is returned, we make
+ * sure no dispatch can happen until the queue is unquiesced via
+ * blk_mq_unquiesce_queue().
+ */
+void blk_mq_quiesce_queue(struct request_queue *q)
+{
+       blk_mq_quiesce_queue_nowait(q);
+       blk_mq_quiesce_queue_sync(q);
+}
  EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);

  /*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 23230c1d031e..7fd6994d5b32 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -532,6 +532,7 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
nr_hw_queues);

  void blk_mq_quiesce_queue_nowait(struct request_queue *q);
+void blk_mq_quiesce_queue_sync(struct request_queue *q);

  unsigned int blk_mq_rq_cpu(struct request *rq);
--

[2]:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c16bfdff2953..35c39932c491 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4535,8 +4535,11 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)

         down_read(&ctrl->namespaces_rwsem);
         list_for_each_entry(ns, &ctrl->namespaces, list)
-               blk_mq_quiesce_queue(ns->queue);
+               blk_mq_quiesce_queue_nowait(ns->queue);
+       /* all namespaces share hctxs, so sync just once */
+       blk_mq_quiesce_queue_sync(ns->queue);
         up_read(&ctrl->namespaces_rwsem);
+
  }
  EXPORT_SYMBOL_GPL(nvme_stop_queues);
--

I can send formal patches if you care to test it...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-core: reduce io pause time when fail over
  2020-07-23 16:30       ` Sagi Grimberg
@ 2020-07-23 16:42         ` Keith Busch
  2020-07-23 19:12           ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2020-07-23 16:42 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: axboe, linux-nvme, Chao Leng, hch

On Thu, Jul 23, 2020 at 09:30:57AM -0700, Sagi Grimberg wrote:
> +       /* all namespaces share hctxs, so sync just once */
> +       blk_mq_quiesce_queue_sync(ns->queue);

They don't share hctx's. An hctx is a property of a specific
request_queue, and each namespace has their own.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-core: reduce io pause time when fail over
  2020-07-23 16:42         ` Keith Busch
@ 2020-07-23 19:12           ` Sagi Grimberg
  0 siblings, 0 replies; 7+ messages in thread
From: Sagi Grimberg @ 2020-07-23 19:12 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, Chao Leng, linux-nvme, hch


>> +       /* all namespaces share hctxs, so sync just once */
>> +       blk_mq_quiesce_queue_sync(ns->queue);
> 
> They don't share hctx's. An hctx is a property of a specific
> request_queue, and each namespace has their own.

ugh, right...

But I still think that we should wait for all the srcus together...

Perhaps we should do call_srcu on all and wait for all in one pass.
But I don't want the driver to do this without a proper blk-mq
interface..


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-23 19:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22  9:36 [PATCH] nvme-core: reduce io pause time when fail over Chao Leng
2020-07-22 16:00 ` Keith Busch
2020-07-22 16:13   ` Sagi Grimberg
2020-07-23  8:06     ` Chao Leng
2020-07-23 16:30       ` Sagi Grimberg
2020-07-23 16:42         ` Keith Busch
2020-07-23 19:12           ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).