Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH rfc v2 0/4] improve quiesce time for large amount of namespaces
@ 2020-07-24 23:06 Sagi Grimberg
  2020-07-24 23:06 ` [PATCH rfc v2 1/4] blk-mq: add async quiesce interface for blocking hw queues Sagi Grimberg
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Sagi Grimberg @ 2020-07-24 23:06 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe, Chao Leng

This set attempts to improve the quiesce time when using a large set of
namespaces, which also improves I/O failover time in a multipath environment.

We improve for both non-blocking hctxs (e.g. pci, fc, rdma) and blocking
hctxs (e.g. tcp).

The original patch from Chao was targeted to rdma, hence Patch #4 is just
for testing purposes in case testing with nvme-tcp is an issue.

Changes from v1:
- fixed patch #2 wrong leftovers (start_freeze)

Chao Leng (1):
  nvme-core: reduce io failover time

Sagi Grimberg (3):
  blk-mq: add async quiesce interface for blocking hw queues
  nvme: improve quiesce for blocking queues
  nvme-rdma: use blocking quiesce interface

 block/blk-mq.c           | 31 +++++++++++++++++++++++++++++++
 drivers/nvme/host/core.c | 20 +++++++++++++++++++-
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/rdma.c |  5 +++--
 drivers/nvme/host/tcp.c  |  2 +-
 include/linux/blk-mq.h   |  4 ++++
 6 files changed, 59 insertions(+), 4 deletions(-)

-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH rfc v2 1/4] blk-mq: add async quiesce interface for blocking hw queues
  2020-07-24 23:06 [PATCH rfc v2 0/4] improve quiesce time for large amount of namespaces Sagi Grimberg
@ 2020-07-24 23:06 ` Sagi Grimberg
  2020-07-25  7:03   ` Chao Leng
  2020-07-24 23:06 ` [PATCH rfc v2 2/4] nvme: improve quiesce for blocking queues Sagi Grimberg
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2020-07-24 23:06 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe, Chao Leng

Drivers that use blocking hw queues may have to quiesce a large amount
of request queues at once (e.g. controller or adapter reset). These
drivers would benefit from an async quiesce interface such that
the can trigger quiesce asynchronously and wait for all in parallel.

This leaves the synchronization responsibility to the driver, but adds
a convenient interface to quiesce async and wait in a single pass.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/blk-mq.c         | 31 +++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  4 ++++
 2 files changed, 35 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index abcf590f6238..7326709ed2d1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -209,6 +209,37 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
 
+void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	blk_mq_quiesce_queue_nowait(q);
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+			continue;
+		init_completion(&hctx->rcu_sync.completion);
+		init_rcu_head(&hctx->rcu_sync.head);
+		call_srcu(hctx->srcu, &hctx->rcu_sync.head, wakeme_after_rcu);
+	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_blocking_queue_async);
+
+void blk_mq_quiesce_blocking_queue_async_wait(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+			continue;
+		wait_for_completion(&hctx->rcu_sync.completion);
+		destroy_rcu_head(&hctx->rcu_sync.head);
+	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_blocking_queue_async_wait);
+
 /**
  * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
  * @q: request queue.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 23230c1d031e..863b372d32aa 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -5,6 +5,7 @@
 #include <linux/blkdev.h>
 #include <linux/sbitmap.h>
 #include <linux/srcu.h>
+#include <linux/rcupdate_wait.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -170,6 +171,7 @@ struct blk_mq_hw_ctx {
 	 */
 	struct list_head	hctx_list;
 
+	struct rcu_synchronize	rcu_sync;
 	/**
 	 * @srcu: Sleepable RCU. Use as lock when type of the hardware queue is
 	 * blocking (BLK_MQ_F_BLOCKING). Must be the last member - see also
@@ -532,6 +534,8 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
+void blk_mq_quiesce_blocking_queue_async(struct request_queue *q);
+void blk_mq_quiesce_blocking_queue_async_wait(struct request_queue *q);
 
 unsigned int blk_mq_rq_cpu(struct request *rq);
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH rfc v2 2/4] nvme: improve quiesce for blocking queues
  2020-07-24 23:06 [PATCH rfc v2 0/4] improve quiesce time for large amount of namespaces Sagi Grimberg
  2020-07-24 23:06 ` [PATCH rfc v2 1/4] blk-mq: add async quiesce interface for blocking hw queues Sagi Grimberg
@ 2020-07-24 23:06 ` Sagi Grimberg
  2020-07-24 23:06 ` [PATCH rfc v2 3/4] nvme-core: reduce io failover time Sagi Grimberg
  2020-07-24 23:06 ` [PATCH for-testing v2 4/4] nvme-rdma: use blocking quiesce interface Sagi Grimberg
  3 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2020-07-24 23:06 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe, Chao Leng

nvme transports that use blocking hw queues (e.g. nvme-tcp) currently
will synchronize queue quiesce for each namespace at once. This can
slow down failover time (which first quiesce all ns queues) if we have
a large amount of namespaces. Instead, we want to use an async interface
and do the namespaces quiesce in parallel rather than serially.

Introduce nvme_stop_blocking_queues for transports that use blocking
hw queues and convert nvme-tcp to use the new interface.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 13 +++++++++++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/tcp.c  |  2 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c16bfdff2953..2ae8caa4e25f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4529,6 +4529,19 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_stop_blocking_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_mq_quiesce_blocking_queue_async(ns->queue);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_mq_quiesce_blocking_queue_async_wait(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_stop_blocking_queues);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1609267a1f0e..f8c36176518e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -568,6 +568,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		volatile union nvme_result *res);
 
+void nvme_stop_blocking_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 7953362e7bb5..58ab07edcb4b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1887,7 +1887,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 {
 	if (ctrl->queue_count <= 1)
 		return;
-	nvme_stop_queues(ctrl);
+	nvme_stop_blocking_queues(ctrl);
 	nvme_tcp_stop_io_queues(ctrl);
 	if (ctrl->tagset) {
 		blk_mq_tagset_busy_iter(ctrl->tagset,
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH rfc v2 3/4] nvme-core: reduce io failover time
  2020-07-24 23:06 [PATCH rfc v2 0/4] improve quiesce time for large amount of namespaces Sagi Grimberg
  2020-07-24 23:06 ` [PATCH rfc v2 1/4] blk-mq: add async quiesce interface for blocking hw queues Sagi Grimberg
  2020-07-24 23:06 ` [PATCH rfc v2 2/4] nvme: improve quiesce for blocking queues Sagi Grimberg
@ 2020-07-24 23:06 ` Sagi Grimberg
  2020-07-24 23:15   ` Sagi Grimberg
  2020-07-25  8:51   ` Chao Leng
  2020-07-24 23:06 ` [PATCH for-testing v2 4/4] nvme-rdma: use blocking quiesce interface Sagi Grimberg
  3 siblings, 2 replies; 9+ messages in thread
From: Sagi Grimberg @ 2020-07-24 23:06 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe, Chao Leng

From: Chao Leng <lengchao@huawei.com>

We test nvme over roce fail over with multipath when 1000 namespaces
configured, io pause more than 10 seconds. The reason: nvme_stop_queues
will quiesce all queues for each namespace when io timeout cause path
error. Quiesce queue wait all ongoing dispatches finished through
synchronize_rcu, need more than 10 milliseconds for each wait,
thus io pause more than 10 seconds.

To reduce io pause time, nvme_stop_queues use
blk_mq_quiesce_queue_nowait to quiesce the queue, nvme_stop_queues wait
all ongoing dispatches completed after all queues has been quiesced.

Signed-off-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2ae8caa4e25f..e3fae68f7de6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4548,8 +4548,13 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
 
 	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_quiesce_queue(ns->queue);
+		blk_mq_quiesce_queue_nowait(ns->queue);
 	up_read(&ctrl->namespaces_rwsem);
+	/*
+	 * BLK_MQ_F_BLOCKING drivers should never call us
+	 */
+	WARN_ON_ONCE(ctrl->tagset.flags & BLK_MQ_F_BLOCKING);
+	synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH for-testing v2 4/4] nvme-rdma: use blocking quiesce interface
  2020-07-24 23:06 [PATCH rfc v2 0/4] improve quiesce time for large amount of namespaces Sagi Grimberg
                   ` (2 preceding siblings ...)
  2020-07-24 23:06 ` [PATCH rfc v2 3/4] nvme-core: reduce io failover time Sagi Grimberg
@ 2020-07-24 23:06 ` Sagi Grimberg
  3 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2020-07-24 23:06 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe, Chao Leng

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5c3848974ccb..edd3cf1d5138 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -793,6 +793,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
 		set->reserved_tags = 2; /* connect + keep-alive */
 		set->numa_node = nctrl->numa_node;
+		set->flags = BLK_MQ_F_BLOCKING;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
 				NVME_RDMA_DATA_SGL_SIZE;
 		set->driver_data = ctrl;
@@ -806,7 +807,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = 1; /* fabric connect */
 		set->numa_node = nctrl->numa_node;
-		set->flags = BLK_MQ_F_SHOULD_MERGE;
+		set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
 				NVME_RDMA_DATA_SGL_SIZE;
 		if (nctrl->max_integrity_segments)
@@ -1008,7 +1009,7 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
 	if (ctrl->ctrl.queue_count > 1) {
-		nvme_stop_queues(&ctrl->ctrl);
+		nvme_stop_blocking_queues(&ctrl->ctrl);
 		nvme_rdma_stop_io_queues(ctrl);
 		if (ctrl->ctrl.tagset) {
 			blk_mq_tagset_busy_iter(ctrl->ctrl.tagset,
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH rfc v2 3/4] nvme-core: reduce io failover time
  2020-07-24 23:06 ` [PATCH rfc v2 3/4] nvme-core: reduce io failover time Sagi Grimberg
@ 2020-07-24 23:15   ` Sagi Grimberg
  2020-07-25  8:51   ` Chao Leng
  1 sibling, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2020-07-24 23:15 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe, Chao Leng


> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2ae8caa4e25f..e3fae68f7de6 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4548,8 +4548,13 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
>   
>   	down_read(&ctrl->namespaces_rwsem);
>   	list_for_each_entry(ns, &ctrl->namespaces, list)
> -		blk_mq_quiesce_queue(ns->queue);
> +		blk_mq_quiesce_queue_nowait(ns->queue);
>   	up_read(&ctrl->namespaces_rwsem);
> +	/*
> +	 * BLK_MQ_F_BLOCKING drivers should never call us
> +	 */
> +	WARN_ON_ONCE(ctrl->tagset.flags & BLK_MQ_F_BLOCKING);

Woops^2 tagset is a pointer... will resend v3 after I get some
feedback...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH rfc v2 1/4] blk-mq: add async quiesce interface for blocking hw queues
  2020-07-24 23:06 ` [PATCH rfc v2 1/4] blk-mq: add async quiesce interface for blocking hw queues Sagi Grimberg
@ 2020-07-25  7:03   ` Chao Leng
  2020-07-25 17:35     ` Sagi Grimberg
  0 siblings, 1 reply; 9+ messages in thread
From: Chao Leng @ 2020-07-25  7:03 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe

Looks great. One suggest: srcu provide the batch sync mechanism,
it may be more generic. Weakness: for the same srcu, concurrent batch
waiting is not supported. The code just for TINY_SRCU:

---
  block/blk-mq.c           | 24 ++++++++++++++++++++++++
  include/linux/srcu.h     |  2 ++
  include/linux/srcutiny.h |  1 +
  kernel/rcu/srcutiny.c    | 16 ++++++++++++++++
  4 files changed, 43 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e0d173beaa3..97dabcf2cab8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -235,6 +235,30 @@ void blk_mq_quiesce_queue(struct request_queue *q)
  }
  EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);

+void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
+{
+    struct blk_mq_hw_ctx *hctx;
+    unsigned int i;
+
+    blk_mq_quiesce_queue_nowait(q);
+
+    queue_for_each_hw_ctx(q, hctx, i)
+        if (hctx->flags & BLK_MQ_F_BLOCKING)
+            synchronize_srcu_async(hctx->srcu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_blocking_queue_async);
+
+void blk_mq_quiesce_blocking_queue_async_wait(struct request_queue *q)
+{
+    struct blk_mq_hw_ctx *hctx;
+    unsigned int i;
+
+    queue_for_each_hw_ctx(q, hctx, i)
+        if (hctx->flags & BLK_MQ_F_BLOCKING)
+            synchronize_srcu_async_wait(hctx->srcu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_blocking_queue_async_wait);
+
  /*
   * blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
   * @q: request queue.
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index e432cc92c73d..7e006e51ccf9 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -60,6 +60,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp);
  int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp);
  void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
  void synchronize_srcu(struct srcu_struct *ssp);
+void synchronize_srcu_async(struct srcu_struct *ssp);
+void synchronize_srcu_async_wait(struct srcu_struct *ssp);

  #ifdef CONFIG_DEBUG_LOCK_ALLOC

diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h
index 5a5a1941ca15..3d7d871bef61 100644
--- a/include/linux/srcutiny.h
+++ b/include/linux/srcutiny.h
@@ -23,6 +23,7 @@ struct srcu_struct {
      struct rcu_head *srcu_cb_head;    /* Pending callbacks: Head. */
      struct rcu_head **srcu_cb_tail;    /* Pending callbacks: Tail. */
      struct work_struct srcu_work;    /* For driving grace periods. */
+    struct rcu_synchronize rcu_sync;
  #ifdef CONFIG_DEBUG_LOCK_ALLOC
      struct lockdep_map dep_map;
  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
index 6208c1dae5c9..6e1468175a45 100644
--- a/kernel/rcu/srcutiny.c
+++ b/kernel/rcu/srcutiny.c
@@ -190,6 +190,22 @@ void synchronize_srcu(struct srcu_struct *ssp)
  }
  EXPORT_SYMBOL_GPL(synchronize_srcu);

+void synchronize_srcu_async(struct srcu_struct *ssp)
+{
+    init_rcu_head(&ssp->rcu_sync.head);
+    init_completion(&ssp->rcu_sync.completion);
+    call_srcu(ssp, &ssp->rcu_sync.head, wakeme_after_rcu_batch);
+
+}
+EXPORT_SYMBOL_GPL(synchronize_srcu_async);
+
+void synchronize_srcu_async_wait(struct srcu_struct *ssp)
+{
+    wait_for_completion(&ssp->rcu_sync.completion);
+    destroy_rcu_head(&ssp->rcu_sync.head);
+}
+EXPORT_SYMBOL_GPL(synchronize_srcu_async_wait);
+
  /* Lockdep diagnostics.  */
  void __init rcu_scheduler_starting(void)
  {
-- 


On 2020/7/25 7:06, Sagi Grimberg wrote:
> Drivers that use blocking hw queues may have to quiesce a large amount
> of request queues at once (e.g. controller or adapter reset). These
> drivers would benefit from an async quiesce interface such that
> the can trigger quiesce asynchronously and wait for all in parallel.
>
> This leaves the synchronization responsibility to the driver, but adds
> a convenient interface to quiesce async and wait in a single pass.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   block/blk-mq.c         | 31 +++++++++++++++++++++++++++++++
>   include/linux/blk-mq.h |  4 ++++
>   2 files changed, 35 insertions(+)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index abcf590f6238..7326709ed2d1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -209,6 +209,37 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
>   
> +void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +	unsigned int i;
> +
> +	blk_mq_quiesce_queue_nowait(q);
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> +			continue;
> +		init_completion(&hctx->rcu_sync.completion);
> +		init_rcu_head(&hctx->rcu_sync.head);
> +		call_srcu(hctx->srcu, &hctx->rcu_sync.head, wakeme_after_rcu);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_blocking_queue_async);
> +
> +void blk_mq_quiesce_blocking_queue_async_wait(struct request_queue *q)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +	unsigned int i;
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> +			continue;
> +		wait_for_completion(&hctx->rcu_sync.completion);
> +		destroy_rcu_head(&hctx->rcu_sync.head);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_blocking_queue_async_wait);
> +
>   /**
>    * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
>    * @q: request queue.
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 23230c1d031e..863b372d32aa 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -5,6 +5,7 @@
>   #include <linux/blkdev.h>
>   #include <linux/sbitmap.h>
>   #include <linux/srcu.h>
> +#include <linux/rcupdate_wait.h>
>   
>   struct blk_mq_tags;
>   struct blk_flush_queue;
> @@ -170,6 +171,7 @@ struct blk_mq_hw_ctx {
>   	 */
>   	struct list_head	hctx_list;
>   
> +	struct rcu_synchronize	rcu_sync;
>   	/**
>   	 * @srcu: Sleepable RCU. Use as lock when type of the hardware queue is
>   	 * blocking (BLK_MQ_F_BLOCKING). Must be the last member - see also
> @@ -532,6 +534,8 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
>   void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
>   
>   void blk_mq_quiesce_queue_nowait(struct request_queue *q);
> +void blk_mq_quiesce_blocking_queue_async(struct request_queue *q);
> +void blk_mq_quiesce_blocking_queue_async_wait(struct request_queue *q);
>   
>   unsigned int blk_mq_rq_cpu(struct request *rq);
>   

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH rfc v2 3/4] nvme-core: reduce io failover time
  2020-07-24 23:06 ` [PATCH rfc v2 3/4] nvme-core: reduce io failover time Sagi Grimberg
  2020-07-24 23:15   ` Sagi Grimberg
@ 2020-07-25  8:51   ` Chao Leng
  1 sibling, 0 replies; 9+ messages in thread
From: Chao Leng @ 2020-07-25  8:51 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe


On 2020/7/25 7:06, Sagi Grimberg wrote:
> From: Chao Leng <lengchao@huawei.com>
>
> We test nvme over roce fail over with multipath when 1000 namespaces
> configured, io pause more than 10 seconds. The reason: nvme_stop_queues
> will quiesce all queues for each namespace when io timeout cause path
> error. Quiesce queue wait all ongoing dispatches finished through
> synchronize_rcu, need more than 10 milliseconds for each wait,
> thus io pause more than 10 seconds.
>
> To reduce io pause time, nvme_stop_queues use
> blk_mq_quiesce_queue_nowait to quiesce the queue, nvme_stop_queues wait
> all ongoing dispatches completed after all queues has been quiesced.
>
> Signed-off-by: Chao Leng <lengchao@huawei.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/core.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2ae8caa4e25f..e3fae68f7de6 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4548,8 +4548,13 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
>   
>   	down_read(&ctrl->namespaces_rwsem);
>   	list_for_each_entry(ns, &ctrl->namespaces, list)
> -		blk_mq_quiesce_queue(ns->queue);
> +		blk_mq_quiesce_queue_nowait(ns->queue);
>   	up_read(&ctrl->namespaces_rwsem);
> +	/*
> +	 * BLK_MQ_F_BLOCKING drivers should never call us
> +	 */
> +	WARN_ON_ONCE(ctrl->tagset.flags & BLK_MQ_F_BLOCKING);

There is a little compile bug, ctrl->tagset is a pointer. it should:

WARN_ON_ONCE(ctrl->tagset->flags & BLK_MQ_F_BLOCKING);


> +	synchronize_rcu();
>   }
>   EXPORT_SYMBOL_GPL(nvme_stop_queues);
>   

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH rfc v2 1/4] blk-mq: add async quiesce interface for blocking hw queues
  2020-07-25  7:03   ` Chao Leng
@ 2020-07-25 17:35     ` Sagi Grimberg
  0 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2020-07-25 17:35 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Jens Axboe


> Looks great. One suggest: srcu provide the batch sync mechanism,
> it may be more generic. Weakness: for the same srcu, concurrent batch
> waiting is not supported. The code just for TINY_SRCU:

Doesn't make sense to me...

But what I was thinking is not to do this only for blocking queues, we
can easily use call_rcu for non-blocking queues as well..

i.e.
--
void blk_mq_quiesce_queue_async(struct request_queue *q)
{
         struct blk_mq_hw_ctx *hctx;
         unsigned int i;

         blk_mq_quiesce_queue_nowait(q);

         queue_for_each_hw_ctx(q, hctx, i) {
                 init_completion(&hctx->rcu_sync.completion);
                 init_rcu_head(&hctx->rcu_sync.head);
                 if (hctx->flags & BLK_MQ_F_BLOCKING)
                         call_srcu(hctx->srcu, &hctx->rcu_sync.head,
                                 wakeme_after_rcu);
                 else
                         call_rcu(hctx->srcu, &hctx->rcu_sync.head,
                                 wakeme_after_rcu);
         }
}
EXPORT_SYMBOL_GPL(blk_mq_quiesce_blocking_queue_async);

void blk_mq_quiesce_queue_async_wait(struct request_queue *q)
{
         struct blk_mq_hw_ctx *hctx;
         unsigned int i;

         queue_for_each_hw_ctx(q, hctx, i) {
                 wait_for_completion(&hctx->rcu_sync.completion);
                 destroy_rcu_head(&hctx->rcu_sync.head);
         }
}
EXPORT_SYMBOL_GPL(blk_mq_quiesce_blocking_queue_async_wait);
--

And then in nvme we always use that...

Thoughts?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-24 23:06 [PATCH rfc v2 0/4] improve quiesce time for large amount of namespaces Sagi Grimberg
2020-07-24 23:06 ` [PATCH rfc v2 1/4] blk-mq: add async quiesce interface for blocking hw queues Sagi Grimberg
2020-07-25  7:03   ` Chao Leng
2020-07-25 17:35     ` Sagi Grimberg
2020-07-24 23:06 ` [PATCH rfc v2 2/4] nvme: improve quiesce for blocking queues Sagi Grimberg
2020-07-24 23:06 ` [PATCH rfc v2 3/4] nvme-core: reduce io failover time Sagi Grimberg
2020-07-24 23:15   ` Sagi Grimberg
2020-07-25  8:51   ` Chao Leng
2020-07-24 23:06 ` [PATCH for-testing v2 4/4] nvme-rdma: use blocking quiesce interface Sagi Grimberg

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git