All of lore.kernel.org
 help / color / mirror / Atom feed
* per-tagset SRCU struct and quiesce v2
@ 2022-10-25 14:40 Christoph Hellwig
  2022-10-25 14:40 ` [PATCH 01/17] block: set the disk capacity to 0 in blk_mark_disk_dead Christoph Hellwig
                   ` (16 more replies)
  0 siblings, 17 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Hi all,

this series moves the SRCU struct used for quiescing to the tag_set
as the SRCU critical sections for dispatch take about the same time
on all queues anyway, and then adopts the series from Chao that provides
tagset-wide quiesce to use that infrastructure.

To do so it does a fair amount of cleanups in how nvme quiesces queues.
Even with that nvme is still a bit of a mess in how it quiesces queues,
so further work is probably needed later.

Note: testing is a bit limited as I'm travelling.  Before this proceeds
it should get surprise removal testing for nvme-pci.

Changes since v1:
 - a whole bunch of new nvme cleanups
 - use the notify version of set_capacity in blk_mark_disk_dead
 - improve the documentation for blk_mq_wait_quiesce_done
 - dynamically allocate the srcu_struct in struct blk_mq_tag_set
 - minor spelling fixes

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/17] block: set the disk capacity to 0 in blk_mark_disk_dead
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 12:39   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 02/17] nvme-pci: refactor the tagset handling in nvme_reset_work Christoph Hellwig
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

nvme and xen-blkfront are already doing this to stop buffered writes from
creating dirty pages that can't be written out later.  Move it to the
common code.

This also removes the comment about the ordering from nvme, as bd_mutex
not only is gone entirely, but also hasn't been used for locking updates
to the disk size long before that, and thus the ordering requirement
documented there doesn't apply any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
---
 block/genhd.c                | 5 +++++
 drivers/block/xen-blkfront.c | 1 -
 drivers/nvme/host/core.c     | 7 +------
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 17b33c62423df..70a25e8a64466 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -555,6 +555,11 @@ void blk_mark_disk_dead(struct gendisk *disk)
 {
 	set_bit(GD_DEAD, &disk->state);
 	blk_queue_start_drain(disk->queue);
+
+	/*
+	 * Stop buffered writers from dirtying pages that can't be written out.
+	 */
+	set_capacity_and_notify(disk, 0);
 }
 EXPORT_SYMBOL_GPL(blk_mark_disk_dead);
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 35b9bcad9db90..b28489290323f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2129,7 +2129,6 @@ static void blkfront_closing(struct blkfront_info *info)
 	if (info->rq && info->gd) {
 		blk_mq_stop_hw_queues(info->rq);
 		blk_mark_disk_dead(info->gd);
-		set_capacity(info->gd, 0);
 	}
 
 	for_each_rinfo(info, rinfo, i) {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index dc42206005855..a92d07137cbd8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5110,10 +5110,7 @@ static void nvme_stop_ns_queue(struct nvme_ns *ns)
 /*
  * Prepare a queue for teardown.
  *
- * This must forcibly unquiesce queues to avoid blocking dispatch, and only set
- * the capacity to 0 after that to avoid blocking dispatchers that may be
- * holding bd_butex.  This will end buffered writers dirtying pages that can't
- * be synced.
+ * This must forcibly unquiesce queues to avoid blocking dispatch.
  */
 static void nvme_set_queue_dying(struct nvme_ns *ns)
 {
@@ -5122,8 +5119,6 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
 
 	blk_mark_disk_dead(ns->disk);
 	nvme_start_ns_queue(ns);
-
-	set_capacity_and_notify(ns->disk, 0);
 }
 
 /**
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/17] nvme-pci: refactor the tagset handling in nvme_reset_work
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
  2022-10-25 14:40 ` [PATCH 01/17] block: set the disk capacity to 0 in blk_mark_disk_dead Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 12:46   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 03/17] nvme-pci: don't warn about the lack of I/O queues for admin controllers Christoph Hellwig
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

The code to create, update or delete a tagset and namespaces in
nvme_reset_work is a bit convoluted.  Refactor it with a two high-level
conditionals for first probe vs reset and I/O queues vs no I/O queues
to make the code flow more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 46 ++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 31e577b01257d..51513d263d77a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2902,25 +2902,37 @@ static void nvme_reset_work(struct work_struct *work)
 	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
-
-	/*
-	 * Keep the controller around but remove all namespaces if we don't have
-	 * any working I/O queue.
-	 */
-	if (dev->online_queues < 2) {
-		dev_warn(dev->ctrl.device, "IO queues not created\n");
-		nvme_kill_queues(&dev->ctrl);
-		nvme_remove_namespaces(&dev->ctrl);
-		nvme_free_tagset(dev);
+		
+	if (dev->ctrl.tagset) {
+		/*
+		 * This is a controller reset and we already have a tagset.
+		 * Freeze and update the number of I/O queues as thos might have
+		 * changed.  If there are no I/O queues left after this reset,
+		 * keep the controller around but remove all namespaces.
+		 */
+		if (dev->online_queues > 1) {
+			nvme_start_queues(&dev->ctrl);
+			nvme_wait_freeze(&dev->ctrl);
+			nvme_pci_update_nr_queues(dev);
+			nvme_dbbuf_set(dev);
+			nvme_unfreeze(&dev->ctrl);
+		} else {
+			dev_warn(dev->ctrl.device, "IO queues lost\n");
+			nvme_kill_queues(&dev->ctrl);
+			nvme_remove_namespaces(&dev->ctrl);
+			nvme_free_tagset(dev);
+		}
 	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
-		if (!dev->ctrl.tagset)
+		/*
+		 * First probe.  Still allow the controller to show up even if
+		 * there are no namespaces.
+		 */
+		if (dev->online_queues > 1) {
 			nvme_pci_alloc_tag_set(dev);
-		else
-			nvme_pci_update_nr_queues(dev);
-		nvme_dbbuf_set(dev);
-		nvme_unfreeze(&dev->ctrl);
+			nvme_dbbuf_set(dev);
+		} else {
+			dev_warn(dev->ctrl.device, "IO queues not created\n");
+		}
 	}
 
 	/*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/17] nvme-pci: don't warn about the lack of I/O queues for admin controllers
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
  2022-10-25 14:40 ` [PATCH 01/17] block: set the disk capacity to 0 in blk_mark_disk_dead Christoph Hellwig
  2022-10-25 14:40 ` [PATCH 02/17] nvme-pci: refactor the tagset handling in nvme_reset_work Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 12:49   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 04/17] nvme: don't call nvme_kill_queues from nvme_remove_namespaces Christoph Hellwig
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Admin controllers never have I/O queues, so don't warn about that fact.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 51513d263d77a..ec034d4dd9eff 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2917,7 +2917,8 @@ static void nvme_reset_work(struct work_struct *work)
 			nvme_dbbuf_set(dev);
 			nvme_unfreeze(&dev->ctrl);
 		} else {
-			dev_warn(dev->ctrl.device, "IO queues lost\n");
+			if (dev->ctrl.cntrltype != NVME_CTRL_ADMIN)
+				dev_warn(dev->ctrl.device, "IO queues lost\n");
 			nvme_kill_queues(&dev->ctrl);
 			nvme_remove_namespaces(&dev->ctrl);
 			nvme_free_tagset(dev);
@@ -2931,7 +2932,9 @@ static void nvme_reset_work(struct work_struct *work)
 			nvme_pci_alloc_tag_set(dev);
 			nvme_dbbuf_set(dev);
 		} else {
-			dev_warn(dev->ctrl.device, "IO queues not created\n");
+			if (dev->ctrl.cntrltype != NVME_CTRL_ADMIN)
+				dev_warn(dev->ctrl.device,
+					 "IO queues not created\n");
 		}
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/17] nvme: don't call nvme_kill_queues from nvme_remove_namespaces
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 03/17] nvme-pci: don't warn about the lack of I/O queues for admin controllers Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-25 17:43   ` Keith Busch
  2022-10-25 14:40 ` [PATCH 05/17] nvme: don't remove namespaces in nvme_passthru_end Christoph Hellwig
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Only the PCIe driver can mark a controller dead, and it does so in
exactly one spot just before calling nvme_remove_namespaces.  Move the
state dependent callto nvme_kill_queues out of nvme_remove_namespaces
and into the one caller where this case can happen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c |  9 ---------
 drivers/nvme/host/pci.c  | 10 ++++++++++
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a92d07137cbd8..8eb46ca9f6acf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4560,15 +4560,6 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	/* prevent racing with ns scanning */
 	flush_work(&ctrl->scan_work);
 
-	/*
-	 * The dead states indicates the controller was not gracefully
-	 * disconnected. In that case, we won't be able to flush any data while
-	 * removing the namespaces' disks; fail all the queues now to avoid
-	 * potentially having to clean up the failed sync later.
-	 */
-	if (ctrl->state == NVME_CTRL_DEAD)
-		nvme_kill_queues(ctrl);
-
 	/* this is a no-op when called from the controller reset handler */
 	nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING_NOIO);
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ec034d4dd9eff..f971e96ffd3f6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3249,6 +3249,16 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	flush_work(&dev->ctrl.reset_work);
 	nvme_stop_ctrl(&dev->ctrl);
+
+	/*
+	 * The dead states indicates the controller was not gracefully
+	 * disconnected. In that case, we won't be able to flush any data while
+	 * removing the namespaces' disks; fail all the queues now to avoid
+	 * potentially having to clean up the failed sync later.
+	 */
+	if (dev->ctrl.state == NVME_CTRL_DEAD)
+		nvme_kill_queues(&dev->ctrl);
+
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true);
 	nvme_remove_attrs(dev);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/17] nvme: don't remove namespaces in nvme_passthru_end
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 04/17] nvme: don't call nvme_kill_queues from nvme_remove_namespaces Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 12:50   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 06/17] nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces Christoph Hellwig
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

The call to nvme_remove_invalid_namespaces made sense when
nvme_passthru_end revalidated all namespaces and had to remove those that
didn't exist any more.  Since we don't revalidate from nvme_passthru_end
now, this call is entirely spurious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8eb46ca9f6acf..50b440e1a0ca0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1118,7 +1118,6 @@ void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects,
 		nvme_unfreeze(ctrl);
 		nvme_mpath_unfreeze(ctrl->subsys);
 		mutex_unlock(&ctrl->subsys->lock);
-		nvme_remove_invalid_namespaces(ctrl, NVME_NSID_ALL);
 		mutex_unlock(&ctrl->scan_lock);
 	}
 	if (effects & NVME_CMD_EFFECTS_CCC)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/17] nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 05/17] nvme: don't remove namespaces in nvme_passthru_end Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 12:50   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 07/17] nvme: remove the NVME_NS_DEAD check in nvme_validate_ns Christoph Hellwig
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

The NVME_NS_DEAD check only made sense when we revalidated namespaces
in nvme_passthrough_end for commands that affected the namespace inventory.
These days NVME_NS_DEAD is only set during reset or when tearing down
namespaces, and we always remove all namespaces right after that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 50b440e1a0ca0..2ec838e608509 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4407,7 +4407,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 
 	down_write(&ctrl->namespaces_rwsem);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->head->ns_id > nsid || test_bit(NVME_NS_DEAD, &ns->flags))
+		if (ns->head->ns_id > nsid)
 			list_move_tail(&ns->list, &rm_list);
 	}
 	up_write(&ctrl->namespaces_rwsem);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/17] nvme: remove the NVME_NS_DEAD check in nvme_validate_ns
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 06/17] nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 12:52   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 08/17] nvme: don't unquiesce the admin queue in nvme_kill_queues Christoph Hellwig
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

We never rescan namespaces after marking them dead, so this check is dead
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2ec838e608509..99eabbe7fac98 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4333,10 +4333,6 @@ static void nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info)
 {
 	int ret = NVME_SC_INVALID_NS | NVME_SC_DNR;
 
-	if (test_bit(NVME_NS_DEAD, &ns->flags))
-		goto out;
-
-	ret = NVME_SC_INVALID_NS | NVME_SC_DNR;
 	if (!nvme_ns_ids_equal(&ns->head->ids, &info->ids)) {
 		dev_err(ns->ctrl->device,
 			"identifiers changed for nsid %d\n", ns->head->ns_id);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/17] nvme: don't unquiesce the admin queue in nvme_kill_queues
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 07/17] nvme: remove the NVME_NS_DEAD check in nvme_validate_ns Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 12:53   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 09/17] nvme: don't unquiesce the I/O queues " Christoph Hellwig
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

There are 4 callers of nvme_kill_queues:

 1) nvme_reset_work explicit call nvme_start_admin_queue toward the
    beginning of the function.  The extra call to nvme_start_admin_queue
    in nvme_reset_work this won't do anything as
    NVME_CTRL_ADMIN_Q_STOPPED will already be cleared.
 2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at
    the very beginning of the function if the PCIe device was not present,
    which is the precondition for the call to nvme_kill_queues.
    nvme_dev_disable already calls nvme_start_admin_queue toward the
    end of the function when the shutdown flag is set to true, so the
    admin queue is already enabled at this point.
 3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver,
    which will end up in nvme_remove, which calls nvme_dev_disable with
    the shutdown flag.  This case will call nvme_start_admin_queue a bit
    later than before.
 4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl
     above.

So it is safe to just remove the call to nvme_start_admin_queue in
nvme_kill_queues without replacement.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 99eabbe7fac98..d4d25f1dd6dba 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5121,9 +5121,6 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 	down_read(&ctrl->namespaces_rwsem);
 
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
-	if (ctrl->admin_q && !blk_queue_dying(ctrl->admin_q))
-		nvme_start_admin_queue(ctrl);
-
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		nvme_set_queue_dying(ns);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/17] nvme: don't unquiesce the I/O queues in nvme_kill_queues
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 08/17] nvme: don't unquiesce the admin queue in nvme_kill_queues Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 12:54   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove Christoph Hellwig
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

nvme_kill_queues does two things:

 1) mark the gendisk of all namespaces dead
 2) unquiesce all I/O queues

These used to be be intertwined due to block layer issues, but aren't
any more.  So move the unquiscing of the I/O queues into the callers,
and rename the rest of the function to the now more descriptive
nvme_mark_namespaces_dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/apple.c |  3 ++-
 drivers/nvme/host/core.c  | 31 ++++---------------------------
 drivers/nvme/host/nvme.h  |  3 +--
 drivers/nvme/host/pci.c   | 12 ++++++++----
 4 files changed, 15 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
index ff8b083dc5c6d..14bee207316a0 100644
--- a/drivers/nvme/host/apple.c
+++ b/drivers/nvme/host/apple.c
@@ -1153,7 +1153,8 @@ static void apple_nvme_reset_work(struct work_struct *work)
 	nvme_change_ctrl_state(&anv->ctrl, NVME_CTRL_DELETING);
 	nvme_get_ctrl(&anv->ctrl);
 	apple_nvme_disable(anv, false);
-	nvme_kill_queues(&anv->ctrl);
+	nvme_mark_namespaces_dead(&anv->ctrl);
+	nvme_start_queues(&anv->ctrl);
 	if (!queue_work(nvme_wq, &anv->remove_work))
 		nvme_put_ctrl(&anv->ctrl);
 }
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d4d25f1dd6dba..00612e6857295 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5093,40 +5093,17 @@ static void nvme_stop_ns_queue(struct nvme_ns *ns)
 		blk_mq_wait_quiesce_done(ns->queue);
 }
 
-/*
- * Prepare a queue for teardown.
- *
- * This must forcibly unquiesce queues to avoid blocking dispatch.
- */
-static void nvme_set_queue_dying(struct nvme_ns *ns)
-{
-	if (test_and_set_bit(NVME_NS_DEAD, &ns->flags))
-		return;
-
-	blk_mark_disk_dead(ns->disk);
-	nvme_start_ns_queue(ns);
-}
-
-/**
- * nvme_kill_queues(): Ends all namespace queues
- * @ctrl: the dead controller that needs to end
- *
- * Call this function when the driver determines it is unable to get the
- * controller in a state capable of servicing IO.
- */
-void nvme_kill_queues(struct nvme_ctrl *ctrl)
+/* let I/O to all namespaces fail in preparation for surprise removal */
+void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
 	down_read(&ctrl->namespaces_rwsem);
-
-	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		nvme_set_queue_dying(ns);
-
+		blk_mark_disk_dead(ns->disk);
 	up_read(&ctrl->namespaces_rwsem);
 }
-EXPORT_SYMBOL_GPL(nvme_kill_queues);
+EXPORT_SYMBOL_GPL(nvme_mark_namespaces_dead);
 
 void nvme_unfreeze(struct nvme_ctrl *ctrl)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a29877217ee65..1cd7168002438 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -483,7 +483,6 @@ struct nvme_ns {
 	unsigned long features;
 	unsigned long flags;
 #define NVME_NS_REMOVING	0
-#define NVME_NS_DEAD     	1
 #define NVME_NS_ANA_PENDING	2
 #define NVME_NS_FORCE_RO	3
 #define NVME_NS_READY		4
@@ -758,7 +757,7 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_admin_queue(struct nvme_ctrl *ctrl);
 void nvme_start_admin_queue(struct nvme_ctrl *ctrl);
-void nvme_kill_queues(struct nvme_ctrl *ctrl);
+void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl);
 void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_sync_io_queues(struct nvme_ctrl *ctrl);
 void nvme_unfreeze(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f971e96ffd3f6..8ab54857cfd50 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2793,7 +2793,8 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
 	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
 	nvme_get_ctrl(&dev->ctrl);
 	nvme_dev_disable(dev, false);
-	nvme_kill_queues(&dev->ctrl);
+	nvme_mark_namespaces_dead(&dev->ctrl);
+	nvme_start_queues(&dev->ctrl);
 	if (!queue_work(nvme_wq, &dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -2919,7 +2920,8 @@ static void nvme_reset_work(struct work_struct *work)
 		} else {
 			if (dev->ctrl.cntrltype != NVME_CTRL_ADMIN)
 				dev_warn(dev->ctrl.device, "IO queues lost\n");
-			nvme_kill_queues(&dev->ctrl);
+			nvme_mark_namespaces_dead(&dev->ctrl);
+			nvme_start_queues(&dev->ctrl);
 			nvme_remove_namespaces(&dev->ctrl);
 			nvme_free_tagset(dev);
 		}
@@ -3256,8 +3258,10 @@ static void nvme_remove(struct pci_dev *pdev)
 	 * removing the namespaces' disks; fail all the queues now to avoid
 	 * potentially having to clean up the failed sync later.
 	 */
-	if (dev->ctrl.state == NVME_CTRL_DEAD)
-		nvme_kill_queues(&dev->ctrl);
+	if (dev->ctrl.state == NVME_CTRL_DEAD) {
+		nvme_mark_namespaces_dead(&dev->ctrl);
+		nvme_start_queues(&dev->ctrl);
+	}
 
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 09/17] nvme: don't unquiesce the I/O queues " Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-25 18:53   ` Keith Busch
  2022-10-26 12:55   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 11/17] nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl Christoph Hellwig
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

When we have a non-present controller there is no reason to wait in
marking the namespaces dead, so do it ASAP.  Also remove the superflous
call to nvme_start_queues as nvme_dev_disable already did that for us.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8ab54857cfd50..bef98f6e1396c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3244,25 +3244,19 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
 	pci_set_drvdata(pdev, NULL);
 
+	/*
+	 * Mark the namespaces dead as we can't flush the data, and disable the
+	 * controller ASAP as we can't shut it down properly if it was surprise
+	 * removed.
+	 */
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
+		nvme_mark_namespaces_dead(&dev->ctrl);
 		nvme_dev_disable(dev, true);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
 	nvme_stop_ctrl(&dev->ctrl);
-
-	/*
-	 * The dead states indicates the controller was not gracefully
-	 * disconnected. In that case, we won't be able to flush any data while
-	 * removing the namespaces' disks; fail all the queues now to avoid
-	 * potentially having to clean up the failed sync later.
-	 */
-	if (dev->ctrl.state == NVME_CTRL_DEAD) {
-		nvme_mark_namespaces_dead(&dev->ctrl);
-		nvme_start_queues(&dev->ctrl);
-	}
-
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true);
 	nvme_remove_attrs(dev);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/17] nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26  8:34   ` Chao Leng
  2022-10-25 14:40 ` [PATCH 12/17] nvme-pci: don't unquiesce the I/O queues in apple_nvme_reset_work Christoph Hellwig
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

nvme_remove_dead_ctrl schedules nvme_remove to be called, which will
call nvme_dev_disable and unquiesce the I/O queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index bef98f6e1396c..3a26c9b2bf454 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2794,7 +2794,6 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
 	nvme_get_ctrl(&dev->ctrl);
 	nvme_dev_disable(dev, false);
 	nvme_mark_namespaces_dead(&dev->ctrl);
-	nvme_start_queues(&dev->ctrl);
 	if (!queue_work(nvme_wq, &dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/17] nvme-pci: don't unquiesce the I/O queues in apple_nvme_reset_work
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 11/17] nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26  8:37   ` Chao Leng
  2022-10-26 12:58   ` Sagi Grimberg
  2022-10-25 14:40 ` [PATCH 13/17] blk-mq: skip non-mq queues in blk_mq_quiesce_queue Christoph Hellwig
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

apple_nvme_reset_work schedules apple_nvme_remove, to be called, which
will call apple_nvme_disable and unquiesce the I/O queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/apple.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
index 14bee207316a0..44e7daf93e19c 100644
--- a/drivers/nvme/host/apple.c
+++ b/drivers/nvme/host/apple.c
@@ -1154,7 +1154,6 @@ static void apple_nvme_reset_work(struct work_struct *work)
 	nvme_get_ctrl(&anv->ctrl);
 	apple_nvme_disable(anv, false);
 	nvme_mark_namespaces_dead(&anv->ctrl);
-	nvme_start_queues(&anv->ctrl);
 	if (!queue_work(nvme_wq, &anv->remove_work))
 		nvme_put_ctrl(&anv->ctrl);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/17] blk-mq: skip non-mq queues in blk_mq_quiesce_queue
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 12/17] nvme-pci: don't unquiesce the I/O queues in apple_nvme_reset_work Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-25 14:40 ` [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset Christoph Hellwig
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke

For submit_bio based queues there is no (S)RCU critical section during
I/O submission and thus nothing to wait for in blk_mq_wait_quiesce_done,
so skip doing any synchronization.  No non-mq driver should be calling
this, but for now we have core callers that unconditionally call into it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 098432d3caf1c..802fdd3d737e3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -280,7 +280,9 @@ EXPORT_SYMBOL_GPL(blk_mq_wait_quiesce_done);
 void blk_mq_quiesce_queue(struct request_queue *q)
 {
 	blk_mq_quiesce_queue_nowait(q);
-	blk_mq_wait_quiesce_done(q);
+	/* nothing to wait for non-mq queues */
+	if (queue_is_mq(q))
+		blk_mq_wait_quiesce_done(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 13/17] blk-mq: skip non-mq queues in blk_mq_quiesce_queue Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26  8:48   ` Chao Leng
  2022-10-25 14:40 ` [PATCH 15/17] blk-mq: pass a tagset to blk_mq_wait_quiesce_done Christoph Hellwig
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke

All I/O submissions have fairly similar latencies, and a tagset-wide
quiesce is a fairly common operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-core.c       | 27 +++++----------------------
 block/blk-mq.c         | 33 +++++++++++++++++++++++++--------
 block/blk-mq.h         | 14 +++++++-------
 block/blk-sysfs.c      |  9 ++-------
 block/blk.h            |  9 +--------
 block/genhd.c          |  2 +-
 include/linux/blk-mq.h |  4 ++++
 include/linux/blkdev.h |  9 ---------
 8 files changed, 45 insertions(+), 62 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17667159482e0..3a2ed8dadf738 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -65,7 +65,6 @@ DEFINE_IDA(blk_queue_ida);
  * For queue allocation
  */
 struct kmem_cache *blk_requestq_cachep;
-struct kmem_cache *blk_requestq_srcu_cachep;
 
 /*
  * Controlling structure to kblockd
@@ -373,26 +372,20 @@ static void blk_timeout_work(struct work_struct *work)
 {
 }
 
-struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
+struct request_queue *blk_alloc_queue(int node_id)
 {
 	struct request_queue *q;
 
-	q = kmem_cache_alloc_node(blk_get_queue_kmem_cache(alloc_srcu),
-			GFP_KERNEL | __GFP_ZERO, node_id);
+	q = kmem_cache_alloc_node(blk_requestq_cachep, GFP_KERNEL | __GFP_ZERO,
+				  node_id);
 	if (!q)
 		return NULL;
 
-	if (alloc_srcu) {
-		blk_queue_flag_set(QUEUE_FLAG_HAS_SRCU, q);
-		if (init_srcu_struct(q->srcu) != 0)
-			goto fail_q;
-	}
-
 	q->last_merge = NULL;
 
 	q->id = ida_alloc(&blk_queue_ida, GFP_KERNEL);
 	if (q->id < 0)
-		goto fail_srcu;
+		goto fail_q;
 
 	q->stats = blk_alloc_queue_stats();
 	if (!q->stats)
@@ -435,11 +428,8 @@ struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
 	blk_free_queue_stats(q->stats);
 fail_id:
 	ida_free(&blk_queue_ida, q->id);
-fail_srcu:
-	if (alloc_srcu)
-		cleanup_srcu_struct(q->srcu);
 fail_q:
-	kmem_cache_free(blk_get_queue_kmem_cache(alloc_srcu), q);
+	kmem_cache_free(blk_requestq_cachep, q);
 	return NULL;
 }
 
@@ -1184,9 +1174,6 @@ int __init blk_dev_init(void)
 			sizeof_field(struct request, cmd_flags));
 	BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
 			sizeof_field(struct bio, bi_opf));
-	BUILD_BUG_ON(ALIGN(offsetof(struct request_queue, srcu),
-			   __alignof__(struct request_queue)) !=
-		     sizeof(struct request_queue));
 
 	/* used for unplugging and affects IO latency/throughput - HIGHPRI */
 	kblockd_workqueue = alloc_workqueue("kblockd",
@@ -1197,10 +1184,6 @@ int __init blk_dev_init(void)
 	blk_requestq_cachep = kmem_cache_create("request_queue",
 			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
 
-	blk_requestq_srcu_cachep = kmem_cache_create("request_queue_srcu",
-			sizeof(struct request_queue) +
-			sizeof(struct srcu_struct), 0, SLAB_PANIC, NULL);
-
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
 
 	return 0;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 802fdd3d737e3..6cbf34921e33f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -261,8 +261,8 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
  */
 void blk_mq_wait_quiesce_done(struct request_queue *q)
 {
-	if (blk_queue_has_srcu(q))
-		synchronize_srcu(q->srcu);
+	if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
+		synchronize_srcu(q->tag_set->srcu);
 	else
 		synchronize_rcu();
 }
@@ -3971,7 +3971,7 @@ static struct request_queue *blk_mq_init_queue_data(struct blk_mq_tag_set *set,
 	struct request_queue *q;
 	int ret;
 
-	q = blk_alloc_queue(set->numa_node, set->flags & BLK_MQ_F_BLOCKING);
+	q = blk_alloc_queue(set->numa_node);
 	if (!q)
 		return ERR_PTR(-ENOMEM);
 	q->queuedata = queuedata;
@@ -4138,9 +4138,6 @@ static void blk_mq_update_poll_flag(struct request_queue *q)
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 		struct request_queue *q)
 {
-	WARN_ON_ONCE(blk_queue_has_srcu(q) !=
-			!!(set->flags & BLK_MQ_F_BLOCKING));
-
 	/* mark the queue as mq asap */
 	q->mq_ops = set->ops;
 
@@ -4398,9 +4395,19 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	 */
 	if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
 		set->nr_hw_queues = nr_cpu_ids;
+ 
+	if (set->flags & BLK_MQ_F_BLOCKING) {
+		set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
+		if (!set->srcu)
+			return -ENOMEM;
+		ret = init_srcu_struct(set->srcu);
+		if (ret)
+			goto out_free_srcu;
+	}
 
-	if (blk_mq_alloc_tag_set_tags(set, set->nr_hw_queues) < 0)
-		return -ENOMEM;
+	ret = blk_mq_alloc_tag_set_tags(set, set->nr_hw_queues);
+	if (ret)
+		goto out_cleanup_srcu;
 
 	ret = -ENOMEM;
 	for (i = 0; i < set->nr_maps; i++) {
@@ -4430,6 +4437,12 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	}
 	kfree(set->tags);
 	set->tags = NULL;
+out_cleanup_srcu:
+	if (set->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(set->srcu);
+out_free_srcu:
+	if (set->flags & BLK_MQ_F_BLOCKING)
+		kfree(set->srcu);
 	return ret;
 }
 EXPORT_SYMBOL(blk_mq_alloc_tag_set);
@@ -4469,6 +4482,10 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 
 	kfree(set->tags);
 	set->tags = NULL;
+	if (set->flags & BLK_MQ_F_BLOCKING) {
+		cleanup_srcu_struct(set->srcu);
+		kfree(set->srcu);
+	}
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 0b2870839cdd6..ef59fee62780d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -377,17 +377,17 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 /* run the code block in @dispatch_ops with rcu/srcu read lock held */
 #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)	\
 do {								\
-	if (!blk_queue_has_srcu(q)) {				\
-		rcu_read_lock();				\
-		(dispatch_ops);					\
-		rcu_read_unlock();				\
-	} else {						\
+	if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) {		\
 		int srcu_idx;					\
 								\
 		might_sleep_if(check_sleep);			\
-		srcu_idx = srcu_read_lock((q)->srcu);		\
+		srcu_idx = srcu_read_lock((q)->tag_set->srcu);	\
 		(dispatch_ops);					\
-		srcu_read_unlock((q)->srcu, srcu_idx);		\
+		srcu_read_unlock((q)->tag_set->srcu, srcu_idx);	\
+	} else {						\
+		rcu_read_lock();				\
+		(dispatch_ops);					\
+		rcu_read_unlock();				\
 	}							\
 } while (0)
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7b98c7074771d..02e94c4beff17 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -742,10 +742,8 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
 
 static void blk_free_queue_rcu(struct rcu_head *rcu_head)
 {
-	struct request_queue *q = container_of(rcu_head, struct request_queue,
-					       rcu_head);
-
-	kmem_cache_free(blk_get_queue_kmem_cache(blk_queue_has_srcu(q)), q);
+	kmem_cache_free(blk_requestq_cachep,
+			container_of(rcu_head, struct request_queue, rcu_head));
 }
 
 /**
@@ -782,9 +780,6 @@ static void blk_release_queue(struct kobject *kobj)
 	if (queue_is_mq(q))
 		blk_mq_release(q);
 
-	if (blk_queue_has_srcu(q))
-		cleanup_srcu_struct(q->srcu);
-
 	ida_free(&blk_queue_ida, q->id);
 	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
diff --git a/block/blk.h b/block/blk.h
index d6ea0d1a6db0f..d259d1da4cb4e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -27,7 +27,6 @@ struct blk_flush_queue {
 };
 
 extern struct kmem_cache *blk_requestq_cachep;
-extern struct kmem_cache *blk_requestq_srcu_cachep;
 extern struct kobj_type blk_queue_ktype;
 extern struct ida blk_queue_ida;
 
@@ -427,13 +426,7 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
-static inline struct kmem_cache *blk_get_queue_kmem_cache(bool srcu)
-{
-	if (srcu)
-		return blk_requestq_srcu_cachep;
-	return blk_requestq_cachep;
-}
-struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu);
+struct request_queue *blk_alloc_queue(int node_id);
 
 int disk_scan_partitions(struct gendisk *disk, fmode_t mode);
 
diff --git a/block/genhd.c b/block/genhd.c
index 70a25e8a64466..7bca3ee2f6702 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1412,7 +1412,7 @@ struct gendisk *__blk_alloc_disk(int node, struct lock_class_key *lkclass)
 	struct request_queue *q;
 	struct gendisk *disk;
 
-	q = blk_alloc_queue(node, false);
+	q = blk_alloc_queue(node);
 	if (!q)
 		return NULL;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ba18e9bdb799b..dc189d6b1a367 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -7,6 +7,7 @@
 #include <linux/lockdep.h>
 #include <linux/scatterlist.h>
 #include <linux/prefetch.h>
+#include <linux/srcu.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -501,6 +502,8 @@ enum hctx_type {
  * @tag_list_lock: Serializes tag_list accesses.
  * @tag_list:	   List of the request queues that use this tag set. See also
  *		   request_queue.tag_set_list.
+ * @srcu:	   Use as lock when type of the request queue is blocking
+ *		   (BLK_MQ_F_BLOCKING).
  */
 struct blk_mq_tag_set {
 	struct blk_mq_queue_map	map[HCTX_MAX_TYPES];
@@ -521,6 +524,7 @@ struct blk_mq_tag_set {
 
 	struct mutex		tag_list_lock;
 	struct list_head	tag_list;
+	struct srcu_struct	*srcu;
 };
 
 /**
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50e358a19d986..b15b6a011c028 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -22,7 +22,6 @@
 #include <linux/blkzoned.h>
 #include <linux/sched.h>
 #include <linux/sbitmap.h>
-#include <linux/srcu.h>
 #include <linux/uuid.h>
 #include <linux/xarray.h>
 
@@ -543,18 +542,11 @@ struct request_queue {
 	struct mutex		debugfs_mutex;
 
 	bool			mq_sysfs_init_done;
-
-	/**
-	 * @srcu: Sleepable RCU. Use as lock when type of the request queue
-	 * is blocking (BLK_MQ_F_BLOCKING). Must be the last member
-	 */
-	struct srcu_struct	srcu[];
 };
 
 /* Keep blk_queue_flag_name[] in sync with the definitions below */
 #define QUEUE_FLAG_STOPPED	0	/* queue is stopped */
 #define QUEUE_FLAG_DYING	1	/* queue being torn down */
-#define QUEUE_FLAG_HAS_SRCU	2	/* SRCU is allocated */
 #define QUEUE_FLAG_NOMERGES     3	/* disable merge attempts */
 #define QUEUE_FLAG_SAME_COMP	4	/* complete on same CPU-group */
 #define QUEUE_FLAG_FAIL_IO	5	/* fake timeout */
@@ -590,7 +582,6 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 
 #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
 #define blk_queue_dying(q)	test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags)
-#define blk_queue_has_srcu(q)	test_bit(QUEUE_FLAG_HAS_SRCU, &(q)->queue_flags)
 #define blk_queue_init_done(q)	test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags)
 #define blk_queue_nomerges(q)	test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
 #define blk_queue_noxmerges(q)	\
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 15/17] blk-mq: pass a tagset to blk_mq_wait_quiesce_done
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-25 14:40 ` [PATCH 16/17] blk-mq: add tagset quiesce interface Christoph Hellwig
  2022-10-25 14:40 ` [PATCH 17/17] nvme: use blk_mq_[un]quiesce_tagset Christoph Hellwig
  16 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke

Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just
pass the tagset, and move the non-mq check into the only caller that
needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c           | 14 ++++++++------
 drivers/nvme/host/core.c |  4 ++--
 drivers/scsi/scsi_lib.c  |  2 +-
 include/linux/blk-mq.h   |  2 +-
 4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6cbf34921e33f..8f88c24a55c94 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -254,15 +254,17 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
 
 /**
  * blk_mq_wait_quiesce_done() - wait until in-progress quiesce is done
- * @q: request queue.
+ * @set: tag_set to wait on
  *
  * Note: it is driver's responsibility for making sure that quiesce has
- * been started.
+ * been started on or more of the request_queues of the tag_set.  This
+ * function only waits for the quiesce on those request_queues that had
+ * the quiesce flag set using blk_mq_quiesce_queue_nowait.
  */
-void blk_mq_wait_quiesce_done(struct request_queue *q)
+void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set)
 {
-	if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
-		synchronize_srcu(q->tag_set->srcu);
+	if (set->flags & BLK_MQ_F_BLOCKING)
+		synchronize_srcu(set->srcu);
 	else
 		synchronize_rcu();
 }
@@ -282,7 +284,7 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 	blk_mq_quiesce_queue_nowait(q);
 	/* nothing to wait for non-mq queues */
 	if (queue_is_mq(q))
-		blk_mq_wait_quiesce_done(q);
+		blk_mq_wait_quiesce_done(q->tag_set);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 00612e6857295..012ef0772f5b3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5090,7 +5090,7 @@ static void nvme_stop_ns_queue(struct nvme_ns *ns)
 	if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags))
 		blk_mq_quiesce_queue(ns->queue);
 	else
-		blk_mq_wait_quiesce_done(ns->queue);
+		blk_mq_wait_quiesce_done(ns->queue->tag_set);
 }
 
 /* let I/O to all namespaces fail in preparation for surprise removal */
@@ -5180,7 +5180,7 @@ void nvme_stop_admin_queue(struct nvme_ctrl *ctrl)
 	if (!test_and_set_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->flags))
 		blk_mq_quiesce_queue(ctrl->admin_q);
 	else
-		blk_mq_wait_quiesce_done(ctrl->admin_q);
+		blk_mq_wait_quiesce_done(ctrl->admin_q->tag_set);
 }
 EXPORT_SYMBOL_GPL(nvme_stop_admin_queue);
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8b89fab7c4206..249757ddd8fea 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2735,7 +2735,7 @@ static void scsi_stop_queue(struct scsi_device *sdev, bool nowait)
 			blk_mq_quiesce_queue(sdev->request_queue);
 	} else {
 		if (!nowait)
-			blk_mq_wait_quiesce_done(sdev->request_queue);
+			blk_mq_wait_quiesce_done(sdev->request_queue->tag_set);
 	}
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index dc189d6b1a367..678ce8a661ec9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -881,7 +881,7 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
-void blk_mq_wait_quiesce_done(struct request_queue *q);
+void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set);
 void blk_mq_unquiesce_queue(struct request_queue *q);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 16/17] blk-mq: add tagset quiesce interface
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (14 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 15/17] blk-mq: pass a tagset to blk_mq_wait_quiesce_done Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26  8:51   ` Chao Leng
  2022-10-25 14:40 ` [PATCH 17/17] nvme: use blk_mq_[un]quiesce_tagset Christoph Hellwig
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke

From: Chao Leng <lengchao@huawei.com>

Drivers that have shared tagsets may need to quiesce potentially a lot
of request queues that all share a single tagset (e.g. nvme). Add an
interface to quiesce all the queues on a given tagset. This interface is
useful because it can speedup the quiesce by doing it in parallel.

Because some queues should not need to be quiesced (e.g. the nvme
connect_q) when quiescing the tagset, introduce a
QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to
ski quiescing a particular queue.

Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: simplify for the per-tag_set srcu_struct]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c         | 25 +++++++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  3 +++
 3 files changed, 30 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8f88c24a55c94..d307e44d24aec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -315,6 +315,31 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
 
+void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	mutex_lock(&set->tag_list_lock);
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		if (!blk_queue_skip_tagset_quiesce(q))
+			blk_mq_quiesce_queue_nowait(q);
+	}
+	blk_mq_wait_quiesce_done(set);
+	mutex_unlock(&set->tag_list_lock);
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset);
+
+void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	mutex_lock(&set->tag_list_lock);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_unquiesce_queue(q);
+	mutex_unlock(&set->tag_list_lock);
+}
+EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
+
 void blk_mq_wake_waiters(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 678ce8a661ec9..33a62600150d5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -882,6 +882,8 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
 void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set);
+void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set);
+void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set);
 void blk_mq_unquiesce_queue(struct request_queue *q);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b15b6a011c028..854b4745cdd1f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -571,6 +571,7 @@ struct request_queue {
 #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
 #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
+#define QUEUE_FLAG_SKIP_TAGSET_QUIESCE	31 /* quiesce_tagset skip the queue*/
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1UL << QUEUE_FLAG_IO_STAT) |		\
 				 (1UL << QUEUE_FLAG_SAME_COMP) |	\
@@ -610,6 +611,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_pm_only(q)	atomic_read(&(q)->pm_only)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_sq_sched(q)	test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
+#define blk_queue_skip_tagset_quiesce(q) \
+	test_bit(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 17/17] nvme: use blk_mq_[un]quiesce_tagset
  2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
                   ` (15 preceding siblings ...)
  2022-10-25 14:40 ` [PATCH 16/17] blk-mq: add tagset quiesce interface Christoph Hellwig
@ 2022-10-25 14:40 ` Christoph Hellwig
  2022-10-26 13:03   ` Sagi Grimberg
  16 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-25 14:40 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke

From: Chao Leng <lengchao@huawei.com>

All controller namespaces share the same tagset, so we can use this
interface which does the optimal operation for parallel quiesce based on
the tagset type(e.g. blocking tagsets and non-blocking tagsets).

nvme connect_q should not be quiesced when quiesce tagset, so set the
QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q.

Currently we use NVME_NS_STOPPED to ensure pairing quiescing and
unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be
invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED.
In addition, we never really quiesce a single namespace. It is a better
choice to move the flag from ns to ctrl.

Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: rebased on top of prep patches]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/core.c | 34 ++++++++--------------------------
 drivers/nvme/host/nvme.h |  2 +-
 2 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 012ef0772f5b3..ff9ed2467fb30 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4880,6 +4880,8 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
 			ret = PTR_ERR(ctrl->connect_q);
 			goto out_free_tag_set;
 		}
+		blk_queue_flag_set(QUEUE_FLAG_SKIP_TAGSET_QUIESCE,
+				   ctrl->connect_q);
 	}
 
 	ctrl->tagset = set;
@@ -5079,20 +5081,6 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(nvme_init_ctrl);
 
-static void nvme_start_ns_queue(struct nvme_ns *ns)
-{
-	if (test_and_clear_bit(NVME_NS_STOPPED, &ns->flags))
-		blk_mq_unquiesce_queue(ns->queue);
-}
-
-static void nvme_stop_ns_queue(struct nvme_ns *ns)
-{
-	if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags))
-		blk_mq_quiesce_queue(ns->queue);
-	else
-		blk_mq_wait_quiesce_done(ns->queue->tag_set);
-}
-
 /* let I/O to all namespaces fail in preparation for surprise removal */
 void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl)
 {
@@ -5155,23 +5143,17 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
-
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		nvme_stop_ns_queue(ns);
-	up_read(&ctrl->namespaces_rwsem);
+	if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
+		blk_mq_quiesce_tagset(ctrl->tagset);
+	else
+		blk_mq_wait_quiesce_done(ctrl->tagset);
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
 void nvme_start_queues(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
-
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		nvme_start_ns_queue(ns);
-	up_read(&ctrl->namespaces_rwsem);
+	if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
+		blk_mq_unquiesce_tagset(ctrl->tagset);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1cd7168002438..f9df10653f3c5 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -237,6 +237,7 @@ enum nvme_ctrl_flags {
 	NVME_CTRL_FAILFAST_EXPIRED	= 0,
 	NVME_CTRL_ADMIN_Q_STOPPED	= 1,
 	NVME_CTRL_STARTED_ONCE		= 2,
+	NVME_CTRL_STOPPED		= 3,
 };
 
 struct nvme_ctrl {
@@ -486,7 +487,6 @@ struct nvme_ns {
 #define NVME_NS_ANA_PENDING	2
 #define NVME_NS_FORCE_RO	3
 #define NVME_NS_READY		4
-#define NVME_NS_STOPPED		5
 
 	struct cdev		cdev;
 	struct device		cdev_device;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/17] nvme: don't call nvme_kill_queues from nvme_remove_namespaces
  2022-10-25 14:40 ` [PATCH 04/17] nvme: don't call nvme_kill_queues from nvme_remove_namespaces Christoph Hellwig
@ 2022-10-25 17:43   ` Keith Busch
  2022-10-25 20:17     ` Sagi Grimberg
  2022-10-30  9:22     ` Christoph Hellwig
  0 siblings, 2 replies; 47+ messages in thread
From: Keith Busch @ 2022-10-25 17:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chao Leng, Ming Lei, linux-nvme, linux-block

On Tue, Oct 25, 2022 at 07:40:07AM -0700, Christoph Hellwig wrote:
> @@ -4560,15 +4560,6 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>  	/* prevent racing with ns scanning */
>  	flush_work(&ctrl->scan_work);
>  
> -	/*
> -	 * The dead states indicates the controller was not gracefully
> -	 * disconnected. In that case, we won't be able to flush any data while
> -	 * removing the namespaces' disks; fail all the queues now to avoid
> -	 * potentially having to clean up the failed sync later.
> -	 */
> -	if (ctrl->state == NVME_CTRL_DEAD)
> -		nvme_kill_queues(ctrl);
> -
>  	/* this is a no-op when called from the controller reset handler */
>  	nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING_NOIO);
>  
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index ec034d4dd9eff..f971e96ffd3f6 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3249,6 +3249,16 @@ static void nvme_remove(struct pci_dev *pdev)
>  
>  	flush_work(&dev->ctrl.reset_work);
>  	nvme_stop_ctrl(&dev->ctrl);
> +
> +	/*
> +	 * The dead states indicates the controller was not gracefully
> +	 * disconnected. In that case, we won't be able to flush any data while
> +	 * removing the namespaces' disks; fail all the queues now to avoid
> +	 * potentially having to clean up the failed sync later.
> +	 */
> +	if (dev->ctrl.state == NVME_CTRL_DEAD)
> +		nvme_kill_queues(&dev->ctrl);
> +
>  	nvme_remove_namespaces(&dev->ctrl);
>  	nvme_dev_disable(dev, true);
>  	nvme_remove_attrs(dev);
> -- 
> 2.30.2
> 

We still need the flush_work(scan_work) prior to killing the queues. It
looks like it could safely be moved to nvme_stop_ctrl(), which might
make it easier on everyone if it were there.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove
  2022-10-25 14:40 ` [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove Christoph Hellwig
@ 2022-10-25 18:53   ` Keith Busch
  2022-10-26 12:55   ` Sagi Grimberg
  1 sibling, 0 replies; 47+ messages in thread
From: Keith Busch @ 2022-10-25 18:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chao Leng, Ming Lei, linux-nvme, linux-block

On Tue, Oct 25, 2022 at 07:40:13AM -0700, Christoph Hellwig wrote:
> When we have a non-present controller there is no reason to wait in
> marking the namespaces dead, so do it ASAP.  Also remove the superflous
> call to nvme_start_queues as nvme_dev_disable already did that for us.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/nvme/host/pci.c | 18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8ab54857cfd50..bef98f6e1396c 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3244,25 +3244,19 @@ static void nvme_remove(struct pci_dev *pdev)
>  	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>  	pci_set_drvdata(pdev, NULL);
>  
> +	/*
> +	 * Mark the namespaces dead as we can't flush the data, and disable the
> +	 * controller ASAP as we can't shut it down properly if it was surprise
> +	 * removed.
> +	 */
>  	if (!pci_device_is_present(pdev)) {
>  		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
> +		nvme_mark_namespaces_dead(&dev->ctrl);
>  		nvme_dev_disable(dev, true);
>  	}

Ah, this changes what I mentioned about 4. Flushing the scan_work prior
to nvme_mark_namespaces_dead() still looks the right thing to do, but I
think you have to do both after nvme_dev_disable().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/17] nvme: don't call nvme_kill_queues from nvme_remove_namespaces
  2022-10-25 17:43   ` Keith Busch
@ 2022-10-25 20:17     ` Sagi Grimberg
  2022-10-30  9:22     ` Christoph Hellwig
  1 sibling, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-25 20:17 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Jens Axboe, Chao Leng, Ming Lei, linux-nvme, linux-block



On 10/25/22 20:43, Keith Busch wrote:
> On Tue, Oct 25, 2022 at 07:40:07AM -0700, Christoph Hellwig wrote:
>> @@ -4560,15 +4560,6 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>>   	/* prevent racing with ns scanning */
>>   	flush_work(&ctrl->scan_work);
>>   
>> -	/*
>> -	 * The dead states indicates the controller was not gracefully
>> -	 * disconnected. In that case, we won't be able to flush any data while
>> -	 * removing the namespaces' disks; fail all the queues now to avoid
>> -	 * potentially having to clean up the failed sync later.
>> -	 */
>> -	if (ctrl->state == NVME_CTRL_DEAD)
>> -		nvme_kill_queues(ctrl);
>> -
>>   	/* this is a no-op when called from the controller reset handler */
>>   	nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING_NOIO);
>>   
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index ec034d4dd9eff..f971e96ffd3f6 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -3249,6 +3249,16 @@ static void nvme_remove(struct pci_dev *pdev)
>>   
>>   	flush_work(&dev->ctrl.reset_work);
>>   	nvme_stop_ctrl(&dev->ctrl);
>> +
>> +	/*
>> +	 * The dead states indicates the controller was not gracefully
>> +	 * disconnected. In that case, we won't be able to flush any data while
>> +	 * removing the namespaces' disks; fail all the queues now to avoid
>> +	 * potentially having to clean up the failed sync later.
>> +	 */
>> +	if (dev->ctrl.state == NVME_CTRL_DEAD)
>> +		nvme_kill_queues(&dev->ctrl);
>> +
>>   	nvme_remove_namespaces(&dev->ctrl);
>>   	nvme_dev_disable(dev, true);
>>   	nvme_remove_attrs(dev);
>> -- 
>> 2.30.2
>>
> 
> We still need the flush_work(scan_work) prior to killing the queues. It
> looks like it could safely be moved to nvme_stop_ctrl(), which might
> make it easier on everyone if it were there.

If we do end up moving it to nvme_stop_ctrl, can we make a sub-version
of nvme_stop_ctrl that cannot block on I/O (i.e. without ana/scan/auth)?
for multipathing where we want to teardown the controller quickly so we
can failover I/O asap.

IIRC this is why scan_work is not in nvme_stop_ctrl to begin with, but
it is also possible that there was some other deadlock caused by that.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/17] nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl
  2022-10-25 14:40 ` [PATCH 11/17] nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl Christoph Hellwig
@ 2022-10-26  8:34   ` Chao Leng
  2022-10-26 12:58     ` Sagi Grimberg
  0 siblings, 1 reply; 47+ messages in thread
From: Chao Leng @ 2022-10-26  8:34 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Ming Lei, linux-nvme, linux-block



On 2022/10/25 22:40, Christoph Hellwig wrote:
> nvme_remove_dead_ctrl schedules nvme_remove to be called, which will
> call nvme_dev_disable and unquiesce the I/O queues.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/nvme/host/pci.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index bef98f6e1396c..3a26c9b2bf454 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2794,7 +2794,6 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
>   	nvme_get_ctrl(&dev->ctrl);
>   	nvme_dev_disable(dev, false);
Currently set the parameter "shutdown" to false, nvme_dev_disable() do not unquiesce the queues.
Actually we should set the parameter "shutdown" to true.
         nvme_dev_disable(dev, true);



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/17] nvme-pci: don't unquiesce the I/O queues in apple_nvme_reset_work
  2022-10-25 14:40 ` [PATCH 12/17] nvme-pci: don't unquiesce the I/O queues in apple_nvme_reset_work Christoph Hellwig
@ 2022-10-26  8:37   ` Chao Leng
  2022-10-26 12:58   ` Sagi Grimberg
  1 sibling, 0 replies; 47+ messages in thread
From: Chao Leng @ 2022-10-26  8:37 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Ming Lei, linux-nvme, linux-block



On 2022/10/25 22:40, Christoph Hellwig wrote:
> apple_nvme_reset_work schedules apple_nvme_remove, to be called, which
> will call apple_nvme_disable and unquiesce the I/O queues.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/nvme/host/apple.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
> index 14bee207316a0..44e7daf93e19c 100644
> --- a/drivers/nvme/host/apple.c
> +++ b/drivers/nvme/host/apple.c
> @@ -1154,7 +1154,6 @@ static void apple_nvme_reset_work(struct work_struct *work)
>   	nvme_get_ctrl(&anv->ctrl);
>   	apple_nvme_disable(anv, false);
Same suggestion as the previous patch:
	apple_nvme_disable(anv, true);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset
  2022-10-25 14:40 ` [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset Christoph Hellwig
@ 2022-10-26  8:48   ` Chao Leng
  2022-10-26 13:01     ` Sagi Grimberg
  0 siblings, 1 reply; 47+ messages in thread
From: Chao Leng @ 2022-10-26  8:48 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke



On 2022/10/25 22:40, Christoph Hellwig wrote:
> All I/O submissions have fairly similar latencies, and a tagset-wide
> quiesce is a fairly common operation.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Keith Busch <kbusch@kernel.org>
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Reviewed-by: Chao Leng <lengchao@huawei.com>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> ---
>   block/blk-core.c       | 27 +++++----------------------
>   block/blk-mq.c         | 33 +++++++++++++++++++++++++--------
>   block/blk-mq.h         | 14 +++++++-------
>   block/blk-sysfs.c      |  9 ++-------
>   block/blk.h            |  9 +--------
>   block/genhd.c          |  2 +-
>   include/linux/blk-mq.h |  4 ++++
>   include/linux/blkdev.h |  9 ---------
>   8 files changed, 45 insertions(+), 62 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 17667159482e0..3a2ed8dadf738 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -65,7 +65,6 @@ DEFINE_IDA(blk_queue_ida);
>    * For queue allocation
>    */
>   struct kmem_cache *blk_requestq_cachep;
> -struct kmem_cache *blk_requestq_srcu_cachep;
>   
>   /*
>    * Controlling structure to kblockd
> @@ -373,26 +372,20 @@ static void blk_timeout_work(struct work_struct *work)
>   {
>   }
>   
> -struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
> +struct request_queue *blk_alloc_queue(int node_id)
>   {
>   	struct request_queue *q;
>   
> -	q = kmem_cache_alloc_node(blk_get_queue_kmem_cache(alloc_srcu),
> -			GFP_KERNEL | __GFP_ZERO, node_id);
> +	q = kmem_cache_alloc_node(blk_requestq_cachep, GFP_KERNEL | __GFP_ZERO,
> +				  node_id);
>   	if (!q)
>   		return NULL;
>   
> -	if (alloc_srcu) {
> -		blk_queue_flag_set(QUEUE_FLAG_HAS_SRCU, q);
> -		if (init_srcu_struct(q->srcu) != 0)
> -			goto fail_q;
> -	}
> -
>   	q->last_merge = NULL;
>   
>   	q->id = ida_alloc(&blk_queue_ida, GFP_KERNEL);
>   	if (q->id < 0)
> -		goto fail_srcu;
> +		goto fail_q;
>   
>   	q->stats = blk_alloc_queue_stats();
>   	if (!q->stats)
> @@ -435,11 +428,8 @@ struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
>   	blk_free_queue_stats(q->stats);
>   fail_id:
>   	ida_free(&blk_queue_ida, q->id);
> -fail_srcu:
> -	if (alloc_srcu)
> -		cleanup_srcu_struct(q->srcu);
>   fail_q:
> -	kmem_cache_free(blk_get_queue_kmem_cache(alloc_srcu), q);
> +	kmem_cache_free(blk_requestq_cachep, q);
>   	return NULL;
>   }
>   
> @@ -1184,9 +1174,6 @@ int __init blk_dev_init(void)
>   			sizeof_field(struct request, cmd_flags));
>   	BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
>   			sizeof_field(struct bio, bi_opf));
> -	BUILD_BUG_ON(ALIGN(offsetof(struct request_queue, srcu),
> -			   __alignof__(struct request_queue)) !=
> -		     sizeof(struct request_queue));
>   
>   	/* used for unplugging and affects IO latency/throughput - HIGHPRI */
>   	kblockd_workqueue = alloc_workqueue("kblockd",
> @@ -1197,10 +1184,6 @@ int __init blk_dev_init(void)
>   	blk_requestq_cachep = kmem_cache_create("request_queue",
>   			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
>   
> -	blk_requestq_srcu_cachep = kmem_cache_create("request_queue_srcu",
> -			sizeof(struct request_queue) +
> -			sizeof(struct srcu_struct), 0, SLAB_PANIC, NULL);
> -
>   	blk_debugfs_root = debugfs_create_dir("block", NULL);
>   
>   	return 0;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 802fdd3d737e3..6cbf34921e33f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -261,8 +261,8 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
>    */
>   void blk_mq_wait_quiesce_done(struct request_queue *q)
>   {
> -	if (blk_queue_has_srcu(q))
> -		synchronize_srcu(q->srcu);
> +	if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
> +		synchronize_srcu(q->tag_set->srcu);
>   	else
>   		synchronize_rcu();
>   }
> @@ -3971,7 +3971,7 @@ static struct request_queue *blk_mq_init_queue_data(struct blk_mq_tag_set *set,
>   	struct request_queue *q;
>   	int ret;
>   
> -	q = blk_alloc_queue(set->numa_node, set->flags & BLK_MQ_F_BLOCKING);
> +	q = blk_alloc_queue(set->numa_node);
>   	if (!q)
>   		return ERR_PTR(-ENOMEM);
>   	q->queuedata = queuedata;
> @@ -4138,9 +4138,6 @@ static void blk_mq_update_poll_flag(struct request_queue *q)
>   int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   		struct request_queue *q)
>   {
> -	WARN_ON_ONCE(blk_queue_has_srcu(q) !=
> -			!!(set->flags & BLK_MQ_F_BLOCKING));
> -
>   	/* mark the queue as mq asap */
>   	q->mq_ops = set->ops;
>   
> @@ -4398,9 +4395,19 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>   	 */
>   	if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
>   		set->nr_hw_queues = nr_cpu_ids;
> +
> +	if (set->flags & BLK_MQ_F_BLOCKING) {
> +		set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
Memory with contiguous physical addresses is not necessary, maybe it is a better choice to use kvmalloc,
because sizeof(*set->srcu) is a little large.
kvmalloc() is more friendly to scenarios where memory is insufficient and running for a long time.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 16/17] blk-mq: add tagset quiesce interface
  2022-10-25 14:40 ` [PATCH 16/17] blk-mq: add tagset quiesce interface Christoph Hellwig
@ 2022-10-26  8:51   ` Chao Leng
  2022-10-26 13:02     ` Sagi Grimberg
  0 siblings, 1 reply; 47+ messages in thread
From: Chao Leng @ 2022-10-26  8:51 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke



On 2022/10/25 22:40, Christoph Hellwig wrote:
> @@ -315,6 +315,31 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
>   
> +void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
> +{
> +	struct request_queue *q;
> +
> +	mutex_lock(&set->tag_list_lock);
> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +		if (!blk_queue_skip_tagset_quiesce(q))
> +			blk_mq_quiesce_queue_nowait(q);
> +	}
> +	blk_mq_wait_quiesce_done(set);
> +	mutex_unlock(&set->tag_list_lock);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset);
> +
> +void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
> +{
> +	struct request_queue *q;
> +
> +	mutex_lock(&set->tag_list_lock);
> +	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +		blk_mq_unquiesce_queue(q);
We should add the check of blk_queue_skip_tagset_quiesce() to keep consistent
with blk_mq_quiesce_tagset(), it doesn't make much sense, but maybe look a little better.
		if (!blk_queue_skip_tagset_quiesce(q))
			blk_mq_unquiesce_queue(q);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/17] block: set the disk capacity to 0 in blk_mark_disk_dead
  2022-10-25 14:40 ` [PATCH 01/17] block: set the disk capacity to 0 in blk_mark_disk_dead Christoph Hellwig
@ 2022-10-26 12:39   ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:39 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/17] nvme-pci: refactor the tagset handling in nvme_reset_work
  2022-10-25 14:40 ` [PATCH 02/17] nvme-pci: refactor the tagset handling in nvme_reset_work Christoph Hellwig
@ 2022-10-26 12:46   ` Sagi Grimberg
  2022-10-30  9:17     ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:46 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

> The code to create, update or delete a tagset and namespaces in
> nvme_reset_work is a bit convoluted.  Refactor it with a two high-level
> conditionals for first probe vs reset and I/O queues vs no I/O queues
> to make the code flow more clear.

This is clearer, but what I think would be even cleaner, is if we simply
move the whole first time to a different probe_work and treat it like it
is instead of relying on resources existence as a state indicators
(tagset/admin_q). The shared portion can move to helpers.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/17] nvme-pci: don't warn about the lack of I/O queues for admin controllers
  2022-10-25 14:40 ` [PATCH 03/17] nvme-pci: don't warn about the lack of I/O queues for admin controllers Christoph Hellwig
@ 2022-10-26 12:49   ` Sagi Grimberg
  2022-10-30  9:18     ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:49 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block


> Admin controllers never have I/O queues, so don't warn about that fact.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/nvme/host/pci.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 51513d263d77a..ec034d4dd9eff 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2917,7 +2917,8 @@ static void nvme_reset_work(struct work_struct *work)
>   			nvme_dbbuf_set(dev);
>   			nvme_unfreeze(&dev->ctrl);
>   		} else {
> -			dev_warn(dev->ctrl.device, "IO queues lost\n");
> +			if (dev->ctrl.cntrltype != NVME_CTRL_ADMIN)
> +				dev_warn(dev->ctrl.device, "IO queues lost\n");

I have a feeling that we have quite a few other messages that are
irrelevant for admin controllers. And I wander what device you have that
presents an admin controller, but looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/17] nvme: don't remove namespaces in nvme_passthru_end
  2022-10-25 14:40 ` [PATCH 05/17] nvme: don't remove namespaces in nvme_passthru_end Christoph Hellwig
@ 2022-10-26 12:50   ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:50 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/17] nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces
  2022-10-25 14:40 ` [PATCH 06/17] nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces Christoph Hellwig
@ 2022-10-26 12:50   ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:50 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/17] nvme: remove the NVME_NS_DEAD check in nvme_validate_ns
  2022-10-25 14:40 ` [PATCH 07/17] nvme: remove the NVME_NS_DEAD check in nvme_validate_ns Christoph Hellwig
@ 2022-10-26 12:52   ` Sagi Grimberg
  2022-10-30  9:28     ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:52 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block


> We never rescan namespaces after marking them dead, so this check is dead
> now.

scan_work can still run, I'd keep this one.

> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/nvme/host/core.c | 4 ----
>   1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2ec838e608509..99eabbe7fac98 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4333,10 +4333,6 @@ static void nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info)
>   {
>   	int ret = NVME_SC_INVALID_NS | NVME_SC_DNR;
>   
> -	if (test_bit(NVME_NS_DEAD, &ns->flags))
> -		goto out;
> -
> -	ret = NVME_SC_INVALID_NS | NVME_SC_DNR;
>   	if (!nvme_ns_ids_equal(&ns->head->ids, &info->ids)) {
>   		dev_err(ns->ctrl->device,
>   			"identifiers changed for nsid %d\n", ns->head->ns_id);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/17] nvme: don't unquiesce the admin queue in nvme_kill_queues
  2022-10-25 14:40 ` [PATCH 08/17] nvme: don't unquiesce the admin queue in nvme_kill_queues Christoph Hellwig
@ 2022-10-26 12:53   ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:53 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/17] nvme: don't unquiesce the I/O queues in nvme_kill_queues
  2022-10-25 14:40 ` [PATCH 09/17] nvme: don't unquiesce the I/O queues " Christoph Hellwig
@ 2022-10-26 12:54   ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:54 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove
  2022-10-25 14:40 ` [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove Christoph Hellwig
  2022-10-25 18:53   ` Keith Busch
@ 2022-10-26 12:55   ` Sagi Grimberg
  2022-10-26 12:57     ` Sagi Grimberg
  1 sibling, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:55 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block


> When we have a non-present controller there is no reason to wait in
> marking the namespaces dead, so do it ASAP.  Also remove the superflous
> call to nvme_start_queues as nvme_dev_disable already did that for us.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/nvme/host/pci.c | 18 ++++++------------
>   1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8ab54857cfd50..bef98f6e1396c 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3244,25 +3244,19 @@ static void nvme_remove(struct pci_dev *pdev)
>   	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>   	pci_set_drvdata(pdev, NULL);
>   
> +	/*
> +	 * Mark the namespaces dead as we can't flush the data, and disable the
> +	 * controller ASAP as we can't shut it down properly if it was surprise
> +	 * removed.
> +	 */
>   	if (!pci_device_is_present(pdev)) {
>   		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
> +		nvme_mark_namespaces_dead(&dev->ctrl);

Don't you need to start the queues as well here?

>   		nvme_dev_disable(dev, true);
>   	}
>   
>   	flush_work(&dev->ctrl.reset_work);
>   	nvme_stop_ctrl(&dev->ctrl);
> -
> -	/*
> -	 * The dead states indicates the controller was not gracefully
> -	 * disconnected. In that case, we won't be able to flush any data while
> -	 * removing the namespaces' disks; fail all the queues now to avoid
> -	 * potentially having to clean up the failed sync later.
> -	 */
> -	if (dev->ctrl.state == NVME_CTRL_DEAD) {
> -		nvme_mark_namespaces_dead(&dev->ctrl);
> -		nvme_start_queues(&dev->ctrl);
> -	}
> -
>   	nvme_remove_namespaces(&dev->ctrl);
>   	nvme_dev_disable(dev, true);
>   	nvme_remove_attrs(dev);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove
  2022-10-26 12:55   ` Sagi Grimberg
@ 2022-10-26 12:57     ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:57 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block


>> When we have a non-present controller there is no reason to wait in
>> marking the namespaces dead, so do it ASAP.  Also remove the superflous
>> call to nvme_start_queues as nvme_dev_disable already did that for us.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   drivers/nvme/host/pci.c | 18 ++++++------------
>>   1 file changed, 6 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 8ab54857cfd50..bef98f6e1396c 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -3244,25 +3244,19 @@ static void nvme_remove(struct pci_dev *pdev)
>>       nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>>       pci_set_drvdata(pdev, NULL);
>> +    /*
>> +     * Mark the namespaces dead as we can't flush the data, and 
>> disable the
>> +     * controller ASAP as we can't shut it down properly if it was 
>> surprise
>> +     * removed.
>> +     */
>>       if (!pci_device_is_present(pdev)) {
>>           nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>> +        nvme_mark_namespaces_dead(&dev->ctrl);
> 
> Don't you need to start the queues as well here?

nevermind, you call disable with shutdown=true

> 
>>           nvme_dev_disable(dev, true);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/17] nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl
  2022-10-26  8:34   ` Chao Leng
@ 2022-10-26 12:58     ` Sagi Grimberg
  2022-10-27  2:46       ` Chao Leng
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:58 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Ming Lei, linux-nvme, linux-block


>> nvme_remove_dead_ctrl schedules nvme_remove to be called, which will
>> call nvme_dev_disable and unquiesce the I/O queues.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   drivers/nvme/host/pci.c | 1 -
>>   1 file changed, 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index bef98f6e1396c..3a26c9b2bf454 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -2794,7 +2794,6 @@ static void nvme_remove_dead_ctrl(struct 
>> nvme_dev *dev)
>>       nvme_get_ctrl(&dev->ctrl);
>>       nvme_dev_disable(dev, false);
> Currently set the parameter "shutdown" to false, nvme_dev_disable() do 
> not unquiesce the queues.
> Actually we should set the parameter "shutdown" to true.
>          nvme_dev_disable(dev, true);

But the final put will trigger nvme_remove no?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/17] nvme-pci: don't unquiesce the I/O queues in apple_nvme_reset_work
  2022-10-25 14:40 ` [PATCH 12/17] nvme-pci: don't unquiesce the I/O queues in apple_nvme_reset_work Christoph Hellwig
  2022-10-26  8:37   ` Chao Leng
@ 2022-10-26 12:58   ` Sagi Grimberg
  1 sibling, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 12:58 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset
  2022-10-26  8:48   ` Chao Leng
@ 2022-10-26 13:01     ` Sagi Grimberg
  2022-10-27  2:49       ` Chao Leng
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 13:01 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke



On 10/26/22 11:48, Chao Leng wrote:
> 
> 
> On 2022/10/25 22:40, Christoph Hellwig wrote:
>> All I/O submissions have fairly similar latencies, and a tagset-wide
>> quiesce is a fairly common operation.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> Reviewed-by: Keith Busch <kbusch@kernel.org>
>> Reviewed-by: Ming Lei <ming.lei@redhat.com>
>> Reviewed-by: Chao Leng <lengchao@huawei.com>
>> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   block/blk-core.c       | 27 +++++----------------------
>>   block/blk-mq.c         | 33 +++++++++++++++++++++++++--------
>>   block/blk-mq.h         | 14 +++++++-------
>>   block/blk-sysfs.c      |  9 ++-------
>>   block/blk.h            |  9 +--------
>>   block/genhd.c          |  2 +-
>>   include/linux/blk-mq.h |  4 ++++
>>   include/linux/blkdev.h |  9 ---------
>>   8 files changed, 45 insertions(+), 62 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 17667159482e0..3a2ed8dadf738 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -65,7 +65,6 @@ DEFINE_IDA(blk_queue_ida);
>>    * For queue allocation
>>    */
>>   struct kmem_cache *blk_requestq_cachep;
>> -struct kmem_cache *blk_requestq_srcu_cachep;
>>   /*
>>    * Controlling structure to kblockd
>> @@ -373,26 +372,20 @@ static void blk_timeout_work(struct work_struct 
>> *work)
>>   {
>>   }
>> -struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
>> +struct request_queue *blk_alloc_queue(int node_id)
>>   {
>>       struct request_queue *q;
>> -    q = kmem_cache_alloc_node(blk_get_queue_kmem_cache(alloc_srcu),
>> -            GFP_KERNEL | __GFP_ZERO, node_id);
>> +    q = kmem_cache_alloc_node(blk_requestq_cachep, GFP_KERNEL | 
>> __GFP_ZERO,
>> +                  node_id);
>>       if (!q)
>>           return NULL;
>> -    if (alloc_srcu) {
>> -        blk_queue_flag_set(QUEUE_FLAG_HAS_SRCU, q);
>> -        if (init_srcu_struct(q->srcu) != 0)
>> -            goto fail_q;
>> -    }
>> -
>>       q->last_merge = NULL;
>>       q->id = ida_alloc(&blk_queue_ida, GFP_KERNEL);
>>       if (q->id < 0)
>> -        goto fail_srcu;
>> +        goto fail_q;
>>       q->stats = blk_alloc_queue_stats();
>>       if (!q->stats)
>> @@ -435,11 +428,8 @@ struct request_queue *blk_alloc_queue(int 
>> node_id, bool alloc_srcu)
>>       blk_free_queue_stats(q->stats);
>>   fail_id:
>>       ida_free(&blk_queue_ida, q->id);
>> -fail_srcu:
>> -    if (alloc_srcu)
>> -        cleanup_srcu_struct(q->srcu);
>>   fail_q:
>> -    kmem_cache_free(blk_get_queue_kmem_cache(alloc_srcu), q);
>> +    kmem_cache_free(blk_requestq_cachep, q);
>>       return NULL;
>>   }
>> @@ -1184,9 +1174,6 @@ int __init blk_dev_init(void)
>>               sizeof_field(struct request, cmd_flags));
>>       BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
>>               sizeof_field(struct bio, bi_opf));
>> -    BUILD_BUG_ON(ALIGN(offsetof(struct request_queue, srcu),
>> -               __alignof__(struct request_queue)) !=
>> -             sizeof(struct request_queue));
>>       /* used for unplugging and affects IO latency/throughput - 
>> HIGHPRI */
>>       kblockd_workqueue = alloc_workqueue("kblockd",
>> @@ -1197,10 +1184,6 @@ int __init blk_dev_init(void)
>>       blk_requestq_cachep = kmem_cache_create("request_queue",
>>               sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
>> -    blk_requestq_srcu_cachep = kmem_cache_create("request_queue_srcu",
>> -            sizeof(struct request_queue) +
>> -            sizeof(struct srcu_struct), 0, SLAB_PANIC, NULL);
>> -
>>       blk_debugfs_root = debugfs_create_dir("block", NULL);
>>       return 0;
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 802fdd3d737e3..6cbf34921e33f 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -261,8 +261,8 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
>>    */
>>   void blk_mq_wait_quiesce_done(struct request_queue *q)
>>   {
>> -    if (blk_queue_has_srcu(q))
>> -        synchronize_srcu(q->srcu);
>> +    if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
>> +        synchronize_srcu(q->tag_set->srcu);
>>       else
>>           synchronize_rcu();
>>   }
>> @@ -3971,7 +3971,7 @@ static struct request_queue 
>> *blk_mq_init_queue_data(struct blk_mq_tag_set *set,
>>       struct request_queue *q;
>>       int ret;
>> -    q = blk_alloc_queue(set->numa_node, set->flags & BLK_MQ_F_BLOCKING);
>> +    q = blk_alloc_queue(set->numa_node);
>>       if (!q)
>>           return ERR_PTR(-ENOMEM);
>>       q->queuedata = queuedata;
>> @@ -4138,9 +4138,6 @@ static void blk_mq_update_poll_flag(struct 
>> request_queue *q)
>>   int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>           struct request_queue *q)
>>   {
>> -    WARN_ON_ONCE(blk_queue_has_srcu(q) !=
>> -            !!(set->flags & BLK_MQ_F_BLOCKING));
>> -
>>       /* mark the queue as mq asap */
>>       q->mq_ops = set->ops;
>> @@ -4398,9 +4395,19 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set 
>> *set)
>>        */
>>       if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
>>           set->nr_hw_queues = nr_cpu_ids;
>> +
>> +    if (set->flags & BLK_MQ_F_BLOCKING) {
>> +        set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
> Memory with contiguous physical addresses is not necessary, maybe it is 
> a better choice to use kvmalloc,
> because sizeof(*set->srcu) is a little large.
> kvmalloc() is more friendly to scenarios where memory is insufficient 
> and running for a long time.

Huh?

(gdb) p sizeof(struct srcu_struct)
$1 = 392


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 16/17] blk-mq: add tagset quiesce interface
  2022-10-26  8:51   ` Chao Leng
@ 2022-10-26 13:02     ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 13:02 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke


>> @@ -315,6 +315,31 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>>   }
>>   EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
>> +void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
>> +{
>> +    struct request_queue *q;
>> +
>> +    mutex_lock(&set->tag_list_lock);
>> +    list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> +        if (!blk_queue_skip_tagset_quiesce(q))
>> +            blk_mq_quiesce_queue_nowait(q);
>> +    }
>> +    blk_mq_wait_quiesce_done(set);
>> +    mutex_unlock(&set->tag_list_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset);
>> +
>> +void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
>> +{
>> +    struct request_queue *q;
>> +
>> +    mutex_lock(&set->tag_list_lock);
>> +    list_for_each_entry(q, &set->tag_list, tag_set_list)
>> +        blk_mq_unquiesce_queue(q);
> We should add the check of blk_queue_skip_tagset_quiesce() to keep 
> consistent
> with blk_mq_quiesce_tagset(), it doesn't make much sense, but maybe look 
> a little better.
>          if (!blk_queue_skip_tagset_quiesce(q))
>              blk_mq_unquiesce_queue(q);

I agree.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 17/17] nvme: use blk_mq_[un]quiesce_tagset
  2022-10-25 14:40 ` [PATCH 17/17] nvme: use blk_mq_[un]quiesce_tagset Christoph Hellwig
@ 2022-10-26 13:03   ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-26 13:03 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke

Looks great,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/17] nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl
  2022-10-26 12:58     ` Sagi Grimberg
@ 2022-10-27  2:46       ` Chao Leng
  0 siblings, 0 replies; 47+ messages in thread
From: Chao Leng @ 2022-10-27  2:46 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Ming Lei, linux-nvme, linux-block



On 2022/10/26 20:58, Sagi Grimberg wrote:
> 
>>> nvme_remove_dead_ctrl schedules nvme_remove to be called, which will
>>> call nvme_dev_disable and unquiesce the I/O queues.
>>>
>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>> ---
>>>   drivers/nvme/host/pci.c | 1 -
>>>   1 file changed, 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>>> index bef98f6e1396c..3a26c9b2bf454 100644
>>> --- a/drivers/nvme/host/pci.c
>>> +++ b/drivers/nvme/host/pci.c
>>> @@ -2794,7 +2794,6 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
>>>       nvme_get_ctrl(&dev->ctrl);
>>>       nvme_dev_disable(dev, false);
>> Currently set the parameter "shutdown" to false, nvme_dev_disable() do not unquiesce the queues.
>> Actually we should set the parameter "shutdown" to true.
>>          nvme_dev_disable(dev, true);
> 
> But the final put will trigger nvme_remove no?
No, it do not trigger nvme_remove.
However, the controller is actually in deleting state.
Unquiescing the queues seems to be the better choice here.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset
  2022-10-26 13:01     ` Sagi Grimberg
@ 2022-10-27  2:49       ` Chao Leng
  2022-10-27 10:02         ` Sagi Grimberg
  0 siblings, 1 reply; 47+ messages in thread
From: Chao Leng @ 2022-10-27  2:49 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke



On 2022/10/26 21:01, Sagi Grimberg wrote:
> 
> 
> On 10/26/22 11:48, Chao Leng wrote:
>>> +
>>> +    if (set->flags & BLK_MQ_F_BLOCKING) {
>>> +        set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
>> Memory with contiguous physical addresses is not necessary, maybe it is a better choice to use kvmalloc,
>> because sizeof(*set->srcu) is a little large.
>> kvmalloc() is more friendly to scenarios where memory is insufficient and running for a long time.
> 
> Huh?
> 
> (gdb) p sizeof(struct srcu_struct)
> $1 = 392
I double recheck it. What I remember in my head is the old version of the size.
The size of the latest version is the size that you show.
Change the srcu_node array to a pointer in April 2022.
The link:
https://github.com/torvalds/linux/commit/2ec303113d978931ef368886c4c6bc854493e8bf

Therefore, kvmalloc() is a better choice for older versions.
For the latest version, static data member or kmalloc() are OK.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset
  2022-10-27  2:49       ` Chao Leng
@ 2022-10-27 10:02         ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2022-10-27 10:02 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Ming Lei, linux-nvme, linux-block, Hannes Reinecke


>>>> +
>>>> +    if (set->flags & BLK_MQ_F_BLOCKING) {
>>>> +        set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
>>> Memory with contiguous physical addresses is not necessary, maybe it 
>>> is a better choice to use kvmalloc,
>>> because sizeof(*set->srcu) is a little large.
>>> kvmalloc() is more friendly to scenarios where memory is insufficient 
>>> and running for a long time.
>>
>> Huh?
>>
>> (gdb) p sizeof(struct srcu_struct)
>> $1 = 392
> I double recheck it. What I remember in my head is the old version of 
> the size.
> The size of the latest version is the size that you show.
> Change the srcu_node array to a pointer in April 2022.
> The link:
> https://github.com/torvalds/linux/commit/2ec303113d978931ef368886c4c6bc854493e8bf
> 
> Therefore, kvmalloc() is a better choice for older versions.
> For the latest version, static data member or kmalloc() are OK.

We can keep it dynamic allocation IMO.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/17] nvme-pci: refactor the tagset handling in nvme_reset_work
  2022-10-26 12:46   ` Sagi Grimberg
@ 2022-10-30  9:17     ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-30  9:17 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng, Ming Lei,
	linux-nvme, linux-block

On Wed, Oct 26, 2022 at 03:46:09PM +0300, Sagi Grimberg wrote:
> This is clearer, but what I think would be even cleaner, is if we simply
> move the whole first time to a different probe_work and treat it like it
> is instead of relying on resources existence as a state indicators
> (tagset/admin_q). The shared portion can move to helpers.

I started playing with that a bit on my flight home this week.
I think the right thing is to do away with the scheduled work for probe
entirely and set the PROBE_PREFER_ASYNCHRONOUS flag in the driver, and
that should also really help with probe error handling.  But that's
another fairly big set of changes over this already quite big series
so I'd prefer to do it separately.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/17] nvme-pci: don't warn about the lack of I/O queues for admin controllers
  2022-10-26 12:49   ` Sagi Grimberg
@ 2022-10-30  9:18     ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-30  9:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng, Ming Lei,
	linux-nvme, linux-block

On Wed, Oct 26, 2022 at 03:49:06PM +0300, Sagi Grimberg wrote:
> I have a feeling that we have quite a few other messages that are
> irrelevant for admin controllers. And I wander what device you have that
> presents an admin controller, but looks good,

Could be.  I don't actually have any administrative controller at hand,
I just noticed the warning while looking over this code.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/17] nvme: don't call nvme_kill_queues from nvme_remove_namespaces
  2022-10-25 17:43   ` Keith Busch
  2022-10-25 20:17     ` Sagi Grimberg
@ 2022-10-30  9:22     ` Christoph Hellwig
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-30  9:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chao Leng,
	Ming Lei, linux-nvme, linux-block

On Tue, Oct 25, 2022 at 11:43:19AM -0600, Keith Busch wrote:
> We still need the flush_work(scan_work) prior to killing the queues.

Yes.

> It
> looks like it could safely be moved to nvme_stop_ctrl(), which might
> make it easier on everyone if it were there.

Hmm.  As Sagi pointed out this could become a bit more complicated due
to multipath, but in general I'd love to bring some more regularity
into the teardown path.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/17] nvme: remove the NVME_NS_DEAD check in nvme_validate_ns
  2022-10-26 12:52   ` Sagi Grimberg
@ 2022-10-30  9:28     ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2022-10-30  9:28 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Chao Leng, Ming Lei,
	linux-nvme, linux-block

On Wed, Oct 26, 2022 at 03:52:25PM +0300, Sagi Grimberg wrote:
>
>> We never rescan namespaces after marking them dead, so this check is dead
>> now.
>
> scan_work can still run, I'd keep this one.

At this time the controller already has a non-live state, so the
command won't make it anywhere and thus except for a tiny race we
won't even get to this call.

That being said if we flush the scan_work earlier this will all become
moot anyway.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2022-10-30  9:28 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-25 14:40 per-tagset SRCU struct and quiesce v2 Christoph Hellwig
2022-10-25 14:40 ` [PATCH 01/17] block: set the disk capacity to 0 in blk_mark_disk_dead Christoph Hellwig
2022-10-26 12:39   ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 02/17] nvme-pci: refactor the tagset handling in nvme_reset_work Christoph Hellwig
2022-10-26 12:46   ` Sagi Grimberg
2022-10-30  9:17     ` Christoph Hellwig
2022-10-25 14:40 ` [PATCH 03/17] nvme-pci: don't warn about the lack of I/O queues for admin controllers Christoph Hellwig
2022-10-26 12:49   ` Sagi Grimberg
2022-10-30  9:18     ` Christoph Hellwig
2022-10-25 14:40 ` [PATCH 04/17] nvme: don't call nvme_kill_queues from nvme_remove_namespaces Christoph Hellwig
2022-10-25 17:43   ` Keith Busch
2022-10-25 20:17     ` Sagi Grimberg
2022-10-30  9:22     ` Christoph Hellwig
2022-10-25 14:40 ` [PATCH 05/17] nvme: don't remove namespaces in nvme_passthru_end Christoph Hellwig
2022-10-26 12:50   ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 06/17] nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces Christoph Hellwig
2022-10-26 12:50   ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 07/17] nvme: remove the NVME_NS_DEAD check in nvme_validate_ns Christoph Hellwig
2022-10-26 12:52   ` Sagi Grimberg
2022-10-30  9:28     ` Christoph Hellwig
2022-10-25 14:40 ` [PATCH 08/17] nvme: don't unquiesce the admin queue in nvme_kill_queues Christoph Hellwig
2022-10-26 12:53   ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 09/17] nvme: don't unquiesce the I/O queues " Christoph Hellwig
2022-10-26 12:54   ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 10/17] nvme-pci: mark the namespaces dead earlier in nvme_remove Christoph Hellwig
2022-10-25 18:53   ` Keith Busch
2022-10-26 12:55   ` Sagi Grimberg
2022-10-26 12:57     ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 11/17] nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl Christoph Hellwig
2022-10-26  8:34   ` Chao Leng
2022-10-26 12:58     ` Sagi Grimberg
2022-10-27  2:46       ` Chao Leng
2022-10-25 14:40 ` [PATCH 12/17] nvme-pci: don't unquiesce the I/O queues in apple_nvme_reset_work Christoph Hellwig
2022-10-26  8:37   ` Chao Leng
2022-10-26 12:58   ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 13/17] blk-mq: skip non-mq queues in blk_mq_quiesce_queue Christoph Hellwig
2022-10-25 14:40 ` [PATCH 14/17] blk-mq: move the srcu_struct used for quiescing to the tagset Christoph Hellwig
2022-10-26  8:48   ` Chao Leng
2022-10-26 13:01     ` Sagi Grimberg
2022-10-27  2:49       ` Chao Leng
2022-10-27 10:02         ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 15/17] blk-mq: pass a tagset to blk_mq_wait_quiesce_done Christoph Hellwig
2022-10-25 14:40 ` [PATCH 16/17] blk-mq: add tagset quiesce interface Christoph Hellwig
2022-10-26  8:51   ` Chao Leng
2022-10-26 13:02     ` Sagi Grimberg
2022-10-25 14:40 ` [PATCH 17/17] nvme: use blk_mq_[un]quiesce_tagset Christoph Hellwig
2022-10-26 13:03   ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.