linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/7] Elevator cleanups and improvements
@ 2019-09-05  9:51 Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 1/7] block: Cleanup elevator_init_mq() use Damien Le Moal
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Damien Le Moal @ 2019-09-05  9:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

This patch series implements some cleanup of the elevator initialization
code and introduces elevator features identification and device matching
to enhance checks for elevator/device compatibility and fitness.

The first 2 patches of the series are simple cleanups which simplify 
elevator initialization for newly allocated device queues.

Patch 3 introduce elevator features, allowing a clean and extensible
definition of devices and features that an elevator supports and match
these against features required by a block device. With this, the sysfs
elevator list for a device always shows only elevators matching the
features that a particular device requires, with the exception of the
none elevator which has no features but is always available for use
with any device.

The first feature defined is for zoned block device sequential write
constraint support through zone write locking which prevents the use of
any elevator that does not support this feature with zoned devices.

The last 4 patches of this series rework the default elevator selection
and initialization to allow for the elevator/device features matching
to work, doing so addressing cases not currently well supported, namely,
multi-queue zoned block devices.

Changes from v4:
* Fix patch 5 again to correctly handle request based DM devices and
  avoid that default queue elevator of these devices end up always
  being "none".

Changes from v3:
* Fixed patch 5 to correctly handle DM devices which do not register a
  request queue and so do not need elevator initialization.

Changes from v2:
* Fixed patch 4
* Call elevator_init_mq() earlier in device_add_disk() as suggested by
  Christoph (patch 5)
* Fixed title of patch 7

Changes from v1:
* Addressed Johannes comments
* Rebased on newest for-next branch to include Ming's sysfs lock changes

Damien Le Moal (7):
  block: Cleanup elevator_init_mq() use
  block: Change elevator_init_mq() to always succeed
  block: Introduce elevator features
  block: Improve default elevator selection
  block: Delay default elevator initialization
  block: Set ELEVATOR_F_ZBD_SEQ_WRITE for nullblk zoned disks
  sd: Set ELEVATOR_F_ZBD_SEQ_WRITE for ZBC disks

 block/blk-mq.c                |  20 +++--
 block/blk-settings.c          |  16 ++++
 block/blk.h                   |   2 +-
 block/elevator.c              | 137 ++++++++++++++++++++++++++--------
 block/genhd.c                 |   9 +++
 block/mq-deadline.c           |   1 +
 drivers/block/null_blk_main.c |   2 +
 drivers/md/dm-rq.c            |   2 +-
 drivers/scsi/sd_zbc.c         |   2 +
 include/linux/blk-mq.h        |   3 +-
 include/linux/blkdev.h        |   4 +
 include/linux/elevator.h      |   8 ++
 12 files changed, 161 insertions(+), 45 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v5 1/7] block: Cleanup elevator_init_mq() use
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
@ 2019-09-05  9:51 ` Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 2/7] block: Change elevator_init_mq() to always succeed Damien Le Moal
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2019-09-05  9:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
calling elevator_init_mq() to make sure that the queue supports IO
scheduling, use the elevator.c function elv_support_iosched() in
elevator_init_mq(). This does not introduce any functional change but
ensure that elevator_init_mq() does the right thing based on the queue
settings.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c   |  8 +++-----
 block/elevator.c | 23 +++++++++++++----------
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b622029b19ea..13923630e00a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2904,11 +2904,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	blk_mq_add_queue_tag_set(set, q);
 	blk_mq_map_swqueue(q);
 
-	if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
-		ret = elevator_init_mq(q);
-		if (ret)
-			goto err_tag_set;
-	}
+	ret = elevator_init_mq(q);
+	if (ret)
+		goto err_tag_set;
 
 	return q;
 
diff --git a/block/elevator.c b/block/elevator.c
index 86100de88883..4721834815bb 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -619,16 +619,26 @@ int elevator_switch_mq(struct request_queue *q,
 	return ret;
 }
 
+static inline bool elv_support_iosched(struct request_queue *q)
+{
+	if (q->tag_set && (q->tag_set->flags & BLK_MQ_F_NO_SCHED))
+		return false;
+	return true;
+}
+
 /*
- * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices.  If deadline isn't available OR we have multiple queues,
- * default to "none".
+ * For blk-mq devices supporting IO scheduling, we default to using mq-deadline,
+ * if available, for single queue devices. If deadline isn't available OR we
+ * have multiple queues, default to "none".
  */
 int elevator_init_mq(struct request_queue *q)
 {
 	struct elevator_type *e;
 	int err = 0;
 
+	if (!elv_support_iosched(q))
+		return 0;
+
 	if (q->nr_hw_queues != 1)
 		return 0;
 
@@ -706,13 +716,6 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	return elevator_switch(q, e);
 }
 
-static inline bool elv_support_iosched(struct request_queue *q)
-{
-	if (q->tag_set && (q->tag_set->flags & BLK_MQ_F_NO_SCHED))
-		return false;
-	return true;
-}
-
 ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 			  size_t count)
 {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 2/7] block: Change elevator_init_mq() to always succeed
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 1/7] block: Cleanup elevator_init_mq() use Damien Le Moal
@ 2019-09-05  9:51 ` Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 3/7] block: Introduce elevator features Damien Le Moal
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2019-09-05  9:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

If the default elevator chosen is mq-deadline, elevator_init_mq() may
return an error if mq-deadline initialization fails, leading to
blk_mq_init_allocated_queue() returning an error, which in turn will
cause the block device initialization to fail and the device not being
exposed.

Instead of taking such extreme measure, handle mq-deadline
initialization failures in the same manner as when mq-deadline is not
available (no module to load), that is, default to the "none" scheduler.
With this change, elevator_init_mq() return type can be changed to void.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c   |  8 +-------
 block/blk.h      |  2 +-
 block/elevator.c | 23 ++++++++++++-----------
 3 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 13923630e00a..ee4caf0c0807 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2842,8 +2842,6 @@ static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q)
 {
-	int ret = -ENOMEM;
-
 	/* mark the queue as mq asap */
 	q->mq_ops = set->ops;
 
@@ -2904,14 +2902,10 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	blk_mq_add_queue_tag_set(set, q);
 	blk_mq_map_swqueue(q);
 
-	ret = elevator_init_mq(q);
-	if (ret)
-		goto err_tag_set;
+	elevator_init_mq(q);
 
 	return q;
 
-err_tag_set:
-	blk_mq_del_queue_tag_set(q);
 err_hctxs:
 	kfree(q->queue_hw_ctx);
 	q->nr_hw_queues = 0;
diff --git a/block/blk.h b/block/blk.h
index e4619fc5c99a..ed347f7a97b1 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -184,7 +184,7 @@ void blk_account_io_done(struct request *req, u64 now);
 
 void blk_insert_flush(struct request *rq);
 
-int elevator_init_mq(struct request_queue *q);
+void elevator_init_mq(struct request_queue *q);
 int elevator_switch_mq(struct request_queue *q,
 			      struct elevator_type *new_e);
 void __elevator_exit(struct request_queue *, struct elevator_queue *);
diff --git a/block/elevator.c b/block/elevator.c
index 4721834815bb..2944c129760c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -628,34 +628,35 @@ static inline bool elv_support_iosched(struct request_queue *q)
 
 /*
  * For blk-mq devices supporting IO scheduling, we default to using mq-deadline,
- * if available, for single queue devices. If deadline isn't available OR we
- * have multiple queues, default to "none".
+ * if available, for single queue devices. If deadline isn't available OR
+ * deadline initialization fails OR we have multiple queues, default to "none".
  */
-int elevator_init_mq(struct request_queue *q)
+void elevator_init_mq(struct request_queue *q)
 {
 	struct elevator_type *e;
-	int err = 0;
+	int err;
 
 	if (!elv_support_iosched(q))
-		return 0;
+		return;
 
 	if (q->nr_hw_queues != 1)
-		return 0;
+		return;
 
 	WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags));
 
 	if (unlikely(q->elevator))
-		goto out;
+		return;
 
 	e = elevator_get(q, "mq-deadline", false);
 	if (!e)
-		goto out;
+		return;
 
 	err = blk_mq_init_sched(q, e);
-	if (err)
+	if (err) {
+		pr_warn("\"%s\" elevator initialization failed, "
+			"falling back to \"none\"\n", e->elevator_name);
 		elevator_put(e);
-out:
-	return err;
+	}
 }
 
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 3/7] block: Introduce elevator features
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 1/7] block: Cleanup elevator_init_mq() use Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 2/7] block: Change elevator_init_mq() to always succeed Damien Le Moal
@ 2019-09-05  9:51 ` Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 4/7] block: Improve default elevator selection Damien Le Moal
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2019-09-05  9:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

Introduce the definition of elevator features through the
elevator_features flags in the elevator_type structure. Each flag can
represent a feature supported by an elevator. The first feature defined
by this patch is support for zoned block device sequential write
constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented
by the mq-deadline elevator using zone write locking.

Other possible features are IO priorities, write hints, latency targets
or single-LUN dual-actuator disks (for which the elevator could maintain
one LBA ordered list per actuator).

The required_elevator_features field is also added to the request_queue
structure to allow a device driver to specify elevator feature flags
that an elevator must support for the correct operation of the device
(e.g. device drivers for zoned block devices can have the
ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature).
The helper function blk_queue_required_elevator_features() is
defined for setting this new field.

With these two new fields in place, the elevator functions
elevator_match() and elevator_find() are modified to allow a user to set
only an elevator with a set of features that satisfies the device
required features. Elevators not matching the device requirements are
not shown in the device sysfs queue/scheduler file to prevent their use.

The "none" elevator can always be selected as before.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-settings.c     | 16 +++++++++++++
 block/elevator.c         | 49 +++++++++++++++++++++++++++++++---------
 block/mq-deadline.c      |  1 +
 include/linux/blkdev.h   |  4 ++++
 include/linux/elevator.h |  8 +++++++
 5 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a058997b9cce..6bd1e3b082d8 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -832,6 +832,22 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
+/**
+ * blk_queue_required_elevator_features - Set a queue required elevator features
+ * @q:		the request queue for the target device
+ * @features:	Required elevator features OR'ed together
+ *
+ * Tell the block layer that for the device controlled through @q, only the
+ * only elevators that can be used are those that implement at least the set of
+ * features specified by @features.
+ */
+void blk_queue_required_elevator_features(struct request_queue *q,
+					  unsigned int features)
+{
+	q->required_elevator_features = features;
+}
+EXPORT_SYMBOL_GPL(blk_queue_required_elevator_features);
+
 static int __init blk_settings_init(void)
 {
 	blk_max_low_pfn = max_low_pfn - 1;
diff --git a/block/elevator.c b/block/elevator.c
index 2944c129760c..ac7c8ad580ba 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -83,8 +83,26 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio)
 }
 EXPORT_SYMBOL(elv_bio_merge_ok);
 
-static bool elevator_match(const struct elevator_type *e, const char *name)
+static inline bool elv_support_features(unsigned int elv_features,
+					unsigned int required_features)
 {
+	return (required_features & elv_features) == required_features;
+}
+
+/**
+ * elevator_match - Test an elevator name and features
+ * @e: Scheduler to test
+ * @name: Elevator name to test
+ * @required_features: Features that the elevator must provide
+ *
+ * Return true is the elevator @e name matches @name and if @e provides all the
+ * the feratures spcified by @required_features.
+ */
+static bool elevator_match(const struct elevator_type *e, const char *name,
+			   unsigned int required_features)
+{
+	if (!elv_support_features(e->elevator_features, required_features))
+		return false;
 	if (!strcmp(e->elevator_name, name))
 		return true;
 	if (e->elevator_alias && !strcmp(e->elevator_alias, name))
@@ -93,15 +111,21 @@ static bool elevator_match(const struct elevator_type *e, const char *name)
 	return false;
 }
 
-/*
- * Return scheduler with name 'name'
+/**
+ * elevator_find - Find an elevator
+ * @name: Name of the elevator to find
+ * @required_features: Features that the elevator must provide
+ *
+ * Return the first registered scheduler with name @name and supporting the
+ * features @required_features and NULL otherwise.
  */
-static struct elevator_type *elevator_find(const char *name)
+static struct elevator_type *elevator_find(const char *name,
+					   unsigned int required_features)
 {
 	struct elevator_type *e;
 
 	list_for_each_entry(e, &elv_list, list) {
-		if (elevator_match(e, name))
+		if (elevator_match(e, name, required_features))
 			return e;
 	}
 
@@ -120,12 +144,12 @@ static struct elevator_type *elevator_get(struct request_queue *q,
 
 	spin_lock(&elv_list_lock);
 
-	e = elevator_find(name);
+	e = elevator_find(name, q->required_elevator_features);
 	if (!e && try_loading) {
 		spin_unlock(&elv_list_lock);
 		request_module("%s-iosched", name);
 		spin_lock(&elv_list_lock);
-		e = elevator_find(name);
+		e = elevator_find(name, q->required_elevator_features);
 	}
 
 	if (e && !try_module_get(e->elevator_owner))
@@ -525,7 +549,7 @@ int elv_register(struct elevator_type *e)
 
 	/* register, don't allow duplicate names */
 	spin_lock(&elv_list_lock);
-	if (elevator_find(e->elevator_name)) {
+	if (elevator_find(e->elevator_name, 0)) {
 		spin_unlock(&elv_list_lock);
 		kmem_cache_destroy(e->icq_cache);
 		return -EBUSY;
@@ -709,7 +733,8 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	if (!e)
 		return -EINVAL;
 
-	if (q->elevator && elevator_match(q->elevator->type, elevator_name)) {
+	if (q->elevator &&
+	    elevator_match(q->elevator->type, elevator_name, 0)) {
 		elevator_put(e);
 		return 0;
 	}
@@ -749,11 +774,13 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
 
 	spin_lock(&elv_list_lock);
 	list_for_each_entry(__e, &elv_list, list) {
-		if (elv && elevator_match(elv, __e->elevator_name)) {
+		if (elv && elevator_match(elv, __e->elevator_name, 0)) {
 			len += sprintf(name+len, "[%s] ", elv->elevator_name);
 			continue;
 		}
-		if (elv_support_iosched(q))
+		if (elv_support_iosched(q) &&
+		    elevator_match(__e, __e->elevator_name,
+				   q->required_elevator_features))
 			len += sprintf(name+len, "%s ", __e->elevator_name);
 	}
 	spin_unlock(&elv_list_lock);
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 35e84bc0ec8c..b490f47fd553 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -794,6 +794,7 @@ static struct elevator_type mq_deadline = {
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "mq-deadline",
 	.elevator_alias = "deadline",
+	.elevator_features = ELEVATOR_F_ZBD_SEQ_WRITE,
 	.elevator_owner = THIS_MODULE,
 };
 MODULE_ALIAS("mq-deadline-iosched");
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d0ad21e4771b..b196124e3240 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -496,6 +496,8 @@ struct request_queue {
 
 	struct queue_limits	limits;
 
+	unsigned int		required_elevator_features;
+
 #ifdef CONFIG_BLK_DEV_ZONED
 	/*
 	 * Zoned block device information for request dispatch control.
@@ -1097,6 +1099,8 @@ extern void blk_queue_dma_alignment(struct request_queue *, int);
 extern void blk_queue_update_dma_alignment(struct request_queue *, int);
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
+extern void blk_queue_required_elevator_features(struct request_queue *q,
+						 unsigned int features);
 
 /*
  * Number of physical segments as sent to the device.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1dd014c9c87b..901bda352dcb 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -76,6 +76,7 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	const char *elevator_name;
 	const char *elevator_alias;
+	const unsigned int elevator_features;
 	struct module *elevator_owner;
 #ifdef CONFIG_BLK_DEBUG_FS
 	const struct blk_mq_debugfs_attr *queue_debugfs_attrs;
@@ -165,5 +166,12 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 #define rq_entry_fifo(ptr)	list_entry((ptr), struct request, queuelist)
 #define rq_fifo_clear(rq)	list_del_init(&(rq)->queuelist)
 
+/*
+ * Elevator features.
+ */
+
+/* Supports zoned block devices sequential write constraint */
+#define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
+
 #endif /* CONFIG_BLOCK */
 #endif
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 4/7] block: Improve default elevator selection
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
                   ` (2 preceding siblings ...)
  2019-09-05  9:51 ` [PATCH v5 3/7] block: Introduce elevator features Damien Le Moal
@ 2019-09-05  9:51 ` Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 5/7] block: Delay default elevator initialization Damien Le Moal
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2019-09-05  9:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

For block devices that do not specify required features, preserve the
current default elevator selection (mq-deadline for single queue
devices, none for multi-queue devices). However, for devices specifying
required features (e.g. zoned block devices ELEVATOR_F_ZBD_SEQ_WRITE
feature), select the first available elevator providing the required
features.

In all cases, default to "none" if no elevator is available or if the
initialization of the default elevator fails.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/elevator.c | 51 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index ac7c8ad580ba..520d6b224b74 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -651,9 +651,46 @@ static inline bool elv_support_iosched(struct request_queue *q)
 }
 
 /*
- * For blk-mq devices supporting IO scheduling, we default to using mq-deadline,
- * if available, for single queue devices. If deadline isn't available OR
- * deadline initialization fails OR we have multiple queues, default to "none".
+ * For single queue devices, default to using mq-deadline. If we have multiple
+ * queues or mq-deadline is not available, default to "none".
+ */
+static struct elevator_type *elevator_get_default(struct request_queue *q)
+{
+	if (q->nr_hw_queues != 1)
+		return NULL;
+
+	return elevator_get(q, "mq-deadline", false);
+}
+
+/*
+ * Get the first elevator providing the features required by the request queue.
+ * Default to "none" if no matching elevator is found.
+ */
+static struct elevator_type *elevator_get_by_features(struct request_queue *q)
+{
+	struct elevator_type *e;
+
+	spin_lock(&elv_list_lock);
+
+	list_for_each_entry(e, &elv_list, list) {
+		if (elv_support_features(e->elevator_features,
+					 q->required_elevator_features))
+			break;
+	}
+
+	if (e && !try_module_get(e->elevator_owner))
+		e = NULL;
+
+	spin_unlock(&elv_list_lock);
+
+	return e;
+}
+
+/*
+ * For a device queue that has no required features, use the default elevator
+ * settings. Otherwise, use the first elevator available matching the required
+ * features. If no suitable elevator is find or if the chosen elevator
+ * initialization fails, fall back to the "none" elevator (no elevator).
  */
 void elevator_init_mq(struct request_queue *q)
 {
@@ -663,15 +700,15 @@ void elevator_init_mq(struct request_queue *q)
 	if (!elv_support_iosched(q))
 		return;
 
-	if (q->nr_hw_queues != 1)
-		return;
-
 	WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags));
 
 	if (unlikely(q->elevator))
 		return;
 
-	e = elevator_get(q, "mq-deadline", false);
+	if (!q->required_elevator_features)
+		e = elevator_get_default(q);
+	else
+		e = elevator_get_by_features(q);
 	if (!e)
 		return;
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 5/7] block: Delay default elevator initialization
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
                   ` (3 preceding siblings ...)
  2019-09-05  9:51 ` [PATCH v5 4/7] block: Improve default elevator selection Damien Le Moal
@ 2019-09-05  9:51 ` Damien Le Moal
  2019-10-01 20:46   ` Eric Farman
  2019-09-05  9:51 ` [PATCH v5 6/7] block: Set ELEVATOR_F_ZBD_SEQ_WRITE for nullblk zoned disks Damien Le Moal
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2019-09-05  9:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
   blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
   case, a call to elevator_init_mq() is added to __device_add_disk(),
   resulting in the delayed initialization of the queue elevator
   after the device driver finished probing the device information. This
   effectively allows elevator_init_mq() access to more information
   about the device.
2) elevator_init = true preserves the current behavior of initializing
   the elevator directly from blk_mq_init_allocated_queue(). This case
   is used for the special request based DM devices where the device
   gendisk is created before the queue initialization and device
   information (e.g. queue limits) is already known when the queue
   initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-mq.c         | 12 +++++++++---
 block/elevator.c       |  7 +++++++
 block/genhd.c          |  9 +++++++++
 drivers/md/dm-rq.c     |  2 +-
 include/linux/blk-mq.h |  3 ++-
 5 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee4caf0c0807..240416057f28 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2689,7 +2689,11 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 	if (!uninit_q)
 		return ERR_PTR(-ENOMEM);
 
-	q = blk_mq_init_allocated_queue(set, uninit_q);
+	/*
+	 * Initialize the queue without an elevator. device_add_disk() will do
+	 * the initialization.
+	 */
+	q = blk_mq_init_allocated_queue(set, uninit_q, false);
 	if (IS_ERR(q))
 		blk_cleanup_queue(uninit_q);
 
@@ -2840,7 +2844,8 @@ static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
 }
 
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
-						  struct request_queue *q)
+						  struct request_queue *q,
+						  bool elevator_init)
 {
 	/* mark the queue as mq asap */
 	q->mq_ops = set->ops;
@@ -2902,7 +2907,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	blk_mq_add_queue_tag_set(set, q);
 	blk_mq_map_swqueue(q);
 
-	elevator_init_mq(q);
+	if (elevator_init)
+		elevator_init_mq(q);
 
 	return q;
 
diff --git a/block/elevator.c b/block/elevator.c
index 520d6b224b74..096a670d22d7 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -712,7 +712,14 @@ void elevator_init_mq(struct request_queue *q)
 	if (!e)
 		return;
 
+	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
+
 	err = blk_mq_init_sched(q, e);
+
+	blk_mq_unquiesce_queue(q);
+	blk_mq_unfreeze_queue(q);
+
 	if (err) {
 		pr_warn("\"%s\" elevator initialization failed, "
 			"falling back to \"none\"\n", e->elevator_name);
diff --git a/block/genhd.c b/block/genhd.c
index 54f1f0d381f4..26b31fcae217 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -695,6 +695,15 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
 	dev_t devt;
 	int retval;
 
+	/*
+	 * The disk queue should now be all set with enough information about
+	 * the device for the elevator code to pick an adequate default
+	 * elevator if one is needed, that is, for devices requesting queue
+	 * registration.
+	 */
+	if (register_queue)
+		elevator_init_mq(disk->queue);
+
 	/* minors == 0 indicates to use ext devt from part0 and should
 	 * be accompanied with EXT_DEVT flag.  Make sure all
 	 * parameters make sense.
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 21d5c1784d0c..3f8577e2c13b 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -563,7 +563,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	if (err)
 		goto out_kfree_tag_set;
 
-	q = blk_mq_init_allocated_queue(md->tag_set, md->queue);
+	q = blk_mq_init_allocated_queue(md->tag_set, md->queue, true);
 	if (IS_ERR(q)) {
 		err = PTR_ERR(q);
 		goto out_tag_set;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 62a3bb715899..0bf056de5cc3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -248,7 +248,8 @@ enum {
 
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
-						  struct request_queue *q);
+						  struct request_queue *q,
+						  bool elevator_init);
 struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 						const struct blk_mq_ops *ops,
 						unsigned int queue_depth,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 6/7] block: Set ELEVATOR_F_ZBD_SEQ_WRITE for nullblk zoned disks
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
                   ` (4 preceding siblings ...)
  2019-09-05  9:51 ` [PATCH v5 5/7] block: Delay default elevator initialization Damien Le Moal
@ 2019-09-05  9:51 ` Damien Le Moal
  2019-09-05  9:51 ` [PATCH v5 7/7] sd: Set ELEVATOR_F_ZBD_SEQ_WRITE for ZBC disks Damien Le Moal
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2019-09-05  9:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

Using the helper blk_queue_required_elevator_features(), set the
elevator feature ELEVATOR_F_ZBD_SEQ_WRITE as required for the request
queue of null_blk devices created with zoned mode enabled.

This feature requirement can always be satisfied as the mq-deadline
elevator is always selected for in-kernel compilation when
CONFIG_BLK_DEV_ZONED (zoned block device support) is enabled.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/block/null_blk_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index b26a178d064d..b29b273690b0 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1695,6 +1695,8 @@ static int null_add_dev(struct nullb_device *dev)
 		blk_queue_chunk_sectors(nullb->q, dev->zone_size_sects);
 		nullb->q->limits.zoned = BLK_ZONED_HM;
 		blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, nullb->q);
+		blk_queue_required_elevator_features(nullb->q,
+						ELEVATOR_F_ZBD_SEQ_WRITE);
 	}
 
 	nullb->q->queuedata = nullb;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 7/7] sd: Set ELEVATOR_F_ZBD_SEQ_WRITE for ZBC disks
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
                   ` (5 preceding siblings ...)
  2019-09-05  9:51 ` [PATCH v5 6/7] block: Set ELEVATOR_F_ZBD_SEQ_WRITE for nullblk zoned disks Damien Le Moal
@ 2019-09-05  9:51 ` Damien Le Moal
  2019-09-06  0:31 ` [PATCH v5 0/7] Elevator cleanups and improvements Ming Lei
  2019-09-06  1:53 ` Jens Axboe
  8 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2019-09-05  9:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

Using the helper blk_queue_required_elevator_features(), set the
elevator feature ELEVATOR_F_ZBD_SEQ_WRITE as required for the request
queue of SCSI ZBC disks.

This feature requirement can always be satisfied as the mq-deadline
elevator is always selected for in-kernel compilation when
CONFIG_BLK_DEV_ZONED (zoned block device support) is enabled.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/scsi/sd_zbc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 367614f0e34f..de4019dc0f0b 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -493,6 +493,8 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 	blk_queue_chunk_sectors(sdkp->disk->queue,
 			logical_to_sectors(sdkp->device, zone_blocks));
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue);
+	blk_queue_required_elevator_features(sdkp->disk->queue,
+					     ELEVATOR_F_ZBD_SEQ_WRITE);
 	nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
 
 	/* READ16/WRITE16 is mandatory for ZBC disks */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 0/7] Elevator cleanups and improvements
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
                   ` (6 preceding siblings ...)
  2019-09-05  9:51 ` [PATCH v5 7/7] sd: Set ELEVATOR_F_ZBD_SEQ_WRITE for ZBC disks Damien Le Moal
@ 2019-09-06  0:31 ` Ming Lei
  2019-09-06  1:53 ` Jens Axboe
  8 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2019-09-06  0:31 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer

On Thu, Sep 05, 2019 at 06:51:28PM +0900, Damien Le Moal wrote:
> This patch series implements some cleanup of the elevator initialization
> code and introduces elevator features identification and device matching
> to enhance checks for elevator/device compatibility and fitness.
> 
> The first 2 patches of the series are simple cleanups which simplify 
> elevator initialization for newly allocated device queues.
> 
> Patch 3 introduce elevator features, allowing a clean and extensible
> definition of devices and features that an elevator supports and match
> these against features required by a block device. With this, the sysfs
> elevator list for a device always shows only elevators matching the
> features that a particular device requires, with the exception of the
> none elevator which has no features but is always available for use
> with any device.
> 
> The first feature defined is for zoned block device sequential write
> constraint support through zone write locking which prevents the use of
> any elevator that does not support this feature with zoned devices.
> 
> The last 4 patches of this series rework the default elevator selection
> and initialization to allow for the elevator/device features matching
> to work, doing so addressing cases not currently well supported, namely,
> multi-queue zoned block devices.
> 
> Changes from v4:
> * Fix patch 5 again to correctly handle request based DM devices and
>   avoid that default queue elevator of these devices end up always
>   being "none".
> 
> Changes from v3:
> * Fixed patch 5 to correctly handle DM devices which do not register a
>   request queue and so do not need elevator initialization.
> 
> Changes from v2:
> * Fixed patch 4
> * Call elevator_init_mq() earlier in device_add_disk() as suggested by
>   Christoph (patch 5)
> * Fixed title of patch 7
> 
> Changes from v1:
> * Addressed Johannes comments
> * Rebased on newest for-next branch to include Ming's sysfs lock changes
> 
> Damien Le Moal (7):
>   block: Cleanup elevator_init_mq() use
>   block: Change elevator_init_mq() to always succeed
>   block: Introduce elevator features
>   block: Improve default elevator selection
>   block: Delay default elevator initialization
>   block: Set ELEVATOR_F_ZBD_SEQ_WRITE for nullblk zoned disks
>   sd: Set ELEVATOR_F_ZBD_SEQ_WRITE for ZBC disks
> 
>  block/blk-mq.c                |  20 +++--
>  block/blk-settings.c          |  16 ++++
>  block/blk.h                   |   2 +-
>  block/elevator.c              | 137 ++++++++++++++++++++++++++--------
>  block/genhd.c                 |   9 +++
>  block/mq-deadline.c           |   1 +
>  drivers/block/null_blk_main.c |   2 +
>  drivers/md/dm-rq.c            |   2 +-
>  drivers/scsi/sd_zbc.c         |   2 +
>  include/linux/blk-mq.h        |   3 +-
>  include/linux/blkdev.h        |   4 +
>  include/linux/elevator.h      |   8 ++
>  12 files changed, 161 insertions(+), 45 deletions(-)

Looks fine for the series:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 0/7] Elevator cleanups and improvements
  2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
                   ` (7 preceding siblings ...)
  2019-09-06  0:31 ` [PATCH v5 0/7] Elevator cleanups and improvements Ming Lei
@ 2019-09-06  1:53 ` Jens Axboe
  8 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2019-09-06  1:53 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Ming Lei

On 9/5/19 3:51 AM, Damien Le Moal wrote:
> This patch series implements some cleanup of the elevator initialization
> code and introduces elevator features identification and device matching
> to enhance checks for elevator/device compatibility and fitness.
> 
> The first 2 patches of the series are simple cleanups which simplify
> elevator initialization for newly allocated device queues.
> 
> Patch 3 introduce elevator features, allowing a clean and extensible
> definition of devices and features that an elevator supports and match
> these against features required by a block device. With this, the sysfs
> elevator list for a device always shows only elevators matching the
> features that a particular device requires, with the exception of the
> none elevator which has no features but is always available for use
> with any device.
> 
> The first feature defined is for zoned block device sequential write
> constraint support through zone write locking which prevents the use of
> any elevator that does not support this feature with zoned devices.
> 
> The last 4 patches of this series rework the default elevator selection
> and initialization to allow for the elevator/device features matching
> to work, doing so addressing cases not currently well supported, namely,
> multi-queue zoned block devices.

Applied for 5.4, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 5/7] block: Delay default elevator initialization
  2019-09-05  9:51 ` [PATCH v5 5/7] block: Delay default elevator initialization Damien Le Moal
@ 2019-10-01 20:46   ` Eric Farman
  2019-10-16 18:51     ` Eric Farman
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Farman @ 2019-10-01 20:46 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Ming Lei, Christian Borntraeger



On 9/5/19 5:51 AM, Damien Le Moal wrote:
> When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
> the only information known about the device is the number of hardware
> queues as the block device scan by the device driver is not completed
> yet for most drivers. The device type and elevator required features
> are not set yet, preventing to correctly select the default elevator
> most suitable for the device.
> 
> This currently affects all multi-queue zoned block devices which default
> to the "none" elevator instead of the required "mq-deadline" elevator.
> These drives currently include host-managed SMR disks connected to a
> smartpqi HBA and null_blk block devices with zoned mode enabled.
> Upcoming NVMe Zoned Namespace devices will also be affected.
> 
> Fix this by adding the boolean elevator_init argument to
> blk_mq_init_allocated_queue() to control the execution of
> elevator_init_mq(). Two cases exist:
> 1) elevator_init = false is used for calls to
>    blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
>    case, a call to elevator_init_mq() is added to __device_add_disk(),
>    resulting in the delayed initialization of the queue elevator
>    after the device driver finished probing the device information. This
>    effectively allows elevator_init_mq() access to more information
>    about the device.
> 2) elevator_init = true preserves the current behavior of initializing
>    the elevator directly from blk_mq_init_allocated_queue(). This case
>    is used for the special request based DM devices where the device
>    gendisk is created before the queue initialization and device
>    information (e.g. queue limits) is already known when the queue
>    initialization is executed.
> 
> Additionally, to make sure that the elevator initialization is never
> done while requests are in-flight (there should be none when the device
> driver calls device_add_disk()), freeze and quiesce the device request
> queue before calling blk_mq_init_sched() in elevator_init_mq().
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>

Coincidentally, I had been looking into a problem that is fixed in
5.4-rc1 by this patch.  Thanks for that!

The problem was a delay during boot of a KVM guest with virtio-scsi
devices (or hotplug of such a device to a guest) in recent releases,
especially when virtio-scsi is configured as a module.  The symptoms
look like:

[    0.975315] virtio_blk virtio2: [vda] 1803060 4096-byte logical
blocks (7.39 GB/6.88 GiB)
[    0.977859] scsi host0: Virtio SCSI HBA
[    0.980339] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK
2.5+ PQ: 0 ANSI: 5
[    0.981685]  vda:VOL1/  0XA906: vda1
[    0.988253] alg: No test for crc32be (crc32be-vx)
...stall...
[   24.544920] sd 0:0:0:0: Power-on or device reset occurred
[   24.545176] sd 0:0:0:0: Attached scsi generic sg0 type 0
[   24.545292] sd 0:0:0:0: [sda] 385 512-byte logical blocks: (197
kB/193 KiB)
[   24.545368] sd 0:0:0:0: [sda] Write Protect is off
[   24.545416] sd 0:0:0:0: [sda] Mode Sense: 63 00 00 08
[   24.545456] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[   24.547033] sd 0:0:0:0: [sda] Attached SCSI disk

I debugged this down to the same behavior described/fixed back in 3.18
by commit 17497acbdce9 ("blk-mq, percpu_ref: start q->mq_usage_counter
in atomic mode"), and for the same reason.  The delay starts occurring
as soon as q->q_usage_counter is converted to percpu for the one LUN tha
twas found, while scsi_scan_channel() is still working on its loop of
mostly non-existent devices.  Exactly when this problem started
re-occuring is not certain to me, though I did see this problem with 5.2
on linux-stable.

When I run with a 5.3 kernel, the problem is easily reproducible.  So I
bisected between 5.3 and 5.4-rc1, and got here.  Cherry-picking this
patch on top of 5.3 cleans up the boot/hotplug process and removes any
stall.  Any chance this could be cc'd to stable?  Any data someone wants
to see behavioral changes?

Thanks,
Eric

> ---
>  block/blk-mq.c         | 12 +++++++++---
>  block/elevator.c       |  7 +++++++
>  block/genhd.c          |  9 +++++++++
>  drivers/md/dm-rq.c     |  2 +-
>  include/linux/blk-mq.h |  3 ++-
>  5 files changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ee4caf0c0807..240416057f28 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2689,7 +2689,11 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
>  	if (!uninit_q)
>  		return ERR_PTR(-ENOMEM);
>  
> -	q = blk_mq_init_allocated_queue(set, uninit_q);
> +	/*
> +	 * Initialize the queue without an elevator. device_add_disk() will do
> +	 * the initialization.
> +	 */
> +	q = blk_mq_init_allocated_queue(set, uninit_q, false);
>  	if (IS_ERR(q))
>  		blk_cleanup_queue(uninit_q);
>  
> @@ -2840,7 +2844,8 @@ static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
>  }
>  
>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> -						  struct request_queue *q)
> +						  struct request_queue *q,
> +						  bool elevator_init)
>  {
>  	/* mark the queue as mq asap */
>  	q->mq_ops = set->ops;
> @@ -2902,7 +2907,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  	blk_mq_add_queue_tag_set(set, q);
>  	blk_mq_map_swqueue(q);
>  
> -	elevator_init_mq(q);
> +	if (elevator_init)
> +		elevator_init_mq(q);
>  
>  	return q;
>  
> diff --git a/block/elevator.c b/block/elevator.c
> index 520d6b224b74..096a670d22d7 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -712,7 +712,14 @@ void elevator_init_mq(struct request_queue *q)
>  	if (!e)
>  		return;
>  
> +	blk_mq_freeze_queue(q);
> +	blk_mq_quiesce_queue(q);
> +
>  	err = blk_mq_init_sched(q, e);
> +
> +	blk_mq_unquiesce_queue(q);
> +	blk_mq_unfreeze_queue(q);
> +
>  	if (err) {
>  		pr_warn("\"%s\" elevator initialization failed, "
>  			"falling back to \"none\"\n", e->elevator_name);
> diff --git a/block/genhd.c b/block/genhd.c
> index 54f1f0d381f4..26b31fcae217 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -695,6 +695,15 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
>  	dev_t devt;
>  	int retval;
>  
> +	/*
> +	 * The disk queue should now be all set with enough information about
> +	 * the device for the elevator code to pick an adequate default
> +	 * elevator if one is needed, that is, for devices requesting queue
> +	 * registration.
> +	 */
> +	if (register_queue)
> +		elevator_init_mq(disk->queue);
> +
>  	/* minors == 0 indicates to use ext devt from part0 and should
>  	 * be accompanied with EXT_DEVT flag.  Make sure all
>  	 * parameters make sense.
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 21d5c1784d0c..3f8577e2c13b 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -563,7 +563,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>  	if (err)
>  		goto out_kfree_tag_set;
>  
> -	q = blk_mq_init_allocated_queue(md->tag_set, md->queue);
> +	q = blk_mq_init_allocated_queue(md->tag_set, md->queue, true);
>  	if (IS_ERR(q)) {
>  		err = PTR_ERR(q);
>  		goto out_tag_set;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 62a3bb715899..0bf056de5cc3 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -248,7 +248,8 @@ enum {
>  
>  struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> -						  struct request_queue *q);
> +						  struct request_queue *q,
> +						  bool elevator_init);
>  struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
>  						const struct blk_mq_ops *ops,
>  						unsigned int queue_depth,
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 5/7] block: Delay default elevator initialization
  2019-10-01 20:46   ` Eric Farman
@ 2019-10-16 18:51     ` Eric Farman
  2019-10-17  2:39       ` Damien Le Moal
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Farman @ 2019-10-16 18:51 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Ming Lei, Christian Borntraeger



On 10/1/19 4:46 PM, Eric Farman wrote:
> 
> 
> On 9/5/19 5:51 AM, Damien Le Moal wrote:
>> When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
>> the only information known about the device is the number of hardware
>> queues as the block device scan by the device driver is not completed
>> yet for most drivers. The device type and elevator required features
>> are not set yet, preventing to correctly select the default elevator
>> most suitable for the device.
>>
>> This currently affects all multi-queue zoned block devices which default
>> to the "none" elevator instead of the required "mq-deadline" elevator.
>> These drives currently include host-managed SMR disks connected to a
>> smartpqi HBA and null_blk block devices with zoned mode enabled.
>> Upcoming NVMe Zoned Namespace devices will also be affected.
>>
>> Fix this by adding the boolean elevator_init argument to
>> blk_mq_init_allocated_queue() to control the execution of
>> elevator_init_mq(). Two cases exist:
>> 1) elevator_init = false is used for calls to
>>    blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
>>    case, a call to elevator_init_mq() is added to __device_add_disk(),
>>    resulting in the delayed initialization of the queue elevator
>>    after the device driver finished probing the device information. This
>>    effectively allows elevator_init_mq() access to more information
>>    about the device.
>> 2) elevator_init = true preserves the current behavior of initializing
>>    the elevator directly from blk_mq_init_allocated_queue(). This case
>>    is used for the special request based DM devices where the device
>>    gendisk is created before the queue initialization and device
>>    information (e.g. queue limits) is already known when the queue
>>    initialization is executed.
>>
>> Additionally, to make sure that the elevator initialization is never
>> done while requests are in-flight (there should be none when the device
>> driver calls device_add_disk()), freeze and quiesce the device request
>> queue before calling blk_mq_init_sched() in elevator_init_mq().
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> 
> Coincidentally, I had been looking into a problem that is fixed in
> 5.4-rc1 by this patch.  Thanks for that!
> 
> The problem was a delay during boot of a KVM guest with virtio-scsi
> devices (or hotplug of such a device to a guest) in recent releases,
> especially when virtio-scsi is configured as a module.  The symptoms
> look like:
> 
> [    0.975315] virtio_blk virtio2: [vda] 1803060 4096-byte logical
> blocks (7.39 GB/6.88 GiB)
> [    0.977859] scsi host0: Virtio SCSI HBA
> [    0.980339] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK
> 2.5+ PQ: 0 ANSI: 5
> [    0.981685]  vda:VOL1/  0XA906: vda1
> [    0.988253] alg: No test for crc32be (crc32be-vx)
> ...stall...
> [   24.544920] sd 0:0:0:0: Power-on or device reset occurred
> [   24.545176] sd 0:0:0:0: Attached scsi generic sg0 type 0
> [   24.545292] sd 0:0:0:0: [sda] 385 512-byte logical blocks: (197
> kB/193 KiB)
> [   24.545368] sd 0:0:0:0: [sda] Write Protect is off
> [   24.545416] sd 0:0:0:0: [sda] Mode Sense: 63 00 00 08
> [   24.545456] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
> [   24.547033] sd 0:0:0:0: [sda] Attached SCSI disk
> 
> I debugged this down to the same behavior described/fixed back in 3.18
> by commit 17497acbdce9 ("blk-mq, percpu_ref: start q->mq_usage_counter
> in atomic mode"), and for the same reason.  The delay starts occurring
> as soon as q->q_usage_counter is converted to percpu for the one LUN tha
> twas found, while scsi_scan_channel() is still working on its loop of
> mostly non-existent devices.  Exactly when this problem started
> re-occuring is not certain to me, though I did see this problem with 5.2
> on linux-stable.

This problem started occurring reliably with kernel 4.16 because of
commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic
irq vector affinity") which forced blk-mq on for virtio-scsi devices.

I'm able to reproduce the behavior (fixed by this commit) on 4.15.0, as
well as 4.14.0 and 4.9.0, if I enable SCSI_MQ_DEFAULT in the kernel
config.  I cannot reproduce it with 4.4.0, but didn't chase further to
see why that was the case.

That force_blk_mq commit went into linux-stable, so SCSI_MQ_DEFAULT need
not be enabled on 4.14.y or 4.9.y stable branches to see the behavior
fixed by this commit.

> 
> When I run with a 5.3 kernel, the problem is easily reproducible.  So I
> bisected between 5.3 and 5.4-rc1, and got here.  Cherry-picking this
> patch on top of 5.3 cleans up the boot/hotplug process and removes any
> stall.  Any chance this could be cc'd to stable?  

Please?  Anything I can do to help with that effort?

Thanks,
Eric

> Any data someone wants
> to see behavioral changes?
> 
> Thanks,
> Eric
> 
>> ---
>>  block/blk-mq.c         | 12 +++++++++---
>>  block/elevator.c       |  7 +++++++
>>  block/genhd.c          |  9 +++++++++
>>  drivers/md/dm-rq.c     |  2 +-
>>  include/linux/blk-mq.h |  3 ++-
>>  5 files changed, 28 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index ee4caf0c0807..240416057f28 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2689,7 +2689,11 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
>>  	if (!uninit_q)
>>  		return ERR_PTR(-ENOMEM);
>>  
>> -	q = blk_mq_init_allocated_queue(set, uninit_q);
>> +	/*
>> +	 * Initialize the queue without an elevator. device_add_disk() will do
>> +	 * the initialization.
>> +	 */
>> +	q = blk_mq_init_allocated_queue(set, uninit_q, false);
>>  	if (IS_ERR(q))
>>  		blk_cleanup_queue(uninit_q);
>>  
>> @@ -2840,7 +2844,8 @@ static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
>>  }
>>  
>>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>> -						  struct request_queue *q)
>> +						  struct request_queue *q,
>> +						  bool elevator_init)
>>  {
>>  	/* mark the queue as mq asap */
>>  	q->mq_ops = set->ops;
>> @@ -2902,7 +2907,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>  	blk_mq_add_queue_tag_set(set, q);
>>  	blk_mq_map_swqueue(q);
>>  
>> -	elevator_init_mq(q);
>> +	if (elevator_init)
>> +		elevator_init_mq(q);
>>  
>>  	return q;
>>  
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 520d6b224b74..096a670d22d7 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -712,7 +712,14 @@ void elevator_init_mq(struct request_queue *q)
>>  	if (!e)
>>  		return;
>>  
>> +	blk_mq_freeze_queue(q);
>> +	blk_mq_quiesce_queue(q);
>> +
>>  	err = blk_mq_init_sched(q, e);
>> +
>> +	blk_mq_unquiesce_queue(q);
>> +	blk_mq_unfreeze_queue(q);
>> +
>>  	if (err) {
>>  		pr_warn("\"%s\" elevator initialization failed, "
>>  			"falling back to \"none\"\n", e->elevator_name);
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 54f1f0d381f4..26b31fcae217 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -695,6 +695,15 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
>>  	dev_t devt;
>>  	int retval;
>>  
>> +	/*
>> +	 * The disk queue should now be all set with enough information about
>> +	 * the device for the elevator code to pick an adequate default
>> +	 * elevator if one is needed, that is, for devices requesting queue
>> +	 * registration.
>> +	 */
>> +	if (register_queue)
>> +		elevator_init_mq(disk->queue);
>> +
>>  	/* minors == 0 indicates to use ext devt from part0 and should
>>  	 * be accompanied with EXT_DEVT flag.  Make sure all
>>  	 * parameters make sense.
>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
>> index 21d5c1784d0c..3f8577e2c13b 100644
>> --- a/drivers/md/dm-rq.c
>> +++ b/drivers/md/dm-rq.c
>> @@ -563,7 +563,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>>  	if (err)
>>  		goto out_kfree_tag_set;
>>  
>> -	q = blk_mq_init_allocated_queue(md->tag_set, md->queue);
>> +	q = blk_mq_init_allocated_queue(md->tag_set, md->queue, true);
>>  	if (IS_ERR(q)) {
>>  		err = PTR_ERR(q);
>>  		goto out_tag_set;
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 62a3bb715899..0bf056de5cc3 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -248,7 +248,8 @@ enum {
>>  
>>  struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
>>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>> -						  struct request_queue *q);
>> +						  struct request_queue *q,
>> +						  bool elevator_init);
>>  struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
>>  						const struct blk_mq_ops *ops,
>>  						unsigned int queue_depth,
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 5/7] block: Delay default elevator initialization
  2019-10-16 18:51     ` Eric Farman
@ 2019-10-17  2:39       ` Damien Le Moal
  2019-10-17 13:37         ` Eric Farman
  0 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2019-10-17  2:39 UTC (permalink / raw)
  To: Eric Farman, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Ming Lei, Christian Borntraeger

On 2019/10/17 3:51, Eric Farman wrote:
> 
> 
> On 10/1/19 4:46 PM, Eric Farman wrote:
>>
>>
>> On 9/5/19 5:51 AM, Damien Le Moal wrote:
>>> When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
>>> the only information known about the device is the number of hardware
>>> queues as the block device scan by the device driver is not completed
>>> yet for most drivers. The device type and elevator required features
>>> are not set yet, preventing to correctly select the default elevator
>>> most suitable for the device.
>>>
>>> This currently affects all multi-queue zoned block devices which default
>>> to the "none" elevator instead of the required "mq-deadline" elevator.
>>> These drives currently include host-managed SMR disks connected to a
>>> smartpqi HBA and null_blk block devices with zoned mode enabled.
>>> Upcoming NVMe Zoned Namespace devices will also be affected.
>>>
>>> Fix this by adding the boolean elevator_init argument to
>>> blk_mq_init_allocated_queue() to control the execution of
>>> elevator_init_mq(). Two cases exist:
>>> 1) elevator_init = false is used for calls to
>>>    blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
>>>    case, a call to elevator_init_mq() is added to __device_add_disk(),
>>>    resulting in the delayed initialization of the queue elevator
>>>    after the device driver finished probing the device information. This
>>>    effectively allows elevator_init_mq() access to more information
>>>    about the device.
>>> 2) elevator_init = true preserves the current behavior of initializing
>>>    the elevator directly from blk_mq_init_allocated_queue(). This case
>>>    is used for the special request based DM devices where the device
>>>    gendisk is created before the queue initialization and device
>>>    information (e.g. queue limits) is already known when the queue
>>>    initialization is executed.
>>>
>>> Additionally, to make sure that the elevator initialization is never
>>> done while requests are in-flight (there should be none when the device
>>> driver calls device_add_disk()), freeze and quiesce the device request
>>> queue before calling blk_mq_init_sched() in elevator_init_mq().
>>>
>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>
>> Coincidentally, I had been looking into a problem that is fixed in
>> 5.4-rc1 by this patch.  Thanks for that!
>>
>> The problem was a delay during boot of a KVM guest with virtio-scsi
>> devices (or hotplug of such a device to a guest) in recent releases,
>> especially when virtio-scsi is configured as a module.  The symptoms
>> look like:
>>
>> [    0.975315] virtio_blk virtio2: [vda] 1803060 4096-byte logical
>> blocks (7.39 GB/6.88 GiB)
>> [    0.977859] scsi host0: Virtio SCSI HBA
>> [    0.980339] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK
>> 2.5+ PQ: 0 ANSI: 5
>> [    0.981685]  vda:VOL1/  0XA906: vda1
>> [    0.988253] alg: No test for crc32be (crc32be-vx)
>> ...stall...
>> [   24.544920] sd 0:0:0:0: Power-on or device reset occurred
>> [   24.545176] sd 0:0:0:0: Attached scsi generic sg0 type 0
>> [   24.545292] sd 0:0:0:0: [sda] 385 512-byte logical blocks: (197
>> kB/193 KiB)
>> [   24.545368] sd 0:0:0:0: [sda] Write Protect is off
>> [   24.545416] sd 0:0:0:0: [sda] Mode Sense: 63 00 00 08
>> [   24.545456] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
>> enabled, doesn't support DPO or FUA
>> [   24.547033] sd 0:0:0:0: [sda] Attached SCSI disk
>>
>> I debugged this down to the same behavior described/fixed back in 3.18
>> by commit 17497acbdce9 ("blk-mq, percpu_ref: start q->mq_usage_counter
>> in atomic mode"), and for the same reason.  The delay starts occurring
>> as soon as q->q_usage_counter is converted to percpu for the one LUN tha
>> twas found, while scsi_scan_channel() is still working on its loop of
>> mostly non-existent devices.  Exactly when this problem started
>> re-occuring is not certain to me, though I did see this problem with 5.2
>> on linux-stable.
> 
> This problem started occurring reliably with kernel 4.16 because of
> commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic
> irq vector affinity") which forced blk-mq on for virtio-scsi devices.
> 
> I'm able to reproduce the behavior (fixed by this commit) on 4.15.0, as
> well as 4.14.0 and 4.9.0, if I enable SCSI_MQ_DEFAULT in the kernel
> config.  I cannot reproduce it with 4.4.0, but didn't chase further to
> see why that was the case.

This is really strange since in 4.9 there is no elevator initialization being
done for a mq device. So with SCSI_MQ_DEFAULT set, you should end up with no
elevator for your disk and the presence or not of this patch or variations of it
should not change anything.

This patch may be hiding your problem root cause rather than fixing it...

> 
> That force_blk_mq commit went into linux-stable, so SCSI_MQ_DEFAULT need
> not be enabled on 4.14.y or 4.9.y stable branches to see the behavior
> fixed by this commit.
> 
>>
>> When I run with a 5.3 kernel, the problem is easily reproducible.  So I
>> bisected between 5.3 and 5.4-rc1, and got here.  Cherry-picking this
>> patch on top of 5.3 cleans up the boot/hotplug process and removes any
>> stall.  Any chance this could be cc'd to stable?  
> 
> Please?  Anything I can do to help with that effort?
> 
> Thanks,
> Eric
> 
>> Any data someone wants
>> to see behavioral changes?
>>
>> Thanks,
>> Eric
>>
>>> ---
>>>  block/blk-mq.c         | 12 +++++++++---
>>>  block/elevator.c       |  7 +++++++
>>>  block/genhd.c          |  9 +++++++++
>>>  drivers/md/dm-rq.c     |  2 +-
>>>  include/linux/blk-mq.h |  3 ++-
>>>  5 files changed, 28 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index ee4caf0c0807..240416057f28 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -2689,7 +2689,11 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
>>>  	if (!uninit_q)
>>>  		return ERR_PTR(-ENOMEM);
>>>  
>>> -	q = blk_mq_init_allocated_queue(set, uninit_q);
>>> +	/*
>>> +	 * Initialize the queue without an elevator. device_add_disk() will do
>>> +	 * the initialization.
>>> +	 */
>>> +	q = blk_mq_init_allocated_queue(set, uninit_q, false);
>>>  	if (IS_ERR(q))
>>>  		blk_cleanup_queue(uninit_q);
>>>  
>>> @@ -2840,7 +2844,8 @@ static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
>>>  }
>>>  
>>>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>> -						  struct request_queue *q)
>>> +						  struct request_queue *q,
>>> +						  bool elevator_init)
>>>  {
>>>  	/* mark the queue as mq asap */
>>>  	q->mq_ops = set->ops;
>>> @@ -2902,7 +2907,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>>  	blk_mq_add_queue_tag_set(set, q);
>>>  	blk_mq_map_swqueue(q);
>>>  
>>> -	elevator_init_mq(q);
>>> +	if (elevator_init)
>>> +		elevator_init_mq(q);
>>>  
>>>  	return q;
>>>  
>>> diff --git a/block/elevator.c b/block/elevator.c
>>> index 520d6b224b74..096a670d22d7 100644
>>> --- a/block/elevator.c
>>> +++ b/block/elevator.c
>>> @@ -712,7 +712,14 @@ void elevator_init_mq(struct request_queue *q)
>>>  	if (!e)
>>>  		return;
>>>  
>>> +	blk_mq_freeze_queue(q);
>>> +	blk_mq_quiesce_queue(q);
>>> +
>>>  	err = blk_mq_init_sched(q, e);
>>> +
>>> +	blk_mq_unquiesce_queue(q);
>>> +	blk_mq_unfreeze_queue(q);
>>> +
>>>  	if (err) {
>>>  		pr_warn("\"%s\" elevator initialization failed, "
>>>  			"falling back to \"none\"\n", e->elevator_name);
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index 54f1f0d381f4..26b31fcae217 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -695,6 +695,15 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
>>>  	dev_t devt;
>>>  	int retval;
>>>  
>>> +	/*
>>> +	 * The disk queue should now be all set with enough information about
>>> +	 * the device for the elevator code to pick an adequate default
>>> +	 * elevator if one is needed, that is, for devices requesting queue
>>> +	 * registration.
>>> +	 */
>>> +	if (register_queue)
>>> +		elevator_init_mq(disk->queue);
>>> +
>>>  	/* minors == 0 indicates to use ext devt from part0 and should
>>>  	 * be accompanied with EXT_DEVT flag.  Make sure all
>>>  	 * parameters make sense.
>>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
>>> index 21d5c1784d0c..3f8577e2c13b 100644
>>> --- a/drivers/md/dm-rq.c
>>> +++ b/drivers/md/dm-rq.c
>>> @@ -563,7 +563,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>>>  	if (err)
>>>  		goto out_kfree_tag_set;
>>>  
>>> -	q = blk_mq_init_allocated_queue(md->tag_set, md->queue);
>>> +	q = blk_mq_init_allocated_queue(md->tag_set, md->queue, true);
>>>  	if (IS_ERR(q)) {
>>>  		err = PTR_ERR(q);
>>>  		goto out_tag_set;
>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>> index 62a3bb715899..0bf056de5cc3 100644
>>> --- a/include/linux/blk-mq.h
>>> +++ b/include/linux/blk-mq.h
>>> @@ -248,7 +248,8 @@ enum {
>>>  
>>>  struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
>>>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>> -						  struct request_queue *q);
>>> +						  struct request_queue *q,
>>> +						  bool elevator_init);
>>>  struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
>>>  						const struct blk_mq_ops *ops,
>>>  						unsigned int queue_depth,
>>>
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 5/7] block: Delay default elevator initialization
  2019-10-17  2:39       ` Damien Le Moal
@ 2019-10-17 13:37         ` Eric Farman
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Farman @ 2019-10-17 13:37 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Ming Lei, Christian Borntraeger



On 10/16/19 10:39 PM, Damien Le Moal wrote:
> On 2019/10/17 3:51, Eric Farman wrote:
>>
>>
>> On 10/1/19 4:46 PM, Eric Farman wrote:
>>>
>>>
>>> On 9/5/19 5:51 AM, Damien Le Moal wrote:
>>>> When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
>>>> the only information known about the device is the number of hardware
>>>> queues as the block device scan by the device driver is not completed
>>>> yet for most drivers. The device type and elevator required features
>>>> are not set yet, preventing to correctly select the default elevator
>>>> most suitable for the device.
>>>>
>>>> This currently affects all multi-queue zoned block devices which default
>>>> to the "none" elevator instead of the required "mq-deadline" elevator.
>>>> These drives currently include host-managed SMR disks connected to a
>>>> smartpqi HBA and null_blk block devices with zoned mode enabled.
>>>> Upcoming NVMe Zoned Namespace devices will also be affected.
>>>>
>>>> Fix this by adding the boolean elevator_init argument to
>>>> blk_mq_init_allocated_queue() to control the execution of
>>>> elevator_init_mq(). Two cases exist:
>>>> 1) elevator_init = false is used for calls to
>>>>    blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
>>>>    case, a call to elevator_init_mq() is added to __device_add_disk(),
>>>>    resulting in the delayed initialization of the queue elevator
>>>>    after the device driver finished probing the device information. This
>>>>    effectively allows elevator_init_mq() access to more information
>>>>    about the device.
>>>> 2) elevator_init = true preserves the current behavior of initializing
>>>>    the elevator directly from blk_mq_init_allocated_queue(). This case
>>>>    is used for the special request based DM devices where the device
>>>>    gendisk is created before the queue initialization and device
>>>>    information (e.g. queue limits) is already known when the queue
>>>>    initialization is executed.
>>>>
>>>> Additionally, to make sure that the elevator initialization is never
>>>> done while requests are in-flight (there should be none when the device
>>>> driver calls device_add_disk()), freeze and quiesce the device request
>>>> queue before calling blk_mq_init_sched() in elevator_init_mq().
>>>>
>>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>>
>>> Coincidentally, I had been looking into a problem that is fixed in
>>> 5.4-rc1 by this patch.  Thanks for that!
>>>
>>> The problem was a delay during boot of a KVM guest with virtio-scsi
>>> devices (or hotplug of such a device to a guest) in recent releases,
>>> especially when virtio-scsi is configured as a module.  The symptoms
>>> look like:
>>>
>>> [    0.975315] virtio_blk virtio2: [vda] 1803060 4096-byte logical
>>> blocks (7.39 GB/6.88 GiB)
>>> [    0.977859] scsi host0: Virtio SCSI HBA
>>> [    0.980339] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK
>>> 2.5+ PQ: 0 ANSI: 5
>>> [    0.981685]  vda:VOL1/  0XA906: vda1
>>> [    0.988253] alg: No test for crc32be (crc32be-vx)
>>> ...stall...
>>> [   24.544920] sd 0:0:0:0: Power-on or device reset occurred
>>> [   24.545176] sd 0:0:0:0: Attached scsi generic sg0 type 0
>>> [   24.545292] sd 0:0:0:0: [sda] 385 512-byte logical blocks: (197
>>> kB/193 KiB)
>>> [   24.545368] sd 0:0:0:0: [sda] Write Protect is off
>>> [   24.545416] sd 0:0:0:0: [sda] Mode Sense: 63 00 00 08
>>> [   24.545456] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
>>> enabled, doesn't support DPO or FUA
>>> [   24.547033] sd 0:0:0:0: [sda] Attached SCSI disk
>>>
>>> I debugged this down to the same behavior described/fixed back in 3.18
>>> by commit 17497acbdce9 ("blk-mq, percpu_ref: start q->mq_usage_counter
>>> in atomic mode"), and for the same reason.  The delay starts occurring
>>> as soon as q->q_usage_counter is converted to percpu for the one LUN tha
>>> twas found, while scsi_scan_channel() is still working on its loop of
>>> mostly non-existent devices.  Exactly when this problem started
>>> re-occuring is not certain to me, though I did see this problem with 5.2
>>> on linux-stable.
>>
>> This problem started occurring reliably with kernel 4.16 because of
>> commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic
>> irq vector affinity") which forced blk-mq on for virtio-scsi devices.
>>
>> I'm able to reproduce the behavior (fixed by this commit) on 4.15.0, as
>> well as 4.14.0 and 4.9.0, if I enable SCSI_MQ_DEFAULT in the kernel
>> config.  I cannot reproduce it with 4.4.0, but didn't chase further to
>> see why that was the case.
> 
> This is really strange since in 4.9 there is no elevator initialization being
> done for a mq device. So with SCSI_MQ_DEFAULT set, you should end up with no
> elevator for your disk and the presence or not of this patch or variations of it
> should not change anything.
> 
> This patch may be hiding your problem root cause rather than fixing it...

Neat.  Back to debugging 5.3 then.

Thanks,
Eric

> 
>>
>> That force_blk_mq commit went into linux-stable, so SCSI_MQ_DEFAULT need
>> not be enabled on 4.14.y or 4.9.y stable branches to see the behavior
>> fixed by this commit.
>>
>>>
>>> When I run with a 5.3 kernel, the problem is easily reproducible.  So I
>>> bisected between 5.3 and 5.4-rc1, and got here.  Cherry-picking this
>>> patch on top of 5.3 cleans up the boot/hotplug process and removes any
>>> stall.  Any chance this could be cc'd to stable?  
>>
>> Please?  Anything I can do to help with that effort?
>>
>> Thanks,
>> Eric
>>
>>> Any data someone wants
>>> to see behavioral changes?
>>>
>>> Thanks,
>>> Eric
>>>
>>>> ---
>>>>  block/blk-mq.c         | 12 +++++++++---
>>>>  block/elevator.c       |  7 +++++++
>>>>  block/genhd.c          |  9 +++++++++
>>>>  drivers/md/dm-rq.c     |  2 +-
>>>>  include/linux/blk-mq.h |  3 ++-
>>>>  5 files changed, 28 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index ee4caf0c0807..240416057f28 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -2689,7 +2689,11 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
>>>>  	if (!uninit_q)
>>>>  		return ERR_PTR(-ENOMEM);
>>>>  
>>>> -	q = blk_mq_init_allocated_queue(set, uninit_q);
>>>> +	/*
>>>> +	 * Initialize the queue without an elevator. device_add_disk() will do
>>>> +	 * the initialization.
>>>> +	 */
>>>> +	q = blk_mq_init_allocated_queue(set, uninit_q, false);
>>>>  	if (IS_ERR(q))
>>>>  		blk_cleanup_queue(uninit_q);
>>>>  
>>>> @@ -2840,7 +2844,8 @@ static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
>>>>  }
>>>>  
>>>>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>>> -						  struct request_queue *q)
>>>> +						  struct request_queue *q,
>>>> +						  bool elevator_init)
>>>>  {
>>>>  	/* mark the queue as mq asap */
>>>>  	q->mq_ops = set->ops;
>>>> @@ -2902,7 +2907,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>>>  	blk_mq_add_queue_tag_set(set, q);
>>>>  	blk_mq_map_swqueue(q);
>>>>  
>>>> -	elevator_init_mq(q);
>>>> +	if (elevator_init)
>>>> +		elevator_init_mq(q);
>>>>  
>>>>  	return q;
>>>>  
>>>> diff --git a/block/elevator.c b/block/elevator.c
>>>> index 520d6b224b74..096a670d22d7 100644
>>>> --- a/block/elevator.c
>>>> +++ b/block/elevator.c
>>>> @@ -712,7 +712,14 @@ void elevator_init_mq(struct request_queue *q)
>>>>  	if (!e)
>>>>  		return;
>>>>  
>>>> +	blk_mq_freeze_queue(q);
>>>> +	blk_mq_quiesce_queue(q);
>>>> +
>>>>  	err = blk_mq_init_sched(q, e);
>>>> +
>>>> +	blk_mq_unquiesce_queue(q);
>>>> +	blk_mq_unfreeze_queue(q);
>>>> +
>>>>  	if (err) {
>>>>  		pr_warn("\"%s\" elevator initialization failed, "
>>>>  			"falling back to \"none\"\n", e->elevator_name);
>>>> diff --git a/block/genhd.c b/block/genhd.c
>>>> index 54f1f0d381f4..26b31fcae217 100644
>>>> --- a/block/genhd.c
>>>> +++ b/block/genhd.c
>>>> @@ -695,6 +695,15 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
>>>>  	dev_t devt;
>>>>  	int retval;
>>>>  
>>>> +	/*
>>>> +	 * The disk queue should now be all set with enough information about
>>>> +	 * the device for the elevator code to pick an adequate default
>>>> +	 * elevator if one is needed, that is, for devices requesting queue
>>>> +	 * registration.
>>>> +	 */
>>>> +	if (register_queue)
>>>> +		elevator_init_mq(disk->queue);
>>>> +
>>>>  	/* minors == 0 indicates to use ext devt from part0 and should
>>>>  	 * be accompanied with EXT_DEVT flag.  Make sure all
>>>>  	 * parameters make sense.
>>>> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
>>>> index 21d5c1784d0c..3f8577e2c13b 100644
>>>> --- a/drivers/md/dm-rq.c
>>>> +++ b/drivers/md/dm-rq.c
>>>> @@ -563,7 +563,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>>>>  	if (err)
>>>>  		goto out_kfree_tag_set;
>>>>  
>>>> -	q = blk_mq_init_allocated_queue(md->tag_set, md->queue);
>>>> +	q = blk_mq_init_allocated_queue(md->tag_set, md->queue, true);
>>>>  	if (IS_ERR(q)) {
>>>>  		err = PTR_ERR(q);
>>>>  		goto out_tag_set;
>>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>>> index 62a3bb715899..0bf056de5cc3 100644
>>>> --- a/include/linux/blk-mq.h
>>>> +++ b/include/linux/blk-mq.h
>>>> @@ -248,7 +248,8 @@ enum {
>>>>  
>>>>  struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
>>>>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>>> -						  struct request_queue *q);
>>>> +						  struct request_queue *q,
>>>> +						  bool elevator_init);
>>>>  struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
>>>>  						const struct blk_mq_ops *ops,
>>>>  						unsigned int queue_depth,
>>>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-10-17 13:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-05  9:51 [PATCH v5 0/7] Elevator cleanups and improvements Damien Le Moal
2019-09-05  9:51 ` [PATCH v5 1/7] block: Cleanup elevator_init_mq() use Damien Le Moal
2019-09-05  9:51 ` [PATCH v5 2/7] block: Change elevator_init_mq() to always succeed Damien Le Moal
2019-09-05  9:51 ` [PATCH v5 3/7] block: Introduce elevator features Damien Le Moal
2019-09-05  9:51 ` [PATCH v5 4/7] block: Improve default elevator selection Damien Le Moal
2019-09-05  9:51 ` [PATCH v5 5/7] block: Delay default elevator initialization Damien Le Moal
2019-10-01 20:46   ` Eric Farman
2019-10-16 18:51     ` Eric Farman
2019-10-17  2:39       ` Damien Le Moal
2019-10-17 13:37         ` Eric Farman
2019-09-05  9:51 ` [PATCH v5 6/7] block: Set ELEVATOR_F_ZBD_SEQ_WRITE for nullblk zoned disks Damien Le Moal
2019-09-05  9:51 ` [PATCH v5 7/7] sd: Set ELEVATOR_F_ZBD_SEQ_WRITE for ZBC disks Damien Le Moal
2019-09-06  0:31 ` [PATCH v5 0/7] Elevator cleanups and improvements Ming Lei
2019-09-06  1:53 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).