All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing
@ 2023-10-21 15:47 Yu Kuai
  2023-10-21 15:47 ` [PATCH RFC v2 1/8] blk-mq: factor out a structure from blk_mq_tags Yu Kuai
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:47 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Current implementation:
 - a counter active_queues record how many queue/hctx is sharing tags,
 and it's updated while issue new IO, and cleared in
 blk_mq_timeout_work().
 - if active_queues is more than 1, then tags is fair shared to each
 node;

New implementation:
 - a new field 'available_tags' is added to each node, and it's
 calculate in slow path, hence fast path won't be affected, patch 5;
 - a new counter 'busy_queues' is added to blk_mq_tags, and it's updated
 while fail to get driver tag, and it's also cleared in
 blk_mq_timeout_work(), and tag sharing will based on 'busy_queues'
 instead of 'active_queues', patch 6,7;
 - a new counter 'busy_count' is added to each node to record how many
 times a node failed to get driver tag, and it's used to judge if a node
 is busy and need more tags, patch 8;
 - a new timer is added to blk_mq_tags, it will start if any node failed
 to get driver tag, and timer function will be used to borrow tags and
 return borrowed tags, patch 8;

A simple test, 32 tags with two shared node:
[global]
ioengine=libaio
iodepth=2
bs=4k
direct=1
rw=randrw
group_reporting

[sda]
numjobs=32
filename=/dev/sda

[sdb]
numjobs=1
filename=/dev/sdb

Test result(monitor new debugfs entry):

time    active          available
        sda     sdb     sda     sdb
0       0       0       32      32
1       16      2       16      16      -> start fair sharing
2       19      2       20      16
3       24      2       24      16
4       26      2       28      16      -> borrow 32/8=4 tags each round
5       28      2       28      16      -> save at lease 4 tags for sdb

Yu Kuai (8):
  blk-mq: factor out a structure from blk_mq_tags
  blk-mq: factor out a structure to store information for tag sharing
  blk-mq: add a helper to initialize shared_tag_info
  blk-mq: support to track active queues from blk_mq_tags
  blk-mq: precalculate available tags for hctx_may_queue()
  blk-mq: add new helpers blk_mq_driver_tag_busy/idle()
  blk-mq-tag: delay tag sharing until fail to get driver tag
  blk-mq-tag: allow shared queue/hctx to get more driver tags

 block/blk-core.c       |   2 -
 block/blk-mq-debugfs.c |  30 +++++-
 block/blk-mq-tag.c     | 226 +++++++++++++++++++++++++++++++++++++++--
 block/blk-mq.c         |  12 ++-
 block/blk-mq.h         |  64 +++++++-----
 include/linux/blk-mq.h |  36 +++++--
 include/linux/blkdev.h |  11 +-
 7 files changed, 328 insertions(+), 53 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC v2 1/8] blk-mq: factor out a structure from blk_mq_tags
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
@ 2023-10-21 15:47 ` Yu Kuai
  2023-10-21 15:48 ` [PATCH RFC v2 2/8] blk-mq: factor out a structure to store information for tag sharing Yu Kuai
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:47 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Currently tags are fairly shared. The new structure contains only one
field active_queues for now.

There are no functional changes and this patch prepares for refactoring
tag sharing.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-debugfs.c |  2 +-
 block/blk-mq-tag.c     |  8 ++++----
 block/blk-mq.h         |  2 +-
 include/linux/blk-mq.h | 10 ++++++++--
 4 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 5cbeb9344f2f..f6b77289bc1f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -400,7 +400,7 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 	seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
 	seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
 	seq_printf(m, "active_queues=%d\n",
-		   READ_ONCE(tags->active_queues));
+		   READ_ONCE(tags->ctl.active_queues));
 
 	seq_puts(m, "\nbitmap_tags:\n");
 	sbitmap_queue_show(&tags->bitmap_tags, m);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index cc57e2dd9a0b..fe41a0d34fc0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -57,8 +57,8 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 	}
 
 	spin_lock_irq(&tags->lock);
-	users = tags->active_queues + 1;
-	WRITE_ONCE(tags->active_queues, users);
+	users = tags->ctl.active_queues + 1;
+	WRITE_ONCE(tags->ctl.active_queues, users);
 	blk_mq_update_wake_batch(tags, users);
 	spin_unlock_irq(&tags->lock);
 }
@@ -94,8 +94,8 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 	}
 
 	spin_lock_irq(&tags->lock);
-	users = tags->active_queues - 1;
-	WRITE_ONCE(tags->active_queues, users);
+	users = tags->ctl.active_queues - 1;
+	WRITE_ONCE(tags->ctl.active_queues, users);
 	blk_mq_update_wake_batch(tags, users);
 	spin_unlock_irq(&tags->lock);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index f75a9ecfebde..363c8f6e13dd 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -435,7 +435,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 			return true;
 	}
 
-	users = READ_ONCE(hctx->tags->active_queues);
+	users = READ_ONCE(hctx->tags->ctl.active_queues);
 	if (!users)
 		return true;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1ab3081c82ed..5a08527c53b9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -727,13 +727,16 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 		blk_opf_t opf, blk_mq_req_flags_t flags,
 		unsigned int hctx_idx);
 
+struct tag_sharing_ctl {
+	unsigned int active_queues;
+};
+
 /*
  * Tag address space map.
  */
 struct blk_mq_tags {
 	unsigned int nr_tags;
 	unsigned int nr_reserved_tags;
-	unsigned int active_queues;
 
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
@@ -744,9 +747,12 @@ struct blk_mq_tags {
 
 	/*
 	 * used to clear request reference in rqs[] before freeing one
-	 * request pool
+	 * request pool, and to protect tag_sharing_ctl.
 	 */
 	spinlock_t lock;
+
+	/* used when tags is shared for multiple request_queue or hctx. */
+	struct tag_sharing_ctl ctl;
 };
 
 static inline struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC v2 2/8] blk-mq: factor out a structure to store information for tag sharing
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
  2023-10-21 15:47 ` [PATCH RFC v2 1/8] blk-mq: factor out a structure from blk_mq_tags Yu Kuai
@ 2023-10-21 15:48 ` Yu Kuai
  2023-10-21 15:48 ` [PATCH RFC v2 3/8] blk-mq: add a helper to initialize shared_tag_info Yu Kuai
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:48 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Tags can be shared by multiple request_queue or hctx, and they both have
a counter to record how many tags is used, 'nr_active_requests_shared_tags'
for request_queue and 'nr_active' for hctx. Factor out a new structure
shared_tag_info to store such information.

Also add new entry debugfs entry 'shared_tag_info' for both queue and
hctx.

This patch prepares for refactoring tag sharing.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-core.c       |  2 +-
 block/blk-mq-debugfs.c | 24 ++++++++++++++++++++++++
 block/blk-mq.c         |  2 +-
 block/blk-mq.h         | 23 ++++++++++++-----------
 include/linux/blk-mq.h | 11 +++++------
 include/linux/blkdev.h |  7 +++++--
 6 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9d51e9894ece..c028b047f5d5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -414,7 +414,7 @@ struct request_queue *blk_alloc_queue(int node_id)
 
 	q->node = node_id;
 
-	atomic_set(&q->nr_active_requests_shared_tags, 0);
+	atomic_set(&q->shared_tag_info.active_tags, 0);
 
 	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
 	INIT_WORK(&q->timeout_work, blk_timeout_work);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f6b77289bc1f..d6ebd8d9d3bb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -155,12 +155,27 @@ static ssize_t queue_state_write(void *data, const char __user *buf,
 	return count;
 }
 
+static void shared_tag_info_show(struct shared_tag_info *info,
+				 struct seq_file *m)
+{
+	seq_printf(m, "%d\n", atomic_read(&info->active_tags));
+}
+
+static int queue_shared_tag_info_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+
+	shared_tag_info_show(&q->shared_tag_info, m);
+	return 0;
+}
+
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 	{ "poll_stat", 0400, queue_poll_stat_show },
 	{ "requeue_list", 0400, .seq_ops = &queue_requeue_list_seq_ops },
 	{ "pm_only", 0600, queue_pm_only_show, NULL },
 	{ "state", 0600, queue_state_show, queue_state_write },
 	{ "zone_wlock", 0400, queue_zone_wlock_show, NULL },
+	{ "shared_tag_info", 0400, queue_shared_tag_info_show, NULL },
 	{ },
 };
 
@@ -512,6 +527,14 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+static int hctx_shared_tag_info_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+
+	shared_tag_info_show(&hctx->shared_tag_info, m);
+	return 0;
+}
+
 #define CTX_RQ_SEQ_OPS(name, type)					\
 static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
 	__acquires(&ctx->lock)						\
@@ -628,6 +651,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"active", 0400, hctx_active_show},
 	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
 	{"type", 0400, hctx_type_show},
+	{"shared_tag_info", 0400, hctx_shared_tag_info_show},
 	{},
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e2d11183f62e..d85b9ad816d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3679,7 +3679,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node))
 		goto free_hctx;
 
-	atomic_set(&hctx->nr_active, 0);
+	atomic_set(&hctx->shared_tag_info.active_tags, 0);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 	hctx->numa_node = node;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 363c8f6e13dd..6f332dc122ff 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -274,10 +274,10 @@ static inline int blk_mq_get_rq_budget_token(struct request *rq)
 static inline void __blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx,
 						int val)
 {
-	if (blk_mq_is_shared_tags(hctx->flags))
-		atomic_add(val, &hctx->queue->nr_active_requests_shared_tags);
-	else
-		atomic_add(val, &hctx->nr_active);
+	struct shared_tag_info *info = blk_mq_is_shared_tags(hctx->flags) ?
+		&hctx->queue->shared_tag_info : &hctx->shared_tag_info;
+
+	atomic_add(val, &info->active_tags);
 }
 
 static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx)
@@ -288,10 +288,10 @@ static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx)
 static inline void __blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx,
 		int val)
 {
-	if (blk_mq_is_shared_tags(hctx->flags))
-		atomic_sub(val, &hctx->queue->nr_active_requests_shared_tags);
-	else
-		atomic_sub(val, &hctx->nr_active);
+	struct shared_tag_info *info = blk_mq_is_shared_tags(hctx->flags) ?
+		&hctx->queue->shared_tag_info : &hctx->shared_tag_info;
+
+	atomic_sub(val, &info->active_tags);
 }
 
 static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx)
@@ -327,9 +327,10 @@ static inline void blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx)
 
 static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx)
 {
-	if (blk_mq_is_shared_tags(hctx->flags))
-		return atomic_read(&hctx->queue->nr_active_requests_shared_tags);
-	return atomic_read(&hctx->nr_active);
+	struct shared_tag_info *info = blk_mq_is_shared_tags(hctx->flags) ?
+		&hctx->queue->shared_tag_info : &hctx->shared_tag_info;
+
+	return atomic_read(&info->active_tags);
 }
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5a08527c53b9..4301226f311b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -384,6 +384,11 @@ struct blk_mq_hw_ctx {
 	 * assigned when a request is dispatched from a hardware queue.
 	 */
 	struct blk_mq_tags	*tags;
+	/**
+	 * @shared_tag_info: Only used when a tag set is shared across request
+	 * queues.
+	 */
+	struct shared_tag_info	shared_tag_info;
 	/**
 	 * @sched_tags: Tags owned by I/O scheduler. If there is an I/O
 	 * scheduler associated with a request queue, a tag is assigned when
@@ -399,12 +404,6 @@ struct blk_mq_hw_ctx {
 	/** @queue_num: Index of this hardware queue. */
 	unsigned int		queue_num;
 
-	/**
-	 * @nr_active: Number of active requests. Only used when a tag set is
-	 * shared across request queues.
-	 */
-	atomic_t		nr_active;
-
 	/** @cpuhp_online: List to store request if CPU is going to die */
 	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..645a8e245add 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -375,6 +375,10 @@ struct blk_independent_access_ranges {
 	struct blk_independent_access_range	ia_range[];
 };
 
+struct shared_tag_info {
+	atomic_t active_tags;
+};
+
 struct request_queue {
 	struct request		*last_merge;
 	struct elevator_queue	*elevator;
@@ -455,8 +459,6 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 
-	atomic_t		nr_active_requests_shared_tags;
-
 	struct blk_mq_tags	*sched_shared_tags;
 
 	struct list_head	icq_list;
@@ -513,6 +515,7 @@ struct request_queue {
 
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
+	struct shared_tag_info	shared_tag_info;
 
 	struct dentry		*debugfs_dir;
 	struct dentry		*sched_debugfs_dir;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC v2 3/8] blk-mq: add a helper to initialize shared_tag_info
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
  2023-10-21 15:47 ` [PATCH RFC v2 1/8] blk-mq: factor out a structure from blk_mq_tags Yu Kuai
  2023-10-21 15:48 ` [PATCH RFC v2 2/8] blk-mq: factor out a structure to store information for tag sharing Yu Kuai
@ 2023-10-21 15:48 ` Yu Kuai
  2023-10-21 15:48 ` [PATCH RFC v2 4/8] blk-mq: support to track active queues from blk_mq_tags Yu Kuai
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:48 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

shared_tag_info is used for both request_queue and hctx, and follow up
patches will add more fields into the structure, add a helper to avoid
redundant code.

Also move initialization for request_queue from blk_alloc_queue() to
blk_mq_init_allocated_queue(), because shared_tag_info won't be used for
bio-based device. And move initialization for hctx from
blk_mq_alloc_hctx() to blk_mq_init_hctx(), because hctx can be reused.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-core.c   | 2 --
 block/blk-mq-tag.c | 5 +++++
 block/blk-mq.c     | 3 ++-
 block/blk-mq.h     | 1 +
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c028b047f5d5..756ca1109f6c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -414,8 +414,6 @@ struct request_queue *blk_alloc_queue(int node_id)
 
 	q->node = node_id;
 
-	atomic_set(&q->shared_tag_info.active_tags, 0);
-
 	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
 	INIT_WORK(&q->timeout_work, blk_timeout_work);
 	INIT_LIST_HEAD(&q->icq_list);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index fe41a0d34fc0..2f91a7605d7a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -29,6 +29,11 @@ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags,
 			users);
 }
 
+void blk_mq_init_shared_tag_info(struct shared_tag_info *info)
+{
+	atomic_set(&info->active_tags, 0);
+}
+
 /*
  * If a previously inactive queue goes active, bump the active user count.
  * We need to do this before try to allocate driver tag, then even if fail
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d85b9ad816d0..de5859dd9f52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3652,6 +3652,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	if (xa_insert(&q->hctx_table, hctx_idx, hctx, GFP_KERNEL))
 		goto exit_flush_rq;
 
+	blk_mq_init_shared_tag_info(&hctx->shared_tag_info);
 	return 0;
 
  exit_flush_rq:
@@ -3679,7 +3680,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node))
 		goto free_hctx;
 
-	atomic_set(&hctx->shared_tag_info.active_tags, 0);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 	hctx->numa_node = node;
@@ -4227,6 +4227,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (blk_mq_alloc_ctxs(q))
 		goto err_exit;
 
+	blk_mq_init_shared_tag_info(&q->shared_tag_info);
 	/* init q->mq_kobj and sw queues' kobjects */
 	blk_mq_sysfs_init(q);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6f332dc122ff..ac58f2e22f20 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -63,6 +63,7 @@ struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
 void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
 			     struct blk_mq_tags *tags,
 			     unsigned int hctx_idx);
+void blk_mq_init_shared_tag_info(struct shared_tag_info *info);
 
 /*
  * CPU -> queue mappings
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC v2 4/8] blk-mq: support to track active queues from blk_mq_tags
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
                   ` (2 preceding siblings ...)
  2023-10-21 15:48 ` [PATCH RFC v2 3/8] blk-mq: add a helper to initialize shared_tag_info Yu Kuai
@ 2023-10-21 15:48 ` Yu Kuai
  2023-10-21 15:48 ` [PATCH RFC v2 5/8] blk-mq: precalculate available tags for hctx_may_queue() Yu Kuai
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:48 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

In order to refactor how tags is shared, it's necessary to acquire some
information for each shared q/hctx, so that more tags can be assigned to
the one with higher pressure.

Prepare to refactor tag sharing.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c     | 38 ++++++++++++++++++++++++++++++++------
 include/linux/blk-mq.h |  5 +++++
 include/linux/blkdev.h |  3 ++-
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2f91a7605d7a..07d9b513990b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -32,6 +32,7 @@ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags,
 void blk_mq_init_shared_tag_info(struct shared_tag_info *info)
 {
 	atomic_set(&info->active_tags, 0);
+	INIT_LIST_HEAD(&info->node);
 }
 
 /*
@@ -44,6 +45,7 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
 	unsigned int users;
 	struct blk_mq_tags *tags = hctx->tags;
+	struct shared_tag_info *info;
 
 	/*
 	 * calling test_bit() prior to test_and_set_bit() is intentional,
@@ -55,13 +57,18 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
 		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
 			return;
+
+		info = &q->shared_tag_info;
 	} else {
 		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
 		    test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 			return;
+
+		info = &hctx->shared_tag_info;
 	}
 
 	spin_lock_irq(&tags->lock);
+	list_add(&info->node, &tags->ctl.head);
 	users = tags->ctl.active_queues + 1;
 	WRITE_ONCE(tags->ctl.active_queues, users);
 	blk_mq_update_wake_batch(tags, users);
@@ -84,26 +91,44 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
  */
 void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
-	struct blk_mq_tags *tags = hctx->tags;
 	unsigned int users;
+	struct blk_mq_tags *tags = hctx->tags;
+	struct shared_tag_info *info;
 
 	if (blk_mq_is_shared_tags(hctx->flags)) {
 		struct request_queue *q = hctx->queue;
 
-		if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
-					&q->queue_flags))
+		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
 			return;
+		spin_lock_irq(&tags->lock);
+		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) {
+			spin_unlock_irq(&tags->lock);
+			return;
+		}
+
+		info = &q->shared_tag_info;
 	} else {
-		if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 			return;
+		spin_lock_irq(&tags->lock);
+		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) {
+			spin_unlock_irq(&tags->lock);
+			return;
+		}
+
+		info = &hctx->shared_tag_info;
 	}
 
-	spin_lock_irq(&tags->lock);
+	list_del_init(&info->node);
 	users = tags->ctl.active_queues - 1;
 	WRITE_ONCE(tags->ctl.active_queues, users);
 	blk_mq_update_wake_batch(tags, users);
-	spin_unlock_irq(&tags->lock);
 
+	if (blk_mq_is_shared_tags(hctx->flags))
+		clear_bit(QUEUE_FLAG_HCTX_ACTIVE, &hctx->queue->queue_flags);
+	else
+		clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state);
+	spin_unlock_irq(&tags->lock);
 	blk_mq_tag_wakeup_all(tags, false);
 }
 
@@ -586,6 +611,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
 	spin_lock_init(&tags->lock);
+	INIT_LIST_HEAD(&tags->ctl.head);
 
 	if (blk_mq_init_bitmaps(&tags->bitmap_tags, &tags->breserved_tags,
 				total_tags, reserved_tags, node,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4301226f311b..c93955f5f28f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -728,6 +728,11 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 
 struct tag_sharing_ctl {
 	unsigned int active_queues;
+	/*
+	 * If driver tags is shared for multiple queue/hctx, this is the head of
+	 * a list with request_queue/hctx->shared_tag_info.node entries.
+	 */
+	struct list_head head;
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 645a8e245add..f97bc2c7acc9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -376,7 +376,8 @@ struct blk_independent_access_ranges {
 };
 
 struct shared_tag_info {
-	atomic_t active_tags;
+	atomic_t		active_tags;
+	struct list_head	node;
 };
 
 struct request_queue {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC v2 5/8] blk-mq: precalculate available tags for hctx_may_queue()
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
                   ` (3 preceding siblings ...)
  2023-10-21 15:48 ` [PATCH RFC v2 4/8] blk-mq: support to track active queues from blk_mq_tags Yu Kuai
@ 2023-10-21 15:48 ` Yu Kuai
  2023-10-21 15:48 ` [PATCH RFC v2 6/8] blk-mq: add new helpers blk_mq_driver_tag_busy/idle() Yu Kuai
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:48 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Currently, hctx_mq_queue() only need to get how many queues is sharing
tags, then calculate how many tags is available for each queue by fair
sharing.

Add a new field 'available_tags' for struct shared_tag_info to store
how many tags is available directly from slow path, so that
hctx_mq_queue() doesn't need to do calculation.

Currently tags are still fair shared, and now that calculation is in the
slow path, it's okay to refactor tag sharing with more complicated
algorithm, which is implemented in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-debugfs.c |  3 ++-
 block/blk-mq-tag.c     | 35 ++++++++++++++++++++++++++++++++++-
 block/blk-mq.c         |  4 ++--
 block/blk-mq.h         | 19 ++++++++-----------
 include/linux/blkdev.h |  1 +
 5 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index d6ebd8d9d3bb..1d460119f5b3 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -158,7 +158,8 @@ static ssize_t queue_state_write(void *data, const char __user *buf,
 static void shared_tag_info_show(struct shared_tag_info *info,
 				 struct seq_file *m)
 {
-	seq_printf(m, "%d\n", atomic_read(&info->active_tags));
+	seq_printf(m, "active tags %d\n", atomic_read(&info->active_tags));
+	seq_printf(m, "available tags %u\n", READ_ONCE(info->available_tags));
 }
 
 static int queue_shared_tag_info_show(void *data, struct seq_file *m)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 07d9b513990b..261769251282 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -14,6 +14,8 @@
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
 
+#define shared_tags(tags, users) max((tags->nr_tags + users - 1) / users, 4U)
+
 /*
  * Recalculate wakeup batch when tag is shared by hctx.
  */
@@ -29,10 +31,39 @@ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags,
 			users);
 }
 
-void blk_mq_init_shared_tag_info(struct shared_tag_info *info)
+void blk_mq_init_shared_tag_info(struct shared_tag_info *info,
+				 unsigned int nr_tags)
 {
 	atomic_set(&info->active_tags, 0);
 	INIT_LIST_HEAD(&info->node);
+	info->available_tags = nr_tags;
+}
+
+static void blk_mq_update_available_driver_tags(struct blk_mq_tags *tags,
+						struct shared_tag_info *info,
+						unsigned int users)
+{
+	unsigned int old = tags->ctl.active_queues;
+	int nr_tags;
+	struct shared_tag_info *iter;
+
+	if (!old || !users)
+		return;
+
+	nr_tags = (int)shared_tags(tags, users);
+	if (old < users)
+		WRITE_ONCE(info->available_tags, nr_tags);
+	else
+		WRITE_ONCE(info->available_tags, tags->nr_tags);
+
+	nr_tags -= (int)shared_tags(tags, old);
+	list_for_each_entry(iter, &tags->ctl.head, node) {
+		if (iter == info)
+			continue;
+
+		WRITE_ONCE(iter->available_tags,
+			   (unsigned int)((int)iter->available_tags + nr_tags));
+	}
 }
 
 /*
@@ -70,6 +101,7 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 	spin_lock_irq(&tags->lock);
 	list_add(&info->node, &tags->ctl.head);
 	users = tags->ctl.active_queues + 1;
+	blk_mq_update_available_driver_tags(tags, info, users);
 	WRITE_ONCE(tags->ctl.active_queues, users);
 	blk_mq_update_wake_batch(tags, users);
 	spin_unlock_irq(&tags->lock);
@@ -121,6 +153,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 
 	list_del_init(&info->node);
 	users = tags->ctl.active_queues - 1;
+	blk_mq_update_available_driver_tags(tags, info, users);
 	WRITE_ONCE(tags->ctl.active_queues, users);
 	blk_mq_update_wake_batch(tags, users);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index de5859dd9f52..8775616bc85c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3652,7 +3652,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	if (xa_insert(&q->hctx_table, hctx_idx, hctx, GFP_KERNEL))
 		goto exit_flush_rq;
 
-	blk_mq_init_shared_tag_info(&hctx->shared_tag_info);
+	blk_mq_init_shared_tag_info(&hctx->shared_tag_info, set->queue_depth);
 	return 0;
 
  exit_flush_rq:
@@ -4227,7 +4227,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (blk_mq_alloc_ctxs(q))
 		goto err_exit;
 
-	blk_mq_init_shared_tag_info(&q->shared_tag_info);
+	blk_mq_init_shared_tag_info(&q->shared_tag_info, set->queue_depth);
 	/* init q->mq_kobj and sw queues' kobjects */
 	blk_mq_sysfs_init(q);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ac58f2e22f20..5c0d19562848 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -63,7 +63,8 @@ struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
 void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
 			     struct blk_mq_tags *tags,
 			     unsigned int hctx_idx);
-void blk_mq_init_shared_tag_info(struct shared_tag_info *info);
+void blk_mq_init_shared_tag_info(struct shared_tag_info *info,
+				 unsigned int nr_tags);
 
 /*
  * CPU -> queue mappings
@@ -416,7 +417,7 @@ static inline void blk_mq_free_requests(struct list_head *list)
 static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 				  struct sbitmap_queue *bt)
 {
-	unsigned int depth, users;
+	struct shared_tag_info *info;
 
 	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return true;
@@ -432,20 +433,16 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 
 		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
 			return true;
+
+		info = &q->shared_tag_info;
 	} else {
 		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 			return true;
-	}
 
-	users = READ_ONCE(hctx->tags->ctl.active_queues);
-	if (!users)
-		return true;
+		info = &hctx->shared_tag_info;
+	}
 
-	/*
-	 * Allow at least some tags
-	 */
-	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return __blk_mq_active_requests(hctx) < depth;
+	return atomic_read(&info->active_tags) < READ_ONCE(info->available_tags);
 }
 
 /* run the code block in @dispatch_ops with rcu/srcu read lock held */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f97bc2c7acc9..b364d65fe4e5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -377,6 +377,7 @@ struct blk_independent_access_ranges {
 
 struct shared_tag_info {
 	atomic_t		active_tags;
+	unsigned int		available_tags;
 	struct list_head	node;
 };
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC v2 6/8] blk-mq: add new helpers blk_mq_driver_tag_busy/idle()
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
                   ` (4 preceding siblings ...)
  2023-10-21 15:48 ` [PATCH RFC v2 5/8] blk-mq: precalculate available tags for hctx_may_queue() Yu Kuai
@ 2023-10-21 15:48 ` Yu Kuai
  2023-10-21 15:48 ` [PATCH RFC v2 7/8] blk-mq-tag: delay tag sharing until fail to get driver tag Yu Kuai
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:48 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Refer to the implementation of blk_mq_tag_busy/idle():

 - blk_mq_driver_tag_busy() will be used the first time when get driver
 tag failed;
 - blk_mq_driver_tag_idle() will be used when driver tag is no longer
 exhausted.
 - A new counter 'busy_queues' is added to indicate how many shared
 queues/hctxs are busy(drivers tags is exhausted);

Tag sharing will be delayed until fail to get driver tag based on these
new helpers.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-debugfs.c |  2 ++
 block/blk-mq-tag.c     | 53 +++++++++++++++++++++++++++++++++++++++++-
 block/blk-mq.c         |  9 +++++--
 block/blk-mq.h         | 25 ++++++++++++++++----
 include/linux/blk-mq.h |  7 ++++--
 include/linux/blkdev.h |  1 +
 6 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 1d460119f5b3..170bc2236e81 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -417,6 +417,8 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 	seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
 	seq_printf(m, "active_queues=%d\n",
 		   READ_ONCE(tags->ctl.active_queues));
+	seq_printf(m, "busy_queues=%d\n",
+		   READ_ONCE(tags->ctl.busy_queues));
 
 	seq_puts(m, "\nbitmap_tags:\n");
 	sbitmap_queue_show(&tags->bitmap_tags, m);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 261769251282..cd13d8e512f7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -165,6 +165,51 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 	blk_mq_tag_wakeup_all(tags, false);
 }
 
+void __blk_mq_driver_tag_busy(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int users;
+	struct blk_mq_tags *tags = hctx->tags;
+
+	if (blk_mq_is_shared_tags(hctx->flags)) {
+		struct request_queue *q = hctx->queue;
+
+		if (test_bit(QUEUE_FLAG_HCTX_BUSY, &q->queue_flags) ||
+		    test_and_set_bit(QUEUE_FLAG_HCTX_BUSY, &q->queue_flags))
+			return;
+	} else {
+		if (test_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state) ||
+		    test_and_set_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state))
+			return;
+	}
+
+	spin_lock_irq(&tags->lock);
+	users = tags->ctl.busy_queues + 1;
+	WRITE_ONCE(tags->ctl.busy_queues, users);
+	spin_unlock_irq(&tags->lock);
+}
+
+void __blk_mq_driver_tag_idle(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int users;
+	struct blk_mq_tags *tags = hctx->tags;
+
+	if (blk_mq_is_shared_tags(hctx->flags)) {
+		struct request_queue *q = hctx->queue;
+
+		if (!test_and_clear_bit(QUEUE_FLAG_HCTX_BUSY,
+					&q->queue_flags))
+			return;
+	} else {
+		if (!test_and_clear_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state))
+			return;
+	}
+
+	spin_lock_irq(&tags->lock);
+	users = tags->ctl.busy_queues - 1;
+	WRITE_ONCE(tags->ctl.busy_queues, users);
+	spin_unlock_irq(&tags->lock);
+}
+
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 			    struct sbitmap_queue *bt)
 {
@@ -218,8 +263,11 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	if (tag != BLK_MQ_NO_TAG)
 		goto found_tag;
 
-	if (data->flags & BLK_MQ_REQ_NOWAIT)
+	if (data->flags & BLK_MQ_REQ_NOWAIT) {
+		if (!(data->rq_flags & RQF_SCHED_TAGS))
+			blk_mq_driver_tag_busy(data->hctx);
 		return BLK_MQ_NO_TAG;
+	}
 
 	ws = bt_wait_ptr(bt, data->hctx);
 	do {
@@ -246,6 +294,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		if (!(data->rq_flags & RQF_SCHED_TAGS))
+			blk_mq_driver_tag_busy(data->hctx);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8775616bc85c..a106533f063f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1668,8 +1668,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		 */
 		queue_for_each_hw_ctx(q, hctx, i) {
 			/* the hctx may be unmapped, so check it here */
-			if (blk_mq_hw_queue_mapped(hctx))
+			if (blk_mq_hw_queue_mapped(hctx)) {
 				blk_mq_tag_idle(hctx);
+				blk_mq_driver_tag_idle(hctx);
+			}
 		}
 	}
 	blk_queue_exit(q);
@@ -3594,8 +3596,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 {
 	struct request *flush_rq = hctx->fq->flush_rq;
 
-	if (blk_mq_hw_queue_mapped(hctx))
+	if (blk_mq_hw_queue_mapped(hctx)) {
 		blk_mq_tag_idle(hctx);
+		blk_mq_driver_tag_idle(hctx);
+	}
 
 	if (blk_queue_init_done(q))
 		blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
@@ -3931,6 +3935,7 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		} else {
 			blk_mq_tag_idle(hctx);
+			blk_mq_driver_tag_idle(hctx);
 			hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		}
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 5c0d19562848..3e555af1de49 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -195,8 +195,10 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 	return sbq_wait_ptr(bt, &hctx->wait_index);
 }
 
-void __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
-void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
+void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx);
+void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx);
+void __blk_mq_driver_tag_busy(struct blk_mq_hw_ctx *hctx);
+void __blk_mq_driver_tag_idle(struct blk_mq_hw_ctx *hctx);
 
 static inline void blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
@@ -210,6 +212,18 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 		__blk_mq_tag_idle(hctx);
 }
 
+static inline void blk_mq_driver_tag_busy(struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
+		__blk_mq_driver_tag_busy(hctx);
+}
+
+static inline void blk_mq_driver_tag_idle(struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
+		__blk_mq_driver_tag_idle(hctx);
+}
+
 static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
 					  unsigned int tag)
 {
@@ -293,7 +307,8 @@ static inline void __blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx,
 	struct shared_tag_info *info = blk_mq_is_shared_tags(hctx->flags) ?
 		&hctx->queue->shared_tag_info : &hctx->shared_tag_info;
 
-	atomic_sub(val, &info->active_tags);
+	if (!atomic_sub_return(val, &info->active_tags))
+		blk_mq_driver_tag_idle(hctx);
 }
 
 static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx)
@@ -354,8 +369,10 @@ bool __blk_mq_alloc_driver_tag(struct request *rq);
 
 static inline bool blk_mq_get_driver_tag(struct request *rq)
 {
-	if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_alloc_driver_tag(rq))
+	if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_alloc_driver_tag(rq)) {
+		blk_mq_driver_tag_busy(rq->mq_hctx);
 		return false;
+	}
 
 	return true;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c93955f5f28f..9182ceca8c7a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -666,10 +666,11 @@ enum {
 
 	BLK_MQ_S_STOPPED	= 0,
 	BLK_MQ_S_TAG_ACTIVE	= 1,
-	BLK_MQ_S_SCHED_RESTART	= 2,
+	BLK_MQ_S_DTAG_BUSY      = 2,
+	BLK_MQ_S_SCHED_RESTART	= 3,
 
 	/* hw queue is inactive after all its CPUs become offline */
-	BLK_MQ_S_INACTIVE	= 3,
+	BLK_MQ_S_INACTIVE	= 4,
 
 	BLK_MQ_MAX_DEPTH	= 10240,
 
@@ -728,6 +729,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 
 struct tag_sharing_ctl {
 	unsigned int active_queues;
+	/* The number of shared queues/hctxs with exhausted driver tags. */
+	unsigned int busy_queues;
 	/*
 	 * If driver tags is shared for multiple queue/hctx, this is the head of
 	 * a list with request_queue/hctx->shared_tag_info.node entries.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b364d65fe4e5..8fd6a0a92233 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -552,6 +552,7 @@ struct request_queue {
 #define QUEUE_FLAG_DAX		19	/* device supports DAX */
 #define QUEUE_FLAG_STATS	20	/* track IO start and completion times */
 #define QUEUE_FLAG_REGISTERED	22	/* queue has been registered to a disk */
+#define QUEUE_FLAG_HCTX_BUSY	23      /* driver tag is exhausted for at least one blk-mq hctx */
 #define QUEUE_FLAG_QUIESCED	24	/* queue has been quiesced */
 #define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
 #define QUEUE_FLAG_ZONE_RESETALL 26	/* supports Zone Reset All */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC v2 7/8] blk-mq-tag: delay tag sharing until fail to get driver tag
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
                   ` (5 preceding siblings ...)
  2023-10-21 15:48 ` [PATCH RFC v2 6/8] blk-mq: add new helpers blk_mq_driver_tag_busy/idle() Yu Kuai
@ 2023-10-21 15:48 ` Yu Kuai
  2023-10-21 15:48 ` [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags Yu Kuai
  2023-10-23  4:38 ` [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Ming Lei
  8 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:48 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Before this patch, tags will be shared when shared node start to handle
IO, however, this will waste tags if some node doen't need all the fair
shared tags and such tags can't be used for other node, even if other
node might want more than fair shared tags.

Prevent such problem by delaying tag sharing from issue io until fail
to get driver tag. Note that such problem still exist if all the tags
are exhausted, and the next patch will implement a algorithm to allow
busy node to borrow tags from idle node.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 67 ++++++++++++++++++++++++++--------------------
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index cd13d8e512f7..a98b25c8d594 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -43,7 +43,7 @@ static void blk_mq_update_available_driver_tags(struct blk_mq_tags *tags,
 						struct shared_tag_info *info,
 						unsigned int users)
 {
-	unsigned int old = tags->ctl.active_queues;
+	unsigned int old = tags->ctl.busy_queues;
 	int nr_tags;
 	struct shared_tag_info *iter;
 
@@ -74,9 +74,7 @@ static void blk_mq_update_available_driver_tags(struct blk_mq_tags *tags,
  */
 void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-	unsigned int users;
 	struct blk_mq_tags *tags = hctx->tags;
-	struct shared_tag_info *info;
 
 	/*
 	 * calling test_bit() prior to test_and_set_bit() is intentional,
@@ -88,22 +86,14 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 		if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
 		    test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
 			return;
-
-		info = &q->shared_tag_info;
 	} else {
 		if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
 		    test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 			return;
-
-		info = &hctx->shared_tag_info;
 	}
 
 	spin_lock_irq(&tags->lock);
-	list_add(&info->node, &tags->ctl.head);
-	users = tags->ctl.active_queues + 1;
-	blk_mq_update_available_driver_tags(tags, info, users);
-	WRITE_ONCE(tags->ctl.active_queues, users);
-	blk_mq_update_wake_batch(tags, users);
+	WRITE_ONCE(tags->ctl.active_queues, tags->ctl.active_queues + 1);
 	spin_unlock_irq(&tags->lock);
 }
 
@@ -123,9 +113,7 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
  */
 void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
-	unsigned int users;
 	struct blk_mq_tags *tags = hctx->tags;
-	struct shared_tag_info *info;
 
 	if (blk_mq_is_shared_tags(hctx->flags)) {
 		struct request_queue *q = hctx->queue;
@@ -137,8 +125,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 			spin_unlock_irq(&tags->lock);
 			return;
 		}
-
-		info = &q->shared_tag_info;
 	} else {
 		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 			return;
@@ -147,28 +133,21 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 			spin_unlock_irq(&tags->lock);
 			return;
 		}
-
-		info = &hctx->shared_tag_info;
 	}
 
-	list_del_init(&info->node);
-	users = tags->ctl.active_queues - 1;
-	blk_mq_update_available_driver_tags(tags, info, users);
-	WRITE_ONCE(tags->ctl.active_queues, users);
-	blk_mq_update_wake_batch(tags, users);
-
+	WRITE_ONCE(tags->ctl.active_queues, tags->ctl.active_queues - 1);
 	if (blk_mq_is_shared_tags(hctx->flags))
 		clear_bit(QUEUE_FLAG_HCTX_ACTIVE, &hctx->queue->queue_flags);
 	else
 		clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state);
 	spin_unlock_irq(&tags->lock);
-	blk_mq_tag_wakeup_all(tags, false);
 }
 
 void __blk_mq_driver_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
 	unsigned int users;
 	struct blk_mq_tags *tags = hctx->tags;
+	struct shared_tag_info *info;
 
 	if (blk_mq_is_shared_tags(hctx->flags)) {
 		struct request_queue *q = hctx->queue;
@@ -176,14 +155,21 @@ void __blk_mq_driver_tag_busy(struct blk_mq_hw_ctx *hctx)
 		if (test_bit(QUEUE_FLAG_HCTX_BUSY, &q->queue_flags) ||
 		    test_and_set_bit(QUEUE_FLAG_HCTX_BUSY, &q->queue_flags))
 			return;
+
+		info = &q->shared_tag_info;
 	} else {
 		if (test_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state) ||
 		    test_and_set_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state))
 			return;
+
+		info = &hctx->shared_tag_info;
 	}
 
 	spin_lock_irq(&tags->lock);
+	list_add(&info->node, &tags->ctl.head);
 	users = tags->ctl.busy_queues + 1;
+	blk_mq_update_available_driver_tags(tags, info, users);
+	blk_mq_update_wake_batch(tags, users);
 	WRITE_ONCE(tags->ctl.busy_queues, users);
 	spin_unlock_irq(&tags->lock);
 }
@@ -192,22 +178,45 @@ void __blk_mq_driver_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
 	unsigned int users;
 	struct blk_mq_tags *tags = hctx->tags;
+	struct shared_tag_info *info;
 
 	if (blk_mq_is_shared_tags(hctx->flags)) {
 		struct request_queue *q = hctx->queue;
 
-		if (!test_and_clear_bit(QUEUE_FLAG_HCTX_BUSY,
-					&q->queue_flags))
+		if (!test_bit(QUEUE_FLAG_HCTX_BUSY, &q->queue_flags))
 			return;
+
+		spin_lock_irq(&tags->lock);
+		if (!test_bit(QUEUE_FLAG_HCTX_BUSY, &q->queue_flags)) {
+			spin_unlock_irq(&tags->lock);
+			return;
+		}
+		info = &q->shared_tag_info;
 	} else {
-		if (!test_and_clear_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state))
+		if (!test_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state))
 			return;
+
+		spin_lock_irq(&tags->lock);
+		if (!test_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state)) {
+			spin_unlock_irq(&tags->lock);
+			return;
+		}
+		info = &hctx->shared_tag_info;
 	}
 
-	spin_lock_irq(&tags->lock);
+	list_del_init(&info->node);
 	users = tags->ctl.busy_queues - 1;
+	blk_mq_update_available_driver_tags(tags, info, users);
+	blk_mq_update_wake_batch(tags, users);
 	WRITE_ONCE(tags->ctl.busy_queues, users);
+
+	if (blk_mq_is_shared_tags(hctx->flags))
+		clear_bit(QUEUE_FLAG_HCTX_BUSY, &hctx->queue->queue_flags);
+	else
+		clear_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state);
+
 	spin_unlock_irq(&tags->lock);
+	blk_mq_tag_wakeup_all(tags, false);
 }
 
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
                   ` (6 preceding siblings ...)
  2023-10-21 15:48 ` [PATCH RFC v2 7/8] blk-mq-tag: delay tag sharing until fail to get driver tag Yu Kuai
@ 2023-10-21 15:48 ` Yu Kuai
  2023-10-23  3:24   ` kernel test robot
  2023-10-23 20:46   ` Bart Van Assche
  2023-10-23  4:38 ` [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Ming Lei
  8 siblings, 2 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-21 15:48 UTC (permalink / raw)
  To: bvanassche, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Currently tags are fail shared to each node, and if some node only
issue litter IO, the shared tags will be wasted and can't be used for
other nodes that is under high IO pressure.

This patch implement a new way to allow one node to borrow tags from
other node:

1) multiple nodes start issue io, and all tags is exhausted, start
   sharing tags by fair share;
2) while sharing tag, detect if there are any node need more tags, if so
   start a timer;
3) in the timer function
   - If there are node with borrowed tags and doesn't busy anymore,
     return borrowed tags;
   - Find the busy node with highest pressure, and calculate available
     free tags from node that is not busy, then borrow tags;

A simple test, 32 tags with two shared node:
[global]
ioengine=libaio
iodepth=2
bs=4k
direct=1
rw=randrw
group_reporting

[sda]
numjobs=32
filename=/dev/sda

[sdb]
numjobs=1
filename=/dev/sdb

Test result(monitor new debugfs entry):

time	active		available
	sda 	sdb	sda	sdb
0	0	0	32	32
1	16	2	16	16	-> start fair sharing
2	19	2	20	16
3	24	2	24	16
4	26	2	28	16 	-> borrow 32/8=4 tags each round
5	28	2	28	16	-> save at lease 4 tags for sdb

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-tag.c     | 92 +++++++++++++++++++++++++++++++++++++++---
 include/linux/blk-mq.h |  3 ++
 include/linux/blkdev.h |  1 +
 4 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 170bc2236e81..a987f4228b67 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -160,6 +160,7 @@ static void shared_tag_info_show(struct shared_tag_info *info,
 {
 	seq_printf(m, "active tags %d\n", atomic_read(&info->active_tags));
 	seq_printf(m, "available tags %u\n", READ_ONCE(info->available_tags));
+	seq_printf(m, "busy count %u\n", atomic_read(&info->busy_count));
 }
 
 static int queue_shared_tag_info_show(void *data, struct seq_file *m)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a98b25c8d594..798b90d3f11a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -35,6 +35,7 @@ void blk_mq_init_shared_tag_info(struct shared_tag_info *info,
 				 unsigned int nr_tags)
 {
 	atomic_set(&info->active_tags, 0);
+	atomic_set(&info->busy_count, 0);
 	INIT_LIST_HEAD(&info->node);
 	info->available_tags = nr_tags;
 }
@@ -143,26 +144,93 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 	spin_unlock_irq(&tags->lock);
 }
 
+static void tag_sharing_ctl_timer_fn(struct timer_list *t)
+{
+	struct tag_sharing_ctl *ctl = from_timer(ctl, t, timer);
+	struct blk_mq_tags *tags = container_of(ctl, struct blk_mq_tags, ctl);
+	struct shared_tag_info *busy = NULL;
+	struct shared_tag_info *info;
+	unsigned int nr_tags;
+	unsigned int step;
+	unsigned int free_tags = 0;
+	unsigned int borrowed_tags = 0;
+	unsigned int max_busy_count = 0;
+
+	spin_lock_irq(&tags->lock);
+
+	if (tags->ctl.busy_queues <= 1)
+		goto out;
+
+	/* First round, decrease busy_count by half. */
+	list_for_each_entry(info, &tags->ctl.head, node) {
+		int count = atomic_read(&info->busy_count);
+
+		if (count > tags->nr_tags)
+			count = count >> 1;
+		atomic_sub(count, &info->busy_count);
+	}
+
+	/* Second round, find borrowed tags that can be returned. */
+	nr_tags = shared_tags(tags, tags->ctl.busy_queues);
+	step = clamp_t(unsigned int, tags->nr_tags / SBQ_WAIT_QUEUES, 1,
+		       SBQ_WAKE_BATCH);
+	list_for_each_entry(info, &tags->ctl.head, node) {
+		if (info->available_tags > nr_tags &&
+		    atomic_read(&info->active_tags) <= nr_tags &&
+		    atomic_read(&info->busy_count) <= tags->nr_tags)
+			info->available_tags = nr_tags;
+	}
+
+	/* Last round, find available free tags, and which node need more tags. */
+	list_for_each_entry(info, &tags->ctl.head, node) {
+		unsigned int busy_count;
+
+		if (info->available_tags > nr_tags)
+			borrowed_tags += info->available_tags - nr_tags;
+		else
+			free_tags += info->available_tags - max(step,
+				(unsigned int)atomic_read(&info->active_tags));
+
+		busy_count = atomic_read(&info->busy_count);
+		if (busy_count > tags->nr_tags && busy_count > max_busy_count) {
+			busy = info;
+			max_busy_count = busy_count;
+		}
+	}
+
+	/* Borrow tags. */
+	if (busy && borrowed_tags < free_tags)
+		busy->available_tags += min(free_tags - borrowed_tags, step);
+
+out:
+	if (!busy) {
+		ctl->timer_running = false;
+	} else {
+		ctl->timer.expires = jiffies + HZ;
+		add_timer(&ctl->timer);
+	}
+	spin_unlock_irq(&tags->lock);
+}
+
 void __blk_mq_driver_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
 	unsigned int users;
 	struct blk_mq_tags *tags = hctx->tags;
 	struct shared_tag_info *info;
+	bool timer_running = false;
 
 	if (blk_mq_is_shared_tags(hctx->flags)) {
 		struct request_queue *q = hctx->queue;
 
+		info = &q->shared_tag_info;
 		if (test_bit(QUEUE_FLAG_HCTX_BUSY, &q->queue_flags) ||
 		    test_and_set_bit(QUEUE_FLAG_HCTX_BUSY, &q->queue_flags))
-			return;
-
-		info = &q->shared_tag_info;
+			goto inc_busy;
 	} else {
+		info = &hctx->shared_tag_info;
 		if (test_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state) ||
 		    test_and_set_bit(BLK_MQ_S_DTAG_BUSY, &hctx->state))
-			return;
-
-		info = &hctx->shared_tag_info;
+			goto inc_busy;
 	}
 
 	spin_lock_irq(&tags->lock);
@@ -172,6 +240,15 @@ void __blk_mq_driver_tag_busy(struct blk_mq_hw_ctx *hctx)
 	blk_mq_update_wake_batch(tags, users);
 	WRITE_ONCE(tags->ctl.busy_queues, users);
 	spin_unlock_irq(&tags->lock);
+	return;
+
+inc_busy:
+	atomic_inc(&info->busy_count);
+	if (!tags->ctl.timer_running &&
+	    try_cmpxchg_relaxed(&tags->ctl.timer_running, &timer_running, true)) {
+		tags->ctl.timer.expires = jiffies + HZ;
+		add_timer(&tags->ctl.timer);
+	}
 }
 
 void __blk_mq_driver_tag_idle(struct blk_mq_hw_ctx *hctx)
@@ -204,6 +281,7 @@ void __blk_mq_driver_tag_idle(struct blk_mq_hw_ctx *hctx)
 		info = &hctx->shared_tag_info;
 	}
 
+	atomic_set(&info->busy_count, 0);
 	list_del_init(&info->node);
 	users = tags->ctl.busy_queues - 1;
 	blk_mq_update_available_driver_tags(tags, info, users);
@@ -705,6 +783,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	tags->nr_reserved_tags = reserved_tags;
 	spin_lock_init(&tags->lock);
 	INIT_LIST_HEAD(&tags->ctl.head);
+	timer_setup(&tags->ctl.timer, tag_sharing_ctl_timer_fn, 0);
 
 	if (blk_mq_init_bitmaps(&tags->bitmap_tags, &tags->breserved_tags,
 				total_tags, reserved_tags, node,
@@ -717,6 +796,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
+	del_timer_sync(&tags->ctl.timer);
 	sbitmap_queue_free(&tags->bitmap_tags);
 	sbitmap_queue_free(&tags->breserved_tags);
 	kfree(tags);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 9182ceca8c7a..1e4aa733e1ab 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -736,6 +736,9 @@ struct tag_sharing_ctl {
 	 * a list with request_queue/hctx->shared_tag_info.node entries.
 	 */
 	struct list_head head;
+	bool timer_running;
+	/* Start when any queue/hctx failed to get driver tag. */
+	struct timer_list timer;
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8fd6a0a92233..8bfae877dd27 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -378,6 +378,7 @@ struct blk_independent_access_ranges {
 struct shared_tag_info {
 	atomic_t		active_tags;
 	unsigned int		available_tags;
+	atomic_t		busy_count;
 	struct list_head	node;
 };
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags
  2023-10-21 15:48 ` [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags Yu Kuai
@ 2023-10-23  3:24   ` kernel test robot
  2023-10-23 20:46   ` Bart Van Assche
  1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2023-10-23  3:24 UTC (permalink / raw)
  To: Yu Kuai; +Cc: oe-kbuild-all

Hi Yu,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on next-20231020]
[cannot apply to hch-configfs/for-next linus/master v6.6-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yu-Kuai/blk-mq-factor-out-a-structure-from-blk_mq_tags/20231021-155339
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20231021154806.4019417-9-yukuai1%40huaweicloud.com
patch subject: [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags
config: sh-defconfig (https://download.01.org/0day-ci/archive/20231023/202310231109.iysoP8pc-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231023/202310231109.iysoP8pc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310231109.iysoP8pc-lkp@intel.com/

All errors (new ones prefixed by >>):

   sh4-linux-ld: block/blk-mq-tag.o: in function `__blk_mq_driver_tag_busy':
>> blk-mq-tag.c:(.text+0xa74): undefined reference to `__cmpxchg_called_with_bad_pointer'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing
  2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
                   ` (7 preceding siblings ...)
  2023-10-21 15:48 ` [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags Yu Kuai
@ 2023-10-23  4:38 ` Ming Lei
  2023-10-23  7:26   ` Yu Kuai
  8 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2023-10-23  4:38 UTC (permalink / raw)
  To: Yu Kuai
  Cc: bvanassche, hch, kbusch, axboe, linux-block, linux-kernel,
	yukuai3, yi.zhang, yangerkun

Hello Yu Kuai,

On Sat, Oct 21, 2023 at 11:47:58PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Current implementation:
>  - a counter active_queues record how many queue/hctx is sharing tags,
>  and it's updated while issue new IO, and cleared in
>  blk_mq_timeout_work().
>  - if active_queues is more than 1, then tags is fair shared to each
>  node;

Can you explain a bit what the problem is in current tag sharing?
And what is your basic approach for this problem?

Just mentioning the implementation is not too helpful for initial
review, cause the problem and approach(correctness) need to be
understood first.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing
  2023-10-23  4:38 ` [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Ming Lei
@ 2023-10-23  7:26   ` Yu Kuai
  0 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-23  7:26 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: bvanassche, hch, kbusch, axboe, linux-block, linux-kernel,
	yi.zhang, yangerkun, yukuai (C)

Hi,

在 2023/10/23 12:38, Ming Lei 写道:
> Hello Yu Kuai,
> 
> On Sat, Oct 21, 2023 at 11:47:58PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Current implementation:
>>   - a counter active_queues record how many queue/hctx is sharing tags,
>>   and it's updated while issue new IO, and cleared in
>>   blk_mq_timeout_work().
>>   - if active_queues is more than 1, then tags is fair shared to each
>>   node;
> 
> Can you explain a bit what the problem is in current tag sharing?
> And what is your basic approach for this problem?
> 
> Just mentioning the implementation is not too helpful for initial
> review, cause the problem and approach(correctness) need to be
> understood first.

Of course, I'll add following if there will be a v3;

Current problems:

If there are multiple active_queues, then tag is fair shared to each
queue, and if one queue is not busy(for example, only issue one IO once
for a while), then shared tags for this queue is wasted and can't be
used for other queues.

Depends on the hardware, this might casue performance problems in some
user case. For example, as reported by [1], UFS devices
have multiple logical units. One of these logical units (WLUN) is used
to submit control commands, e.g. START STOP UNIT. If any request is
submitted to the WLUN, the queue depth is reduced from 31 to 15 or
lower for data LUNs.

This patchset first delay tag sharing from issue IO to failed to get
driver tag; then add a counter to record how many times shared queue
failed to get driver tag to indicate if the queue is busy; finially,
allow busy queue to borrow more tags from idle queue.

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> .
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags
  2023-10-21 15:48 ` [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags Yu Kuai
  2023-10-23  3:24   ` kernel test robot
@ 2023-10-23 20:46   ` Bart Van Assche
  2023-10-24  1:07     ` Yu Kuai
  1 sibling, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2023-10-23 20:46 UTC (permalink / raw)
  To: Yu Kuai, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun

On 10/21/23 08:48, Yu Kuai wrote:
> +	if (!busy) {
> +		ctl->timer_running = false;
> +	} else {
> +		ctl->timer.expires = jiffies + HZ;
> +		add_timer(&ctl->timer);
> +	}
> +	spin_unlock_irq(&tags->lock);

[ ... ]

> +inc_busy:
> +	atomic_inc(&info->busy_count);
> +	if (!tags->ctl.timer_running &&
> +	    try_cmpxchg_relaxed(&tags->ctl.timer_running, &timer_running, true)) {
> +		tags->ctl.timer.expires = jiffies + HZ;
> +		add_timer(&tags->ctl.timer);
> +	}
>   }

So the choice has been made to let the timer expire after one second? I
think the choice of the timer value is important enough to mention it in
the patch description.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags
  2023-10-23 20:46   ` Bart Van Assche
@ 2023-10-24  1:07     ` Yu Kuai
  0 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2023-10-24  1:07 UTC (permalink / raw)
  To: Bart Van Assche, Yu Kuai, hch, kbusch, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, yukuai (C)

Hi,

在 2023/10/24 4:46, Bart Van Assche 写道:
> On 10/21/23 08:48, Yu Kuai wrote:
>> +    if (!busy) {
>> +        ctl->timer_running = false;
>> +    } else {
>> +        ctl->timer.expires = jiffies + HZ;
>> +        add_timer(&ctl->timer);
>> +    }
>> +    spin_unlock_irq(&tags->lock);
> 
> [ ... ]
> 
>> +inc_busy:
>> +    atomic_inc(&info->busy_count);
>> +    if (!tags->ctl.timer_running &&
>> +        try_cmpxchg_relaxed(&tags->ctl.timer_running, &timer_running, 
>> true)) {
>> +        tags->ctl.timer.expires = jiffies + HZ;
>> +        add_timer(&tags->ctl.timer);
>> +    }
>>   }
> 
> So the choice has been made to let the timer expire after one second? I
> think the choice of the timer value is important enough to mention it in
> the patch description.

Yes, I agree, I admit that I was not that cautious while choose 1 HZ in
this version, I'm not quite sure is this approch will be accepted yet,
while this is the best(simple and workable) way that I can think of.

Thanks,
Kuai

> 
> Thanks,
> 
> Bart.
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-10-24  1:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-21 15:47 [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Yu Kuai
2023-10-21 15:47 ` [PATCH RFC v2 1/8] blk-mq: factor out a structure from blk_mq_tags Yu Kuai
2023-10-21 15:48 ` [PATCH RFC v2 2/8] blk-mq: factor out a structure to store information for tag sharing Yu Kuai
2023-10-21 15:48 ` [PATCH RFC v2 3/8] blk-mq: add a helper to initialize shared_tag_info Yu Kuai
2023-10-21 15:48 ` [PATCH RFC v2 4/8] blk-mq: support to track active queues from blk_mq_tags Yu Kuai
2023-10-21 15:48 ` [PATCH RFC v2 5/8] blk-mq: precalculate available tags for hctx_may_queue() Yu Kuai
2023-10-21 15:48 ` [PATCH RFC v2 6/8] blk-mq: add new helpers blk_mq_driver_tag_busy/idle() Yu Kuai
2023-10-21 15:48 ` [PATCH RFC v2 7/8] blk-mq-tag: delay tag sharing until fail to get driver tag Yu Kuai
2023-10-21 15:48 ` [PATCH RFC v2 8/8] blk-mq-tag: allow shared queue/hctx to get more driver tags Yu Kuai
2023-10-23  3:24   ` kernel test robot
2023-10-23 20:46   ` Bart Van Assche
2023-10-24  1:07     ` Yu Kuai
2023-10-23  4:38 ` [PATCH RFC v2 0/8] blk-mq: improve tag fair sharing Ming Lei
2023-10-23  7:26   ` Yu Kuai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.