All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -next RFC 0/6] improve large random io for HDD
@ 2022-03-29  9:40 Yu Kuai
  2022-03-29  9:40 ` [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION' Yu Kuai
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Yu Kuai @ 2022-03-29  9:40 UTC (permalink / raw)
  To: axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

There is a defect for blk-mq compare to blk-sq, specifically split io
will end up discontinuous if the device is under high io pressure, while
split io will still be continuous in sq, this is because:

1) split bio is issued one by one, if one bio can't get tag, it will go
to wail. - patch 2
2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
Thus if a thread is woken up, it will unlikey to get multiple tags.
- patch 3,4
3) new io can preempt tag even if there are lots of threads waiting for
tags. - patch 5

Test environment:
x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
max_sectors_kb is set to 128.

I haven't tested this patchset on physical machine yet, I'll try later
if anyone thinks this approch is meaningful.

Fio test cmd:
[global]
filename=/dev/sda
ioengine=libaio
direct=1
offset_increment=100m

[test]
rw=randwrite
bs=512k
numjobs=256
iodepth=2

Result: raito of sequential io(calculated from log by blktrace)
original:
21%
patched: split io thoroughly and wake up based on required tags.
40%
patched and set flag: disable tag preemption.
69%

Yu Kuai (6):
  blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION'
  block: refactor to split bio thoroughly
  blk-mq: record how many tags are needed for splited bio
  sbitmap: wake up the number of threads based on required tags
  blk-mq: don't preempt tag expect for split bios
  sbitmap: force tag preemption if free tags are sufficient

 block/bio.c               |  2 +
 block/blk-merge.c         | 95 ++++++++++++++++++++++++++++-----------
 block/blk-mq-debugfs.c    |  1 +
 block/blk-mq-tag.c        | 39 +++++++++++-----
 block/blk-mq.c            | 14 +++++-
 block/blk-mq.h            |  7 +++
 block/blk.h               |  3 +-
 include/linux/blk-mq.h    |  7 ++-
 include/linux/blk_types.h |  6 +++
 include/linux/sbitmap.h   |  8 ++++
 lib/sbitmap.c             | 33 +++++++++++++-
 11 files changed, 173 insertions(+), 42 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION'
  2022-03-29  9:40 [PATCH -next RFC 0/6] improve large random io for HDD Yu Kuai
@ 2022-03-29  9:40 ` Yu Kuai
  2022-03-29 12:44   ` Jens Axboe
  2022-03-29  9:40 ` [PATCH -next RFC 2/6] block: refactor to split bio thoroughly Yu Kuai
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Yu Kuai @ 2022-03-29  9:40 UTC (permalink / raw)
  To: axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Tag preemption is the default behaviour, specifically blk_mq_get_tag()
will try to get tag unconditionally, which means a new io can preempt tag
even if there are lots of ios that are waiting for tags.

This patch introduce a new flag, prepare to disable such behaviour, in
order to optimize io performance for large random io for HHD.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-debugfs.c | 1 +
 block/blk-mq.h         | 5 +++++
 include/linux/blk-mq.h | 7 ++++++-
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index aa0349e9f083..f4228532ee3d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -226,6 +226,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(NO_SCHED),
 	HCTX_FLAG_NAME(STACKING),
 	HCTX_FLAG_NAME(TAG_HCTX_SHARED),
+	HCTX_FLAG_NAME(NO_TAG_PREEMPTION),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2615bd58bad3..1a084b3b6097 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -168,6 +168,11 @@ static inline bool blk_mq_is_shared_tags(unsigned int flags)
 	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
 }
 
+static inline bool blk_mq_is_tag_preemptive(unsigned int flags)
+{
+	return !(flags & BLK_MQ_F_NO_TAG_PREEMPTION);
+}
+
 static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
 {
 	if (!(data->rq_flags & RQF_ELV))
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7aa5c54901a9..c9434162acc5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -656,7 +656,12 @@ enum {
 	 * or shared hwqs instead of 'mq-deadline'.
 	 */
 	BLK_MQ_F_NO_SCHED_BY_DEFAULT	= 1 << 7,
-	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
+	/*
+	 * If the disk is under high io pressure, new io will wait directly
+	 * without trying to preempt tag.
+	 */
+	BLK_MQ_F_NO_TAG_PREEMPTION	= 1 << 8,
+	BLK_MQ_F_ALLOC_POLICY_START_BIT = 9,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
 	BLK_MQ_S_STOPPED	= 0,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29  9:40 [PATCH -next RFC 0/6] improve large random io for HDD Yu Kuai
  2022-03-29  9:40 ` [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION' Yu Kuai
@ 2022-03-29  9:40 ` Yu Kuai
  2022-03-29 12:46   ` Jens Axboe
  2022-03-29 13:32   ` Christoph Hellwig
  2022-03-29  9:40 ` [PATCH -next RFC 3/6] blk-mq: record how many tags are needed for splited bio Yu Kuai
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Yu Kuai @ 2022-03-29  9:40 UTC (permalink / raw)
  To: axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Currently, the splited bio is handled first, and then continue to split
the original bio. This patch tries to split the original bio thoroughly,
so that it can be known in advance how many tags will be needed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bio.c               |  2 +
 block/blk-merge.c         | 90 ++++++++++++++++++++++++++++-----------
 block/blk-mq.c            |  7 ++-
 block/blk.h               |  3 +-
 include/linux/blk_types.h |  4 ++
 5 files changed, 77 insertions(+), 29 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index cdd7b2915c53..ac7ce8b4ba42 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -258,6 +258,8 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	bio->bi_flags = 0;
 	bio->bi_ioprio = 0;
 	bio->bi_status = 0;
+	bio->bi_nr_segs = 0;
+	bio->bi_nr_split = 0;
 	bio->bi_iter.bi_sector = 0;
 	bio->bi_iter.bi_size = 0;
 	bio->bi_iter.bi_idx = 0;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7771dacc99cb..340860746cac 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -309,44 +309,85 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return bio_split(bio, sectors, GFP_NOIO, bs);
 }
 
-/**
- * __blk_queue_split - split a bio and submit the second half
- * @q:       [in] request_queue new bio is being queued at
- * @bio:     [in, out] bio to be split
- * @nr_segs: [out] number of segments in the first bio
- *
- * Split a bio into two bios, chain the two bios, submit the second half and
- * store a pointer to the first half in *@bio. If the second bio is still too
- * big it will be split by a recursive call to this function. Since this
- * function may allocate a new bio from q->bio_split, it is the responsibility
- * of the caller to ensure that q->bio_split is only released after processing
- * of the split bio has finished.
- */
-void __blk_queue_split(struct request_queue *q, struct bio **bio,
-		       unsigned int *nr_segs)
+static struct bio *blk_queue_split_one(struct request_queue *q, struct bio *bio)
 {
 	struct bio *split = NULL;
+	unsigned int nr_segs = 1;
 
-	switch (bio_op(*bio)) {
+	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
-		split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
+		split = blk_bio_discard_split(q, bio, &q->bio_split, &nr_segs);
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
-				nr_segs);
+		split = blk_bio_write_zeroes_split(q, bio, &q->bio_split,
+				&nr_segs);
 		break;
 	default:
-		split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
+		split = blk_bio_segment_split(q, bio, &q->bio_split, &nr_segs);
 		break;
 	}
 
 	if (split) {
 		/* there isn't chance to merge the splitted bio */
-		split->bi_opf |= REQ_NOMERGE;
+		split->bi_opf |= (REQ_NOMERGE | REQ_SPLIT);
+		split->bi_nr_segs = nr_segs;
+
+		bio_chain(split, bio);
+		trace_block_split(split, bio->bi_iter.bi_sector);
+	} else {
+		bio->bi_nr_segs = nr_segs;
+	}
+
+	return split;
+}
+
+static unsigned short blk_queue_split_all(struct request_queue *q,
+					  struct bio *bio)
+{
+	struct bio *split = NULL;
+	struct bio *first = NULL;
+	unsigned short nr_split = 1;
+	unsigned short total;
 
-		bio_chain(split, *bio);
-		trace_block_split(split, (*bio)->bi_iter.bi_sector);
+	if (!current->bio_list)
+		return 1;
+
+	while ((split = blk_queue_split_one(q, bio))) {
+		if (!first)
+			first = split;
+
+		nr_split++;
+		submit_bio_noacct(split);
+	}
+
+	total = nr_split;
+	while (first) {
+		first->bi_nr_split = --total;
+		first = first->bi_next;
+	}
+
+	return nr_split;
+}
+
+/**
+ * __blk_queue_split - split a bio, store the first and submit others
+ * @q:       [in] request_queue new bio is being queued at
+ * @bio:     [in, out] bio to be split
+ *
+ * Split a bio into several bios, chain all the bios, store a pointer to the
+ * first in *@bio, and submit others. Since this function may allocate a new
+ * bio from q->bio_split, it is the responsibility of the caller to ensure
+ * that q->bio_split is only released after processing of the split bio has
+ * finished.
+ */
+void __blk_queue_split(struct request_queue *q, struct bio **bio)
+{
+	struct bio *split = blk_queue_split_one(q, *bio);
+
+	if (split) {
+		split->bi_nr_split = blk_queue_split_all(q, *bio);
+		(*bio)->bi_opf |= REQ_SPLIT;
 		submit_bio_noacct(*bio);
 		*bio = split;
 	}
@@ -365,10 +406,9 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
 void blk_queue_split(struct bio **bio)
 {
 	struct request_queue *q = bdev_get_queue((*bio)->bi_bdev);
-	unsigned int nr_segs;
 
 	if (blk_may_split(q, *bio))
-		__blk_queue_split(q, bio, &nr_segs);
+		__blk_queue_split(q, bio);
 }
 EXPORT_SYMBOL(blk_queue_split);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e6f24fa4a4c2..cad207d2079e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2817,8 +2817,11 @@ void blk_mq_submit_bio(struct bio *bio)
 	blk_status_t ret;
 
 	blk_queue_bounce(q, &bio);
-	if (blk_may_split(q, bio))
-		__blk_queue_split(q, &bio, &nr_segs);
+	if (blk_may_split(q, bio)) {
+		if (!(bio->bi_opf & REQ_SPLIT))
+			__blk_queue_split(q, &bio);
+		nr_segs = bio->bi_nr_segs;
+	}
 
 	if (!bio_integrity_prep(bio))
 		return;
diff --git a/block/blk.h b/block/blk.h
index 8ccbc6e07636..cd478187b525 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -303,8 +303,7 @@ static inline bool blk_may_split(struct request_queue *q, struct bio *bio)
 		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
 }
 
-void __blk_queue_split(struct request_queue *q, struct bio **bio,
-			unsigned int *nr_segs);
+void __blk_queue_split(struct request_queue *q, struct bio **bio);
 int ll_back_merge_fn(struct request *req, struct bio *bio,
 		unsigned int nr_segs);
 bool blk_attempt_req_merge(struct request_queue *q, struct request *rq,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dd0763a1c674..702f6b83dc88 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -250,6 +250,8 @@ struct bio {
 						 */
 	unsigned short		bi_flags;	/* BIO_* below */
 	unsigned short		bi_ioprio;
+	unsigned int		bi_nr_segs;
+	unsigned int		bi_nr_split;
 	blk_status_t		bi_status;
 	atomic_t		__bi_remaining;
 
@@ -416,6 +418,7 @@ enum req_flag_bits {
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
+	__REQ_SPLIT,		/* io is split */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -440,6 +443,7 @@ enum req_flag_bits {
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
+#define REQ_SPLIT		(1ULL << __REQ_SPLIT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -next RFC 3/6] blk-mq: record how many tags are needed for splited bio
  2022-03-29  9:40 [PATCH -next RFC 0/6] improve large random io for HDD Yu Kuai
  2022-03-29  9:40 ` [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION' Yu Kuai
  2022-03-29  9:40 ` [PATCH -next RFC 2/6] block: refactor to split bio thoroughly Yu Kuai
@ 2022-03-29  9:40 ` Yu Kuai
  2022-03-29  9:40 ` [PATCH -next RFC 4/6] sbitmap: wake up the number of threads based on required tags Yu Kuai
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Yu Kuai @ 2022-03-29  9:40 UTC (permalink / raw)
  To: axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Prepare to wake up number of threads based on required tags.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c      | 1 +
 block/blk-mq.c          | 1 +
 block/blk-mq.h          | 1 +
 include/linux/sbitmap.h | 2 ++
 4 files changed, 5 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 68ac23d0b640..83dfbe2f1cfc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -155,6 +155,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	if (data->flags & BLK_MQ_REQ_NOWAIT)
 		return BLK_MQ_NO_TAG;
 
+	wait.nr_tags += data->nr_split;
 	ws = bt_wait_ptr(bt, data->hctx);
 	do {
 		struct sbitmap_queue *bt_prev;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cad207d2079e..9bace9e2c5ca 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2737,6 +2737,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
 		.q		= q,
 		.nr_tags	= 1,
 		.cmd_flags	= bio->bi_opf,
+		.nr_split	= bio->bi_nr_split,
 	};
 	struct request *rq;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1a084b3b6097..3eabe394a5a9 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -156,6 +156,7 @@ struct blk_mq_alloc_data {
 
 	/* allocate multiple requests/tags in one go */
 	unsigned int nr_tags;
+	unsigned int nr_split;
 	struct request **cached_rq;
 
 	/* input & output parameter */
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 8f5a86e210b9..9c8c6da3d820 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -591,11 +591,13 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m);
 struct sbq_wait {
 	struct sbitmap_queue *sbq;	/* if set, sbq_wait is accounted */
 	struct wait_queue_entry wait;
+	unsigned int nr_tags;
 };
 
 #define DEFINE_SBQ_WAIT(name)							\
 	struct sbq_wait name = {						\
 		.sbq = NULL,							\
+		.nr_tags = 1,							\
 		.wait = {							\
 			.private	= current,				\
 			.func		= autoremove_wake_function,		\
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -next RFC 4/6] sbitmap: wake up the number of threads based on required tags
  2022-03-29  9:40 [PATCH -next RFC 0/6] improve large random io for HDD Yu Kuai
                   ` (2 preceding siblings ...)
  2022-03-29  9:40 ` [PATCH -next RFC 3/6] blk-mq: record how many tags are needed for splited bio Yu Kuai
@ 2022-03-29  9:40 ` Yu Kuai
  2022-03-29  9:40 ` [PATCH -next RFC 5/6] blk-mq: don't preempt tag expect for split bios Yu Kuai
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Yu Kuai @ 2022-03-29  9:40 UTC (permalink / raw)
  To: axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Currently, __sbq_wake_up() will wake up 'wake_batch' threads
unconditionally, for split io this will intensify competition and split
io won't be issued sequentially.

This modification can optimize the ratio of sequentially for split
big io, however, in order to gain optimal result, tag preemption still
need to be disabled, which will be done in later patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 lib/sbitmap.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index ae4fd4de9ebe..9d04c0ecc8f7 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -597,6 +597,26 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
 	return NULL;
 }
 
+static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)
+{
+	struct sbq_wait *wait;
+	struct wait_queue_entry *entry;
+	unsigned int nr = 1;
+
+	spin_lock_irq(&ws->wait.lock);
+	list_for_each_entry(entry, &ws->wait.head, entry) {
+		wait = container_of(entry, struct sbq_wait, wait);
+		if (nr_tags <= wait->nr_tags)
+			break;
+
+		nr++;
+		nr_tags -= wait->nr_tags;
+	}
+	spin_unlock_irq(&ws->wait.lock);
+
+	return nr;
+}
+
 static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 {
 	struct sbq_wait_state *ws;
@@ -628,7 +648,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 		ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch);
 		if (ret == wait_cnt) {
 			sbq_index_atomic_inc(&sbq->wake_index);
-			wake_up_nr(&ws->wait, wake_batch);
+			wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
 			return false;
 		}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -next RFC 5/6] blk-mq: don't preempt tag expect for split bios
  2022-03-29  9:40 [PATCH -next RFC 0/6] improve large random io for HDD Yu Kuai
                   ` (3 preceding siblings ...)
  2022-03-29  9:40 ` [PATCH -next RFC 4/6] sbitmap: wake up the number of threads based on required tags Yu Kuai
@ 2022-03-29  9:40 ` Yu Kuai
  2022-03-29  9:40 ` [PATCH -next RFC 6/6] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
  2022-03-29 12:53 ` [PATCH -next RFC 0/6] improve large random io for HDD Jens Axboe
  6 siblings, 0 replies; 20+ messages in thread
From: Yu Kuai @ 2022-03-29  9:40 UTC (permalink / raw)
  To: axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

In order to improve the sequential of split io, this patch disables
tag preemption for the first split bios and other non-split bios if
the device is under high io pressure.

Noted that this solution rely on waitqueues of sbitmap to be balanced,
otherwise it may happen that 'wake_batch' tags is freed and wakers don't
obtain 'wake_batch' new tags, thus concurrent io will become less. The
next patch will avoid such problem, however, fix the unfairness of
waitqueues might be better.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-merge.c         |  7 ++++++-
 block/blk-mq-tag.c        | 37 ++++++++++++++++++++++++++-----------
 block/blk-mq.c            |  6 ++++++
 block/blk-mq.h            |  1 +
 include/linux/blk_types.h |  2 ++
 lib/sbitmap.c             | 14 ++++++++++----
 6 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 340860746cac..fd4bbf773b45 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -357,6 +357,11 @@ static unsigned short blk_queue_split_all(struct request_queue *q,
 		if (!first)
 			first = split;
 
+		/*
+		 * Except the first split bio, others will always preempt
+		 * tag, so that they can be sequential.
+		 */
+		split->bi_opf |= REQ_PREEMPTIVE;
 		nr_split++;
 		submit_bio_noacct(split);
 	}
@@ -387,7 +392,7 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio)
 
 	if (split) {
 		split->bi_nr_split = blk_queue_split_all(q, *bio);
-		(*bio)->bi_opf |= REQ_SPLIT;
+		(*bio)->bi_opf |= (REQ_SPLIT | REQ_PREEMPTIVE);
 		submit_bio_noacct(*bio);
 		*bio = split;
 	}
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 83dfbe2f1cfc..4e485bcc5820 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -127,6 +127,13 @@ unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
 	return ret;
 }
 
+static inline bool preempt_tag(struct blk_mq_alloc_data *data,
+			       struct sbitmap_queue *bt)
+{
+	return data->preemption ||
+	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES;
+}
+
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
@@ -148,12 +155,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		tag_offset = tags->nr_reserved_tags;
 	}
 
-	tag = __blk_mq_get_tag(data, bt);
-	if (tag != BLK_MQ_NO_TAG)
-		goto found_tag;
+	if (data->flags & BLK_MQ_REQ_NOWAIT || preempt_tag(data, bt)) {
+		tag = __blk_mq_get_tag(data, bt);
+		if (tag != BLK_MQ_NO_TAG)
+			goto found_tag;
 
-	if (data->flags & BLK_MQ_REQ_NOWAIT)
-		return BLK_MQ_NO_TAG;
+		if (data->flags & BLK_MQ_REQ_NOWAIT)
+			return BLK_MQ_NO_TAG;
+	}
 
 	wait.nr_tags += data->nr_split;
 	ws = bt_wait_ptr(bt, data->hctx);
@@ -171,20 +180,26 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		 * Retry tag allocation after running the hardware queue,
 		 * as running the queue may also have found completions.
 		 */
-		tag = __blk_mq_get_tag(data, bt);
-		if (tag != BLK_MQ_NO_TAG)
-			break;
+		if (preempt_tag(data, bt)) {
+			tag = __blk_mq_get_tag(data, bt);
+			if (tag != BLK_MQ_NO_TAG)
+				break;
+		}
 
 		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
-		tag = __blk_mq_get_tag(data, bt);
-		if (tag != BLK_MQ_NO_TAG)
-			break;
+		if (preempt_tag(data, bt)) {
+			tag = __blk_mq_get_tag(data, bt);
+			if (tag != BLK_MQ_NO_TAG)
+				break;
+		}
 
 		bt_prev = bt;
 		io_schedule();
 
 		sbitmap_finish_wait(bt, ws, &wait);
+		if (!blk_mq_is_tag_preemptive(data->hctx->flags))
+			data->preemption = true;
 
 		data->ctx = blk_mq_get_ctx(data->q);
 		data->hctx = blk_mq_map_queue(data->q, data->cmd_flags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9bace9e2c5ca..06ba6fa9ec1a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -470,6 +470,9 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 retry:
 	data->ctx = blk_mq_get_ctx(q);
 	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
+	if (blk_mq_is_tag_preemptive(data->hctx->flags))
+		data->preemption = true;
+
 	if (!(data->rq_flags & RQF_ELV))
 		blk_mq_tag_busy(data->hctx);
 
@@ -577,6 +580,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	data.hctx = xa_load(&q->hctx_table, hctx_idx);
 	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
+	if (blk_mq_is_tag_preemptive(data.hctx->flags))
+		data.preemption = true;
 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
@@ -2738,6 +2743,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
 		.nr_tags	= 1,
 		.cmd_flags	= bio->bi_opf,
 		.nr_split	= bio->bi_nr_split,
+		.preemption	= (bio->bi_opf & REQ_PREEMPTIVE),
 	};
 	struct request *rq;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3eabe394a5a9..915bb710dd6f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -157,6 +157,7 @@ struct blk_mq_alloc_data {
 	/* allocate multiple requests/tags in one go */
 	unsigned int nr_tags;
 	unsigned int nr_split;
+	bool preemption;
 	struct request **cached_rq;
 
 	/* input & output parameter */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 702f6b83dc88..8fd9756f0a06 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -419,6 +419,7 @@ enum req_flag_bits {
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
 	__REQ_SPLIT,		/* io is splitted */
+	__REQ_PREEMPTIVE,	/* io can preempt tag */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -444,6 +445,7 @@ enum req_flag_bits {
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
 #define REQ_SPLIT		(1ULL << __REQ_SPLIT)
+#define REQ_PREEMPTIVE		(1ULL << __REQ_PREEMPTIVE)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 9d04c0ecc8f7..1655c15ee11d 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -597,7 +597,8 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
 	return NULL;
 }
 
-static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)
+static unsigned int get_wake_nr(struct sbq_wait_state *ws,
+				unsigned int *nr_tags)
 {
 	struct sbq_wait *wait;
 	struct wait_queue_entry *entry;
@@ -606,11 +607,13 @@ static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)
 	spin_lock_irq(&ws->wait.lock);
 	list_for_each_entry(entry, &ws->wait.head, entry) {
 		wait = container_of(entry, struct sbq_wait, wait);
-		if (nr_tags <= wait->nr_tags)
+		if (*nr_tags <= wait->nr_tags) {
+			*nr_tags = 0;
 			break;
+		}
 
 		nr++;
-		nr_tags -= wait->nr_tags;
+		*nr_tags -= wait->nr_tags;
 	}
 	spin_unlock_irq(&ws->wait.lock);
 
@@ -648,7 +651,10 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 		ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch);
 		if (ret == wait_cnt) {
 			sbq_index_atomic_inc(&sbq->wake_index);
-			wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
+			wake_up_nr(&ws->wait, get_wake_nr(ws, &wake_batch));
+			if (wake_batch)
+				sbitmap_queue_wake_all(sbq);
+
 			return false;
 		}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -next RFC 6/6] sbitmap: force tag preemption if free tags are sufficient
  2022-03-29  9:40 [PATCH -next RFC 0/6] improve large random io for HDD Yu Kuai
                   ` (4 preceding siblings ...)
  2022-03-29  9:40 ` [PATCH -next RFC 5/6] blk-mq: don't preempt tag expect for split bios Yu Kuai
@ 2022-03-29  9:40 ` Yu Kuai
  2022-03-29 12:53 ` [PATCH -next RFC 0/6] improve large random io for HDD Jens Axboe
  6 siblings, 0 replies; 20+ messages in thread
From: Yu Kuai @ 2022-03-29  9:40 UTC (permalink / raw)
  To: axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

If tag preemption is disabled and system is under high io pressure,
inflight io should use up tags. Since new io will wait directly, this
rely on waked up threads will obtain at least 'wake_batch' tags.
However, this might be broken if 8 waitqueues is unbalanced.

This patch tries to calculate free tags each time a 'ws' is woken up,
and force tag preemption if free tags are sufficient.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c      | 3 ++-
 include/linux/sbitmap.h | 6 ++++++
 lib/sbitmap.c           | 5 +++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 4e485bcc5820..55139a011e75 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -131,7 +131,8 @@ static inline bool preempt_tag(struct blk_mq_alloc_data *data,
 			       struct sbitmap_queue *bt)
 {
 	return data->preemption ||
-	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES;
+	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES ||
+	       READ_ONCE(bt->force_tag_preemption);
 }
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 9c8c6da3d820..7a0ea8c0692b 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -118,6 +118,12 @@ struct sbitmap_queue {
 	 */
 	unsigned int wake_batch;
 
+	/**
+	 * @force_tag_preemption: prrempt tag even is tag preemption is
+	 * disabled.
+	 */
+	bool force_tag_preemption;
+
 	/**
 	 * @wake_index: Next wait queue in @ws to wake up.
 	 */
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 1655c15ee11d..49241b44f163 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -432,6 +432,7 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 
 	sbq->min_shallow_depth = UINT_MAX;
 	sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
+	sbq->force_tag_preemption = 0;
 	atomic_set(&sbq->wake_index, 0);
 	atomic_set(&sbq->ws_active, 0);
 
@@ -650,6 +651,10 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 		 */
 		ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch);
 		if (ret == wait_cnt) {
+			bool force = (sbq->sb.depth - sbitmap_weight(&sbq->sb) >
+				      READ_ONCE(sbq->wake_batch) * 2);
+
+			WRITE_ONCE(sbq->force_tag_preemption, force);
 			sbq_index_atomic_inc(&sbq->wake_index);
 			wake_up_nr(&ws->wait, get_wake_nr(ws, &wake_batch));
 			if (wake_batch)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION'
  2022-03-29  9:40 ` [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION' Yu Kuai
@ 2022-03-29 12:44   ` Jens Axboe
  2022-03-30  1:18     ` yukuai (C)
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2022-03-29 12:44 UTC (permalink / raw)
  To: Yu Kuai, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 3/29/22 3:40 AM, Yu Kuai wrote:
> Tag preemption is the default behaviour, specifically blk_mq_get_tag()
> will try to get tag unconditionally, which means a new io can preempt tag
> even if there are lots of ios that are waiting for tags.
> 
> This patch introduce a new flag, prepare to disable such behaviour, in
> order to optimize io performance for large random io for HHD.

Not sure why we need a flag for this behavior. Does it ever make sense
to allow preempting waiters, jumping the queue?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29  9:40 ` [PATCH -next RFC 2/6] block: refactor to split bio thoroughly Yu Kuai
@ 2022-03-29 12:46   ` Jens Axboe
  2022-03-30  1:35     ` yukuai (C)
  2022-03-29 13:32   ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2022-03-29 12:46 UTC (permalink / raw)
  To: Yu Kuai, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 3/29/22 3:40 AM, Yu Kuai wrote:
> Currently, the splited bio is handled first, and then continue to split
> the original bio. This patch tries to split the original bio thoroughly,
> so that it can be known in advance how many tags will be needed.

This makes the bio almost 10% bigger in size, which is NOT something
that is just done casually and without strong reasoning.

So please provide that, your commit message is completely missing
justification for why this change is useful or necessary. A good
commit always explains WHY the change needs to be made, yours is
simply stating WHAT is being done. The latter can be gleaned from
the code change anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 0/6] improve large random io for HDD
  2022-03-29  9:40 [PATCH -next RFC 0/6] improve large random io for HDD Yu Kuai
                   ` (5 preceding siblings ...)
  2022-03-29  9:40 ` [PATCH -next RFC 6/6] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
@ 2022-03-29 12:53 ` Jens Axboe
  2022-03-30  2:05   ` yukuai (C)
  6 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2022-03-29 12:53 UTC (permalink / raw)
  To: Yu Kuai, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 3/29/22 3:40 AM, Yu Kuai wrote:
> There is a defect for blk-mq compare to blk-sq, specifically split io
> will end up discontinuous if the device is under high io pressure, while
> split io will still be continuous in sq, this is because:
> 
> 1) split bio is issued one by one, if one bio can't get tag, it will go
> to wail. - patch 2
> 2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
> Thus if a thread is woken up, it will unlikey to get multiple tags.
> - patch 3,4
> 3) new io can preempt tag even if there are lots of threads waiting for
> tags. - patch 5
> 
> Test environment:
> x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
> max_sectors_kb is set to 128.
> 
> I haven't tested this patchset on physical machine yet, I'll try later
> if anyone thinks this approch is meaningful.

A real machine test would definitely be a requirement. What real world
uses cases is this solving? These days most devices have plenty of tags,
and I would not really expect tag starvation to be much of a concern.

However, I do think there's merrit in fixing the unfairness we have
here. But not at the cost of all of this. Why not just simply enforce
more strict ordering of tag allocations? If someone is waiting, you get
to wait too.

And I don't see much utility at all in tracking how many splits (and
hence tags) would be required. Is this really a common issue, tons of
splits and needing many tags? Why not just enforce the strict ordering
as mentioned above, not allowing new allocators to get a tag if others
are waiting, but perhaps allow someone submitting a string of splits to
indeed keep allocating.

Yes, it'll be less efficient to still wake one-by-one, but honestly do
we really care about that? If you're stalled on waiting for other IO to
finish and release a tag, that isn't very efficient to begin with and
doesn't seem like a case worth optimizing for me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29  9:40 ` [PATCH -next RFC 2/6] block: refactor to split bio thoroughly Yu Kuai
  2022-03-29 12:46   ` Jens Axboe
@ 2022-03-29 13:32   ` Christoph Hellwig
  2022-03-29 14:35     ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2022-03-29 13:32 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, andriy.shevchenko, john.garry, ming.lei, linux-block,
	linux-kernel, yi.zhang

On Tue, Mar 29, 2022 at 05:40:44PM +0800, Yu Kuai wrote:
> Currently, the splited bio is handled first, and then continue to split
> the original bio. This patch tries to split the original bio thoroughly,
> so that it can be known in advance how many tags will be needed.

How do you avoid the deadlock risk with all the split bios being
submitted together?

But more importantly why does your use case even have splits that get
submitted together?  Is this a case of Linus' stupidly low default
max_sectors when the hardware supports more, or is the hardware limited
to a low number of sectors per request?  Or do we hit another reason
for the split?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29 13:32   ` Christoph Hellwig
@ 2022-03-29 14:35     ` Jens Axboe
  2022-03-29 14:40       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2022-03-29 14:35 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: andriy.shevchenko, john.garry, ming.lei, linux-block,
	linux-kernel, yi.zhang

On 3/29/22 7:32 AM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2022 at 05:40:44PM +0800, Yu Kuai wrote:
>> Currently, the splited bio is handled first, and then continue to split
>> the original bio. This patch tries to split the original bio thoroughly,
>> so that it can be known in advance how many tags will be needed.
> 
> How do you avoid the deadlock risk with all the split bios being
> submitted together?

That too

> But more importantly why does your use case even have splits that get
> submitted together?  Is this a case of Linus' stupidly low default
> max_sectors when the hardware supports more, or is the hardware limited
> to a low number of sectors per request?  Or do we hit another reason
> for the split?

See the posted use case, it's running 512kb ios on a 128kb device.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29 14:35     ` Jens Axboe
@ 2022-03-29 14:40       ` Christoph Hellwig
  2022-03-29 14:41         ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2022-03-29 14:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Yu Kuai, andriy.shevchenko, john.garry,
	ming.lei, linux-block, linux-kernel, yi.zhang

On Tue, Mar 29, 2022 at 08:35:29AM -0600, Jens Axboe wrote:
> > But more importantly why does your use case even have splits that get
> > submitted together?  Is this a case of Linus' stupidly low default
> > max_sectors when the hardware supports more, or is the hardware limited
> > to a low number of sectors per request?  Or do we hit another reason
> > for the split?
> 
> See the posted use case, it's running 512kb ios on a 128kb device.

That is an awfully low limit these days.  I'm really not sure we should
optimize the block layer for that.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29 14:40       ` Christoph Hellwig
@ 2022-03-29 14:41         ` Jens Axboe
  2022-03-29 14:42           ` Christoph Hellwig
  2022-03-30  1:54           ` yukuai (C)
  0 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2022-03-29 14:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yu Kuai, andriy.shevchenko, john.garry, ming.lei, linux-block,
	linux-kernel, yi.zhang

On 3/29/22 8:40 AM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2022 at 08:35:29AM -0600, Jens Axboe wrote:
>>> But more importantly why does your use case even have splits that get
>>> submitted together?  Is this a case of Linus' stupidly low default
>>> max_sectors when the hardware supports more, or is the hardware limited
>>> to a low number of sectors per request?  Or do we hit another reason
>>> for the split?
>>
>> See the posted use case, it's running 512kb ios on a 128kb device.
> 
> That is an awfully low limit these days.  I'm really not sure we should
> optimize the block layer for that.

That's exactly what my replies have been saying. I don't think this is
a relevant thing to optimize for.

Fixing fairness for wakeups seems useful, however.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29 14:41         ` Jens Axboe
@ 2022-03-29 14:42           ` Christoph Hellwig
  2022-03-30  1:54           ` yukuai (C)
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2022-03-29 14:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Yu Kuai, andriy.shevchenko, john.garry,
	ming.lei, linux-block, linux-kernel, yi.zhang

On Tue, Mar 29, 2022 at 08:41:57AM -0600, Jens Axboe wrote:
> Fixing fairness for wakeups seems useful, however.

Agreed.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION'
  2022-03-29 12:44   ` Jens Axboe
@ 2022-03-30  1:18     ` yukuai (C)
  2022-03-30  1:20       ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: yukuai (C) @ 2022-03-30  1:18 UTC (permalink / raw)
  To: Jens Axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 2022/03/29 20:44, Jens Axboe wrote:
> On 3/29/22 3:40 AM, Yu Kuai wrote:
>> Tag preemption is the default behaviour, specifically blk_mq_get_tag()
>> will try to get tag unconditionally, which means a new io can preempt tag
>> even if there are lots of ios that are waiting for tags.
>>
>> This patch introduce a new flag, prepare to disable such behaviour, in
>> order to optimize io performance for large random io for HHD.
> 
> Not sure why we need a flag for this behavior. Does it ever make sense
> to allow preempting waiters, jumping the queue?
> 

Hi,

I was thinking using the flag to control the new behavior, in order to
reduce the impact on general path.

If wake up path is handled properly, I think it's ok to disable
preempting tags.

Thanks
Kuai

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION'
  2022-03-30  1:18     ` yukuai (C)
@ 2022-03-30  1:20       ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2022-03-30  1:20 UTC (permalink / raw)
  To: yukuai (C), andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 3/29/22 7:18 PM, yukuai (C) wrote:
> On 2022/03/29 20:44, Jens Axboe wrote:
>> On 3/29/22 3:40 AM, Yu Kuai wrote:
>>> Tag preemption is the default behaviour, specifically blk_mq_get_tag()
>>> will try to get tag unconditionally, which means a new io can preempt tag
>>> even if there are lots of ios that are waiting for tags.
>>>
>>> This patch introduce a new flag, prepare to disable such behaviour, in
>>> order to optimize io performance for large random io for HHD.
>>
>> Not sure why we need a flag for this behavior. Does it ever make sense
>> to allow preempting waiters, jumping the queue?
>>
> 
> Hi,
> 
> I was thinking using the flag to control the new behavior, in order to
> reduce the impact on general path.
> 
> If wake up path is handled properly, I think it's ok to disable
> preempting tags.

If we hit tag starvation, we are by definition out of the fast path.
That doesn't mean that scalability should drop to the floor, something
that often happened before blk-mq and without the rolling wakeups. But
it does mean that we can throw a bit more smarts at it, if it improves
fairness/performance in that situation.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29 12:46   ` Jens Axboe
@ 2022-03-30  1:35     ` yukuai (C)
  0 siblings, 0 replies; 20+ messages in thread
From: yukuai (C) @ 2022-03-30  1:35 UTC (permalink / raw)
  To: Jens Axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 2022/03/29 20:46, Jens Axboe wrote:
> On 3/29/22 3:40 AM, Yu Kuai wrote:
>> Currently, the splited bio is handled first, and then continue to split
>> the original bio. This patch tries to split the original bio thoroughly,
>> so that it can be known in advance how many tags will be needed.
> 
> This makes the bio almost 10% bigger in size, which is NOT something
> that is just done casually and without strong reasoning.
Hi,

Thanks for taking time on this patchset, It's right this way is not
appropriate.
> 
> So please provide that, your commit message is completely missing
> justification for why this change is useful or necessary. A good
> commit always explains WHY the change needs to be made, yours is
> simply stating WHAT is being done. The latter can be gleaned from
> the code change anyway.
> 
Thanks for the guidance, I'll pay attemtion to that carefully in future.

For this patch, what I want to do is to gain information about how many
tags will be needed for the big io before getting the first tag, and use
that information to optimize wake up path. The problem in this patch is
that adding two feilds in bio is a bad move.

I was thinking that for segment split, I can get the information by
caculating bio segments / queue max segments.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly
  2022-03-29 14:41         ` Jens Axboe
  2022-03-29 14:42           ` Christoph Hellwig
@ 2022-03-30  1:54           ` yukuai (C)
  1 sibling, 0 replies; 20+ messages in thread
From: yukuai (C) @ 2022-03-30  1:54 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig
  Cc: andriy.shevchenko, john.garry, ming.lei, linux-block,
	linux-kernel, yi.zhang

On 2022/03/29 22:41, Jens Axboe wrote:
> On 3/29/22 8:40 AM, Christoph Hellwig wrote:
>> On Tue, Mar 29, 2022 at 08:35:29AM -0600, Jens Axboe wrote:
>>>> But more importantly why does your use case even have splits that get
>>>> submitted together?  Is this a case of Linus' stupidly low default
>>>> max_sectors when the hardware supports more, or is the hardware limited
>>>> to a low number of sectors per request?  Or do we hit another reason
>>>> for the split?
>>>
>>> See the posted use case, it's running 512kb ios on a 128kb device.
Hi,

The problem was first found during kernel upgrade(v3.10 to v4.18), and
we maintain a series of io performance test suites, and one of the test
is fio random rw with large bs. In the environment, the 'max_sectors_kb'
is 256kb, and fio bs is 1m.
>>
>> That is an awfully low limit these days.  I'm really not sure we should
>> optimize the block layer for that.
> 
> That's exactly what my replies have been saying. I don't think this is
> a relevant thing to optimize for.

If the use case that large ios get submitted together is not a common
issue(probably not since it's been a long time without complaining),
I agree that we should not optimize the block layer for that.

Thanks,
Kuai
> 
> Fixing fairness for wakeups seems useful, however.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -next RFC 0/6] improve large random io for HDD
  2022-03-29 12:53 ` [PATCH -next RFC 0/6] improve large random io for HDD Jens Axboe
@ 2022-03-30  2:05   ` yukuai (C)
  0 siblings, 0 replies; 20+ messages in thread
From: yukuai (C) @ 2022-03-30  2:05 UTC (permalink / raw)
  To: Jens Axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/03/29 20:53, Jens Axboe 写道:
> On 3/29/22 3:40 AM, Yu Kuai wrote:
>> There is a defect for blk-mq compare to blk-sq, specifically split io
>> will end up discontinuous if the device is under high io pressure, while
>> split io will still be continuous in sq, this is because:
>>
>> 1) split bio is issued one by one, if one bio can't get tag, it will go
>> to wail. - patch 2
>> 2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>> - patch 3,4
>> 3) new io can preempt tag even if there are lots of threads waiting for
>> tags. - patch 5
>>
>> Test environment:
>> x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
>> max_sectors_kb is set to 128.
>>
>> I haven't tested this patchset on physical machine yet, I'll try later
>> if anyone thinks this approch is meaningful.
> 
> A real machine test would definitely be a requirement. What real world
> uses cases is this solving? These days most devices have plenty of tags,
> and I would not really expect tag starvation to be much of a concern.
> 
> However, I do think there's merrit in fixing the unfairness we have
> here. But not at the cost of all of this. Why not just simply enforce
> more strict ordering of tag allocations? If someone is waiting, you get
> to wait too.
> 
> And I don't see much utility at all in tracking how many splits (and
> hence tags) would be required. Is this really a common issue, tons of
> splits and needing many tags? Why not just enforce the strict ordering
> as mentioned above, not allowing new allocators to get a tag if others
> are waiting, but perhaps allow someone submitting a string of splits to
> indeed keep allocating.
> 
> Yes, it'll be less efficient to still wake one-by-one, but honestly do
> we really care about that? If you're stalled on waiting for other IO to
> finish and release a tag, that isn't very efficient to begin with and
> doesn't seem like a case worth optimizing for me.
> 

Hi,

Thanks for your adivce, I'll do more work based on your suggestions.

Kuai

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-03-30  2:07 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-29  9:40 [PATCH -next RFC 0/6] improve large random io for HDD Yu Kuai
2022-03-29  9:40 ` [PATCH -next RFC 1/6] blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION' Yu Kuai
2022-03-29 12:44   ` Jens Axboe
2022-03-30  1:18     ` yukuai (C)
2022-03-30  1:20       ` Jens Axboe
2022-03-29  9:40 ` [PATCH -next RFC 2/6] block: refactor to split bio thoroughly Yu Kuai
2022-03-29 12:46   ` Jens Axboe
2022-03-30  1:35     ` yukuai (C)
2022-03-29 13:32   ` Christoph Hellwig
2022-03-29 14:35     ` Jens Axboe
2022-03-29 14:40       ` Christoph Hellwig
2022-03-29 14:41         ` Jens Axboe
2022-03-29 14:42           ` Christoph Hellwig
2022-03-30  1:54           ` yukuai (C)
2022-03-29  9:40 ` [PATCH -next RFC 3/6] blk-mq: record how many tags are needed for splited bio Yu Kuai
2022-03-29  9:40 ` [PATCH -next RFC 4/6] sbitmap: wake up the number of threads based on required tags Yu Kuai
2022-03-29  9:40 ` [PATCH -next RFC 5/6] blk-mq: don't preempt tag expect for split bios Yu Kuai
2022-03-29  9:40 ` [PATCH -next RFC 6/6] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
2022-03-29 12:53 ` [PATCH -next RFC 0/6] improve large random io for HDD Jens Axboe
2022-03-30  2:05   ` yukuai (C)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.