All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
@ 2022-04-15 10:10 Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Changes in v3:
 - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
 in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
 'waiters_cnt' are all 0, which will cause deap loop.
 - don't add 'wait_index' during each loop in patch 2
 - fix that 'wake_index' might mismatch in the first wake up in patch 3,
 also improving coding for the patch.
 - add a detection in patch 4 in case io hung is triggered in corner
 cases.
 - make the detection, free tags are sufficient, more flexible.
 - fix a race in patch 8.
 - fix some words and add some comments.

Changes in v2:
 - use a new title
 - add patches to fix waitqueues' unfairness - path 1-3
 - delete patch to add queue flag
 - delete patch to split big io thoroughly

In this patchset:
 - patch 1-3 fix waitqueues' unfairness.
 - patch 4,5 disable tag preemption on heavy load.
 - patch 6 forces tag preemption for split bios.
 - patch 7,8 improve large random io for HDD. We do meet the problem and
 I'm trying to fix it at very low cost. However, if anyone still thinks
 this is not a common case and not worth to optimize, I'll drop them.

There is a defect for blk-mq compare to blk-sq, specifically split io
will end up discontinuous if the device is under high io pressure, while
split io will still be continuous in sq, this is because:

1) new io can preempt tag even if there are lots of threads waiting.
2) split bio is issued one by one, if one bio can't get tag, it will go
to wail.
3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
Thus if a thread is woken up, it will unlikey to get multiple tags.

The problem was first found by upgrading kernel from v3.10 to v4.18,
test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
ios with high concurrency.

Noted that there is a precondition for such performance problem:
There is a certain gap between bandwidth for single io with
bs=max_sectors_kb and disk upper limit.

During the test, I found that waitqueues can be extremly unbalanced on
heavy load. This is because 'wake_index' is not set properly in
__sbq_wake_up(), see details in patch 3.

Test environment:
arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
where 'max_sectors_kb' is 256).

The single io performance(randwrite):

| bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
| -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
| bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |

It can be seen that 1280k io is already close to upper limit, and it'll
be hard to see differences with the default value, thus I set
'max_sectors_kb' to 128 in the following test.

Test cmd:
        fio \
        -filename=/dev/$dev \
        -name=test \
        -ioengine=psync \
        -allow_mounted_write=0 \
        -group_reporting \
        -direct=1 \
        -offset_increment=1g \
        -rw=randwrite \
        -bs=1024k \
        -numjobs={1,2,4,8,16,32,64,128,256,512} \
        -runtime=110 \
        -ramp_time=10

Test result: MiB/s

| numjobs | v5.18-rc1 | v5.18-rc1-patched |
| ------- | --------- | ----------------- |
| 1       | 67.7      | 67.7              |
| 2       | 67.7      | 67.7              |
| 4       | 67.7      | 67.7              |
| 8       | 67.7      | 67.7              |
| 16      | 64.8      | 65.6              |
| 32      | 59.8      | 63.8              |
| 64      | 54.9      | 59.4              |
| 128     | 49        | 56.9              |
| 256     | 37.7      | 58.3              |
| 512     | 31.8      | 57.9              |

Yu Kuai (8):
  sbitmap: record the number of waiters for each waitqueue
  blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
  sbitmap: make sure waitqueues are balanced
  blk-mq: don't preempt tag under heavy load
  sbitmap: force tag preemption if free tags are sufficient
  blk-mq: force tag preemption for split bios
  blk-mq: record how many tags are needed for splited bio
  sbitmap: wake up the number of threads based on required tags

 block/blk-merge.c         |   8 +-
 block/blk-mq-tag.c        |  49 +++++++++----
 block/blk-mq.c            |  54 +++++++++++++-
 block/blk-mq.h            |   4 +
 include/linux/blk_types.h |   4 +
 include/linux/sbitmap.h   |   9 +++
 lib/sbitmap.c             | 149 +++++++++++++++++++++++++++-----------
 7 files changed, 216 insertions(+), 61 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH -next RFC v3 1/8] sbitmap: record the number of waiters for each waitqueue
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
@ 2022-04-15 10:10 ` Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Add a counter in struct sbq_wait_stat to recored how many threads are
waiting on the waitqueue, this will be used in later patches to make
sure 8 waitqueues are balanced. Such counter will also be shown in
debugfs so that user can see if waitqueues are balanced.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 include/linux/sbitmap.h | 5 +++++
 lib/sbitmap.c           | 7 +++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 8f5a86e210b9..8a64271d0696 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -91,6 +91,11 @@ struct sbq_wait_state {
 	 */
 	atomic_t wait_cnt;
 
+	/**
+	 * @waiters_cnt: Number of active waiters
+	 */
+	atomic_t waiters_cnt;
+
 	/**
 	 * @wait: Wait queue.
 	 */
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index ae4fd4de9ebe..a5105ce6d424 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -444,6 +444,7 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
 		init_waitqueue_head(&sbq->ws[i].wait);
 		atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch);
+		atomic_set(&sbq->ws[i].waiters_cnt, 0);
 	}
 
 	return 0;
@@ -759,9 +760,9 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
 	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
 		struct sbq_wait_state *ws = &sbq->ws[i];
 
-		seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n",
+		seq_printf(m, "\t{.wait_cnt=%d, .waiters_cnt=%d},\n",
 			   atomic_read(&ws->wait_cnt),
-			   waitqueue_active(&ws->wait) ? "active" : "inactive");
+			   atomic_read(&ws->waiters_cnt));
 	}
 	seq_puts(m, "}\n");
 
@@ -797,6 +798,7 @@ void sbitmap_prepare_to_wait(struct sbitmap_queue *sbq,
 			     struct sbq_wait *sbq_wait, int state)
 {
 	if (!sbq_wait->sbq) {
+		atomic_inc(&ws->waiters_cnt);
 		atomic_inc(&sbq->ws_active);
 		sbq_wait->sbq = sbq;
 	}
@@ -810,6 +812,7 @@ void sbitmap_finish_wait(struct sbitmap_queue *sbq, struct sbq_wait_state *ws,
 	finish_wait(&ws->wait, &sbq_wait->wait);
 	if (sbq_wait->sbq) {
 		atomic_dec(&sbq->ws_active);
+		atomic_dec(&ws->waiters_cnt);
 		sbq_wait->sbq = NULL;
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -next RFC v3 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
@ 2022-04-15 10:10 ` Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

bt_wait_ptr() will increase 'wait_index', however, if blk_mq_get_tag()
get a tag successfully after bt_wait_ptr() is called and before
sbitmap_prepare_to_wait() is called, then the 'ws' is skipped. This
behavior might cause 8 waitqueues to be unbalanced.

Move bt_wait_ptr() later should reduce the problem when the disk is
under high io preesure.

In the meantime, instead of calling bt_wait_ptr() during every loop,
calling bt_wait_ptr() only if destination hw queue is changed, which
should reduce the unfairness further.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 68ac23d0b640..5ad85063e91e 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -131,7 +131,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 	struct sbitmap_queue *bt;
-	struct sbq_wait_state *ws;
+	struct sbq_wait_state *ws = NULL;
 	DEFINE_SBQ_WAIT(wait);
 	unsigned int tag_offset;
 	int tag;
@@ -155,7 +155,6 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	if (data->flags & BLK_MQ_REQ_NOWAIT)
 		return BLK_MQ_NO_TAG;
 
-	ws = bt_wait_ptr(bt, data->hctx);
 	do {
 		struct sbitmap_queue *bt_prev;
 
@@ -174,6 +173,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		if (!ws)
+			ws = bt_wait_ptr(bt, data->hctx);
 		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
 		tag = __blk_mq_get_tag(data, bt);
@@ -199,10 +200,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		 * previous queue for compensating the wake up miss, so
 		 * other allocations on previous queue won't be starved.
 		 */
-		if (bt != bt_prev)
+		if (bt != bt_prev) {
 			sbitmap_queue_wake_up(bt_prev);
-
-		ws = bt_wait_ptr(bt, data->hctx);
+			ws = bt_wait_ptr(bt, data->hctx);
+		}
 	} while (1);
 
 	sbitmap_finish_wait(bt, ws, &wait);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -next RFC v3 3/8] sbitmap: make sure waitqueues are balanced
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
@ 2022-04-15 10:10 ` Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Currently, same waitqueue might be woken up continuously:

__sbq_wake_up		__sbq_wake_up
 sbq_wake_ptr -> assume	0
			 sbq_wake_ptr -> 0
 atomic_dec_return
			atomic_dec_return
 atomic_cmpxchg -> succeed
			 atomic_cmpxchg -> failed
			  return true

			__sbq_wake_up
			 sbq_wake_ptr
			  atomic_read(&sbq->wake_index) -> still 0
 sbq_index_atomic_inc -> inc to 1
			  if (waitqueue_active(&ws->wait))
			   if (wake_index != atomic_read(&sbq->wake_index))
			    atomic_set -> reset from 1 to 0
 wake_up_nr -> wake up first waitqueue
			    // continue to wake up in first waitqueue

What's worse, io hung is possible in theory because wake up might be
missed. For example, 2 * wake_batch tags are put, while only wake_batch
threads are worken:

__sbq_wake_up
 atomic_cmpxchg -> reset wait_cnt
			__sbq_wake_up -> decrease wait_cnt
			...
			__sbq_wake_up -> wait_cnt is decreased to 0 again
			 atomic_cmpxchg
			 sbq_index_atomic_inc -> increase wake_index
			 wake_up_nr -> wake up and waitqueue might be empty
 sbq_index_atomic_inc -> increase again, one waitqueue is skipped
 wake_up_nr -> invalid wake up because old wakequeue might be empty

To fix the problem, refactor to make sure waitqueues will be woken up
one by one, and also choose the next waitqueue by the number of threads
that are waiting to keep waitqueues balanced.

Test cmd: nr_requests is 64, and queue_depth is 32
[global]
filename=/dev/sda
ioengine=libaio
direct=1
allow_mounted_write=0
group_reporting

[test]
rw=randwrite
bs=4k
numjobs=512
iodepth=2

Before this patch, waitqueues can be extremly unbalanced, for example:
ws_active=484
ws={
        {.wait_cnt=8, .waiters_cnt=117},
        {.wait_cnt=8, .waiters_cnt=59},
        {.wait_cnt=8, .waiters_cnt=76},
        {.wait_cnt=8, .waiters_cnt=0},
        {.wait_cnt=5, .waiters_cnt=24},
        {.wait_cnt=8, .waiters_cnt=12},
        {.wait_cnt=8, .waiters_cnt=21},
        {.wait_cnt=8, .waiters_cnt=175},
}

With this patch, waitqueues is always balanced, for example:
ws_active=477
ws={
        {.wait_cnt=8, .waiters_cnt=59},
        {.wait_cnt=6, .waiters_cnt=62},
        {.wait_cnt=8, .waiters_cnt=61},
        {.wait_cnt=8, .waiters_cnt=60},
        {.wait_cnt=8, .waiters_cnt=63},
        {.wait_cnt=8, .waiters_cnt=56},
        {.wait_cnt=8, .waiters_cnt=59},
        {.wait_cnt=8, .waiters_cnt=57},
}

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 lib/sbitmap.c | 88 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 48 insertions(+), 40 deletions(-)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index a5105ce6d424..7527527bbc86 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -575,66 +575,74 @@ void sbitmap_queue_min_shallow_depth(struct sbitmap_queue *sbq,
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth);
 
-static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
+/* always choose the 'ws' with the max waiters */
+static void sbq_update_wake_index(struct sbitmap_queue *sbq,
+				  int old_wake_index)
 {
-	int i, wake_index;
+	int index, wake_index;
+	int max_waiters = 0;
 
-	if (!atomic_read(&sbq->ws_active))
-		return NULL;
+	if (old_wake_index != atomic_read(&sbq->wake_index))
+		return;
 
-	wake_index = atomic_read(&sbq->wake_index);
-	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
-		struct sbq_wait_state *ws = &sbq->ws[wake_index];
+	for (wake_index = 0; wake_index < SBQ_WAIT_QUEUES; wake_index++) {
+		struct sbq_wait_state *ws;
+		int waiters;
 
-		if (waitqueue_active(&ws->wait)) {
-			if (wake_index != atomic_read(&sbq->wake_index))
-				atomic_set(&sbq->wake_index, wake_index);
-			return ws;
-		}
+		if (wake_index == old_wake_index)
+			continue;
 
-		wake_index = sbq_index_inc(wake_index);
+		ws = &sbq->ws[wake_index];
+		waiters = atomic_read(&ws->waiters_cnt);
+		if (waiters > max_waiters) {
+			max_waiters = waiters;
+			index = wake_index;
+		}
 	}
 
-	return NULL;
+	if (max_waiters)
+		atomic_cmpxchg(&sbq->wake_index, old_wake_index, index);
 }
 
 static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 {
 	struct sbq_wait_state *ws;
 	unsigned int wake_batch;
-	int wait_cnt;
+	int wait_cnt, wake_index;
 
-	ws = sbq_wake_ptr(sbq);
-	if (!ws)
+	if (!atomic_read(&sbq->ws_active))
 		return false;
 
-	wait_cnt = atomic_dec_return(&ws->wait_cnt);
-	if (wait_cnt <= 0) {
-		int ret;
-
-		wake_batch = READ_ONCE(sbq->wake_batch);
+	wake_index = atomic_read(&sbq->wake_index);
+	ws = &sbq->ws[wake_index];
 
-		/*
-		 * Pairs with the memory barrier in sbitmap_queue_resize() to
-		 * ensure that we see the batch size update before the wait
-		 * count is reset.
-		 */
-		smp_mb__before_atomic();
+	/* Dismatch wake_index can only happened in the first wakeup. */
+	if (!atomic_read(&ws->waiters_cnt)) {
+		sbq_update_wake_index(sbq, wake_index);
+		return true;
+	}
 
-		/*
-		 * For concurrent callers of this, the one that failed the
-		 * atomic_cmpxhcg() race should call this function again
-		 * to wakeup a new batch on a different 'ws'.
-		 */
-		ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch);
-		if (ret == wait_cnt) {
-			sbq_index_atomic_inc(&sbq->wake_index);
-			wake_up_nr(&ws->wait, wake_batch);
-			return false;
-		}
+	wait_cnt = atomic_dec_return(&ws->wait_cnt);
+	if (wait_cnt > 0)
+		return false;
 
+	sbq_update_wake_index(sbq, wake_index);
+	/*
+	 * Concurrent callers should call this function again
+	 * to wakeup a new batch on a different 'ws'.
+	 */
+	if (wait_cnt < 0)
 		return true;
-	}
+
+	wake_batch = READ_ONCE(sbq->wake_batch);
+	/*
+	 * Pairs with the memory barrier in sbitmap_queue_resize() to
+	 * ensure that we see the batch size update before the wait
+	 * count is reset.
+	 */
+	smp_mb__before_atomic();
+	atomic_set(&ws->wait_cnt, wake_batch);
+	wake_up_nr(&ws->wait, wake_batch);
 
 	return false;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -next RFC v3 4/8] blk-mq: don't preempt tag under heavy load
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (2 preceding siblings ...)
  2022-04-15 10:10 ` [PATCH -next RFC v3 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
@ 2022-04-15 10:10 ` Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 5/8] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Tag preemption is the default behaviour, specifically blk_mq_get_tag()
will try to get tag unconditionally, which means a new io can preempt
tag even if there are lots of ios that are waiting for tags.

Such behaviour doesn't make sense when the disk is under heavy load,
because it will intensify competition without improving performance,
especially for huge io as split ios are unlikely to be issued
continuously.

The ideal way to disable tag preemption is to track how many tags are
available, and wait directly in blk_mq_get_tag() if free tags are
very little. However, this is out of reality because fast path is
affected.

As 'ws_active' is only updated in slow path, this patch disable tag
preemption if 'ws_active' is greater than 8, which means there are many
threads waiting for tags already.

Once tag preemption is disabled, there is a situation that can cause
performance degradation(or io hung in extreme scenarios): the waitqueue
doesn't have 'wake_batch' threads, thus wake up on this waitqueue might
cause the concurrency of ios to be decreased. The next patch will fix this
problem.

This patch also add a detection in blk_mq_timeout_work(), just in case
io hung is triggered due to waiters can't awakened in some corner cases.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 36 +++++++++++++++++++++++++-----------
 block/blk-mq.c     | 29 +++++++++++++++++++++++++++++
 block/blk-mq.h     |  2 ++
 3 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 5ad85063e91e..a6c5ec846a5e 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -127,6 +127,13 @@ unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
 	return ret;
 }
 
+static inline bool preempt_tag(struct blk_mq_alloc_data *data,
+			       struct sbitmap_queue *bt)
+{
+	return data->preempt ||
+	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES;
+}
+
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
@@ -148,12 +155,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		tag_offset = tags->nr_reserved_tags;
 	}
 
-	tag = __blk_mq_get_tag(data, bt);
-	if (tag != BLK_MQ_NO_TAG)
-		goto found_tag;
+	if (data->flags & BLK_MQ_REQ_NOWAIT || preempt_tag(data, bt)) {
+		tag = __blk_mq_get_tag(data, bt);
+		if (tag != BLK_MQ_NO_TAG)
+			goto found_tag;
 
-	if (data->flags & BLK_MQ_REQ_NOWAIT)
-		return BLK_MQ_NO_TAG;
+		if (data->flags & BLK_MQ_REQ_NOWAIT)
+			return BLK_MQ_NO_TAG;
+	}
 
 	do {
 		struct sbitmap_queue *bt_prev;
@@ -169,21 +178,26 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		 * Retry tag allocation after running the hardware queue,
 		 * as running the queue may also have found completions.
 		 */
-		tag = __blk_mq_get_tag(data, bt);
-		if (tag != BLK_MQ_NO_TAG)
-			break;
+		if (preempt_tag(data, bt)) {
+			tag = __blk_mq_get_tag(data, bt);
+			if (tag != BLK_MQ_NO_TAG)
+				break;
+		}
 
 		if (!ws)
 			ws = bt_wait_ptr(bt, data->hctx);
 		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
-		tag = __blk_mq_get_tag(data, bt);
-		if (tag != BLK_MQ_NO_TAG)
-			break;
+		if (preempt_tag(data, bt)) {
+			tag = __blk_mq_get_tag(data, bt);
+			if (tag != BLK_MQ_NO_TAG)
+				break;
+		}
 
 		bt_prev = bt;
 		io_schedule();
 
+		data->preempt = true;
 		sbitmap_finish_wait(bt, ws, &wait);
 
 		data->ctx = blk_mq_get_ctx(data->q);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ed3ed86f7dd2..32beacbad5e2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1446,6 +1446,34 @@ static bool blk_mq_check_expired(struct request *rq, void *priv, bool reserved)
 	return true;
 }
 
+static void blk_mq_check_tag_waiters(struct blk_mq_hw_ctx *hctx)
+{
+	bool warn = false;
+	struct blk_mq_tags *tags = hctx->tags;
+
+again:
+	if (atomic_read(&tags->bitmap_tags.ws_active)) {
+		warn = true;
+		sbitmap_queue_wake_all(&tags->bitmap_tags);
+	}
+
+	if (atomic_read(&tags->breserved_tags.ws_active)) {
+		warn = true;
+		sbitmap_queue_wake_all(&tags->breserved_tags);
+	}
+
+	if (hctx->sched_tags && tags != hctx->sched_tags) {
+		tags = hctx->sched_tags;
+		goto again;
+	}
+
+	/*
+	 * This is problematic because someone is still waiting for tag while
+	 * no tag is used.
+	 */
+	WARN_ON_ONCE(warn);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -1482,6 +1510,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		 * each hctx as idle.
 		 */
 		queue_for_each_hw_ctx(q, hctx, i) {
+			blk_mq_check_tag_waiters(hctx);
 			/* the hctx may be unmapped, so check it here */
 			if (blk_mq_hw_queue_mapped(hctx))
 				blk_mq_tag_idle(hctx);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2615bd58bad3..1a85bd1045d8 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -156,6 +156,8 @@ struct blk_mq_alloc_data {
 
 	/* allocate multiple requests/tags in one go */
 	unsigned int nr_tags;
+	/* true if blk_mq_get_tag() will try to preempt tag */
+	bool preempt;
 	struct request **cached_rq;
 
 	/* input & output parameter */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -next RFC v3 5/8] sbitmap: force tag preemption if free tags are sufficient
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (3 preceding siblings ...)
  2022-04-15 10:10 ` [PATCH -next RFC v3 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
@ 2022-04-15 10:10 ` Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 6/8] blk-mq: force tag preemption for split bios Yu Kuai
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Now that tag preemption is disabled under heavy load, if wakers doesn't
use up 'wake_batch' tags while preemption is still disabled, io
concurrency will be declined.

To fix the problem, add a detection before wake up, and force tag
preemption is free tags are sufficient, so that the extra tags can be
used by new io. And tag preemption will be disabled again if the extra
tags are used up.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c      |  3 ++-
 include/linux/sbitmap.h |  2 ++
 lib/sbitmap.c           | 30 ++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a6c5ec846a5e..d02710cf3355 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -131,7 +131,8 @@ static inline bool preempt_tag(struct blk_mq_alloc_data *data,
 			       struct sbitmap_queue *bt)
 {
 	return data->preempt ||
-	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES;
+	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES ||
+	       bt->force_tag_preemption;
 }
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 8a64271d0696..ca00ccb6af48 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -143,6 +143,8 @@ struct sbitmap_queue {
 	 * sbitmap_queue_get_shallow()
 	 */
 	unsigned int min_shallow_depth;
+
+	bool force_tag_preemption;
 };
 
 /**
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 7527527bbc86..315e5619b384 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -434,6 +434,7 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 	sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
 	atomic_set(&sbq->wake_index, 0);
 	atomic_set(&sbq->ws_active, 0);
+	sbq->force_tag_preemption = true;
 
 	sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node);
 	if (!sbq->ws) {
@@ -604,6 +605,34 @@ static void sbq_update_wake_index(struct sbitmap_queue *sbq,
 		atomic_cmpxchg(&sbq->wake_index, old_wake_index, index);
 }
 
+static inline void sbq_update_preemption(struct sbitmap_queue *sbq,
+					 unsigned int wake_batch)
+{
+	unsigned int free;
+
+	if (wake_batch == 1) {
+		/*
+		 * Waiters will be woken up one by one, no risk of declining
+		 * io concurrency.
+		 */
+		sbq->force_tag_preemption = false;
+		return;
+	}
+
+	free = sbq->sb.depth - sbitmap_weight(&sbq->sb);
+	if (sbq->force_tag_preemption) {
+		if (free <= wake_batch)
+			sbq->force_tag_preemption = false;
+	} else {
+		if (free > wake_batch << 1)
+			sbq->force_tag_preemption = true;
+
+	}
+	sbq->force_tag_preemption =
+		(sbq->sb.depth - sbitmap_weight(&sbq->sb)) >= wake_batch << 1 ?
+		true : false;
+}
+
 static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 {
 	struct sbq_wait_state *ws;
@@ -642,6 +671,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 	 */
 	smp_mb__before_atomic();
 	atomic_set(&ws->wait_cnt, wake_batch);
+	sbq_update_preemption(sbq, wake_batch);
 	wake_up_nr(&ws->wait, wake_batch);
 
 	return false;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -next RFC v3 6/8] blk-mq: force tag preemption for split bios
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (4 preceding siblings ...)
  2022-04-15 10:10 ` [PATCH -next RFC v3 5/8] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
@ 2022-04-15 10:10 ` Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 7/8] blk-mq: record how many tags are needed for splited bio Yu Kuai
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

For HDD, sequential io is much faster than random io, thus it's better
to issue split io continuously. However, this is broken when tag preemption
is disabled, because wakers can only get one tag each time.

Thus tag preemption should be enabled for split bios, at least for HDD,
specifically the first bio won't preempt tag, and following split bios
will preempt tag.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-merge.c         | 8 +++++++-
 block/blk-mq.c            | 1 +
 include/linux/blk_types.h | 4 ++++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7771dacc99cb..85c285023f5e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -343,12 +343,18 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
 
 	if (split) {
 		/* there isn't chance to merge the splitted bio */
-		split->bi_opf |= REQ_NOMERGE;
+		split->bi_opf |= (REQ_NOMERGE | REQ_SPLIT);
+		if ((*bio)->bi_opf & REQ_SPLIT)
+			split->bi_opf |= REQ_PREEMPT;
+		else
+			(*bio)->bi_opf |= REQ_SPLIT;
 
 		bio_chain(split, *bio);
 		trace_block_split(split, (*bio)->bi_iter.bi_sector);
 		submit_bio_noacct(*bio);
 		*bio = split;
+	} else if ((*bio)->bi_opf & REQ_SPLIT) {
+		(*bio)->bi_opf |= REQ_PREEMPT;
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 32beacbad5e2..a889f01d2cdf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2766,6 +2766,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
 		.q		= q,
 		.nr_tags	= 1,
 		.cmd_flags	= bio->bi_opf,
+		.preempt	= (bio->bi_opf & REQ_PREEMPT),
 	};
 	struct request *rq;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c62274466e72..046a34c81ec4 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -418,6 +418,8 @@ enum req_flag_bits {
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
+	__REQ_SPLIT,		/* IO is split. */
+	__REQ_PREEMPT,		/* IO will preempt tag. */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -443,6 +445,8 @@ enum req_flag_bits {
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
+#define REQ_SPLIT		(1ULL << __REQ_SPLIT)
+#define REQ_PREEMPT		(1ULL << __REQ_PREEMPT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -next RFC v3 7/8] blk-mq: record how many tags are needed for splited bio
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (5 preceding siblings ...)
  2022-04-15 10:10 ` [PATCH -next RFC v3 6/8] blk-mq: force tag preemption for split bios Yu Kuai
@ 2022-04-15 10:10 ` Yu Kuai
  2022-04-15 10:10 ` [PATCH -next RFC v3 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Currently, each time 8(or wake batch) requests is done, 8 waiters will
be woken up, this is not necessary because we only need to make sure
wakers will use up 8 tags. For example, if we know in advance that a
thread need 8 tags, then wake up one thread is enough, and this can also
avoid unnecessary context switch.

On the other hand, sequential io is much faster than random io, thus
it's better to issue split io continuously.

This patch tries to provide such information that how many tags will
be needed for huge io, and it will be used in next patch.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c      |  1 +
 block/blk-mq.c          | 24 +++++++++++++++++++++---
 block/blk-mq.h          |  2 ++
 include/linux/sbitmap.h |  2 ++
 4 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d02710cf3355..70ce98a5c32b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -165,6 +165,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 			return BLK_MQ_NO_TAG;
 	}
 
+	wait.nr_tags += data->nr_split;
 	do {
 		struct sbitmap_queue *bt_prev;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a889f01d2cdf..ac614a379a6d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2760,12 +2760,14 @@ static bool blk_mq_attempt_bio_merge(struct request_queue *q,
 static struct request *blk_mq_get_new_requests(struct request_queue *q,
 					       struct blk_plug *plug,
 					       struct bio *bio,
-					       unsigned int nsegs)
+					       unsigned int nsegs,
+					       unsigned int nr_split)
 {
 	struct blk_mq_alloc_data data = {
 		.q		= q,
 		.nr_tags	= 1,
 		.cmd_flags	= bio->bi_opf,
+		.nr_split	= nr_split,
 		.preempt	= (bio->bi_opf & REQ_PREEMPT),
 	};
 	struct request *rq;
@@ -2824,6 +2826,19 @@ static inline struct request *blk_mq_get_cached_request(struct request_queue *q,
 	return rq;
 }
 
+static inline unsigned int caculate_sectors_split(struct bio *bio)
+{
+	switch (bio_op(bio)) {
+	case REQ_OP_DISCARD:
+	case REQ_OP_SECURE_ERASE:
+	case REQ_OP_WRITE_ZEROES:
+		return 0;
+	default:
+		return (bio_sectors(bio) - 1) /
+			queue_max_sectors(bio->bi_bdev->bd_queue);
+	}
+}
+
 /**
  * blk_mq_submit_bio - Create and send a request to block device.
  * @bio: Bio pointer.
@@ -2844,11 +2859,14 @@ void blk_mq_submit_bio(struct bio *bio)
 	const int is_sync = op_is_sync(bio->bi_opf);
 	struct request *rq;
 	unsigned int nr_segs = 1;
+	unsigned int nr_split = 0;
 	blk_status_t ret;
 
 	blk_queue_bounce(q, &bio);
-	if (blk_may_split(q, bio))
+	if (blk_may_split(q, bio)) {
+		nr_split = caculate_sectors_split(bio);
 		__blk_queue_split(q, &bio, &nr_segs);
+	}
 
 	if (!bio_integrity_prep(bio))
 		return;
@@ -2857,7 +2875,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (!rq) {
 		if (!bio)
 			return;
-		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
+		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs, nr_split);
 		if (unlikely(!rq))
 			return;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1a85bd1045d8..9bad3057c1f3 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -156,6 +156,8 @@ struct blk_mq_alloc_data {
 
 	/* allocate multiple requests/tags in one go */
 	unsigned int nr_tags;
+	/* number of ios left after this io is handled */
+	unsigned int nr_split;
 	/* true if blk_mq_get_tag() will try to preempt tag */
 	bool preempt;
 	struct request **cached_rq;
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index ca00ccb6af48..1abd8ed5d406 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -596,12 +596,14 @@ void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
 void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m);
 
 struct sbq_wait {
+	unsigned int nr_tags;
 	struct sbitmap_queue *sbq;	/* if set, sbq_wait is accounted */
 	struct wait_queue_entry wait;
 };
 
 #define DEFINE_SBQ_WAIT(name)							\
 	struct sbq_wait name = {						\
+		.nr_tags = 1,							\
 		.sbq = NULL,							\
 		.wait = {							\
 			.private	= current,				\
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -next RFC v3 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (6 preceding siblings ...)
  2022-04-15 10:10 ` [PATCH -next RFC v3 7/8] blk-mq: record how many tags are needed for splited bio Yu Kuai
@ 2022-04-15 10:10 ` Yu Kuai
  2022-04-24  2:43 ` [PATCH -next RFC v3 0/8] improve tag allocation under heavy load yukuai (C)
  2022-04-25  3:09 ` Bart Van Assche
  9 siblings, 0 replies; 22+ messages in thread
From: Yu Kuai @ 2022-04-15 10:10 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang

Now that split bios are forced to preemt tag, however, they are unlikely
to get tags continuously because 'wake_batch' threads are woke up each
time while there are only 'wake_batch' tags available.

Since it can be known in advance how many tags are required for huge io,
it's safe to wake up based on required tags, because it can be sure that
wakers will use up 'wake_batch' tags.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 lib/sbitmap.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 315e5619b384..5ac5ad1b4b1e 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -633,6 +633,32 @@ static inline void sbq_update_preemption(struct sbitmap_queue *sbq,
 		true : false;
 }
 
+static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)
+{
+	struct sbq_wait *wait;
+	struct wait_queue_entry *entry;
+	unsigned int nr = 1;
+
+	spin_lock_irq(&ws->wait.lock);
+	list_for_each_entry(entry, &ws->wait.head, entry) {
+		wait = container_of(entry, struct sbq_wait, wait);
+		if (nr_tags <= wait->nr_tags) {
+			nr_tags = 0;
+			break;
+		}
+
+		nr++;
+		nr_tags -= wait->nr_tags;
+	}
+	spin_unlock_irq(&ws->wait.lock);
+
+	/*
+	 * If nr_tags is not 0, additional wakeup is triggered to fix the race
+	 * that new threads are waited before wake_up_nr() is called.
+	 */
+	return nr + nr_tags;
+}
+
 static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 {
 	struct sbq_wait_state *ws;
@@ -672,7 +698,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 	smp_mb__before_atomic();
 	atomic_set(&ws->wait_cnt, wake_batch);
 	sbq_update_preemption(sbq, wake_batch);
-	wake_up_nr(&ws->wait, wake_batch);
+	wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
 
 	return false;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (7 preceding siblings ...)
  2022-04-15 10:10 ` [PATCH -next RFC v3 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
@ 2022-04-24  2:43 ` yukuai (C)
  2022-04-25  3:24   ` Damien Le Moal
  2022-04-25  3:09 ` Bart Van Assche
  9 siblings, 1 reply; 22+ messages in thread
From: yukuai (C) @ 2022-04-24  2:43 UTC (permalink / raw)
  To: axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

friendly ping ...

在 2022/04/15 18:10, Yu Kuai 写道:
> Changes in v3:
>   - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>   in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>   'waiters_cnt' are all 0, which will cause deap loop.
>   - don't add 'wait_index' during each loop in patch 2
>   - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>   also improving coding for the patch.
>   - add a detection in patch 4 in case io hung is triggered in corner
>   cases.
>   - make the detection, free tags are sufficient, more flexible.
>   - fix a race in patch 8.
>   - fix some words and add some comments.
> 
> Changes in v2:
>   - use a new title
>   - add patches to fix waitqueues' unfairness - path 1-3
>   - delete patch to add queue flag
>   - delete patch to split big io thoroughly
> 
> In this patchset:
>   - patch 1-3 fix waitqueues' unfairness.
>   - patch 4,5 disable tag preemption on heavy load.
>   - patch 6 forces tag preemption for split bios.
>   - patch 7,8 improve large random io for HDD. We do meet the problem and
>   I'm trying to fix it at very low cost. However, if anyone still thinks
>   this is not a common case and not worth to optimize, I'll drop them.
> 
> There is a defect for blk-mq compare to blk-sq, specifically split io
> will end up discontinuous if the device is under high io pressure, while
> split io will still be continuous in sq, this is because:
> 
> 1) new io can preempt tag even if there are lots of threads waiting.
> 2) split bio is issued one by one, if one bio can't get tag, it will go
> to wail.
> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
> Thus if a thread is woken up, it will unlikey to get multiple tags.
> 
> The problem was first found by upgrading kernel from v3.10 to v4.18,
> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
> ios with high concurrency.
> 
> Noted that there is a precondition for such performance problem:
> There is a certain gap between bandwidth for single io with
> bs=max_sectors_kb and disk upper limit.
> 
> During the test, I found that waitqueues can be extremly unbalanced on
> heavy load. This is because 'wake_index' is not set properly in
> __sbq_wake_up(), see details in patch 3.
> 
> Test environment:
> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
> where 'max_sectors_kb' is 256).
> 
> The single io performance(randwrite):
> 
> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
> 
> It can be seen that 1280k io is already close to upper limit, and it'll
> be hard to see differences with the default value, thus I set
> 'max_sectors_kb' to 128 in the following test.
> 
> Test cmd:
>          fio \
>          -filename=/dev/$dev \
>          -name=test \
>          -ioengine=psync \
>          -allow_mounted_write=0 \
>          -group_reporting \
>          -direct=1 \
>          -offset_increment=1g \
>          -rw=randwrite \
>          -bs=1024k \
>          -numjobs={1,2,4,8,16,32,64,128,256,512} \
>          -runtime=110 \
>          -ramp_time=10
> 
> Test result: MiB/s
> 
> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
> | ------- | --------- | ----------------- |
> | 1       | 67.7      | 67.7              |
> | 2       | 67.7      | 67.7              |
> | 4       | 67.7      | 67.7              |
> | 8       | 67.7      | 67.7              |
> | 16      | 64.8      | 65.6              |
> | 32      | 59.8      | 63.8              |
> | 64      | 54.9      | 59.4              |
> | 128     | 49        | 56.9              |
> | 256     | 37.7      | 58.3              |
> | 512     | 31.8      | 57.9              |
> 
> Yu Kuai (8):
>    sbitmap: record the number of waiters for each waitqueue
>    blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
>    sbitmap: make sure waitqueues are balanced
>    blk-mq: don't preempt tag under heavy load
>    sbitmap: force tag preemption if free tags are sufficient
>    blk-mq: force tag preemption for split bios
>    blk-mq: record how many tags are needed for splited bio
>    sbitmap: wake up the number of threads based on required tags
> 
>   block/blk-merge.c         |   8 +-
>   block/blk-mq-tag.c        |  49 +++++++++----
>   block/blk-mq.c            |  54 +++++++++++++-
>   block/blk-mq.h            |   4 +
>   include/linux/blk_types.h |   4 +
>   include/linux/sbitmap.h   |   9 +++
>   lib/sbitmap.c             | 149 +++++++++++++++++++++++++++-----------
>   7 files changed, 216 insertions(+), 61 deletions(-)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (8 preceding siblings ...)
  2022-04-24  2:43 ` [PATCH -next RFC v3 0/8] improve tag allocation under heavy load yukuai (C)
@ 2022-04-25  3:09 ` Bart Van Assche
  2022-04-25  3:27   ` yukuai (C)
  9 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2022-04-25  3:09 UTC (permalink / raw)
  To: Yu Kuai, axboe, andriy.shevchenko, john.garry, ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

On 4/15/22 03:10, Yu Kuai wrote:
> The single io performance(randwrite):
> 
> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |

Although the above data is interesting, it is not sufficient. The above 
data comes from a setup with a single hard disk. There are many other 
configurations that are relevant (hard disk array, high speed NVMe, QD=1 
USB stick, ...) but for which no conclusions can be drawn from the above 
data.

Another question is whether the approach of this patch series is the 
right approach? I would expect that round-robin wakeup of waiters would 
be ideal from a fairness point of view. However, there are patches in 
this patch series that guarantee that wakeup of tag waiters won't happen 
in a round robin fashion.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-24  2:43 ` [PATCH -next RFC v3 0/8] improve tag allocation under heavy load yukuai (C)
@ 2022-04-25  3:24   ` Damien Le Moal
  2022-04-25  6:14     ` yukuai (C)
  0 siblings, 1 reply; 22+ messages in thread
From: Damien Le Moal @ 2022-04-25  3:24 UTC (permalink / raw)
  To: yukuai (C),
	axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei,
	qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

On 4/24/22 11:43, yukuai (C) wrote:
> friendly ping ...
> 
> 在 2022/04/15 18:10, Yu Kuai 写道:
>> Changes in v3:
>>   - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>   in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>   'waiters_cnt' are all 0, which will cause deap loop.
>>   - don't add 'wait_index' during each loop in patch 2
>>   - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>   also improving coding for the patch.
>>   - add a detection in patch 4 in case io hung is triggered in corner
>>   cases.
>>   - make the detection, free tags are sufficient, more flexible.
>>   - fix a race in patch 8.
>>   - fix some words and add some comments.
>>
>> Changes in v2:
>>   - use a new title
>>   - add patches to fix waitqueues' unfairness - path 1-3
>>   - delete patch to add queue flag
>>   - delete patch to split big io thoroughly
>>
>> In this patchset:
>>   - patch 1-3 fix waitqueues' unfairness.
>>   - patch 4,5 disable tag preemption on heavy load.
>>   - patch 6 forces tag preemption for split bios.
>>   - patch 7,8 improve large random io for HDD. We do meet the problem and
>>   I'm trying to fix it at very low cost. However, if anyone still thinks
>>   this is not a common case and not worth to optimize, I'll drop them.
>>
>> There is a defect for blk-mq compare to blk-sq, specifically split io
>> will end up discontinuous if the device is under high io pressure, while
>> split io will still be continuous in sq, this is because:
>>
>> 1) new io can preempt tag even if there are lots of threads waiting.
>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>> to wail.
>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>
>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>> ios with high concurrency.
>>
>> Noted that there is a precondition for such performance problem:
>> There is a certain gap between bandwidth for single io with
>> bs=max_sectors_kb and disk upper limit.
>>
>> During the test, I found that waitqueues can be extremly unbalanced on
>> heavy load. This is because 'wake_index' is not set properly in
>> __sbq_wake_up(), see details in patch 3.
>>
>> Test environment:
>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>> where 'max_sectors_kb' is 256).>>
>> The single io performance(randwrite):
>>
>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |

These results are extremely strange, unless you are running with the
device write cache disabled ? If you have the device write cache enabled,
the problem you mention above would be most likely completely invisible,
which I guess is why nobody really noticed any issue until now.

Similarly, with reads, the device side read-ahead may hide the problem,
albeit that depends on how "intelligent" the drive is at identifying
sequential accesses.

>>
>> It can be seen that 1280k io is already close to upper limit, and it'll
>> be hard to see differences with the default value, thus I set
>> 'max_sectors_kb' to 128 in the following test.
>>
>> Test cmd:
>>          fio \
>>          -filename=/dev/$dev \
>>          -name=test \
>>          -ioengine=psync \
>>          -allow_mounted_write=0 \
>>          -group_reporting \
>>          -direct=1 \
>>          -offset_increment=1g \
>>          -rw=randwrite \
>>          -bs=1024k \
>>          -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>          -runtime=110 \
>>          -ramp_time=10
>>
>> Test result: MiB/s
>>
>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>> | ------- | --------- | ----------------- |
>> | 1       | 67.7      | 67.7              |
>> | 2       | 67.7      | 67.7              |
>> | 4       | 67.7      | 67.7              |
>> | 8       | 67.7      | 67.7              |
>> | 16      | 64.8      | 65.6              |
>> | 32      | 59.8      | 63.8              |
>> | 64      | 54.9      | 59.4              |
>> | 128     | 49        | 56.9              |
>> | 256     | 37.7      | 58.3              |
>> | 512     | 31.8      | 57.9              |

Device write cache disabled ?

Also, what is the max QD of this disk ?

E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
tags. So for any of your tests with more than 64 threads, many of the
threads will be waiting for a scheduler tag for the BIO before the
bio_split problem you explain triggers. Given that the numbers you show
are the same for before-after patch with a number of threads <= 64, I am
tempted to think that the problem is not really BIO splitting...

What about random read workloads ? What kind of results do you see ?

>>
>> Yu Kuai (8):
>>    sbitmap: record the number of waiters for each waitqueue
>>    blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
>>    sbitmap: make sure waitqueues are balanced
>>    blk-mq: don't preempt tag under heavy load
>>    sbitmap: force tag preemption if free tags are sufficient
>>    blk-mq: force tag preemption for split bios
>>    blk-mq: record how many tags are needed for splited bio
>>    sbitmap: wake up the number of threads based on required tags
>>
>>   block/blk-merge.c         |   8 +-
>>   block/blk-mq-tag.c        |  49 +++++++++----
>>   block/blk-mq.c            |  54 +++++++++++++-
>>   block/blk-mq.h            |   4 +
>>   include/linux/blk_types.h |   4 +
>>   include/linux/sbitmap.h   |   9 +++
>>   lib/sbitmap.c             | 149 +++++++++++++++++++++++++++-----------
>>   7 files changed, 216 insertions(+), 61 deletions(-)
>>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  3:09 ` Bart Van Assche
@ 2022-04-25  3:27   ` yukuai (C)
  0 siblings, 0 replies; 22+ messages in thread
From: yukuai (C) @ 2022-04-25  3:27 UTC (permalink / raw)
  To: Bart Van Assche, axboe, andriy.shevchenko, john.garry, ming.lei,
	qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/25 11:09, Bart Van Assche 写道:
> On 4/15/22 03:10, Yu Kuai wrote:
>> The single io performance(randwrite):
>>
>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
> 
> Although the above data is interesting, it is not sufficient. The above 
> data comes from a setup with a single hard disk. There are many other 
> configurations that are relevant (hard disk array, high speed NVMe, QD=1 
> USB stick, ...) but for which no conclusions can be drawn from the above 
> data.
Hi,

The original idea is to improve large bs randwrite performance in HDD,
here I just test the specific case in a HDD. It's right many other test
cases and configurations are relevant.
> 
> Another question is whether the approach of this patch series is the 
> right approach? I would expect that round-robin wakeup of waiters would 
> be ideal from a fairness point of view. However, there are patches in 
> this patch series that guarantee that wakeup of tag waiters won't happen 
> in a round robin fashion.

I was thinking that round-robin can't grantee fairness in the corner
case that 8 waitqueues are not balanced. For example, one waitqueue
somehow have lots of waiters, and it's better to handle them before
newcome waiters in other waitqueues.

How you think abount this way, keep round-robin wakeup if waitqueues
are balanced, otherwise choose the waitqueue with the max waiters.

Thanks,
Kuai
> 
> Thanks,
> 
> Bart.
> .
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  3:24   ` Damien Le Moal
@ 2022-04-25  6:14     ` yukuai (C)
  2022-04-25  6:23       ` Damien Le Moal
  0 siblings, 1 reply; 22+ messages in thread
From: yukuai (C) @ 2022-04-25  6:14 UTC (permalink / raw)
  To: Damien Le Moal, axboe, bvanassche, andriy.shevchenko, john.garry,
	ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/25 11:24, Damien Le Moal 写道:
> On 4/24/22 11:43, yukuai (C) wrote:
>> friendly ping ...
>>
>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>> Changes in v3:
>>>    - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>    in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>    'waiters_cnt' are all 0, which will cause deap loop.
>>>    - don't add 'wait_index' during each loop in patch 2
>>>    - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>    also improving coding for the patch.
>>>    - add a detection in patch 4 in case io hung is triggered in corner
>>>    cases.
>>>    - make the detection, free tags are sufficient, more flexible.
>>>    - fix a race in patch 8.
>>>    - fix some words and add some comments.
>>>
>>> Changes in v2:
>>>    - use a new title
>>>    - add patches to fix waitqueues' unfairness - path 1-3
>>>    - delete patch to add queue flag
>>>    - delete patch to split big io thoroughly
>>>
>>> In this patchset:
>>>    - patch 1-3 fix waitqueues' unfairness.
>>>    - patch 4,5 disable tag preemption on heavy load.
>>>    - patch 6 forces tag preemption for split bios.
>>>    - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>    I'm trying to fix it at very low cost. However, if anyone still thinks
>>>    this is not a common case and not worth to optimize, I'll drop them.
>>>
>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>> will end up discontinuous if the device is under high io pressure, while
>>> split io will still be continuous in sq, this is because:
>>>
>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>> to wail.
>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>
>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>> ios with high concurrency.
>>>
>>> Noted that there is a precondition for such performance problem:
>>> There is a certain gap between bandwidth for single io with
>>> bs=max_sectors_kb and disk upper limit.
>>>
>>> During the test, I found that waitqueues can be extremly unbalanced on
>>> heavy load. This is because 'wake_index' is not set properly in
>>> __sbq_wake_up(), see details in patch 3.
>>>
>>> Test environment:
>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>> where 'max_sectors_kb' is 256).>>
>>> The single io performance(randwrite):
>>>
>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
> 
> These results are extremely strange, unless you are running with the
> device write cache disabled ? If you have the device write cache enabled,
> the problem you mention above would be most likely completely invisible,
> which I guess is why nobody really noticed any issue until now.
> 
> Similarly, with reads, the device side read-ahead may hide the problem,
> albeit that depends on how "intelligent" the drive is at identifying
> sequential accesses.
> 
>>>
>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>> be hard to see differences with the default value, thus I set
>>> 'max_sectors_kb' to 128 in the following test.
>>>
>>> Test cmd:
>>>           fio \
>>>           -filename=/dev/$dev \
>>>           -name=test \
>>>           -ioengine=psync \
>>>           -allow_mounted_write=0 \
>>>           -group_reporting \
>>>           -direct=1 \
>>>           -offset_increment=1g \
>>>           -rw=randwrite \
>>>           -bs=1024k \
>>>           -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>           -runtime=110 \
>>>           -ramp_time=10
>>>
>>> Test result: MiB/s
>>>
>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>> | ------- | --------- | ----------------- |
>>> | 1       | 67.7      | 67.7              |
>>> | 2       | 67.7      | 67.7              |
>>> | 4       | 67.7      | 67.7              |
>>> | 8       | 67.7      | 67.7              |
>>> | 16      | 64.8      | 65.6              |
>>> | 32      | 59.8      | 63.8              |
>>> | 64      | 54.9      | 59.4              |
>>> | 128     | 49        | 56.9              |
>>> | 256     | 37.7      | 58.3              |
>>> | 512     | 31.8      | 57.9              |
> 
> Device write cache disabled ?
> 
> Also, what is the max QD of this disk ?
> 
> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
> tags. So for any of your tests with more than 64 threads, many of the
> threads will be waiting for a scheduler tag for the BIO before the
> bio_split problem you explain triggers. Given that the numbers you show
> are the same for before-after patch with a number of threads <= 64, I am
> tempted to think that the problem is not really BIO splitting...
> 
> What about random read workloads ? What kind of results do you see ?

Hi,

Sorry about the misleading of this test case.

This testcase is high concurrency huge randwrite, it's just for the
problem that split bios won't be issued continuously, which is the
root cause of the performance degradation as the numjobs increases.

queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
than 8, performance is fine, because the ratio of sequential io should
be 7/8. However, as numjobs increases, performance is worse because
the ratio is lower. For example, when numjobs is 512, the ratio of
sequential io is about 20%.

patch 6-8 will let split bios still be issued continuously under high
pressure.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  6:14     ` yukuai (C)
@ 2022-04-25  6:23       ` Damien Le Moal
  2022-04-25  6:47         ` yukuai (C)
  0 siblings, 1 reply; 22+ messages in thread
From: Damien Le Moal @ 2022-04-25  6:23 UTC (permalink / raw)
  To: yukuai (C),
	axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei,
	qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

On 4/25/22 15:14, yukuai (C) wrote:
> 在 2022/04/25 11:24, Damien Le Moal 写道:
>> On 4/24/22 11:43, yukuai (C) wrote:
>>> friendly ping ...
>>>
>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>> Changes in v3:
>>>>    - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>    in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>    'waiters_cnt' are all 0, which will cause deap loop.
>>>>    - don't add 'wait_index' during each loop in patch 2
>>>>    - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>    also improving coding for the patch.
>>>>    - add a detection in patch 4 in case io hung is triggered in corner
>>>>    cases.
>>>>    - make the detection, free tags are sufficient, more flexible.
>>>>    - fix a race in patch 8.
>>>>    - fix some words and add some comments.
>>>>
>>>> Changes in v2:
>>>>    - use a new title
>>>>    - add patches to fix waitqueues' unfairness - path 1-3
>>>>    - delete patch to add queue flag
>>>>    - delete patch to split big io thoroughly
>>>>
>>>> In this patchset:
>>>>    - patch 1-3 fix waitqueues' unfairness.
>>>>    - patch 4,5 disable tag preemption on heavy load.
>>>>    - patch 6 forces tag preemption for split bios.
>>>>    - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>    I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>    this is not a common case and not worth to optimize, I'll drop them.
>>>>
>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>> will end up discontinuous if the device is under high io pressure, while
>>>> split io will still be continuous in sq, this is because:
>>>>
>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>> to wail.
>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>
>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>> ios with high concurrency.
>>>>
>>>> Noted that there is a precondition for such performance problem:
>>>> There is a certain gap between bandwidth for single io with
>>>> bs=max_sectors_kb and disk upper limit.
>>>>
>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>> heavy load. This is because 'wake_index' is not set properly in
>>>> __sbq_wake_up(), see details in patch 3.
>>>>
>>>> Test environment:
>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>> where 'max_sectors_kb' is 256).>>
>>>> The single io performance(randwrite):
>>>>
>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>
>> These results are extremely strange, unless you are running with the
>> device write cache disabled ? If you have the device write cache enabled,
>> the problem you mention above would be most likely completely invisible,
>> which I guess is why nobody really noticed any issue until now.
>>
>> Similarly, with reads, the device side read-ahead may hide the problem,
>> albeit that depends on how "intelligent" the drive is at identifying
>> sequential accesses.
>>
>>>>
>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>> be hard to see differences with the default value, thus I set
>>>> 'max_sectors_kb' to 128 in the following test.
>>>>
>>>> Test cmd:
>>>>           fio \
>>>>           -filename=/dev/$dev \
>>>>           -name=test \
>>>>           -ioengine=psync \
>>>>           -allow_mounted_write=0 \
>>>>           -group_reporting \
>>>>           -direct=1 \
>>>>           -offset_increment=1g \
>>>>           -rw=randwrite \
>>>>           -bs=1024k \
>>>>           -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>           -runtime=110 \
>>>>           -ramp_time=10
>>>>
>>>> Test result: MiB/s
>>>>
>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>> | ------- | --------- | ----------------- |
>>>> | 1       | 67.7      | 67.7              |
>>>> | 2       | 67.7      | 67.7              |
>>>> | 4       | 67.7      | 67.7              |
>>>> | 8       | 67.7      | 67.7              |
>>>> | 16      | 64.8      | 65.6              |
>>>> | 32      | 59.8      | 63.8              |
>>>> | 64      | 54.9      | 59.4              |
>>>> | 128     | 49        | 56.9              |
>>>> | 256     | 37.7      | 58.3              |
>>>> | 512     | 31.8      | 57.9              |
>>
>> Device write cache disabled ?
>>
>> Also, what is the max QD of this disk ?
>>
>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>> tags. So for any of your tests with more than 64 threads, many of the
>> threads will be waiting for a scheduler tag for the BIO before the
>> bio_split problem you explain triggers. Given that the numbers you show
>> are the same for before-after patch with a number of threads <= 64, I am
>> tempted to think that the problem is not really BIO splitting...
>>
>> What about random read workloads ? What kind of results do you see ?
> 
> Hi,
> 
> Sorry about the misleading of this test case.
> 
> This testcase is high concurrency huge randwrite, it's just for the
> problem that split bios won't be issued continuously, which is the
> root cause of the performance degradation as the numjobs increases.
> 
> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
> than 8, performance is fine, because the ratio of sequential io should
> be 7/8. However, as numjobs increases, performance is worse because
> the ratio is lower. For example, when numjobs is 512, the ratio of
> sequential io is about 20%.

But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
All other jobs will be waiting for a scheduler tag before being able to
issue their large BIO. No ?

It sounds like the set of scheduler tags should be a bit more elastic:
always allow BIOs from a split of a large BIO to be submitted (that is to
get a scheduler tag) even if that causes a temporary excess of the number
of requests beyond the default number of scheduler tags. Doing so, all
fragments of a large BIOs can be queued immediately. From there, if the
scheduler operates correctly, all the requests from the large BIOs split
would be issued in sequence to the device.


> 
> patch 6-8 will let split bios still be issued continuously under high
> pressure.
> 
> Thanks,
> Kuai
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  6:23       ` Damien Le Moal
@ 2022-04-25  6:47         ` yukuai (C)
  2022-04-25  6:50           ` Damien Le Moal
  0 siblings, 1 reply; 22+ messages in thread
From: yukuai (C) @ 2022-04-25  6:47 UTC (permalink / raw)
  To: Damien Le Moal, axboe, bvanassche, andriy.shevchenko, john.garry,
	ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/25 14:23, Damien Le Moal 写道:
> On 4/25/22 15:14, yukuai (C) wrote:
>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>> friendly ping ...
>>>>
>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>> Changes in v3:
>>>>>     - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>     in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>     'waiters_cnt' are all 0, which will cause deap loop.
>>>>>     - don't add 'wait_index' during each loop in patch 2
>>>>>     - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>     also improving coding for the patch.
>>>>>     - add a detection in patch 4 in case io hung is triggered in corner
>>>>>     cases.
>>>>>     - make the detection, free tags are sufficient, more flexible.
>>>>>     - fix a race in patch 8.
>>>>>     - fix some words and add some comments.
>>>>>
>>>>> Changes in v2:
>>>>>     - use a new title
>>>>>     - add patches to fix waitqueues' unfairness - path 1-3
>>>>>     - delete patch to add queue flag
>>>>>     - delete patch to split big io thoroughly
>>>>>
>>>>> In this patchset:
>>>>>     - patch 1-3 fix waitqueues' unfairness.
>>>>>     - patch 4,5 disable tag preemption on heavy load.
>>>>>     - patch 6 forces tag preemption for split bios.
>>>>>     - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>     I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>     this is not a common case and not worth to optimize, I'll drop them.
>>>>>
>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>> split io will still be continuous in sq, this is because:
>>>>>
>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>> to wail.
>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>
>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>> ios with high concurrency.
>>>>>
>>>>> Noted that there is a precondition for such performance problem:
>>>>> There is a certain gap between bandwidth for single io with
>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>
>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>
>>>>> Test environment:
>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>> where 'max_sectors_kb' is 256).>>
>>>>> The single io performance(randwrite):
>>>>>
>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>
>>> These results are extremely strange, unless you are running with the
>>> device write cache disabled ? If you have the device write cache enabled,
>>> the problem you mention above would be most likely completely invisible,
>>> which I guess is why nobody really noticed any issue until now.
>>>
>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>> albeit that depends on how "intelligent" the drive is at identifying
>>> sequential accesses.
>>>
>>>>>
>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>> be hard to see differences with the default value, thus I set
>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>
>>>>> Test cmd:
>>>>>            fio \
>>>>>            -filename=/dev/$dev \
>>>>>            -name=test \
>>>>>            -ioengine=psync \
>>>>>            -allow_mounted_write=0 \
>>>>>            -group_reporting \
>>>>>            -direct=1 \
>>>>>            -offset_increment=1g \
>>>>>            -rw=randwrite \
>>>>>            -bs=1024k \
>>>>>            -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>            -runtime=110 \
>>>>>            -ramp_time=10
>>>>>
>>>>> Test result: MiB/s
>>>>>
>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>> | ------- | --------- | ----------------- |
>>>>> | 1       | 67.7      | 67.7              |
>>>>> | 2       | 67.7      | 67.7              |
>>>>> | 4       | 67.7      | 67.7              |
>>>>> | 8       | 67.7      | 67.7              |
>>>>> | 16      | 64.8      | 65.6              |
>>>>> | 32      | 59.8      | 63.8              |
>>>>> | 64      | 54.9      | 59.4              |
>>>>> | 128     | 49        | 56.9              |
>>>>> | 256     | 37.7      | 58.3              |
>>>>> | 512     | 31.8      | 57.9              |
>>>
>>> Device write cache disabled ?
>>>
>>> Also, what is the max QD of this disk ?
>>>
>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>> tags. So for any of your tests with more than 64 threads, many of the
>>> threads will be waiting for a scheduler tag for the BIO before the
>>> bio_split problem you explain triggers. Given that the numbers you show
>>> are the same for before-after patch with a number of threads <= 64, I am
>>> tempted to think that the problem is not really BIO splitting...
>>>
>>> What about random read workloads ? What kind of results do you see ?
>>
>> Hi,
>>
>> Sorry about the misleading of this test case.
>>
>> This testcase is high concurrency huge randwrite, it's just for the
>> problem that split bios won't be issued continuously, which is the
>> root cause of the performance degradation as the numjobs increases.
>>
>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>> than 8, performance is fine, because the ratio of sequential io should
>> be 7/8. However, as numjobs increases, performance is worse because
>> the ratio is lower. For example, when numjobs is 512, the ratio of
>> sequential io is about 20%.
> 
> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
> All other jobs will be waiting for a scheduler tag before being able to
> issue their large BIO. No ?

Hi,

It's right.

In fact, after this patchset, since each large io will need total 8
tags, only 8 jobs can be in the queue while others are waiting for
scheduler tag.

> 
> It sounds like the set of scheduler tags should be a bit more elastic:
> always allow BIOs from a split of a large BIO to be submitted (that is to
> get a scheduler tag) even if that causes a temporary excess of the number
> of requests beyond the default number of scheduler tags. Doing so, all
> fragments of a large BIOs can be queued immediately. From there, if the
> scheduler operates correctly, all the requests from the large BIOs split
> would be issued in sequence to the device.

This solution sounds feasible in theory, however, I'm not sure yet how
to implement that 'temporary excess'.

Thanks,
Kuai
> 
> 
>>
>> patch 6-8 will let split bios still be issued continuously under high
>> pressure.
>>
>> Thanks,
>> Kuai
>>
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  6:47         ` yukuai (C)
@ 2022-04-25  6:50           ` Damien Le Moal
  2022-04-25  7:05             ` yukuai (C)
  0 siblings, 1 reply; 22+ messages in thread
From: Damien Le Moal @ 2022-04-25  6:50 UTC (permalink / raw)
  To: yukuai (C),
	axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei,
	qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

On 4/25/22 15:47, yukuai (C) wrote:
> 在 2022/04/25 14:23, Damien Le Moal 写道:
>> On 4/25/22 15:14, yukuai (C) wrote:
>>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>>> friendly ping ...
>>>>>
>>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>>> Changes in v3:
>>>>>>     - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>>     in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>>     'waiters_cnt' are all 0, which will cause deap loop.
>>>>>>     - don't add 'wait_index' during each loop in patch 2
>>>>>>     - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>>     also improving coding for the patch.
>>>>>>     - add a detection in patch 4 in case io hung is triggered in corner
>>>>>>     cases.
>>>>>>     - make the detection, free tags are sufficient, more flexible.
>>>>>>     - fix a race in patch 8.
>>>>>>     - fix some words and add some comments.
>>>>>>
>>>>>> Changes in v2:
>>>>>>     - use a new title
>>>>>>     - add patches to fix waitqueues' unfairness - path 1-3
>>>>>>     - delete patch to add queue flag
>>>>>>     - delete patch to split big io thoroughly
>>>>>>
>>>>>> In this patchset:
>>>>>>     - patch 1-3 fix waitqueues' unfairness.
>>>>>>     - patch 4,5 disable tag preemption on heavy load.
>>>>>>     - patch 6 forces tag preemption for split bios.
>>>>>>     - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>>     I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>>     this is not a common case and not worth to optimize, I'll drop them.
>>>>>>
>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>>> split io will still be continuous in sq, this is because:
>>>>>>
>>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>>> to wail.
>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>>
>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>>> ios with high concurrency.
>>>>>>
>>>>>> Noted that there is a precondition for such performance problem:
>>>>>> There is a certain gap between bandwidth for single io with
>>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>>
>>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>>
>>>>>> Test environment:
>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>>> where 'max_sectors_kb' is 256).>>
>>>>>> The single io performance(randwrite):
>>>>>>
>>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>>
>>>> These results are extremely strange, unless you are running with the
>>>> device write cache disabled ? If you have the device write cache enabled,
>>>> the problem you mention above would be most likely completely invisible,
>>>> which I guess is why nobody really noticed any issue until now.
>>>>
>>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>>> albeit that depends on how "intelligent" the drive is at identifying
>>>> sequential accesses.
>>>>
>>>>>>
>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>>> be hard to see differences with the default value, thus I set
>>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>>
>>>>>> Test cmd:
>>>>>>            fio \
>>>>>>            -filename=/dev/$dev \
>>>>>>            -name=test \
>>>>>>            -ioengine=psync \
>>>>>>            -allow_mounted_write=0 \
>>>>>>            -group_reporting \
>>>>>>            -direct=1 \
>>>>>>            -offset_increment=1g \
>>>>>>            -rw=randwrite \
>>>>>>            -bs=1024k \
>>>>>>            -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>>            -runtime=110 \
>>>>>>            -ramp_time=10
>>>>>>
>>>>>> Test result: MiB/s
>>>>>>
>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>>> | ------- | --------- | ----------------- |
>>>>>> | 1       | 67.7      | 67.7              |
>>>>>> | 2       | 67.7      | 67.7              |
>>>>>> | 4       | 67.7      | 67.7              |
>>>>>> | 8       | 67.7      | 67.7              |
>>>>>> | 16      | 64.8      | 65.6              |
>>>>>> | 32      | 59.8      | 63.8              |
>>>>>> | 64      | 54.9      | 59.4              |
>>>>>> | 128     | 49        | 56.9              |
>>>>>> | 256     | 37.7      | 58.3              |
>>>>>> | 512     | 31.8      | 57.9              |
>>>>
>>>> Device write cache disabled ?
>>>>
>>>> Also, what is the max QD of this disk ?
>>>>
>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>>> tags. So for any of your tests with more than 64 threads, many of the
>>>> threads will be waiting for a scheduler tag for the BIO before the
>>>> bio_split problem you explain triggers. Given that the numbers you show
>>>> are the same for before-after patch with a number of threads <= 64, I am
>>>> tempted to think that the problem is not really BIO splitting...
>>>>
>>>> What about random read workloads ? What kind of results do you see ?
>>>
>>> Hi,
>>>
>>> Sorry about the misleading of this test case.
>>>
>>> This testcase is high concurrency huge randwrite, it's just for the
>>> problem that split bios won't be issued continuously, which is the
>>> root cause of the performance degradation as the numjobs increases.
>>>
>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>>> than 8, performance is fine, because the ratio of sequential io should
>>> be 7/8. However, as numjobs increases, performance is worse because
>>> the ratio is lower. For example, when numjobs is 512, the ratio of
>>> sequential io is about 20%.
>>
>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
>> All other jobs will be waiting for a scheduler tag before being able to
>> issue their large BIO. No ?
> 
> Hi,
> 
> It's right.
> 
> In fact, after this patchset, since each large io will need total 8
> tags, only 8 jobs can be in the queue while others are waiting for
> scheduler tag.
> 
>>
>> It sounds like the set of scheduler tags should be a bit more elastic:
>> always allow BIOs from a split of a large BIO to be submitted (that is to
>> get a scheduler tag) even if that causes a temporary excess of the number
>> of requests beyond the default number of scheduler tags. Doing so, all
>> fragments of a large BIOs can be queued immediately. From there, if the
>> scheduler operates correctly, all the requests from the large BIOs split
>> would be issued in sequence to the device.
> 
> This solution sounds feasible in theory, however, I'm not sure yet how
> to implement that 'temporary excess'.

It should not be too hard.

By the way, did you check that doing something like:

echo 2048 > /sys/block/sdX/queue/nr_requests

improves performance for your high number of jobs test case ?

> 
> Thanks,
> Kuai
>>
>>
>>>
>>> patch 6-8 will let split bios still be issued continuously under high
>>> pressure.
>>>
>>> Thanks,
>>> Kuai
>>>
>>
>>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  6:50           ` Damien Le Moal
@ 2022-04-25  7:05             ` yukuai (C)
  2022-04-25  7:06               ` Damien Le Moal
  0 siblings, 1 reply; 22+ messages in thread
From: yukuai (C) @ 2022-04-25  7:05 UTC (permalink / raw)
  To: Damien Le Moal, axboe, bvanassche, andriy.shevchenko, john.garry,
	ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/25 14:50, Damien Le Moal 写道:
> On 4/25/22 15:47, yukuai (C) wrote:
>> 在 2022/04/25 14:23, Damien Le Moal 写道:
>>> On 4/25/22 15:14, yukuai (C) wrote:
>>>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>>>> friendly ping ...
>>>>>>
>>>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>>>> Changes in v3:
>>>>>>>      - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>>>      in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>>>      'waiters_cnt' are all 0, which will cause deap loop.
>>>>>>>      - don't add 'wait_index' during each loop in patch 2
>>>>>>>      - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>>>      also improving coding for the patch.
>>>>>>>      - add a detection in patch 4 in case io hung is triggered in corner
>>>>>>>      cases.
>>>>>>>      - make the detection, free tags are sufficient, more flexible.
>>>>>>>      - fix a race in patch 8.
>>>>>>>      - fix some words and add some comments.
>>>>>>>
>>>>>>> Changes in v2:
>>>>>>>      - use a new title
>>>>>>>      - add patches to fix waitqueues' unfairness - path 1-3
>>>>>>>      - delete patch to add queue flag
>>>>>>>      - delete patch to split big io thoroughly
>>>>>>>
>>>>>>> In this patchset:
>>>>>>>      - patch 1-3 fix waitqueues' unfairness.
>>>>>>>      - patch 4,5 disable tag preemption on heavy load.
>>>>>>>      - patch 6 forces tag preemption for split bios.
>>>>>>>      - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>>>      I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>>>      this is not a common case and not worth to optimize, I'll drop them.
>>>>>>>
>>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>>>> split io will still be continuous in sq, this is because:
>>>>>>>
>>>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>>>> to wail.
>>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>>>
>>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>>>> ios with high concurrency.
>>>>>>>
>>>>>>> Noted that there is a precondition for such performance problem:
>>>>>>> There is a certain gap between bandwidth for single io with
>>>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>>>
>>>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>>>
>>>>>>> Test environment:
>>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>>>> where 'max_sectors_kb' is 256).>>
>>>>>>> The single io performance(randwrite):
>>>>>>>
>>>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>>>
>>>>> These results are extremely strange, unless you are running with the
>>>>> device write cache disabled ? If you have the device write cache enabled,
>>>>> the problem you mention above would be most likely completely invisible,
>>>>> which I guess is why nobody really noticed any issue until now.
>>>>>
>>>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>>>> albeit that depends on how "intelligent" the drive is at identifying
>>>>> sequential accesses.
>>>>>
>>>>>>>
>>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>>>> be hard to see differences with the default value, thus I set
>>>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>>>
>>>>>>> Test cmd:
>>>>>>>             fio \
>>>>>>>             -filename=/dev/$dev \
>>>>>>>             -name=test \
>>>>>>>             -ioengine=psync \
>>>>>>>             -allow_mounted_write=0 \
>>>>>>>             -group_reporting \
>>>>>>>             -direct=1 \
>>>>>>>             -offset_increment=1g \
>>>>>>>             -rw=randwrite \
>>>>>>>             -bs=1024k \
>>>>>>>             -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>>>             -runtime=110 \
>>>>>>>             -ramp_time=10
>>>>>>>
>>>>>>> Test result: MiB/s
>>>>>>>
>>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>>>> | ------- | --------- | ----------------- |
>>>>>>> | 1       | 67.7      | 67.7              |
>>>>>>> | 2       | 67.7      | 67.7              |
>>>>>>> | 4       | 67.7      | 67.7              |
>>>>>>> | 8       | 67.7      | 67.7              |
>>>>>>> | 16      | 64.8      | 65.6              |
>>>>>>> | 32      | 59.8      | 63.8              |
>>>>>>> | 64      | 54.9      | 59.4              |
>>>>>>> | 128     | 49        | 56.9              |
>>>>>>> | 256     | 37.7      | 58.3              |
>>>>>>> | 512     | 31.8      | 57.9              |
>>>>>
>>>>> Device write cache disabled ?
>>>>>
>>>>> Also, what is the max QD of this disk ?
>>>>>
>>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>>>> tags. So for any of your tests with more than 64 threads, many of the
>>>>> threads will be waiting for a scheduler tag for the BIO before the
>>>>> bio_split problem you explain triggers. Given that the numbers you show
>>>>> are the same for before-after patch with a number of threads <= 64, I am
>>>>> tempted to think that the problem is not really BIO splitting...
>>>>>
>>>>> What about random read workloads ? What kind of results do you see ?
>>>>
>>>> Hi,
>>>>
>>>> Sorry about the misleading of this test case.
>>>>
>>>> This testcase is high concurrency huge randwrite, it's just for the
>>>> problem that split bios won't be issued continuously, which is the
>>>> root cause of the performance degradation as the numjobs increases.
>>>>
>>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>>>> than 8, performance is fine, because the ratio of sequential io should
>>>> be 7/8. However, as numjobs increases, performance is worse because
>>>> the ratio is lower. For example, when numjobs is 512, the ratio of
>>>> sequential io is about 20%.
>>>
>>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
>>> All other jobs will be waiting for a scheduler tag before being able to
>>> issue their large BIO. No ?
>>
>> Hi,
>>
>> It's right.
>>
>> In fact, after this patchset, since each large io will need total 8
>> tags, only 8 jobs can be in the queue while others are waiting for
>> scheduler tag.
>>
>>>
>>> It sounds like the set of scheduler tags should be a bit more elastic:
>>> always allow BIOs from a split of a large BIO to be submitted (that is to
>>> get a scheduler tag) even if that causes a temporary excess of the number
>>> of requests beyond the default number of scheduler tags. Doing so, all
>>> fragments of a large BIOs can be queued immediately. From there, if the
>>> scheduler operates correctly, all the requests from the large BIOs split
>>> would be issued in sequence to the device.
>>
>> This solution sounds feasible in theory, however, I'm not sure yet how
>> to implement that 'temporary excess'.
> 
> It should not be too hard.

I'll try to figure out a proper way, in the meantime, any suggestions
would be appreciated.
> 
> By the way, did you check that doing something like:
> 
> echo 2048 > /sys/block/sdX/queue/nr_requests
> 
> improves performance for your high number of jobs test case ?

Yes, performance will not degrade when numjobs is not greater than 256
in this case.
> 
>>
>> Thanks,
>> Kuai
>>>
>>>
>>>>
>>>> patch 6-8 will let split bios still be issued continuously under high
>>>> pressure.
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>
>>>
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  7:05             ` yukuai (C)
@ 2022-04-25  7:06               ` Damien Le Moal
  2022-04-25  7:28                 ` yukuai (C)
  0 siblings, 1 reply; 22+ messages in thread
From: Damien Le Moal @ 2022-04-25  7:06 UTC (permalink / raw)
  To: yukuai (C),
	axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei,
	qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

On 4/25/22 16:05, yukuai (C) wrote:
> 在 2022/04/25 14:50, Damien Le Moal 写道:
>> On 4/25/22 15:47, yukuai (C) wrote:
>>> 在 2022/04/25 14:23, Damien Le Moal 写道:
>>>> On 4/25/22 15:14, yukuai (C) wrote:
>>>>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>>>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>>>>> friendly ping ...
>>>>>>>
>>>>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>>>>> Changes in v3:
>>>>>>>>      - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>>>>      in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>>>>      'waiters_cnt' are all 0, which will cause deap loop.
>>>>>>>>      - don't add 'wait_index' during each loop in patch 2
>>>>>>>>      - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>>>>      also improving coding for the patch.
>>>>>>>>      - add a detection in patch 4 in case io hung is triggered in corner
>>>>>>>>      cases.
>>>>>>>>      - make the detection, free tags are sufficient, more flexible.
>>>>>>>>      - fix a race in patch 8.
>>>>>>>>      - fix some words and add some comments.
>>>>>>>>
>>>>>>>> Changes in v2:
>>>>>>>>      - use a new title
>>>>>>>>      - add patches to fix waitqueues' unfairness - path 1-3
>>>>>>>>      - delete patch to add queue flag
>>>>>>>>      - delete patch to split big io thoroughly
>>>>>>>>
>>>>>>>> In this patchset:
>>>>>>>>      - patch 1-3 fix waitqueues' unfairness.
>>>>>>>>      - patch 4,5 disable tag preemption on heavy load.
>>>>>>>>      - patch 6 forces tag preemption for split bios.
>>>>>>>>      - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>>>>      I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>>>>      this is not a common case and not worth to optimize, I'll drop them.
>>>>>>>>
>>>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>>>>> split io will still be continuous in sq, this is because:
>>>>>>>>
>>>>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>>>>> to wail.
>>>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>>>>
>>>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>>>>> ios with high concurrency.
>>>>>>>>
>>>>>>>> Noted that there is a precondition for such performance problem:
>>>>>>>> There is a certain gap between bandwidth for single io with
>>>>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>>>>
>>>>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>>>>
>>>>>>>> Test environment:
>>>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>>>>> where 'max_sectors_kb' is 256).>>
>>>>>>>> The single io performance(randwrite):
>>>>>>>>
>>>>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>>>>
>>>>>> These results are extremely strange, unless you are running with the
>>>>>> device write cache disabled ? If you have the device write cache enabled,
>>>>>> the problem you mention above would be most likely completely invisible,
>>>>>> which I guess is why nobody really noticed any issue until now.
>>>>>>
>>>>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>>>>> albeit that depends on how "intelligent" the drive is at identifying
>>>>>> sequential accesses.
>>>>>>
>>>>>>>>
>>>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>>>>> be hard to see differences with the default value, thus I set
>>>>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>>>>
>>>>>>>> Test cmd:
>>>>>>>>             fio \
>>>>>>>>             -filename=/dev/$dev \
>>>>>>>>             -name=test \
>>>>>>>>             -ioengine=psync \
>>>>>>>>             -allow_mounted_write=0 \
>>>>>>>>             -group_reporting \
>>>>>>>>             -direct=1 \
>>>>>>>>             -offset_increment=1g \
>>>>>>>>             -rw=randwrite \
>>>>>>>>             -bs=1024k \
>>>>>>>>             -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>>>>             -runtime=110 \
>>>>>>>>             -ramp_time=10
>>>>>>>>
>>>>>>>> Test result: MiB/s
>>>>>>>>
>>>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>>>>> | ------- | --------- | ----------------- |
>>>>>>>> | 1       | 67.7      | 67.7              |
>>>>>>>> | 2       | 67.7      | 67.7              |
>>>>>>>> | 4       | 67.7      | 67.7              |
>>>>>>>> | 8       | 67.7      | 67.7              |
>>>>>>>> | 16      | 64.8      | 65.6              |
>>>>>>>> | 32      | 59.8      | 63.8              |
>>>>>>>> | 64      | 54.9      | 59.4              |
>>>>>>>> | 128     | 49        | 56.9              |
>>>>>>>> | 256     | 37.7      | 58.3              |
>>>>>>>> | 512     | 31.8      | 57.9              |
>>>>>>
>>>>>> Device write cache disabled ?
>>>>>>
>>>>>> Also, what is the max QD of this disk ?
>>>>>>
>>>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>>>>> tags. So for any of your tests with more than 64 threads, many of the
>>>>>> threads will be waiting for a scheduler tag for the BIO before the
>>>>>> bio_split problem you explain triggers. Given that the numbers you show
>>>>>> are the same for before-after patch with a number of threads <= 64, I am
>>>>>> tempted to think that the problem is not really BIO splitting...
>>>>>>
>>>>>> What about random read workloads ? What kind of results do you see ?
>>>>>
>>>>> Hi,
>>>>>
>>>>> Sorry about the misleading of this test case.
>>>>>
>>>>> This testcase is high concurrency huge randwrite, it's just for the
>>>>> problem that split bios won't be issued continuously, which is the
>>>>> root cause of the performance degradation as the numjobs increases.
>>>>>
>>>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>>>>> than 8, performance is fine, because the ratio of sequential io should
>>>>> be 7/8. However, as numjobs increases, performance is worse because
>>>>> the ratio is lower. For example, when numjobs is 512, the ratio of
>>>>> sequential io is about 20%.
>>>>
>>>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
>>>> All other jobs will be waiting for a scheduler tag before being able to
>>>> issue their large BIO. No ?
>>>
>>> Hi,
>>>
>>> It's right.
>>>
>>> In fact, after this patchset, since each large io will need total 8
>>> tags, only 8 jobs can be in the queue while others are waiting for
>>> scheduler tag.
>>>
>>>>
>>>> It sounds like the set of scheduler tags should be a bit more elastic:
>>>> always allow BIOs from a split of a large BIO to be submitted (that is to
>>>> get a scheduler tag) even if that causes a temporary excess of the number
>>>> of requests beyond the default number of scheduler tags. Doing so, all
>>>> fragments of a large BIOs can be queued immediately. From there, if the
>>>> scheduler operates correctly, all the requests from the large BIOs split
>>>> would be issued in sequence to the device.
>>>
>>> This solution sounds feasible in theory, however, I'm not sure yet how
>>> to implement that 'temporary excess'.
>>
>> It should not be too hard.
> 
> I'll try to figure out a proper way, in the meantime, any suggestions
> would be appreciated.
>>
>> By the way, did you check that doing something like:
>>
>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>
>> improves performance for your high number of jobs test case ?
> 
> Yes, performance will not degrade when numjobs is not greater than 256
> in this case.

That is my thinking as well. I am asking if did check that (did you run it ?).

>>
>>>
>>> Thanks,
>>> Kuai
>>>>
>>>>
>>>>>
>>>>> patch 6-8 will let split bios still be issued continuously under high
>>>>> pressure.
>>>>>
>>>>> Thanks,
>>>>> Kuai
>>>>>
>>>>
>>>>
>>
>>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  7:06               ` Damien Le Moal
@ 2022-04-25  7:28                 ` yukuai (C)
  2022-04-25 11:20                   ` Damien Le Moal
  0 siblings, 1 reply; 22+ messages in thread
From: yukuai (C) @ 2022-04-25  7:28 UTC (permalink / raw)
  To: Damien Le Moal, axboe, bvanassche, andriy.shevchenko, john.garry,
	ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/25 15:06, Damien Le Moal 写道:

>>> By the way, did you check that doing something like:
>>>
>>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>>
>>> improves performance for your high number of jobs test case ?
>>
>> Yes, performance will not degrade when numjobs is not greater than 256
>> in this case.
> 
> That is my thinking as well. I am asking if did check that (did you run it ?).

Hi,

I'm sure I ran it with 256 jobs before.

However, I didn't run it with 512 jobs. And following is the result I
just tested:

ratio of sequential io: 49.1%

Read|Write seek 

cnt 99338, zero cnt 48753 

     >=(KB) .. <(KB)     : count       ratio |distribution 
              |
          0 .. 1         : 48753       49.1% 
|########################################|
          1 .. 2         : 0            0.0% | 
              |
          2 .. 4         : 0            0.0% | 
              |
          4 .. 8         : 0            0.0% | 
              |
          8 .. 16        : 0            0.0% | 
              |
         16 .. 32        : 0            0.0% | 
              |
         32 .. 64        : 0            0.0% | 
              |
         64 .. 128       : 4975         5.0% |##### 
              |
        128 .. 256       : 4439         4.5% |#### 
              |
        256 .. 512       : 2615         2.6% |### 
              |
        512 .. 1024      : 967          1.0% |# 
              |
       1024 .. 2048      : 213          0.2% |# 
              |
       2048 .. 4096      : 375          0.4% |# 
              |
       4096 .. 8192      : 723          0.7% |# 
              |
       8192 .. 16384     : 1436         1.4% |## 
              |
      16384 .. 32768     : 2626         2.6% |### 
              |
      32768 .. 65536     : 4197         4.2% |#### 
              |
      65536 .. 131072    : 6431         6.5% |###### 
              |
     131072 .. 262144    : 7590         7.6% |####### 
              |
     262144 .. 524288    : 6433         6.5% |###### 
              |
     524288 .. 1048576   : 4583         4.6% |#### 
              |
    1048576 .. 2097152   : 2237         2.3% |## 
              |
    2097152 .. 4194304   : 489          0.5% |# 
              |
    4194304 .. 8388608   : 83           0.1% |# 
              |
    8388608 .. 16777216  : 36           0.0% |# 
              |
   16777216 .. 33554432  : 0            0.0% | 
              |
   33554432 .. 67108864  : 0            0.0% | 
              |
   67108864 .. 134217728 : 137          0.1% |# 
              |

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25  7:28                 ` yukuai (C)
@ 2022-04-25 11:20                   ` Damien Le Moal
  2022-04-25 13:42                     ` yukuai (C)
  0 siblings, 1 reply; 22+ messages in thread
From: Damien Le Moal @ 2022-04-25 11:20 UTC (permalink / raw)
  To: yukuai (C),
	axboe, bvanassche, andriy.shevchenko, john.garry, ming.lei,
	qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

On 4/25/22 16:28, yukuai (C) wrote:
> 在 2022/04/25 15:06, Damien Le Moal 写道:
> 
>>>> By the way, did you check that doing something like:
>>>>
>>>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>>>
>>>> improves performance for your high number of jobs test case ?
>>>
>>> Yes, performance will not degrade when numjobs is not greater than 256
>>> in this case.
>>
>> That is my thinking as well. I am asking if did check that (did you run it ?).
> 
> Hi,
> 
> I'm sure I ran it with 256 jobs before.
> 
> However, I didn't run it with 512 jobs. And following is the result I
> just tested:

What was nr_requests ? The default 64 ?
If you increase that number, do you see better throughput/more requests
being sequential ?


> 
> ratio of sequential io: 49.1%
> 
> Read|Write seek 
> 
> cnt 99338, zero cnt 48753 
> 
>      >=(KB) .. <(KB)     : count       ratio |distribution 
>               |
>           0 .. 1         : 48753       49.1% 
> |########################################|
>           1 .. 2         : 0            0.0% | 
>               |
>           2 .. 4         : 0            0.0% | 
>               |
>           4 .. 8         : 0            0.0% | 
>               |
>           8 .. 16        : 0            0.0% | 
>               |
>          16 .. 32        : 0            0.0% | 
>               |
>          32 .. 64        : 0            0.0% | 
>               |
>          64 .. 128       : 4975         5.0% |##### 
>               |
>         128 .. 256       : 4439         4.5% |#### 
>               |
>         256 .. 512       : 2615         2.6% |### 
>               |
>         512 .. 1024      : 967          1.0% |# 
>               |
>        1024 .. 2048      : 213          0.2% |# 
>               |
>        2048 .. 4096      : 375          0.4% |# 
>               |
>        4096 .. 8192      : 723          0.7% |# 
>               |
>        8192 .. 16384     : 1436         1.4% |## 
>               |
>       16384 .. 32768     : 2626         2.6% |### 
>               |
>       32768 .. 65536     : 4197         4.2% |#### 
>               |
>       65536 .. 131072    : 6431         6.5% |###### 
>               |
>      131072 .. 262144    : 7590         7.6% |####### 
>               |
>      262144 .. 524288    : 6433         6.5% |###### 
>               |
>      524288 .. 1048576   : 4583         4.6% |#### 
>               |
>     1048576 .. 2097152   : 2237         2.3% |## 
>               |
>     2097152 .. 4194304   : 489          0.5% |# 
>               |
>     4194304 .. 8388608   : 83           0.1% |# 
>               |
>     8388608 .. 16777216  : 36           0.0% |# 
>               |
>    16777216 .. 33554432  : 0            0.0% | 
>               |
>    33554432 .. 67108864  : 0            0.0% | 
>               |
>    67108864 .. 134217728 : 137          0.1% |# 
>               |


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
  2022-04-25 11:20                   ` Damien Le Moal
@ 2022-04-25 13:42                     ` yukuai (C)
  0 siblings, 0 replies; 22+ messages in thread
From: yukuai (C) @ 2022-04-25 13:42 UTC (permalink / raw)
  To: Damien Le Moal, axboe, bvanassche, andriy.shevchenko, john.garry,
	ming.lei, qiulaibin
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/25 19:20, Damien Le Moal 写道:
> On 4/25/22 16:28, yukuai (C) wrote:
>> 在 2022/04/25 15:06, Damien Le Moal 写道:
>>
>>>>> By the way, did you check that doing something like:
>>>>>
>>>>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>>>>
>>>>> improves performance for your high number of jobs test case ?
>>>>
>>>> Yes, performance will not degrade when numjobs is not greater than 256
>>>> in this case.
>>>
>>> That is my thinking as well. I am asking if did check that (did you run it ?).
>>
>> Hi,
>>
>> I'm sure I ran it with 256 jobs before.
>>
>> However, I didn't run it with 512 jobs. And following is the result I
>> just tested:
> 
> What was nr_requests ? The default 64 ?
> If you increase that number, do you see better throughput/more requests
> being sequential ?

Sorry if I didn't explain this clearly.

If nr_requests is 64, numjobs is 512, the ratio of sequential is about
20%. If nr_requests is 2048, numjobs is 512, the ratio is 49.1%.

Then yes, increase nr_requests can improve performance in the test case.

> 
> 
>>
>> ratio of sequential io: 49.1%
>>
>> Read|Write seek
>>
>> cnt 99338, zero cnt 48753
>>
>>       >=(KB) .. <(KB)     : count       ratio |distribution
>>                |
>>            0 .. 1         : 48753       49.1%
>> |########################################|
>>            1 .. 2         : 0            0.0% |
>>                |
>>            2 .. 4         : 0            0.0% |
>>                |
>>            4 .. 8         : 0            0.0% |
>>                |
>>            8 .. 16        : 0            0.0% |
>>                |
>>           16 .. 32        : 0            0.0% |
>>                |
>>           32 .. 64        : 0            0.0% |
>>                |
>>           64 .. 128       : 4975         5.0% |#####
>>                |
>>          128 .. 256       : 4439         4.5% |####
>>                |
>>          256 .. 512       : 2615         2.6% |###
>>                |
>>          512 .. 1024      : 967          1.0% |#
>>                |
>>         1024 .. 2048      : 213          0.2% |#
>>                |
>>         2048 .. 4096      : 375          0.4% |#
>>                |
>>         4096 .. 8192      : 723          0.7% |#
>>                |
>>         8192 .. 16384     : 1436         1.4% |##
>>                |
>>        16384 .. 32768     : 2626         2.6% |###
>>                |
>>        32768 .. 65536     : 4197         4.2% |####
>>                |
>>        65536 .. 131072    : 6431         6.5% |######
>>                |
>>       131072 .. 262144    : 7590         7.6% |#######
>>                |
>>       262144 .. 524288    : 6433         6.5% |######
>>                |
>>       524288 .. 1048576   : 4583         4.6% |####
>>                |
>>      1048576 .. 2097152   : 2237         2.3% |##
>>                |
>>      2097152 .. 4194304   : 489          0.5% |#
>>                |
>>      4194304 .. 8388608   : 83           0.1% |#
>>                |
>>      8388608 .. 16777216  : 36           0.0% |#
>>                |
>>     16777216 .. 33554432  : 0            0.0% |
>>                |
>>     33554432 .. 67108864  : 0            0.0% |
>>                |
>>     67108864 .. 134217728 : 137          0.1% |#
>>                |
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-04-25 13:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 5/8] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 6/8] blk-mq: force tag preemption for split bios Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 7/8] blk-mq: record how many tags are needed for splited bio Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
2022-04-24  2:43 ` [PATCH -next RFC v3 0/8] improve tag allocation under heavy load yukuai (C)
2022-04-25  3:24   ` Damien Le Moal
2022-04-25  6:14     ` yukuai (C)
2022-04-25  6:23       ` Damien Le Moal
2022-04-25  6:47         ` yukuai (C)
2022-04-25  6:50           ` Damien Le Moal
2022-04-25  7:05             ` yukuai (C)
2022-04-25  7:06               ` Damien Le Moal
2022-04-25  7:28                 ` yukuai (C)
2022-04-25 11:20                   ` Damien Le Moal
2022-04-25 13:42                     ` yukuai (C)
2022-04-25  3:09 ` Bart Van Assche
2022-04-25  3:27   ` yukuai (C)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.