linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next RFC v2 0/8] improve tag allocation under heavy load
@ 2022-04-08  7:39 Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

Changes in v2:
 - use a new title
 - add patches to fix waitqueues' unfairness - path 1-3
 - delete patch to add queue flag
 - delete patch to split big io thoroughly

There is a defect for blk-mq compare to blk-sq, specifically split io
will end up discontinuous if the device is under high io pressure, while
split io will still be continuous in sq, this is because:

1) new io can preempt tag even if there are lots of threads waiting.
2) split bio is issued one by one, if one bio can't get tag, it will go
to wail.
3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
Thus if a thread is woken up, it will unlikey to get multiple tags.

The problem was first found by upgrading kernel from v3.10 to v4.18,
test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
ios with high concurrency.

Noted that there is a precondition for such performance problem:
There is a certain gap between bandwith for single io with
bs=max_sectors_kb and disk upper limit.

During the test, I found that waitqueues can be extremly unbalanced on
heavy load. This is because 'wake_index' is not set properly in
__sbq_wake_up(), see details in patch 3.

In this patchset:
 - patch 1-3 fix waitqueues' unfairness.
 - patch 4,5 disable tag preemption on heavy load.
 - patch 6 forces tag preemption for split bios.
 - patch 7,8 improve large random io for HDD. As I mentioned above, we
 do meet the problem and I'm trying to fix it at very low cost. However,
 if anyone still thinks this is not a common case and not worth to
 optimize, I'll drop them.

Test environment:
arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
where 'max_sectors_kb' is 256).

The single io performance(randwrite):

| bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
| -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
| bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |

It can be seen that 1280k io is already close to upper limit, and it'll
be hard to see differences with the default value, thus I set
'max_sectors_kb' to 128 in the following test.

Test cmd:
        fio \
        -filename=/dev/$dev \
        -name=test \
        -ioengine=psync \
        -allow_mounted_write=0 \
        -group_reporting \
        -direct=1 \
        -offset_increment=1g \
        -rw=randwrite \
        -bs=1024k \
        -numjobs={1,2,4,8,16,32,64,128,256,512} \
        -runtime=110 \
        -ramp_time=10

Test result: MiB/s

| numjobs | v5.18-rc1 | v5.18-rc1-patched |
| ------- | --------- | ----------------- |
| 1       | 67.7      | 67.7              |
| 2       | 67.7      | 67.7              |
| 4       | 67.7      | 67.7              |
| 8       | 67.7      | 67.7              |
| 16      | 64.8      | 65.2              |
| 32      | 59.8      | 62.8              |
| 64      | 54.9      | 58.6              |
| 128     | 49        | 55.8              |
| 256     | 37.7      | 52.3              |
| 512     | 31.8      | 51.4              |

Yu Kuai (8):
  sbitmap: record the number of waiters for each waitqueue
  blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
  sbitmap: make sure waitqueues are balanced
  blk-mq: don't preempt tag on heavy load
  sbitmap: force tag preemption if free tags are sufficient
  blk-mq: force tag preemption for split bios
  blk-mq: record how many tags are needed for splited bio
  sbitmap: wake up the number of threads based on required tags

 block/blk-merge.c         |   9 ++-
 block/blk-mq-tag.c        |  42 +++++++++-----
 block/blk-mq.c            |  25 +++++++-
 block/blk-mq.h            |   2 +
 include/linux/blk_types.h |   4 ++
 include/linux/sbitmap.h   |   9 +++
 lib/sbitmap.c             | 117 +++++++++++++++++++++++++-------------
 7 files changed, 150 insertions(+), 58 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH -next RFC v2 1/8] sbitmap: record the number of waiters for each waitqueue
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
@ 2022-04-08  7:39 ` Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

Add a counter in struct sbq_wait_stat to recored how many threads are
waiting on the waitqueue, this will be used in later patches to make
sure 8 waitqueues are balanced. Such counter will also be shown in
debugfs so that user can see if waitqueues are balanced.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 include/linux/sbitmap.h | 5 +++++
 lib/sbitmap.c           | 7 +++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 8f5a86e210b9..8a64271d0696 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -91,6 +91,11 @@ struct sbq_wait_state {
 	 */
 	atomic_t wait_cnt;
 
+	/**
+	 * @waiters_cnt: Number of active waiters
+	 */
+	atomic_t waiters_cnt;
+
 	/**
 	 * @wait: Wait queue.
 	 */
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index ae4fd4de9ebe..393f2b71647a 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -444,6 +444,7 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
 		init_waitqueue_head(&sbq->ws[i].wait);
 		atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch);
+		atomic_set(&sbq->ws[i].waiters_cnt, 0);
 	}
 
 	return 0;
@@ -759,9 +760,9 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
 	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
 		struct sbq_wait_state *ws = &sbq->ws[i];
 
-		seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n",
+		seq_printf(m, "\t{.wait_cnt=%d, .waiters_cnt=%d},\n",
 			   atomic_read(&ws->wait_cnt),
-			   waitqueue_active(&ws->wait) ? "active" : "inactive");
+			   atomic_read(&ws->waiters_cnt));
 	}
 	seq_puts(m, "}\n");
 
@@ -798,6 +799,7 @@ void sbitmap_prepare_to_wait(struct sbitmap_queue *sbq,
 {
 	if (!sbq_wait->sbq) {
 		atomic_inc(&sbq->ws_active);
+		atomic_inc(&ws->waiters_cnt);
 		sbq_wait->sbq = sbq;
 	}
 	prepare_to_wait_exclusive(&ws->wait, &sbq_wait->wait, state);
@@ -810,6 +812,7 @@ void sbitmap_finish_wait(struct sbitmap_queue *sbq, struct sbq_wait_state *ws,
 	finish_wait(&ws->wait, &sbq_wait->wait);
 	if (sbq_wait->sbq) {
 		atomic_dec(&sbq->ws_active);
+		atomic_dec(&ws->waiters_cnt);
 		sbq_wait->sbq = NULL;
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next RFC v2 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
@ 2022-04-08  7:39 ` Yu Kuai
  2022-04-08 14:20   ` Bart Van Assche
  2022-04-08  7:39 ` [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

bt_wait_ptr() will increase 'wait_index', however, if blk_mq_get_tag()
get a tag successfully after bt_wait_ptr() is called and before
sbitmap_prepare_to_wait() is called, then the 'ws' is skipped. This
behavior might cause 8 waitqueues to be unbalanced.

Move bt_wait_ptr() later should reduce the problem when the disk is
under high io preesure.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 68ac23d0b640..228a0001694f 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -155,7 +155,6 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	if (data->flags & BLK_MQ_REQ_NOWAIT)
 		return BLK_MQ_NO_TAG;
 
-	ws = bt_wait_ptr(bt, data->hctx);
 	do {
 		struct sbitmap_queue *bt_prev;
 
@@ -174,6 +173,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		ws = bt_wait_ptr(bt, data->hctx);
 		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
 		tag = __blk_mq_get_tag(data, bt);
@@ -201,8 +201,6 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		 */
 		if (bt != bt_prev)
 			sbitmap_queue_wake_up(bt_prev);
-
-		ws = bt_wait_ptr(bt, data->hctx);
 	} while (1);
 
 	sbitmap_finish_wait(bt, ws, &wait);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
@ 2022-04-08  7:39 ` Yu Kuai
  2022-04-15  6:31   ` Li, Ming
  2022-04-08  7:39 ` [PATCH -next RFC v2 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

Currently, same waitqueue might be woken up continuously:

__sbq_wake_up		__sbq_wake_up
 sbq_wake_ptr -> assume	0
			 sbq_wake_ptr -> 0
 atomic_dec_return
			atomic_dec_return
 atomic_cmpxchg -> succeed
			 atomic_cmpxchg -> failed
			  return true

			__sbq_wake_up
			 sbq_wake_ptr
			  atomic_read(&sbq->wake_index) -> still 0
 sbq_index_atomic_inc -> inc to 1
			  if (waitqueue_active(&ws->wait))
			   if (wake_index != atomic_read(&sbq->wake_index))
			    atomic_set -> reset from 1 to 0
 wake_up_nr -> wake up first waitqueue
			    // continue to wake up in first waitqueue

What's worse, io hung is possible in theory because wake up might be
missed. For example, 2 * wake_batch tags are put, while only wake_batch
threads are worken:

__sbq_wake_up
 atomic_cmpxchg -> reset wait_cnt
			__sbq_wake_up -> decrease wait_cnt
			...
			__sbq_wake_up -> wait_cnt is decreased to 0 again
			 atomic_cmpxchg
			 sbq_index_atomic_inc -> increase wake_index
			 wake_up_nr -> wake up and waitqueue might be empty
 sbq_index_atomic_inc -> increase again, one waitqueue is skipped
 wake_up_nr -> invalid wake up because old wakequeue might be empty

To fix the problem, refactor to make sure waitqueues will be woken up
one by one, and also choose the next waitqueue by the number of threads
that are waiting to keep waitqueues balanced.

Test cmd: nr_requests is 64, and queue_depth is 32
[global]
filename=/dev/sda
ioengine=libaio
direct=1
allow_mounted_write=0
group_reporting

[test]
rw=randwrite
bs=4k
numjobs=512
iodepth=2

Before this patch, waitqueues can be extremly unbalanced, for example:
ws_active=484
ws={
        {.wait_cnt=8, .waiters_cnt=117},
        {.wait_cnt=8, .waiters_cnt=59},
        {.wait_cnt=8, .waiters_cnt=76},
        {.wait_cnt=8, .waiters_cnt=0},
        {.wait_cnt=5, .waiters_cnt=24},
        {.wait_cnt=8, .waiters_cnt=12},
        {.wait_cnt=8, .waiters_cnt=21},
        {.wait_cnt=8, .waiters_cnt=175},
}

With this patch, waitqueues is always balanced, for example:
ws_active=477
ws={
        {.wait_cnt=8, .waiters_cnt=59},
        {.wait_cnt=6, .waiters_cnt=62},
        {.wait_cnt=8, .waiters_cnt=61},
        {.wait_cnt=8, .waiters_cnt=60},
        {.wait_cnt=8, .waiters_cnt=63},
        {.wait_cnt=8, .waiters_cnt=56},
        {.wait_cnt=8, .waiters_cnt=59},
        {.wait_cnt=8, .waiters_cnt=57},
}

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 lib/sbitmap.c | 81 ++++++++++++++++++++++++++-------------------------
 1 file changed, 42 insertions(+), 39 deletions(-)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 393f2b71647a..176fba0252d7 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -575,68 +575,71 @@ void sbitmap_queue_min_shallow_depth(struct sbitmap_queue *sbq,
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth);
 
-static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
+/* always choose the 'ws' with the max waiters */
+static void sbq_update_wake_index(struct sbitmap_queue *sbq,
+				  int old_wake_index)
 {
-	int i, wake_index;
+	int index, wake_index;
+	int max_waiters = 0;
 
-	if (!atomic_read(&sbq->ws_active))
-		return NULL;
+	if (old_wake_index != atomic_read(&sbq->wake_index))
+		return;
 
-	wake_index = atomic_read(&sbq->wake_index);
-	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
-		struct sbq_wait_state *ws = &sbq->ws[wake_index];
+	for (wake_index = 0; wake_index < SBQ_WAIT_QUEUES; wake_index++) {
+		struct sbq_wait_state *ws;
+		int waiters;
 
-		if (waitqueue_active(&ws->wait)) {
-			if (wake_index != atomic_read(&sbq->wake_index))
-				atomic_set(&sbq->wake_index, wake_index);
-			return ws;
-		}
+		if (wake_index == old_wake_index)
+			continue;
 
-		wake_index = sbq_index_inc(wake_index);
+		ws = &sbq->ws[wake_index];
+		waiters = atomic_read(&ws->waiters_cnt);
+		if (waiters > max_waiters) {
+			max_waiters = waiters;
+			index = wake_index;
+		}
 	}
 
-	return NULL;
+	if (max_waiters)
+		atomic_cmpxchg(&sbq->wake_index, old_wake_index, index);
 }
 
 static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 {
 	struct sbq_wait_state *ws;
 	unsigned int wake_batch;
-	int wait_cnt;
+	int wait_cnt, wake_index;
 
-	ws = sbq_wake_ptr(sbq);
-	if (!ws)
+	if (!atomic_read(&sbq->ws_active))
 		return false;
 
+	wake_index = atomic_read(&sbq->wake_index);
+	ws = &sbq->ws[wake_index];
 	wait_cnt = atomic_dec_return(&ws->wait_cnt);
-	if (wait_cnt <= 0) {
-		int ret;
-
-		wake_batch = READ_ONCE(sbq->wake_batch);
-
-		/*
-		 * Pairs with the memory barrier in sbitmap_queue_resize() to
-		 * ensure that we see the batch size update before the wait
-		 * count is reset.
-		 */
-		smp_mb__before_atomic();
-
+	if (wait_cnt > 0) {
+		return false;
+	} else if (wait_cnt < 0) {
 		/*
-		 * For concurrent callers of this, the one that failed the
-		 * atomic_cmpxhcg() race should call this function again
+		 * Concurrent callers should call this function again
 		 * to wakeup a new batch on a different 'ws'.
 		 */
-		ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch);
-		if (ret == wait_cnt) {
-			sbq_index_atomic_inc(&sbq->wake_index);
-			wake_up_nr(&ws->wait, wake_batch);
-			return false;
-		}
-
+		sbq_update_wake_index(sbq, wake_index);
 		return true;
 	}
 
-	return false;
+	sbq_update_wake_index(sbq, wake_index);
+	wake_batch = READ_ONCE(sbq->wake_batch);
+
+	/*
+	 * Pairs with the memory barrier in sbitmap_queue_resize() to
+	 * ensure that we see the batch size update before the wait
+	 * count is reset.
+	 */
+	smp_mb__before_atomic();
+	atomic_set(&ws->wait_cnt, wake_batch);
+	wake_up_nr(&ws->wait, wake_batch);
+
+	return true;
 }
 
 void sbitmap_queue_wake_up(struct sbitmap_queue *sbq)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next RFC v2 4/8] blk-mq: don't preempt tag under heavy load
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (2 preceding siblings ...)
  2022-04-08  7:39 ` [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
@ 2022-04-08  7:39 ` Yu Kuai
  2022-04-08 14:24   ` Bart Van Assche
  2022-04-08  7:39 ` [PATCH -next RFC v2 5/8] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

Tag preemption is the default behaviour, specifically blk_mq_get_tag()
will try to get tag unconditionally, which means a new io can preempt
tag even if there are lots of ios that are waiting for tags.

Such behaviour doesn't make sense when the disk is under heavy load,
because it will intensify competition without improving performance,
especially for huge io as split io is unlikely to be issued
continuously.

The idle way to disable tag preemption is to track how many tags are
available, and wait directly in blk_mq_get_tag() if free tags are
very little. However, this is out of reality because fast path is
affected.

As 'ws_active' is only updated in slow path, this patch disable tag
preemption if 'ws_active' is greater than 8, which means there are many
threads waiting for tags already.

Once tag preemption is disabled, there is a situation that can cause
performance degration(or io hung in extreme scenarios): the waitqueue
doesn't have 'wake_batch' threads, thus wake up on this waitqueue might
cause the concurrency of ios to be decreased. The next patch will fix this
problem.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 36 +++++++++++++++++++++++++-----------
 block/blk-mq.h     |  1 +
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 228a0001694f..be2d49e6d69e 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -127,6 +127,13 @@ unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
 	return ret;
 }
 
+static inline bool preempt_tag(struct blk_mq_alloc_data *data,
+			       struct sbitmap_queue *bt)
+{
+	return data->preemption ||
+	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES;
+}
+
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
@@ -148,12 +155,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		tag_offset = tags->nr_reserved_tags;
 	}
 
-	tag = __blk_mq_get_tag(data, bt);
-	if (tag != BLK_MQ_NO_TAG)
-		goto found_tag;
+	if (data->flags & BLK_MQ_REQ_NOWAIT || preempt_tag(data, bt)) {
+		tag = __blk_mq_get_tag(data, bt);
+		if (tag != BLK_MQ_NO_TAG)
+			goto found_tag;
 
-	if (data->flags & BLK_MQ_REQ_NOWAIT)
-		return BLK_MQ_NO_TAG;
+		if (data->flags & BLK_MQ_REQ_NOWAIT)
+			return BLK_MQ_NO_TAG;
+	}
 
 	do {
 		struct sbitmap_queue *bt_prev;
@@ -169,20 +178,25 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		 * Retry tag allocation after running the hardware queue,
 		 * as running the queue may also have found completions.
 		 */
-		tag = __blk_mq_get_tag(data, bt);
-		if (tag != BLK_MQ_NO_TAG)
-			break;
+		if (preempt_tag(data, bt)) {
+			tag = __blk_mq_get_tag(data, bt);
+			if (tag != BLK_MQ_NO_TAG)
+				break;
+		}
 
 		ws = bt_wait_ptr(bt, data->hctx);
 		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
-		tag = __blk_mq_get_tag(data, bt);
-		if (tag != BLK_MQ_NO_TAG)
-			break;
+		if (preempt_tag(data, bt)) {
+			tag = __blk_mq_get_tag(data, bt);
+			if (tag != BLK_MQ_NO_TAG)
+				break;
+		}
 
 		bt_prev = bt;
 		io_schedule();
 
+		data->preemption = true;
 		sbitmap_finish_wait(bt, ws, &wait);
 
 		data->ctx = blk_mq_get_ctx(data->q);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2615bd58bad3..b49b20e11350 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -156,6 +156,7 @@ struct blk_mq_alloc_data {
 
 	/* allocate multiple requests/tags in one go */
 	unsigned int nr_tags;
+	bool preemption;
 	struct request **cached_rq;
 
 	/* input & output parameter */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next RFC v2 5/8] sbitmap: force tag preemption if free tags are sufficient
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (3 preceding siblings ...)
  2022-04-08  7:39 ` [PATCH -next RFC v2 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
@ 2022-04-08  7:39 ` Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 6/8] blk-mq: force tag preemption for split bios Yu Kuai
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

Now that tag preemption is disabled, if wakers doesn't use up
'wake_batch' tags while preemption is still disabled, io concurrency
will be declined.

To fix the problem, add a detection before wake up, and force tag
preemption is free tags are sufficient, so that the extra tags can be
used by new io.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c      |  3 ++-
 include/linux/sbitmap.h |  2 ++
 lib/sbitmap.c           | 11 +++++++++++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index be2d49e6d69e..dfbb06edfbc3 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -131,7 +131,8 @@ static inline bool preempt_tag(struct blk_mq_alloc_data *data,
 			       struct sbitmap_queue *bt)
 {
 	return data->preemption ||
-	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES;
+	       atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES ||
+	       bt->force_tag_preemption;
 }
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 8a64271d0696..ca00ccb6af48 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -143,6 +143,8 @@ struct sbitmap_queue {
 	 * sbitmap_queue_get_shallow()
 	 */
 	unsigned int min_shallow_depth;
+
+	bool force_tag_preemption;
 };
 
 /**
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 176fba0252d7..8d01e02ea4b1 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -434,6 +434,7 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
 	sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
 	atomic_set(&sbq->wake_index, 0);
 	atomic_set(&sbq->ws_active, 0);
+	sbq->force_tag_preemption = true;
 
 	sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node);
 	if (!sbq->ws) {
@@ -604,6 +605,15 @@ static void sbq_update_wake_index(struct sbitmap_queue *sbq,
 		atomic_cmpxchg(&sbq->wake_index, old_wake_index, index);
 }
 
+static inline void sbq_update_preemption(struct sbitmap_queue *sbq,
+					 unsigned int wake_batch)
+{
+	bool force = (sbq->sb.depth - sbitmap_weight(&sbq->sb)) >=
+		     wake_batch << 1;
+
+	WRITE_ONCE(sbq->force_tag_preemption, force);
+}
+
 static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 {
 	struct sbq_wait_state *ws;
@@ -637,6 +647,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 	 */
 	smp_mb__before_atomic();
 	atomic_set(&ws->wait_cnt, wake_batch);
+	sbq_update_preemption(sbq, wake_batch);
 	wake_up_nr(&ws->wait, wake_batch);
 
 	return true;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next RFC v2 6/8] blk-mq: force tag preemption for split bios
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (4 preceding siblings ...)
  2022-04-08  7:39 ` [PATCH -next RFC v2 5/8] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
@ 2022-04-08  7:39 ` Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 7/8] blk-mq: record how many tags are needed for splited bio Yu Kuai
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

For HDD, sequential io is much faster than random io, thus it's better
to issue split io continuously. However, this is broken when tag preemption
is disabled, because wakers can only get one tag each time.

Thus tag preemption should be disabled for split bios, specifically the
first bio won't preempt tag, and following split bios will preempt tag.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-merge.c         | 9 ++++++++-
 block/blk-mq.c            | 1 +
 include/linux/blk_types.h | 4 ++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7771dacc99cb..cab6ca681513 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -343,12 +343,19 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
 
 	if (split) {
 		/* there isn't chance to merge the splitted bio */
-		split->bi_opf |= REQ_NOMERGE;
+		split->bi_opf |= (REQ_NOMERGE | REQ_SPLIT);
+		if ((*bio)->bi_opf & REQ_SPLIT)
+			split->bi_opf |= REQ_PREEMPT;
+		else
+			(*bio)->bi_opf |= REQ_SPLIT;
 
 		bio_chain(split, *bio);
 		trace_block_split(split, (*bio)->bi_iter.bi_sector);
 		submit_bio_noacct(*bio);
 		*bio = split;
+	} else {
+		if ((*bio)->bi_opf & REQ_SPLIT)
+			(*bio)->bi_opf |= REQ_PREEMPT;
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ed3ed86f7dd2..909420c5186c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2737,6 +2737,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
 		.q		= q,
 		.nr_tags	= 1,
 		.cmd_flags	= bio->bi_opf,
+		.preemption	= (bio->bi_opf & REQ_PREEMPT),
 	};
 	struct request *rq;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c62274466e72..6b56e271f926 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -418,6 +418,8 @@ enum req_flag_bits {
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
+	__REQ_SPLIT,
+	__REQ_PREEMPT,
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -443,6 +445,8 @@ enum req_flag_bits {
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
+#define REQ_SPLIT		(1ULL << __REQ_SPLIT)
+#define REQ_PREEMPT		(1ULL << __REQ_PREEMPT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next RFC v2 7/8] blk-mq: record how many tags are needed for splited bio
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (5 preceding siblings ...)
  2022-04-08  7:39 ` [PATCH -next RFC v2 6/8] blk-mq: force tag preemption for split bios Yu Kuai
@ 2022-04-08  7:39 ` Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
  2022-04-08 19:10 ` [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Jens Axboe
  8 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

Currently, each time 8(or wake batch) requests is done, 8 waiters will
be woken up, this is not necessary because we only need to make sure
wakers will use up 8 tags. For example, if we know in advance that a
thread need 8 tags, then wake up one thread is enough, and this can also
avoid unnecessary context switch.

This patch tries to provide such information that how many tags will
be needed for huge io, and it will be used in next patch.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c      |  1 +
 block/blk-mq.c          | 24 +++++++++++++++++++++---
 block/blk-mq.h          |  1 +
 include/linux/sbitmap.h |  2 ++
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index dfbb06edfbc3..f91879772dc8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -165,6 +165,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 			return BLK_MQ_NO_TAG;
 	}
 
+	wait.nr_tags += data->nr_split;
 	do {
 		struct sbitmap_queue *bt_prev;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 909420c5186c..65a3b11d5c9f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2731,12 +2731,14 @@ static bool blk_mq_attempt_bio_merge(struct request_queue *q,
 static struct request *blk_mq_get_new_requests(struct request_queue *q,
 					       struct blk_plug *plug,
 					       struct bio *bio,
-					       unsigned int nsegs)
+					       unsigned int nsegs,
+					       unsigned int nr_split)
 {
 	struct blk_mq_alloc_data data = {
 		.q		= q,
 		.nr_tags	= 1,
 		.cmd_flags	= bio->bi_opf,
+		.nr_split	= nr_split,
 		.preemption	= (bio->bi_opf & REQ_PREEMPT),
 	};
 	struct request *rq;
@@ -2795,6 +2797,19 @@ static inline struct request *blk_mq_get_cached_request(struct request_queue *q,
 	return rq;
 }
 
+static inline unsigned int caculate_sectors_split(struct bio *bio)
+{
+	switch (bio_op(bio)) {
+	case REQ_OP_DISCARD:
+	case REQ_OP_SECURE_ERASE:
+	case REQ_OP_WRITE_ZEROES:
+		return 0;
+	default:
+		return (bio_sectors(bio) - 1) /
+			queue_max_sectors(bio->bi_bdev->bd_queue);
+	}
+}
+
 /**
  * blk_mq_submit_bio - Create and send a request to block device.
  * @bio: Bio pointer.
@@ -2815,11 +2830,14 @@ void blk_mq_submit_bio(struct bio *bio)
 	const int is_sync = op_is_sync(bio->bi_opf);
 	struct request *rq;
 	unsigned int nr_segs = 1;
+	unsigned int nr_split = 0;
 	blk_status_t ret;
 
 	blk_queue_bounce(q, &bio);
-	if (blk_may_split(q, bio))
+	if (blk_may_split(q, bio)) {
+		nr_split = caculate_sectors_split(bio);
 		__blk_queue_split(q, &bio, &nr_segs);
+	}
 
 	if (!bio_integrity_prep(bio))
 		return;
@@ -2828,7 +2846,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (!rq) {
 		if (!bio)
 			return;
-		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
+		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs, nr_split);
 		if (unlikely(!rq))
 			return;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index b49b20e11350..dfb2f1b9bf06 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -156,6 +156,7 @@ struct blk_mq_alloc_data {
 
 	/* allocate multiple requests/tags in one go */
 	unsigned int nr_tags;
+	unsigned int nr_split;
 	bool preemption;
 	struct request **cached_rq;
 
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index ca00ccb6af48..1abd8ed5d406 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -596,12 +596,14 @@ void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
 void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m);
 
 struct sbq_wait {
+	unsigned int nr_tags;
 	struct sbitmap_queue *sbq;	/* if set, sbq_wait is accounted */
 	struct wait_queue_entry wait;
 };
 
 #define DEFINE_SBQ_WAIT(name)							\
 	struct sbq_wait name = {						\
+		.nr_tags = 1,							\
 		.sbq = NULL,							\
 		.wait = {							\
 			.private	= current,				\
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (6 preceding siblings ...)
  2022-04-08  7:39 ` [PATCH -next RFC v2 7/8] blk-mq: record how many tags are needed for splited bio Yu Kuai
@ 2022-04-08  7:39 ` Yu Kuai
  2022-04-08 14:31   ` Bart Van Assche
  2022-04-08 21:13   ` Bart Van Assche
  2022-04-08 19:10 ` [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Jens Axboe
  8 siblings, 2 replies; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

Always wake up 'wake_batch' threads will intensify competition and
split io won't be issued continuously. Now that how many tags is required
is recorded for huge io, it's safe to wake up baed on required tags.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 lib/sbitmap.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 8d01e02ea4b1..eac9fa5c2b4d 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -614,6 +614,26 @@ static inline void sbq_update_preemption(struct sbitmap_queue *sbq,
 	WRITE_ONCE(sbq->force_tag_preemption, force);
 }
 
+static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)
+{
+	struct sbq_wait *wait;
+	struct wait_queue_entry *entry;
+	unsigned int nr = 1;
+
+	spin_lock_irq(&ws->wait.lock);
+	list_for_each_entry(entry, &ws->wait.head, entry) {
+		wait = container_of(entry, struct sbq_wait, wait);
+		if (nr_tags <= wait->nr_tags)
+			break;
+
+		nr++;
+		nr_tags -= wait->nr_tags;
+	}
+	spin_unlock_irq(&ws->wait.lock);
+
+	return nr;
+}
+
 static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 {
 	struct sbq_wait_state *ws;
@@ -648,7 +668,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
 	smp_mb__before_atomic();
 	atomic_set(&ws->wait_cnt, wake_batch);
 	sbq_update_preemption(sbq, wake_batch);
-	wake_up_nr(&ws->wait, wake_batch);
+	wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
 
 	return true;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
  2022-04-08  7:39 ` [PATCH -next RFC v2 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
@ 2022-04-08 14:20   ` Bart Van Assche
  2022-04-09  2:09     ` yukuai (C)
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Van Assche @ 2022-04-08 14:20 UTC (permalink / raw)
  To: Yu Kuai, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 4/8/22 00:39, Yu Kuai wrote:
> bt_wait_ptr() will increase 'wait_index', however, if blk_mq_get_tag()
> get a tag successfully after bt_wait_ptr() is called and before
> sbitmap_prepare_to_wait() is called, then the 'ws' is skipped. This
> behavior might cause 8 waitqueues to be unbalanced.
> 
> Move bt_wait_ptr() later should reduce the problem when the disk is
> under high io preesure.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/blk-mq-tag.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 68ac23d0b640..228a0001694f 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -155,7 +155,6 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   	if (data->flags & BLK_MQ_REQ_NOWAIT)
>   		return BLK_MQ_NO_TAG;
>   
> -	ws = bt_wait_ptr(bt, data->hctx);
>   	do {
>   		struct sbitmap_queue *bt_prev;
>   
> @@ -174,6 +173,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   		if (tag != BLK_MQ_NO_TAG)
>   			break;
>   
> +		ws = bt_wait_ptr(bt, data->hctx);
>   		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
>   
>   		tag = __blk_mq_get_tag(data, bt);
> @@ -201,8 +201,6 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   		 */
>   		if (bt != bt_prev)
>   			sbitmap_queue_wake_up(bt_prev);
> -
> -		ws = bt_wait_ptr(bt, data->hctx);
>   	} while (1);

Is it necessary to call bt_wait_ptr() during every loop iteration or 
only if bt != bt_prev? Would calling bt_wait_ptr() only if bt != bt_prev 
help to reduce unfairness further?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 4/8] blk-mq: don't preempt tag under heavy load
  2022-04-08  7:39 ` [PATCH -next RFC v2 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
@ 2022-04-08 14:24   ` Bart Van Assche
  2022-04-09  2:38     ` yukuai (C)
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Van Assche @ 2022-04-08 14:24 UTC (permalink / raw)
  To: Yu Kuai, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 4/8/22 00:39, Yu Kuai wrote:
> The idle way to disable tag preemption is to track how many tags are

idle -> ideal?

> available, and wait directly in blk_mq_get_tag() if free tags are
> very little. However, this is out of reality because fast path is
> affected.
> 
> As 'ws_active' is only updated in slow path, this patch disable tag
> preemption if 'ws_active' is greater than 8, which means there are many
> threads waiting for tags already.
> 
> Once tag preemption is disabled, there is a situation that can cause
> performance degration(or io hung in extreme scenarios): the waitqueue

degration -> degradation?

> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 2615bd58bad3..b49b20e11350 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -156,6 +156,7 @@ struct blk_mq_alloc_data {
>   
>   	/* allocate multiple requests/tags in one go */
>   	unsigned int nr_tags;
> +	bool preemption;
>   	struct request **cached_rq;
>   

Please change "preemption" into "preempt".

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-08  7:39 ` [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
@ 2022-04-08 14:31   ` Bart Van Assche
  2022-04-09  2:19     ` yukuai (C)
  2022-04-08 21:13   ` Bart Van Assche
  1 sibling, 1 reply; 28+ messages in thread
From: Bart Van Assche @ 2022-04-08 14:31 UTC (permalink / raw)
  To: Yu Kuai, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 4/8/22 00:39, Yu Kuai wrote:
> Always wake up 'wake_batch' threads will intensify competition and
> split io won't be issued continuously. Now that how many tags is required
> is recorded for huge io, it's safe to wake up baed on required tags.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   lib/sbitmap.c | 22 +++++++++++++++++++++-
>   1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 8d01e02ea4b1..eac9fa5c2b4d 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -614,6 +614,26 @@ static inline void sbq_update_preemption(struct sbitmap_queue *sbq,
>   	WRITE_ONCE(sbq->force_tag_preemption, force);
>   }
>   
> +static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)

Consider renaming "get_wake_nr()" into "nr_to_wake_up()".

> +{
> +	struct sbq_wait *wait;
> +	struct wait_queue_entry *entry;
> +	unsigned int nr = 1;
> +
> +	spin_lock_irq(&ws->wait.lock);
> +	list_for_each_entry(entry, &ws->wait.head, entry) {
> +		wait = container_of(entry, struct sbq_wait, wait);
> +		if (nr_tags <= wait->nr_tags)
> +			break;
> +
> +		nr++;
> +		nr_tags -= wait->nr_tags;
> +	}
> +	spin_unlock_irq(&ws->wait.lock);
> +
> +	return nr;
> +}
> +
>   static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>   {
>   	struct sbq_wait_state *ws;
> @@ -648,7 +668,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>   	smp_mb__before_atomic();
>   	atomic_set(&ws->wait_cnt, wake_batch);
>   	sbq_update_preemption(sbq, wake_batch);
> -	wake_up_nr(&ws->wait, wake_batch);
> +	wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
>   
>   	return true;
>   }

ws->wait.lock is unlocked after the number of threads to wake up has 
been computed and is locked again by wake_up_nr(). The ws->wait.head 
list may be modified after get_wake_nr() returns and before wake_up_nr() 
is called. Isn't that a race condition?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load
  2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
                   ` (7 preceding siblings ...)
  2022-04-08  7:39 ` [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
@ 2022-04-08 19:10 ` Jens Axboe
  2022-04-09  2:26   ` yukuai (C)
  8 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-04-08 19:10 UTC (permalink / raw)
  To: Yu Kuai, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

For this one, and other patches you send, they send up in spam because
the sender can't be verified. I would encourage you to figure out what
is going wrong here, because a lot of your patches end up getting
dropped or missed because of it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-08  7:39 ` [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
  2022-04-08 14:31   ` Bart Van Assche
@ 2022-04-08 21:13   ` Bart Van Assche
  2022-04-09  2:17     ` yukuai (C)
  1 sibling, 1 reply; 28+ messages in thread
From: Bart Van Assche @ 2022-04-08 21:13 UTC (permalink / raw)
  To: Yu Kuai, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 4/8/22 00:39, Yu Kuai wrote:
> +static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)
> +{
> +	struct sbq_wait *wait;
> +	struct wait_queue_entry *entry;
> +	unsigned int nr = 1;
> +
> +	spin_lock_irq(&ws->wait.lock);
> +	list_for_each_entry(entry, &ws->wait.head, entry) {
> +		wait = container_of(entry, struct sbq_wait, wait);
> +		if (nr_tags <= wait->nr_tags)
> +			break;
> +
> +		nr++;
> +		nr_tags -= wait->nr_tags;
> +	}
> +	spin_unlock_irq(&ws->wait.lock);
> +
> +	return nr;
> +}
> +
>   static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>   {
>   	struct sbq_wait_state *ws;
> @@ -648,7 +668,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>   	smp_mb__before_atomic();
>   	atomic_set(&ws->wait_cnt, wake_batch);
>   	sbq_update_preemption(sbq, wake_batch);
> -	wake_up_nr(&ws->wait, wake_batch);
> +	wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
>   
>   	return true;
>   }

An additional comment: my understanding is that __sbq_wake_up() should 
wake up exactly `wake_batch` waiters. The above patch changes that into 
waking up at most `wake_batch` waiters. I think that's wrong.

Bart.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
  2022-04-08 14:20   ` Bart Van Assche
@ 2022-04-09  2:09     ` yukuai (C)
  0 siblings, 0 replies; 28+ messages in thread
From: yukuai (C) @ 2022-04-09  2:09 UTC (permalink / raw)
  To: Bart Van Assche, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/08 22:20, Bart Van Assche 写道:
> On 4/8/22 00:39, Yu Kuai wrote:
>> bt_wait_ptr() will increase 'wait_index', however, if blk_mq_get_tag()
>> get a tag successfully after bt_wait_ptr() is called and before
>> sbitmap_prepare_to_wait() is called, then the 'ws' is skipped. This
>> behavior might cause 8 waitqueues to be unbalanced.
>>
>> Move bt_wait_ptr() later should reduce the problem when the disk is
>> under high io preesure.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-mq-tag.c | 4 +---
>>   1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 68ac23d0b640..228a0001694f 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -155,7 +155,6 @@ unsigned int blk_mq_get_tag(struct 
>> blk_mq_alloc_data *data)
>>       if (data->flags & BLK_MQ_REQ_NOWAIT)
>>           return BLK_MQ_NO_TAG;
>> -    ws = bt_wait_ptr(bt, data->hctx);
>>       do {
>>           struct sbitmap_queue *bt_prev;
>> @@ -174,6 +173,7 @@ unsigned int blk_mq_get_tag(struct 
>> blk_mq_alloc_data *data)
>>           if (tag != BLK_MQ_NO_TAG)
>>               break;
>> +        ws = bt_wait_ptr(bt, data->hctx);
>>           sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
>>           tag = __blk_mq_get_tag(data, bt);
>> @@ -201,8 +201,6 @@ unsigned int blk_mq_get_tag(struct 
>> blk_mq_alloc_data *data)
>>            */
>>           if (bt != bt_prev)
>>               sbitmap_queue_wake_up(bt_prev);
>> -
>> -        ws = bt_wait_ptr(bt, data->hctx);
>>       } while (1);
> 
> Is it necessary to call bt_wait_ptr() during every loop iteration or 
> only if bt != bt_prev? Would calling bt_wait_ptr() only if bt != bt_prev 
> help to reduce unfairness further?
Hi,

You are right, that sounds reasonable.

Thanks,
Kuai
> 
> Thanks,
> 
> Bart.
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-08 21:13   ` Bart Van Assche
@ 2022-04-09  2:17     ` yukuai (C)
  2022-04-09  4:16       ` Bart Van Assche
  0 siblings, 1 reply; 28+ messages in thread
From: yukuai (C) @ 2022-04-09  2:17 UTC (permalink / raw)
  To: Bart Van Assche, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/09 5:13, Bart Van Assche 写道:
> On 4/8/22 00:39, Yu Kuai wrote:
>> +static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned 
>> int nr_tags)
>> +{
>> +    struct sbq_wait *wait;
>> +    struct wait_queue_entry *entry;
>> +    unsigned int nr = 1;
>> +
>> +    spin_lock_irq(&ws->wait.lock);
>> +    list_for_each_entry(entry, &ws->wait.head, entry) {
>> +        wait = container_of(entry, struct sbq_wait, wait);
>> +        if (nr_tags <= wait->nr_tags)
>> +            break;
>> +
>> +        nr++;
>> +        nr_tags -= wait->nr_tags;
>> +    }
>> +    spin_unlock_irq(&ws->wait.lock);
>> +
>> +    return nr;
>> +}
>> +
>>   static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>>   {
>>       struct sbq_wait_state *ws;
>> @@ -648,7 +668,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>>       smp_mb__before_atomic();
>>       atomic_set(&ws->wait_cnt, wake_batch);
>>       sbq_update_preemption(sbq, wake_batch);
>> -    wake_up_nr(&ws->wait, wake_batch);
>> +    wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
>>       return true;
>>   }
> 
> An additional comment: my understanding is that __sbq_wake_up() should 
> wake up exactly `wake_batch` waiters. The above patch changes that into 
> waking up at most `wake_batch` waiters. I think that's wrong.
Hi,

I think the reason to wake up 'wake_batch' waiters is to make sure
wakers will use up 'wake_batch' tags that is just freed, because each
wakers should aquire at least one tag. Thus I think if we can make sure
wakers will use up 'wake_batch' tags, it's ok to wake up less waiters.

Please kindly correct me if I'm wrong.

Thanks,
Kuai
> 
> Bart.
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-08 14:31   ` Bart Van Assche
@ 2022-04-09  2:19     ` yukuai (C)
  0 siblings, 0 replies; 28+ messages in thread
From: yukuai (C) @ 2022-04-09  2:19 UTC (permalink / raw)
  To: Bart Van Assche, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/08 22:31, Bart Van Assche 写道:
> On 4/8/22 00:39, Yu Kuai wrote:
>> Always wake up 'wake_batch' threads will intensify competition and
>> split io won't be issued continuously. Now that how many tags is required
>> is recorded for huge io, it's safe to wake up baed on required tags.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   lib/sbitmap.c | 22 +++++++++++++++++++++-
>>   1 file changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>> index 8d01e02ea4b1..eac9fa5c2b4d 100644
>> --- a/lib/sbitmap.c
>> +++ b/lib/sbitmap.c
>> @@ -614,6 +614,26 @@ static inline void sbq_update_preemption(struct 
>> sbitmap_queue *sbq,
>>       WRITE_ONCE(sbq->force_tag_preemption, force);
>>   }
>> +static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned 
>> int nr_tags)
> 
> Consider renaming "get_wake_nr()" into "nr_to_wake_up()".
> 
>> +{
>> +    struct sbq_wait *wait;
>> +    struct wait_queue_entry *entry;
>> +    unsigned int nr = 1;
>> +
>> +    spin_lock_irq(&ws->wait.lock);
>> +    list_for_each_entry(entry, &ws->wait.head, entry) {
>> +        wait = container_of(entry, struct sbq_wait, wait);
>> +        if (nr_tags <= wait->nr_tags)
>> +            break;
>> +
>> +        nr++;
>> +        nr_tags -= wait->nr_tags;
>> +    }
>> +    spin_unlock_irq(&ws->wait.lock);
>> +
>> +    return nr;
>> +}
>> +
>>   static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>>   {
>>       struct sbq_wait_state *ws;
>> @@ -648,7 +668,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>>       smp_mb__before_atomic();
>>       atomic_set(&ws->wait_cnt, wake_batch);
>>       sbq_update_preemption(sbq, wake_batch);
>> -    wake_up_nr(&ws->wait, wake_batch);
>> +    wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
>>       return true;
>>   }
> 
> ws->wait.lock is unlocked after the number of threads to wake up has 
> been computed and is locked again by wake_up_nr(). The ws->wait.head 
> list may be modified after get_wake_nr() returns and before wake_up_nr() 
> is called. Isn't that a race condition?
Hi,

That is a race condition, I was hoping that the problem patch 5 fixed
can cover this.

Thanks,
Kuai
> 
> Thanks,
> 
> Bart.
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load
  2022-04-08 19:10 ` [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Jens Axboe
@ 2022-04-09  2:26   ` yukuai (C)
  2022-04-09  2:28     ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: yukuai (C) @ 2022-04-09  2:26 UTC (permalink / raw)
  To: Jens Axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/09 3:10, Jens Axboe 写道:
> For this one, and other patches you send, they send up in spam because
> the sender can't be verified. I would encourage you to figure out what
> is going wrong here, because a lot of your patches end up getting
> dropped or missed because of it.
> 

Hi,

Thanks for your notice, however, I have no clue what is going on right
now. I'll look for some help and hopefully that can be fixed.

Thanks,
Kuai

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load
  2022-04-09  2:26   ` yukuai (C)
@ 2022-04-09  2:28     ` Jens Axboe
  2022-04-09  2:34       ` yukuai (C)
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Jens Axboe @ 2022-04-09  2:28 UTC (permalink / raw)
  To: yukuai (C), andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 4/8/22 8:26 PM, yukuai (C) wrote:
> 在 2022/04/09 3:10, Jens Axboe 写道:
>> For this one, and other patches you send, they send up in spam because
>> the sender can't be verified. I would encourage you to figure out what
>> is going wrong here, because a lot of your patches end up getting
>> dropped or missed because of it.
>>
> 
> Hi,
> 
> Thanks for your notice, however, I have no clue what is going on right
> now. I'll look for some help and hopefully that can be fixed.

The easiest is probably to try and send patches to a gmail account. If
you don't have one, just create one. That will help you see the issue
and verify whatever the fix might be. It might be a company email
server issue, hower.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load
  2022-04-09  2:28     ` Jens Axboe
@ 2022-04-09  2:34       ` yukuai (C)
  2022-04-09  7:14       ` yukuai (C)
  2022-04-09 21:31       ` Bart Van Assche
  2 siblings, 0 replies; 28+ messages in thread
From: yukuai (C) @ 2022-04-09  2:34 UTC (permalink / raw)
  To: Jens Axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/09 10:28, Jens Axboe 写道:
> On 4/8/22 8:26 PM, yukuai (C) wrote:
>> 在 2022/04/09 3:10, Jens Axboe 写道:
>>> For this one, and other patches you send, they send up in spam because
>>> the sender can't be verified. I would encourage you to figure out what
>>> is going wrong here, because a lot of your patches end up getting
>>> dropped or missed because of it.
>>>
>>
>> Hi,
>>
>> Thanks for your notice, however, I have no clue what is going on right
>> now. I'll look for some help and hopefully that can be fixed.
> 
> The easiest is probably to try and send patches to a gmail account. If
> you don't have one, just create one. That will help you see the issue
> and verify whatever the fix might be. It might be a company email
> server issue, hower.
> 
Thanks very much for your advice, I'll try that asap.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 4/8] blk-mq: don't preempt tag under heavy load
  2022-04-08 14:24   ` Bart Van Assche
@ 2022-04-09  2:38     ` yukuai (C)
  0 siblings, 0 replies; 28+ messages in thread
From: yukuai (C) @ 2022-04-09  2:38 UTC (permalink / raw)
  To: Bart Van Assche, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/08 22:24, Bart Van Assche 写道:
> On 4/8/22 00:39, Yu Kuai wrote:
>> The idle way to disable tag preemption is to track how many tags are
> 
> idle -> ideal?
> 
>> available, and wait directly in blk_mq_get_tag() if free tags are
>> very little. However, this is out of reality because fast path is
>> affected.
>>
>> As 'ws_active' is only updated in slow path, this patch disable tag
>> preemption if 'ws_active' is greater than 8, which means there are many
>> threads waiting for tags already.
>>
>> Once tag preemption is disabled, there is a situation that can cause
>> performance degration(or io hung in extreme scenarios): the waitqueue
> 
> degration -> degradation?
> 
>> diff --git a/block/blk-mq.h b/block/blk-mq.h
>> index 2615bd58bad3..b49b20e11350 100644
>> --- a/block/blk-mq.h
>> +++ b/block/blk-mq.h
>> @@ -156,6 +156,7 @@ struct blk_mq_alloc_data {
>>       /* allocate multiple requests/tags in one go */
>>       unsigned int nr_tags;
>> +    bool preemption;
>>       struct request **cached_rq;
> 
> Please change "preemption" into "preempt".

Thanks for your advice, will change that and previous spelling mistakes.

Kuai
> 
> Thanks,
> 
> Bart.
> 
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-09  2:17     ` yukuai (C)
@ 2022-04-09  4:16       ` Bart Van Assche
  2022-04-09  7:01         ` yukuai (C)
  0 siblings, 1 reply; 28+ messages in thread
From: Bart Van Assche @ 2022-04-09  4:16 UTC (permalink / raw)
  To: yukuai (C), axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 4/8/22 19:17, yukuai (C) wrote:
> I think the reason to wake up 'wake_batch' waiters is to make sure
> wakers will use up 'wake_batch' tags that is just freed, because each
> wakers should aquire at least one tag. Thus I think if we can make sure
> wakers will use up 'wake_batch' tags, it's ok to wake up less waiters.

Hmm ... I think it's up to you to (a) explain this behavior change in 
detail in the commit message and (b) to prove that this behavior change 
won't cause trouble (I guess this change will cause trouble).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-09  4:16       ` Bart Van Assche
@ 2022-04-09  7:01         ` yukuai (C)
  2022-04-12  3:20           ` Bart Van Assche
  0 siblings, 1 reply; 28+ messages in thread
From: yukuai (C) @ 2022-04-09  7:01 UTC (permalink / raw)
  To: Bart Van Assche, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/09 12:16, Bart Van Assche 写道:
> On 4/8/22 19:17, yukuai (C) wrote:
>> I think the reason to wake up 'wake_batch' waiters is to make sure
>> wakers will use up 'wake_batch' tags that is just freed, because each
>> wakers should aquire at least one tag. Thus I think if we can make sure
>> wakers will use up 'wake_batch' tags, it's ok to wake up less waiters.
> 
> Hmm ... I think it's up to you to (a) explain this behavior change in 
> detail in the commit message and (b) to prove that this behavior change 
> won't cause trouble (I guess this change will cause trouble).

Hi, Bart

Sorry that the commit message doesn't explain clearly.

There are only two situations that wakers will be less than 'wake_batch'
after this patch:

(a) some wakers will acquire multipul tags, as I mentioned above, this
is ok because wakers will use up 'wake_batch' tags.

(b) the total number of waiters is less than 'wake_batch', this is
problematic if tag preemption is disabled, because io concurrency will
be declined.(patch 5 should fix the problem)

For the race that new threads are waited after get_wake_nr() and before
wake_up_nr() in situation (b), I can't figure out how this can be
problematic, however, this can be optimized by triggering additional
wake up:

@@ -623,15 +623,17 @@ static unsigned int get_wake_nr(struct 
sbq_wait_state *ws, unsigned int nr_tags)
         spin_lock_irq(&ws->wait.lock);
         list_for_each_entry(entry, &ws->wait.head, entry) {
                 wait = container_of(entry, struct sbq_wait, wait);
-               if (nr_tags <= wait->nr_tags)
+               if (nr_tags <= wait->nr_tags) {
+                       nr_tags = 0;
                         break;
+               }

                 nr++;
                 nr_tags -= wait->nr_tags;
         }
         spin_unlock_irq(&ws->wait.lock);

-       return nr;
+       return nr + nr_tags;
  }

What do you think?

Thanks,
Kuai

> 
> Thanks,
> 
> Bart.
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load
  2022-04-09  2:28     ` Jens Axboe
  2022-04-09  2:34       ` yukuai (C)
@ 2022-04-09  7:14       ` yukuai (C)
  2022-04-09 21:31       ` Bart Van Assche
  2 siblings, 0 replies; 28+ messages in thread
From: yukuai (C) @ 2022-04-09  7:14 UTC (permalink / raw)
  To: Jens Axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/09 10:28, Jens Axboe 写道:
> On 4/8/22 8:26 PM, yukuai (C) wrote:
>> 在 2022/04/09 3:10, Jens Axboe 写道:
>>> For this one, and other patches you send, they send up in spam because
>>> the sender can't be verified. I would encourage you to figure out what
>>> is going wrong here, because a lot of your patches end up getting
>>> dropped or missed because of it.
>>>
>>
>> Hi,
>>
>> Thanks for your notice, however, I have no clue what is going on right
>> now. I'll look for some help and hopefully that can be fixed.
> 
> The easiest is probably to try and send patches to a gmail account. If
> you don't have one, just create one. That will help you see the issue
> and verify whatever the fix might be. It might be a company email
> server issue, hower.
> 

I tried to send a patch to gmail, however, no issues are found. I am
contacting our IT support and hope they can figure out what is going on.

By the way, I didn't see anything unusual for my patches in linux-block
patchwork:

https://patchwork.kernel.org/project/linux-block/list/?series=&submitter=187999&state=&q=&archive=both&delegate=

Is there a seperate place to track patches?

Thanks,
Kuai

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load
  2022-04-09  2:28     ` Jens Axboe
  2022-04-09  2:34       ` yukuai (C)
  2022-04-09  7:14       ` yukuai (C)
@ 2022-04-09 21:31       ` Bart Van Assche
  2 siblings, 0 replies; 28+ messages in thread
From: Bart Van Assche @ 2022-04-09 21:31 UTC (permalink / raw)
  To: Jens Axboe, yukuai (C), andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 4/8/22 19:28, Jens Axboe wrote:
> The easiest is probably to try and send patches to a gmail account. If
> you don't have one, just create one. That will help you see the issue
> and verify whatever the fix might be. It might be a company email
> server issue, hower.

Hi Jens and Yu,

I think it's a company email issue. Many servers that receive email rely 
on the SPF, DKIM & DMARC standards to determine whether or not to 
classify email as spam. I had to add the following rule to my inbox 
receive Huawei.com emails:

Matches: from:(huawei.com)
Do this: Never send it to Spam

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags
  2022-04-09  7:01         ` yukuai (C)
@ 2022-04-12  3:20           ` Bart Van Assche
  0 siblings, 0 replies; 28+ messages in thread
From: Bart Van Assche @ 2022-04-12  3:20 UTC (permalink / raw)
  To: yukuai (C), axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

On 4/9/22 00:01, yukuai (C) wrote:
> For the race that new threads are waited after get_wake_nr() and before
> wake_up_nr() in situation (b), I can't figure out how this can be
> problematic [ ... ]

If the atomic_dec_return() statement in __sbq_wake_up() returns a value 
that is less than or equal to zero, wake_batch waiters should be woken 
up or I/O will hang until sbitmap_queue_wake_all() is called. That last 
function should not be called unless there is no alternative.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced
  2022-04-08  7:39 ` [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
@ 2022-04-15  6:31   ` Li, Ming
  2022-04-15  7:07     ` yukuai (C)
  0 siblings, 1 reply; 28+ messages in thread
From: Li, Ming @ 2022-04-15  6:31 UTC (permalink / raw)
  To: Yu Kuai, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang



On 4/8/2022 3:39 PM, Yu Kuai wrote:
> Currently, same waitqueue might be woken up continuously:
> 
> __sbq_wake_up		__sbq_wake_up
>  sbq_wake_ptr -> assume	0
> 			 sbq_wake_ptr -> 0
>  atomic_dec_return
> 			atomic_dec_return
>  atomic_cmpxchg -> succeed
> 			 atomic_cmpxchg -> failed
> 			  return true
> 
> 			__sbq_wake_up
> 			 sbq_wake_ptr
> 			  atomic_read(&sbq->wake_index) -> still 0
>  sbq_index_atomic_inc -> inc to 1
> 			  if (waitqueue_active(&ws->wait))
> 			   if (wake_index != atomic_read(&sbq->wake_index))
> 			    atomic_set -> reset from 1 to 0
>  wake_up_nr -> wake up first waitqueue
> 			    // continue to wake up in first waitqueue
> 
> What's worse, io hung is possible in theory because wake up might be
> missed. For example, 2 * wake_batch tags are put, while only wake_batch
> threads are worken:
> 
> __sbq_wake_up
>  atomic_cmpxchg -> reset wait_cnt
> 			__sbq_wake_up -> decrease wait_cnt
> 			...
> 			__sbq_wake_up -> wait_cnt is decreased to 0 again
> 			 atomic_cmpxchg
> 			 sbq_index_atomic_inc -> increase wake_index
> 			 wake_up_nr -> wake up and waitqueue might be empty
>  sbq_index_atomic_inc -> increase again, one waitqueue is skipped
>  wake_up_nr -> invalid wake up because old wakequeue might be empty
> 
> To fix the problem, refactor to make sure waitqueues will be woken up
> one by one, and also choose the next waitqueue by the number of threads
> that are waiting to keep waitqueues balanced.
Hi, do you think that updating wake_index before atomic_cmpxchg(ws->wait_cnt) also can solve these two problems?
like this:
__sbq_wake_up()
{
	....
	if (wait_cnt <= 0) {
		ret = atomic_cmpxchg(sbq->wake_index, old_wake_index, next_wake_index);
		if (ret == old_wake_index) {
			ret = atomic_cmpxchg(ws->wait_cnt, wait_cnt, wake_batch);
			if (ret == wait_cnt)
				wake_up_nr(ws->wait, wake_batch);
		}
	}
}

Your solution is picking the waitqueue with the largest waiters_cnt as the next one to be waked up, I think that waitqueue is possible to starve.
if lots of threads in a same waitqueue stop waiting before sbq wakes them up, it will cause the waiters_cnt of waitqueue is much less than others, looks like sbq_update_wake_index() would never pick this waitqueue. What do you think? is it possible?


> 
> Test cmd: nr_requests is 64, and queue_depth is 32
> [global]
> filename=/dev/sda
> ioengine=libaio
> direct=1
> allow_mounted_write=0
> group_reporting
> 
> [test]
> rw=randwrite
> bs=4k
> numjobs=512
> iodepth=2
> 
> Before this patch, waitqueues can be extremly unbalanced, for example:
> ws_active=484
> ws={
>         {.wait_cnt=8, .waiters_cnt=117},
>         {.wait_cnt=8, .waiters_cnt=59},
>         {.wait_cnt=8, .waiters_cnt=76},
>         {.wait_cnt=8, .waiters_cnt=0},
>         {.wait_cnt=5, .waiters_cnt=24},
>         {.wait_cnt=8, .waiters_cnt=12},
>         {.wait_cnt=8, .waiters_cnt=21},
>         {.wait_cnt=8, .waiters_cnt=175},
> }
> 
> With this patch, waitqueues is always balanced, for example:
> ws_active=477
> ws={
>         {.wait_cnt=8, .waiters_cnt=59},
>         {.wait_cnt=6, .waiters_cnt=62},
>         {.wait_cnt=8, .waiters_cnt=61},
>         {.wait_cnt=8, .waiters_cnt=60},
>         {.wait_cnt=8, .waiters_cnt=63},
>         {.wait_cnt=8, .waiters_cnt=56},
>         {.wait_cnt=8, .waiters_cnt=59},
>         {.wait_cnt=8, .waiters_cnt=57},
> }
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  lib/sbitmap.c | 81 ++++++++++++++++++++++++++-------------------------
>  1 file changed, 42 insertions(+), 39 deletions(-)
> 
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 393f2b71647a..176fba0252d7 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -575,68 +575,71 @@ void sbitmap_queue_min_shallow_depth(struct sbitmap_queue *sbq,
>  }
>  EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth);
>  
> -static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
> +/* always choose the 'ws' with the max waiters */
> +static void sbq_update_wake_index(struct sbitmap_queue *sbq,
> +				  int old_wake_index)
>  {
> -	int i, wake_index;
> +	int index, wake_index;
> +	int max_waiters = 0;
>  
> -	if (!atomic_read(&sbq->ws_active))
> -		return NULL;
> +	if (old_wake_index != atomic_read(&sbq->wake_index))
> +		return;
>  
> -	wake_index = atomic_read(&sbq->wake_index);
> -	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
> -		struct sbq_wait_state *ws = &sbq->ws[wake_index];
> +	for (wake_index = 0; wake_index < SBQ_WAIT_QUEUES; wake_index++) {
> +		struct sbq_wait_state *ws;
> +		int waiters;
>  
> -		if (waitqueue_active(&ws->wait)) {
> -			if (wake_index != atomic_read(&sbq->wake_index))
> -				atomic_set(&sbq->wake_index, wake_index);
> -			return ws;
> -		}
> +		if (wake_index == old_wake_index)
> +			continue;
>  
> -		wake_index = sbq_index_inc(wake_index);
> +		ws = &sbq->ws[wake_index];
> +		waiters = atomic_read(&ws->waiters_cnt);
> +		if (waiters > max_waiters) {
> +			max_waiters = waiters;
> +			index = wake_index;
> +		}
>  	}
>  
> -	return NULL;
> +	if (max_waiters)
> +		atomic_cmpxchg(&sbq->wake_index, old_wake_index, index);
>  }
>  
>  static bool __sbq_wake_up(struct sbitmap_queue *sbq)
>  {
>  	struct sbq_wait_state *ws;
>  	unsigned int wake_batch;
> -	int wait_cnt;
> +	int wait_cnt, wake_index;
>  
> -	ws = sbq_wake_ptr(sbq);
> -	if (!ws)
> +	if (!atomic_read(&sbq->ws_active))
>  		return false;
>  
> +	wake_index = atomic_read(&sbq->wake_index);
> +	ws = &sbq->ws[wake_index];
>  	wait_cnt = atomic_dec_return(&ws->wait_cnt);
> -	if (wait_cnt <= 0) {
> -		int ret;
> -
> -		wake_batch = READ_ONCE(sbq->wake_batch);
> -
> -		/*
> -		 * Pairs with the memory barrier in sbitmap_queue_resize() to
> -		 * ensure that we see the batch size update before the wait
> -		 * count is reset.
> -		 */
> -		smp_mb__before_atomic();
> -
> +	if (wait_cnt > 0) {
> +		return false;
> +	} else if (wait_cnt < 0) {
>  		/*
> -		 * For concurrent callers of this, the one that failed the
> -		 * atomic_cmpxhcg() race should call this function again
> +		 * Concurrent callers should call this function again
>  		 * to wakeup a new batch on a different 'ws'.
>  		 */
> -		ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch);
> -		if (ret == wait_cnt) {
> -			sbq_index_atomic_inc(&sbq->wake_index);
> -			wake_up_nr(&ws->wait, wake_batch);
> -			return false;
> -		}
> -
> +		sbq_update_wake_index(sbq, wake_index);
>  		return true;
>  	}
>  
> -	return false;
> +	sbq_update_wake_index(sbq, wake_index);
> +	wake_batch = READ_ONCE(sbq->wake_batch);
> +
> +	/*
> +	 * Pairs with the memory barrier in sbitmap_queue_resize() to
> +	 * ensure that we see the batch size update before the wait
> +	 * count is reset.
> +	 */
> +	smp_mb__before_atomic();
> +	atomic_set(&ws->wait_cnt, wake_batch);
> +	wake_up_nr(&ws->wait, wake_batch);
> +
> +	return true;
>  }
>  
>  void sbitmap_queue_wake_up(struct sbitmap_queue *sbq)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced
  2022-04-15  6:31   ` Li, Ming
@ 2022-04-15  7:07     ` yukuai (C)
  0 siblings, 0 replies; 28+ messages in thread
From: yukuai (C) @ 2022-04-15  7:07 UTC (permalink / raw)
  To: Li, Ming, axboe, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

在 2022/04/15 14:31, Li, Ming 写道:
> 
> 
> On 4/8/2022 3:39 PM, Yu Kuai wrote:
>> Currently, same waitqueue might be woken up continuously:
>>
>> __sbq_wake_up		__sbq_wake_up
>>   sbq_wake_ptr -> assume	0
>> 			 sbq_wake_ptr -> 0
>>   atomic_dec_return
>> 			atomic_dec_return
>>   atomic_cmpxchg -> succeed
>> 			 atomic_cmpxchg -> failed
>> 			  return true
>>
>> 			__sbq_wake_up
>> 			 sbq_wake_ptr
>> 			  atomic_read(&sbq->wake_index) -> still 0
>>   sbq_index_atomic_inc -> inc to 1
>> 			  if (waitqueue_active(&ws->wait))
>> 			   if (wake_index != atomic_read(&sbq->wake_index))
>> 			    atomic_set -> reset from 1 to 0
>>   wake_up_nr -> wake up first waitqueue
>> 			    // continue to wake up in first waitqueue
>>
>> What's worse, io hung is possible in theory because wake up might be
>> missed. For example, 2 * wake_batch tags are put, while only wake_batch
>> threads are worken:
>>
>> __sbq_wake_up
>>   atomic_cmpxchg -> reset wait_cnt
>> 			__sbq_wake_up -> decrease wait_cnt
>> 			...
>> 			__sbq_wake_up -> wait_cnt is decreased to 0 again
>> 			 atomic_cmpxchg
>> 			 sbq_index_atomic_inc -> increase wake_index
>> 			 wake_up_nr -> wake up and waitqueue might be empty
>>   sbq_index_atomic_inc -> increase again, one waitqueue is skipped
>>   wake_up_nr -> invalid wake up because old wakequeue might be empty
>>
>> To fix the problem, refactor to make sure waitqueues will be woken up
>> one by one, and also choose the next waitqueue by the number of threads
>> that are waiting to keep waitqueues balanced.
> Hi, do you think that updating wake_index before atomic_cmpxchg(ws->wait_cnt) also can solve these two problems?
> like this:
Hi,

The first problem is due to sbq_wake_ptr() is using atomic_set() to
update 'wake_index'.

The second problem is due to __sbq_wake_up() is updating 'wait_cnt'
before 'wait_index'.

> __sbq_wake_up()
> {
> 	....
> 	if (wait_cnt <= 0) {
> 		ret = atomic_cmpxchg(sbq->wake_index, old_wake_index, next_wake_index);
How is the 'next_wake_index' chosen? And the same in sbq_wake_ptr().
> 		if (ret == old_wake_index) {
> 			ret = atomic_cmpxchg(ws->wait_cnt, wait_cnt, wake_batch);

If this failed, just return true with 'wake_index' updated? Then the
caller will call this again, so it seems this can't prevent 'wake_index'
updated multiple times, and 'wait_cnt' in the old 'ws' is not updated.
> 			if (ret == wait_cnt)
> 				wake_up_nr(ws->wait, wake_batch);
> 		}
> 	}
> }
> 
> Your solution is picking the waitqueue with the largest waiters_cnt as the next one to be waked up, I think that waitqueue is possible to starve.
> if lots of threads in a same waitqueue stop waiting before sbq wakes them up, it will cause the waiters_cnt of waitqueue is much less than others, looks like sbq_update_wake_index() would never pick this waitqueue. What do you think? is it possible?

It will be possible if adding threads to waitqueues is not balanced, and
I suppose it's not possible after tag premmption is disabled. However,
instead of chosing the waitqueue with largest waiters_cnt, chosing the
next waitqueue with 'waiters_cnt > 0' might be alternative.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-04-15  7:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
2022-04-08 14:20   ` Bart Van Assche
2022-04-09  2:09     ` yukuai (C)
2022-04-08  7:39 ` [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
2022-04-15  6:31   ` Li, Ming
2022-04-15  7:07     ` yukuai (C)
2022-04-08  7:39 ` [PATCH -next RFC v2 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
2022-04-08 14:24   ` Bart Van Assche
2022-04-09  2:38     ` yukuai (C)
2022-04-08  7:39 ` [PATCH -next RFC v2 5/8] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 6/8] blk-mq: force tag preemption for split bios Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 7/8] blk-mq: record how many tags are needed for splited bio Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
2022-04-08 14:31   ` Bart Van Assche
2022-04-09  2:19     ` yukuai (C)
2022-04-08 21:13   ` Bart Van Assche
2022-04-09  2:17     ` yukuai (C)
2022-04-09  4:16       ` Bart Van Assche
2022-04-09  7:01         ` yukuai (C)
2022-04-12  3:20           ` Bart Van Assche
2022-04-08 19:10 ` [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Jens Axboe
2022-04-09  2:26   ` yukuai (C)
2022-04-09  2:28     ` Jens Axboe
2022-04-09  2:34       ` yukuai (C)
2022-04-09  7:14       ` yukuai (C)
2022-04-09 21:31       ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).