linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next RFC v2 0/8] improve tag allocation under heavy load
@ 2022-04-08  7:39 Yu Kuai
  2022-04-08  7:39 ` [PATCH -next RFC v2 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Yu Kuai @ 2022-04-08  7:39 UTC (permalink / raw)
  To: axboe, yukuai3, andriy.shevchenko, john.garry, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang

Changes in v2:
 - use a new title
 - add patches to fix waitqueues' unfairness - path 1-3
 - delete patch to add queue flag
 - delete patch to split big io thoroughly

There is a defect for blk-mq compare to blk-sq, specifically split io
will end up discontinuous if the device is under high io pressure, while
split io will still be continuous in sq, this is because:

1) new io can preempt tag even if there are lots of threads waiting.
2) split bio is issued one by one, if one bio can't get tag, it will go
to wail.
3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
Thus if a thread is woken up, it will unlikey to get multiple tags.

The problem was first found by upgrading kernel from v3.10 to v4.18,
test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
ios with high concurrency.

Noted that there is a precondition for such performance problem:
There is a certain gap between bandwith for single io with
bs=max_sectors_kb and disk upper limit.

During the test, I found that waitqueues can be extremly unbalanced on
heavy load. This is because 'wake_index' is not set properly in
__sbq_wake_up(), see details in patch 3.

In this patchset:
 - patch 1-3 fix waitqueues' unfairness.
 - patch 4,5 disable tag preemption on heavy load.
 - patch 6 forces tag preemption for split bios.
 - patch 7,8 improve large random io for HDD. As I mentioned above, we
 do meet the problem and I'm trying to fix it at very low cost. However,
 if anyone still thinks this is not a common case and not worth to
 optimize, I'll drop them.

Test environment:
arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
where 'max_sectors_kb' is 256).

The single io performance(randwrite):

| bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
| -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
| bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |

It can be seen that 1280k io is already close to upper limit, and it'll
be hard to see differences with the default value, thus I set
'max_sectors_kb' to 128 in the following test.

Test cmd:
        fio \
        -filename=/dev/$dev \
        -name=test \
        -ioengine=psync \
        -allow_mounted_write=0 \
        -group_reporting \
        -direct=1 \
        -offset_increment=1g \
        -rw=randwrite \
        -bs=1024k \
        -numjobs={1,2,4,8,16,32,64,128,256,512} \
        -runtime=110 \
        -ramp_time=10

Test result: MiB/s

| numjobs | v5.18-rc1 | v5.18-rc1-patched |
| ------- | --------- | ----------------- |
| 1       | 67.7      | 67.7              |
| 2       | 67.7      | 67.7              |
| 4       | 67.7      | 67.7              |
| 8       | 67.7      | 67.7              |
| 16      | 64.8      | 65.2              |
| 32      | 59.8      | 62.8              |
| 64      | 54.9      | 58.6              |
| 128     | 49        | 55.8              |
| 256     | 37.7      | 52.3              |
| 512     | 31.8      | 51.4              |

Yu Kuai (8):
  sbitmap: record the number of waiters for each waitqueue
  blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
  sbitmap: make sure waitqueues are balanced
  blk-mq: don't preempt tag on heavy load
  sbitmap: force tag preemption if free tags are sufficient
  blk-mq: force tag preemption for split bios
  blk-mq: record how many tags are needed for splited bio
  sbitmap: wake up the number of threads based on required tags

 block/blk-merge.c         |   9 ++-
 block/blk-mq-tag.c        |  42 +++++++++-----
 block/blk-mq.c            |  25 +++++++-
 block/blk-mq.h            |   2 +
 include/linux/blk_types.h |   4 ++
 include/linux/sbitmap.h   |   9 +++
 lib/sbitmap.c             | 117 +++++++++++++++++++++++++-------------
 7 files changed, 150 insertions(+), 58 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-04-15  7:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-08  7:39 [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
2022-04-08 14:20   ` Bart Van Assche
2022-04-09  2:09     ` yukuai (C)
2022-04-08  7:39 ` [PATCH -next RFC v2 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
2022-04-15  6:31   ` Li, Ming
2022-04-15  7:07     ` yukuai (C)
2022-04-08  7:39 ` [PATCH -next RFC v2 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
2022-04-08 14:24   ` Bart Van Assche
2022-04-09  2:38     ` yukuai (C)
2022-04-08  7:39 ` [PATCH -next RFC v2 5/8] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 6/8] blk-mq: force tag preemption for split bios Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 7/8] blk-mq: record how many tags are needed for splited bio Yu Kuai
2022-04-08  7:39 ` [PATCH -next RFC v2 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
2022-04-08 14:31   ` Bart Van Assche
2022-04-09  2:19     ` yukuai (C)
2022-04-08 21:13   ` Bart Van Assche
2022-04-09  2:17     ` yukuai (C)
2022-04-09  4:16       ` Bart Van Assche
2022-04-09  7:01         ` yukuai (C)
2022-04-12  3:20           ` Bart Van Assche
2022-04-08 19:10 ` [PATCH -next RFC v2 0/8] improve tag allocation under heavy load Jens Axboe
2022-04-09  2:26   ` yukuai (C)
2022-04-09  2:28     ` Jens Axboe
2022-04-09  2:34       ` yukuai (C)
2022-04-09  7:14       ` yukuai (C)
2022-04-09 21:31       ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).