All of lore.kernel.org
 help / color / mirror / Atom feed
From: "yukuai (C)" <yukuai3@huawei.com>
To: Damien Le Moal <damien.lemoal@opensource.wdc.com>,
	<axboe@kernel.dk>, <bvanassche@acm.org>,
	<andriy.shevchenko@linux.intel.com>, <john.garry@huawei.com>,
	<ming.lei@redhat.com>, <qiulaibin@huawei.com>
Cc: <linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<yi.zhang@huawei.com>
Subject: Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
Date: Mon, 25 Apr 2022 14:14:23 +0800	[thread overview]
Message-ID: <63e84f2a-2487-a0c3-cab2-7d2011bc2db4@huawei.com> (raw)
In-Reply-To: <3fbadd9f-11dd-9043-11cf-f0839dcf30e1@opensource.wdc.com>

在 2022/04/25 11:24, Damien Le Moal 写道:
> On 4/24/22 11:43, yukuai (C) wrote:
>> friendly ping ...
>>
>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>> Changes in v3:
>>>    - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>    in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>    'waiters_cnt' are all 0, which will cause deap loop.
>>>    - don't add 'wait_index' during each loop in patch 2
>>>    - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>    also improving coding for the patch.
>>>    - add a detection in patch 4 in case io hung is triggered in corner
>>>    cases.
>>>    - make the detection, free tags are sufficient, more flexible.
>>>    - fix a race in patch 8.
>>>    - fix some words and add some comments.
>>>
>>> Changes in v2:
>>>    - use a new title
>>>    - add patches to fix waitqueues' unfairness - path 1-3
>>>    - delete patch to add queue flag
>>>    - delete patch to split big io thoroughly
>>>
>>> In this patchset:
>>>    - patch 1-3 fix waitqueues' unfairness.
>>>    - patch 4,5 disable tag preemption on heavy load.
>>>    - patch 6 forces tag preemption for split bios.
>>>    - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>    I'm trying to fix it at very low cost. However, if anyone still thinks
>>>    this is not a common case and not worth to optimize, I'll drop them.
>>>
>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>> will end up discontinuous if the device is under high io pressure, while
>>> split io will still be continuous in sq, this is because:
>>>
>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>> to wail.
>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>
>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>> ios with high concurrency.
>>>
>>> Noted that there is a precondition for such performance problem:
>>> There is a certain gap between bandwidth for single io with
>>> bs=max_sectors_kb and disk upper limit.
>>>
>>> During the test, I found that waitqueues can be extremly unbalanced on
>>> heavy load. This is because 'wake_index' is not set properly in
>>> __sbq_wake_up(), see details in patch 3.
>>>
>>> Test environment:
>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>> where 'max_sectors_kb' is 256).>>
>>> The single io performance(randwrite):
>>>
>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
> 
> These results are extremely strange, unless you are running with the
> device write cache disabled ? If you have the device write cache enabled,
> the problem you mention above would be most likely completely invisible,
> which I guess is why nobody really noticed any issue until now.
> 
> Similarly, with reads, the device side read-ahead may hide the problem,
> albeit that depends on how "intelligent" the drive is at identifying
> sequential accesses.
> 
>>>
>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>> be hard to see differences with the default value, thus I set
>>> 'max_sectors_kb' to 128 in the following test.
>>>
>>> Test cmd:
>>>           fio \
>>>           -filename=/dev/$dev \
>>>           -name=test \
>>>           -ioengine=psync \
>>>           -allow_mounted_write=0 \
>>>           -group_reporting \
>>>           -direct=1 \
>>>           -offset_increment=1g \
>>>           -rw=randwrite \
>>>           -bs=1024k \
>>>           -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>           -runtime=110 \
>>>           -ramp_time=10
>>>
>>> Test result: MiB/s
>>>
>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>> | ------- | --------- | ----------------- |
>>> | 1       | 67.7      | 67.7              |
>>> | 2       | 67.7      | 67.7              |
>>> | 4       | 67.7      | 67.7              |
>>> | 8       | 67.7      | 67.7              |
>>> | 16      | 64.8      | 65.6              |
>>> | 32      | 59.8      | 63.8              |
>>> | 64      | 54.9      | 59.4              |
>>> | 128     | 49        | 56.9              |
>>> | 256     | 37.7      | 58.3              |
>>> | 512     | 31.8      | 57.9              |
> 
> Device write cache disabled ?
> 
> Also, what is the max QD of this disk ?
> 
> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
> tags. So for any of your tests with more than 64 threads, many of the
> threads will be waiting for a scheduler tag for the BIO before the
> bio_split problem you explain triggers. Given that the numbers you show
> are the same for before-after patch with a number of threads <= 64, I am
> tempted to think that the problem is not really BIO splitting...
> 
> What about random read workloads ? What kind of results do you see ?

Hi,

Sorry about the misleading of this test case.

This testcase is high concurrency huge randwrite, it's just for the
problem that split bios won't be issued continuously, which is the
root cause of the performance degradation as the numjobs increases.

queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
than 8, performance is fine, because the ratio of sequential io should
be 7/8. However, as numjobs increases, performance is worse because
the ratio is lower. For example, when numjobs is 512, the ratio of
sequential io is about 20%.

patch 6-8 will let split bios still be issued continuously under high
pressure.

Thanks,
Kuai


  reply	other threads:[~2022-04-25  6:14 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-15 10:10 [PATCH -next RFC v3 0/8] improve tag allocation under heavy load Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 1/8] sbitmap: record the number of waiters for each waitqueue Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 2/8] blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 3/8] sbitmap: make sure waitqueues are balanced Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 4/8] blk-mq: don't preempt tag under heavy load Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 5/8] sbitmap: force tag preemption if free tags are sufficient Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 6/8] blk-mq: force tag preemption for split bios Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 7/8] blk-mq: record how many tags are needed for splited bio Yu Kuai
2022-04-15 10:10 ` [PATCH -next RFC v3 8/8] sbitmap: wake up the number of threads based on required tags Yu Kuai
2022-04-24  2:43 ` [PATCH -next RFC v3 0/8] improve tag allocation under heavy load yukuai (C)
2022-04-25  3:24   ` Damien Le Moal
2022-04-25  6:14     ` yukuai (C) [this message]
2022-04-25  6:23       ` Damien Le Moal
2022-04-25  6:47         ` yukuai (C)
2022-04-25  6:50           ` Damien Le Moal
2022-04-25  7:05             ` yukuai (C)
2022-04-25  7:06               ` Damien Le Moal
2022-04-25  7:28                 ` yukuai (C)
2022-04-25 11:20                   ` Damien Le Moal
2022-04-25 13:42                     ` yukuai (C)
2022-04-25  3:09 ` Bart Van Assche
2022-04-25  3:27   ` yukuai (C)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=63e84f2a-2487-a0c3-cab2-7d2011bc2db4@huawei.com \
    --to=yukuai3@huawei.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=damien.lemoal@opensource.wdc.com \
    --cc=john.garry@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=qiulaibin@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.