Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
To: "yukuai (C)" <yukuai3@huawei.com>,
	axboe@kernel.dk, bvanassche@acm.org,
	andriy.shevchenko@linux.intel.com, john.garry@huawei.com,
	ming.lei@redhat.com, qiulaibin@huawei.com
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	yi.zhang@huawei.com
Subject: Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
Date: Mon, 25 Apr 2022 16:06:59 +0900	[thread overview]
Message-ID: <2e7969ea-68d0-964a-808e-ee8943de70e3@opensource.wdc.com> (raw)
In-Reply-To: <c11ebf9e-a232-4a5d-d539-f95f584f220e@huawei.com>

On 4/25/22 16:05, yukuai (C) wrote:
> 在 2022/04/25 14:50, Damien Le Moal 写道:
>> On 4/25/22 15:47, yukuai (C) wrote:
>>> 在 2022/04/25 14:23, Damien Le Moal 写道:
>>>> On 4/25/22 15:14, yukuai (C) wrote:
>>>>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>>>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>>>>> friendly ping ...
>>>>>>>
>>>>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>>>>> Changes in v3:
>>>>>>>>      - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>>>>      in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>>>>      'waiters_cnt' are all 0, which will cause deap loop.
>>>>>>>>      - don't add 'wait_index' during each loop in patch 2
>>>>>>>>      - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>>>>      also improving coding for the patch.
>>>>>>>>      - add a detection in patch 4 in case io hung is triggered in corner
>>>>>>>>      cases.
>>>>>>>>      - make the detection, free tags are sufficient, more flexible.
>>>>>>>>      - fix a race in patch 8.
>>>>>>>>      - fix some words and add some comments.
>>>>>>>>
>>>>>>>> Changes in v2:
>>>>>>>>      - use a new title
>>>>>>>>      - add patches to fix waitqueues' unfairness - path 1-3
>>>>>>>>      - delete patch to add queue flag
>>>>>>>>      - delete patch to split big io thoroughly
>>>>>>>>
>>>>>>>> In this patchset:
>>>>>>>>      - patch 1-3 fix waitqueues' unfairness.
>>>>>>>>      - patch 4,5 disable tag preemption on heavy load.
>>>>>>>>      - patch 6 forces tag preemption for split bios.
>>>>>>>>      - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>>>>      I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>>>>      this is not a common case and not worth to optimize, I'll drop them.
>>>>>>>>
>>>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>>>>> split io will still be continuous in sq, this is because:
>>>>>>>>
>>>>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>>>>> to wail.
>>>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>>>>
>>>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>>>>> ios with high concurrency.
>>>>>>>>
>>>>>>>> Noted that there is a precondition for such performance problem:
>>>>>>>> There is a certain gap between bandwidth for single io with
>>>>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>>>>
>>>>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>>>>
>>>>>>>> Test environment:
>>>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>>>>> where 'max_sectors_kb' is 256).>>
>>>>>>>> The single io performance(randwrite):
>>>>>>>>
>>>>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>>>>
>>>>>> These results are extremely strange, unless you are running with the
>>>>>> device write cache disabled ? If you have the device write cache enabled,
>>>>>> the problem you mention above would be most likely completely invisible,
>>>>>> which I guess is why nobody really noticed any issue until now.
>>>>>>
>>>>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>>>>> albeit that depends on how "intelligent" the drive is at identifying
>>>>>> sequential accesses.
>>>>>>
>>>>>>>>
>>>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>>>>> be hard to see differences with the default value, thus I set
>>>>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>>>>
>>>>>>>> Test cmd:
>>>>>>>>             fio \
>>>>>>>>             -filename=/dev/$dev \
>>>>>>>>             -name=test \
>>>>>>>>             -ioengine=psync \
>>>>>>>>             -allow_mounted_write=0 \
>>>>>>>>             -group_reporting \
>>>>>>>>             -direct=1 \
>>>>>>>>             -offset_increment=1g \
>>>>>>>>             -rw=randwrite \
>>>>>>>>             -bs=1024k \
>>>>>>>>             -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>>>>             -runtime=110 \
>>>>>>>>             -ramp_time=10
>>>>>>>>
>>>>>>>> Test result: MiB/s
>>>>>>>>
>>>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>>>>> | ------- | --------- | ----------------- |
>>>>>>>> | 1       | 67.7      | 67.7              |
>>>>>>>> | 2       | 67.7      | 67.7              |
>>>>>>>> | 4       | 67.7      | 67.7              |
>>>>>>>> | 8       | 67.7      | 67.7              |
>>>>>>>> | 16      | 64.8      | 65.6              |
>>>>>>>> | 32      | 59.8      | 63.8              |
>>>>>>>> | 64      | 54.9      | 59.4              |
>>>>>>>> | 128     | 49        | 56.9              |
>>>>>>>> | 256     | 37.7      | 58.3              |
>>>>>>>> | 512     | 31.8      | 57.9              |
>>>>>>
>>>>>> Device write cache disabled ?
>>>>>>
>>>>>> Also, what is the max QD of this disk ?
>>>>>>
>>>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>>>>> tags. So for any of your tests with more than 64 threads, many of the
>>>>>> threads will be waiting for a scheduler tag for the BIO before the
>>>>>> bio_split problem you explain triggers. Given that the numbers you show
>>>>>> are the same for before-after patch with a number of threads <= 64, I am
>>>>>> tempted to think that the problem is not really BIO splitting...
>>>>>>
>>>>>> What about random read workloads ? What kind of results do you see ?
>>>>>
>>>>> Hi,
>>>>>
>>>>> Sorry about the misleading of this test case.
>>>>>
>>>>> This testcase is high concurrency huge randwrite, it's just for the
>>>>> problem that split bios won't be issued continuously, which is the
>>>>> root cause of the performance degradation as the numjobs increases.
>>>>>
>>>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>>>>> than 8, performance is fine, because the ratio of sequential io should
>>>>> be 7/8. However, as numjobs increases, performance is worse because
>>>>> the ratio is lower. For example, when numjobs is 512, the ratio of
>>>>> sequential io is about 20%.
>>>>
>>>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
>>>> All other jobs will be waiting for a scheduler tag before being able to
>>>> issue their large BIO. No ?
>>>
>>> Hi,
>>>
>>> It's right.
>>>
>>> In fact, after this patchset, since each large io will need total 8
>>> tags, only 8 jobs can be in the queue while others are waiting for
>>> scheduler tag.
>>>
>>>>
>>>> It sounds like the set of scheduler tags should be a bit more elastic:
>>>> always allow BIOs from a split of a large BIO to be submitted (that is to
>>>> get a scheduler tag) even if that causes a temporary excess of the number
>>>> of requests beyond the default number of scheduler tags. Doing so, all
>>>> fragments of a large BIOs can be queued immediately. From there, if the
>>>> scheduler operates correctly, all the requests from the large BIOs split
>>>> would be issued in sequence to the device.
>>>
>>> This solution sounds feasible in theory, however, I'm not sure yet how
>>> to implement that 'temporary excess'.
>>
>> It should not be too hard.
> 
> I'll try to figure out a proper way, in the meantime, any suggestions
> would be appreciated.
>>
>> By the way, did you check that doing something like:
>>
>> echo 2048 > /sys/block/sdX/queue/nr_requests
>>
>> improves performance for your high number of jobs test case ?
> 
> Yes, performance will not degrade when numjobs is not greater than 256
> in this case.

That is my thinking as well. I am asking if did check that (did you run it ?).

>>
>>>
>>> Thanks,
>>> Kuai
>>>>
>>>>
>>>>>
>>>>> patch 6-8 will let split bios still be issued continuously under high
>>>>> pressure.
>>>>>
>>>>> Thanks,
>>>>> Kuai
>>>>>
>>>>
>>>>
>>
>>

-- 
Damien Le Moal
Western Digital Research