From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-block-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D092BC433F5
	for <linux-block@archiver.kernel.org>; Mon, 25 Apr 2022 07:05:54 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232627AbiDYHIz (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Mon, 25 Apr 2022 03:08:55 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48210 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232849AbiDYHIx (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Mon, 25 Apr 2022 03:08:53 -0400
Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 158AB1CB09;
        Mon, 25 Apr 2022 00:05:50 -0700 (PDT)
Received: from kwepemi100011.china.huawei.com (unknown [172.30.72.57])
        by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4KmwyP0VPgzhYm0;
        Mon, 25 Apr 2022 15:05:33 +0800 (CST)
Received: from kwepemm600009.china.huawei.com (7.193.23.164) by
 kwepemi100011.china.huawei.com (7.221.188.134) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.24; Mon, 25 Apr 2022 15:05:48 +0800
Received: from [10.174.176.73] (10.174.176.73) by
 kwepemm600009.china.huawei.com (7.193.23.164) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.24; Mon, 25 Apr 2022 15:05:47 +0800
Subject: Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
To:     Damien Le Moal <damien.lemoal@opensource.wdc.com>,
        <axboe@kernel.dk>, <bvanassche@acm.org>,
        <andriy.shevchenko@linux.intel.com>, <john.garry@huawei.com>,
        <ming.lei@redhat.com>, <qiulaibin@huawei.com>
CC:     <linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
        <yi.zhang@huawei.com>
References: <20220415101053.554495-1-yukuai3@huawei.com>
 <dc800086-43c6-1ff2-659e-258cb75649dd@huawei.com>
 <3fbadd9f-11dd-9043-11cf-f0839dcf30e1@opensource.wdc.com>
 <63e84f2a-2487-a0c3-cab2-7d2011bc2db4@huawei.com>
 <55e8b04f-0d2f-2ce1-6514-5abd0b67fd48@opensource.wdc.com>
 <6957af40-8720-d74b-5be7-6bcdd9aa1089@huawei.com>
 <237a43f0-3b09-46d0-e73c-57ef51e39590@opensource.wdc.com>
From:   "yukuai (C)" <yukuai3@huawei.com>
Message-ID: <c11ebf9e-a232-4a5d-d539-f95f584f220e@huawei.com>
Date:   Mon, 25 Apr 2022 15:05:46 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.8.0
MIME-Version: 1.0
In-Reply-To: <237a43f0-3b09-46d0-e73c-57ef51e39590@opensource.wdc.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.176.73]
X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To
 kwepemm600009.china.huawei.com (7.193.23.164)
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

在 2022/04/25 14:50, Damien Le Moal 写道:
> On 4/25/22 15:47, yukuai (C) wrote:
>> 在 2022/04/25 14:23, Damien Le Moal 写道:
>>> On 4/25/22 15:14, yukuai (C) wrote:
>>>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>>>> friendly ping ...
>>>>>>
>>>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>>>> Changes in v3:
>>>>>>>      - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>>>      in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>>>      'waiters_cnt' are all 0, which will cause deap loop.
>>>>>>>      - don't add 'wait_index' during each loop in patch 2
>>>>>>>      - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>>>      also improving coding for the patch.
>>>>>>>      - add a detection in patch 4 in case io hung is triggered in corner
>>>>>>>      cases.
>>>>>>>      - make the detection, free tags are sufficient, more flexible.
>>>>>>>      - fix a race in patch 8.
>>>>>>>      - fix some words and add some comments.
>>>>>>>
>>>>>>> Changes in v2:
>>>>>>>      - use a new title
>>>>>>>      - add patches to fix waitqueues' unfairness - path 1-3
>>>>>>>      - delete patch to add queue flag
>>>>>>>      - delete patch to split big io thoroughly
>>>>>>>
>>>>>>> In this patchset:
>>>>>>>      - patch 1-3 fix waitqueues' unfairness.
>>>>>>>      - patch 4,5 disable tag preemption on heavy load.
>>>>>>>      - patch 6 forces tag preemption for split bios.
>>>>>>>      - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>>>      I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>>>      this is not a common case and not worth to optimize, I'll drop them.
>>>>>>>
>>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>>>> split io will still be continuous in sq, this is because:
>>>>>>>
>>>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>>>> to wail.
>>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>>>
>>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>>>> ios with high concurrency.
>>>>>>>
>>>>>>> Noted that there is a precondition for such performance problem:
>>>>>>> There is a certain gap between bandwidth for single io with
>>>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>>>
>>>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>>>
>>>>>>> Test environment:
>>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>>>> where 'max_sectors_kb' is 256).>>
>>>>>>> The single io performance(randwrite):
>>>>>>>
>>>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>>>
>>>>> These results are extremely strange, unless you are running with the
>>>>> device write cache disabled ? If you have the device write cache enabled,
>>>>> the problem you mention above would be most likely completely invisible,
>>>>> which I guess is why nobody really noticed any issue until now.
>>>>>
>>>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>>>> albeit that depends on how "intelligent" the drive is at identifying
>>>>> sequential accesses.
>>>>>
>>>>>>>
>>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>>>> be hard to see differences with the default value, thus I set
>>>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>>>
>>>>>>> Test cmd:
>>>>>>>             fio \
>>>>>>>             -filename=/dev/$dev \
>>>>>>>             -name=test \
>>>>>>>             -ioengine=psync \
>>>>>>>             -allow_mounted_write=0 \
>>>>>>>             -group_reporting \
>>>>>>>             -direct=1 \
>>>>>>>             -offset_increment=1g \
>>>>>>>             -rw=randwrite \
>>>>>>>             -bs=1024k \
>>>>>>>             -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>>>             -runtime=110 \
>>>>>>>             -ramp_time=10
>>>>>>>
>>>>>>> Test result: MiB/s
>>>>>>>
>>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>>>> | ------- | --------- | ----------------- |
>>>>>>> | 1       | 67.7      | 67.7              |
>>>>>>> | 2       | 67.7      | 67.7              |
>>>>>>> | 4       | 67.7      | 67.7              |
>>>>>>> | 8       | 67.7      | 67.7              |
>>>>>>> | 16      | 64.8      | 65.6              |
>>>>>>> | 32      | 59.8      | 63.8              |
>>>>>>> | 64      | 54.9      | 59.4              |
>>>>>>> | 128     | 49        | 56.9              |
>>>>>>> | 256     | 37.7      | 58.3              |
>>>>>>> | 512     | 31.8      | 57.9              |
>>>>>
>>>>> Device write cache disabled ?
>>>>>
>>>>> Also, what is the max QD of this disk ?
>>>>>
>>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>>>> tags. So for any of your tests with more than 64 threads, many of the
>>>>> threads will be waiting for a scheduler tag for the BIO before the
>>>>> bio_split problem you explain triggers. Given that the numbers you show
>>>>> are the same for before-after patch with a number of threads <= 64, I am
>>>>> tempted to think that the problem is not really BIO splitting...
>>>>>
>>>>> What about random read workloads ? What kind of results do you see ?
>>>>
>>>> Hi,
>>>>
>>>> Sorry about the misleading of this test case.
>>>>
>>>> This testcase is high concurrency huge randwrite, it's just for the
>>>> problem that split bios won't be issued continuously, which is the
>>>> root cause of the performance degradation as the numjobs increases.
>>>>
>>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>>>> than 8, performance is fine, because the ratio of sequential io should
>>>> be 7/8. However, as numjobs increases, performance is worse because
>>>> the ratio is lower. For example, when numjobs is 512, the ratio of
>>>> sequential io is about 20%.
>>>
>>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
>>> All other jobs will be waiting for a scheduler tag before being able to
>>> issue their large BIO. No ?
>>
>> Hi,
>>
>> It's right.
>>
>> In fact, after this patchset, since each large io will need total 8
>> tags, only 8 jobs can be in the queue while others are waiting for
>> scheduler tag.
>>
>>>
>>> It sounds like the set of scheduler tags should be a bit more elastic:
>>> always allow BIOs from a split of a large BIO to be submitted (that is to
>>> get a scheduler tag) even if that causes a temporary excess of the number
>>> of requests beyond the default number of scheduler tags. Doing so, all
>>> fragments of a large BIOs can be queued immediately. From there, if the
>>> scheduler operates correctly, all the requests from the large BIOs split
>>> would be issued in sequence to the device.
>>
>> This solution sounds feasible in theory, however, I'm not sure yet how
>> to implement that 'temporary excess'.
> 
> It should not be too hard.

I'll try to figure out a proper way, in the meantime, any suggestions
would be appreciated.
> 
> By the way, did you check that doing something like:
> 
> echo 2048 > /sys/block/sdX/queue/nr_requests
> 
> improves performance for your high number of jobs test case ?

Yes, performance will not degrade when numjobs is not greater than 256
in this case.
> 
>>
>> Thanks,
>> Kuai
>>>
>>>
>>>>
>>>> patch 6-8 will let split bios still be issued continuously under high
>>>> pressure.
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>
>>>
> 
>