Re: Slow initial resync in RAID6 with 36 SAS drives

From: Marcin Wanat <marcin.wanat@gmail.com>
To: Song Liu <song@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>, linux-raid <linux-raid@vger.kernel.org>
Subject: Re: Slow initial resync in RAID6 with 36 SAS drives
Date: Sat, 4 Sep 2021 17:24:51 +0200	[thread overview]
Message-ID: <CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com> (raw)
In-Reply-To: <CAPhsuW7R=8XyU5wU1-NT-Eo=Hir7gV_b7+_MB+bUFx3bEecbDw@mail.gmail.com>

On 03.09.2021 02:58, Song Liu wrote:
> On Tue, Aug 31, 2021 at 10:19 PM Song Liu <song@kernel.org> wrote:
>>
>> On Wed, Aug 25, 2021 at 3:06 AM Marcin Wanat <marcin.wanat@gmail.com> wrote:
>>>
>>> On Thu, Aug 19, 2021 at 11:28 AM Marcin Wanat <marcin.wanat@gmail.com> wrote:
>>>>
>>>> Sorry, this will be a long email with everything I find to be relevant.
>>>> I have a mdraid6 array with 36 hdd SAS drives each able to do
>>>>> 200MB/s, but I am unable to get more than 38MB/s resync speed on a
>>>> fast system (48cores/96GB ram) with no other load.
>>>
>>> I have done a bit more research on 24 NVMe drives server and found
>>> that resync speed bottleneck affect RAID6 with >16 drives:
>>
>> Sorry for the late response.
>>
>> This is interesting behavior. I don't really know why this is the case at the
>> moment. Let me try to reproduce this first.
>>
>> Thanks,
>> Song
>
> The issue is caused by blk_plug logic. Something like the following should
> fix it.
>
> Marcin, could you please give it a try?
>

Patch tested against the latest longterm kernel(as this is a
semi-production server) and 36 HDD raid6 and working fine.

Here are numbers. Without patch:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid6 sdak1[23] sdag1[34] sdaj1[22] sdai1[21] sdah1[35]
sdal1[24] sdae1[20] sdaa1[29] sdx1[19] sdaf1[33] sdad1[32] sdac1[31]
sdab1[30] sdu1[14] sdz1[28] sdv1[18] sdw1[26] sdy1[27] sdt1[13]
sdq1[17] sdr1[25] sds1[12] sdp1[11] sdo1[9] sdm1[15] sdn1[16] sdl1[10]
sdk1[8] sda1[2] sdh1[6] sdc1[1] sdj1[7] sde1[3] sdb1[0] sdi1[4]
sdg1[5]
464838634496 blocks super 1.2 level 6, 512k chunk, algorithm 2 [36/36]
[UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
[>....................] check = 0.0% (651600/13671724544)
finish=5944.4min speed=38329K/sec
bitmap: 1/102 pages [4KB], 65536KB chunk

# iostat -dx 5
Linux 5.10.62 (large1) 09/04/2021 _x86_64_ (48 CPU)

Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await
aqu-sz rareq-sz wareq-sz svctm %util
sdc 9998.40 1.40 41277.60 3.40 321.00 0.00 3.11 0.00 1.42 2.57 14.19
4.13 2.43 0.08 80.54
sdb 10154.80 1.40 41277.60 3.40 164.60 0.00 1.60 0.00 0.83 5.86 8.47
4.06 2.43 0.08 82.02
sdi 10182.80 1.40 41277.60 3.40 136.60 0.00 1.32 0.00 0.78 9.29 7.98
4.05 2.43 0.08 81.86
sdh 10002.00 1.40 41278.40 3.40 317.60 0.00 3.08 0.00 1.11 7.71 11.14
4.13 2.43 0.08 80.56
sde 9779.40 1.40 41278.40 3.40 540.20 0.00 5.23 0.00 2.10 2.43 20.50
4.22 2.43 0.08 80.10
sdj 9861.80 1.40 41277.60 3.40 457.60 0.00 4.43 0.00 1.72 15.14 16.98
4.19 2.43 0.08 81.36
sdg 10167.80 1.40 41277.60 3.40 151.60 0.00 1.47 0.00 0.95 3.43 9.68
4.06 2.43 0.08 80.06
[...]

With patch:
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid6 sdak1[34] sdal1[35] sdaj1[33] sdah1[31] sdai1[32]
sdae1[28] sdag1[30] sdaf1[29] sdad1[27] sdac1[26] sdab1[25] sdaa1[17]
sdz1[16] sdy1[15] sdx1[10] sdv1[23] sds1[20] sdw1[24] sdq1[18]
sdr1[19] sdt1[21] sdu1[22] sdm1[11] sdp1[14] sdl1[9] sdn1[12] sdo1[13]
sdk1[8] sdh1[5] sdj1[7] sdc1[0] sdg1[4] sdf1[3] sde1[2] sdd1[1]
sdi1[6]
464838634496 blocks super 1.2 level 6, 512k chunk, algorithm 2 [36/36]
[UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
[>....................] check = 0.0% (488284/13671724544)
finish=2799.8min speed=81380K/sec
bitmap: 2/102 pages [8KB], 65536KB chunk

# iostat -dx 5
Linux 5.10.62 (large1) 09/04/2021 _x86_64_ (48 CPU)

Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await
aqu-sz rareq-sz wareq-sz svctm %util
sdd 168.00 1.80 80596.80 4.40 19971.00 0.00 99.17 0.00 6.67 7.44 1.14
479.74 2.44 3.22 54.72
sdc 168.20 2.00 80598.40 6.00 19971.00 0.20 99.16 9.09 6.55 4.60 1.11
479.18 3.00 3.18 54.08
sde 168.20 2.00 80598.40 6.00 19964.60 0.20 99.16 9.09 6.69 4.90 1.14
479.18 3.00 3.22 54.80
sdg 168.00 1.80 80596.80 4.40 19964.20 0.00 99.17 0.00 6.66 3.11 1.13
479.74 2.44 3.25 55.20
sdf 168.00 1.80 80596.80 4.40 19964.20 0.00 99.17 0.00 6.62 3.11 1.12
479.74 2.44 3.23 54.84
sdk 168.20 1.80 80596.80 4.40 19963.80 0.00 99.16 0.00 6.82 3.56 1.16
479.17 2.44 3.33 56.58
sdi 168.00 1.80 80596.80 4.40 19964.00 0.00 99.17 0.00 6.67 3.00 1.13
479.74 2.44 3.29 55.78
sdj 168.00 1.80 80596.80 4.40 19963.80 0.00 99.17 0.00 6.75 3.33 1.14
479.74 2.44 3.31 56.14
sdh 168.00 1.80 80596.80 4.40 19962.60 0.00 99.17 0.00 6.72 3.33 1.14
479.74 2.44 3.29 55.86
[...]

With this patch iops dropped from ~10000 to 168 and rareq-sz increased
from 4 to 479. Resync speed increased from 38MB/s to 81MB/s.
These numbers were obtained using default kernel settings, but I am
pretty sure that I can now fine-tune it using
group_thread_cnt/stripe_cache_size to get even better numbers.

Thank you!

-- 
Marcin Wanat

On Fri, Sep 3, 2021 at 2:58 AM Song Liu <song@kernel.org> wrote:
>
> On Tue, Aug 31, 2021 at 10:19 PM Song Liu <song@kernel.org> wrote:
> >
> > On Wed, Aug 25, 2021 at 3:06 AM Marcin Wanat <marcin.wanat@gmail.com> wrote:
> > >
> > > On Thu, Aug 19, 2021 at 11:28 AM Marcin Wanat <marcin.wanat@gmail.com> wrote:
> > > >
> > > > Sorry, this will be a long email with everything I find to be relevant.
> > > > I have a mdraid6 array with 36 hdd SAS drives each able to do
> > > > >200MB/s, but I am unable to get more than 38MB/s resync speed on a
> > > > fast system (48cores/96GB ram) with no other load.
> > >
> > > I have done a bit more research on 24 NVMe drives server and found
> > > that resync speed bottleneck affect RAID6 with >16 drives:
> >
> > Sorry for the late response.
> >
> > This is interesting behavior. I don't really know why this is the case at the
> > moment. Let me try to reproduce this first.
> >
> > Thanks,
> > Song
>
> The issue is caused by blk_plug logic. Something like the following should
> fix it.
>
> Marcin, could you please give it a try?
>
> Thanks,
> Song
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2c4ac51e54eba..fdb945be85753 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2251,7 +2251,7 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
>                 else
>                         last = list_entry_rq(plug->mq_list.prev);
>
> -               if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
> +               if (request_count >= blk_plug_max_rq_count(plug) || (last &&
>                     blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
>                         blk_flush_plug_list(plug, false);
>                         trace_block_plug(q);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index b5c033cf5f26f..2e3c07e959c14 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1239,6 +1239,13 @@ extern void blk_start_plug(struct blk_plug *);
>  extern void blk_finish_plug(struct blk_plug *);
>  extern void blk_flush_plug_list(struct blk_plug *, bool);
>
> +static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
> +{
> +       if (plug->multiple_queues)
> +               return BLK_MAX_REQUEST_COUNT * 4;
> +       return BLK_MAX_REQUEST_COUNT;
> +}
> +
>  static inline void blk_flush_plug(struct task_struct *tsk)
>  {
>         struct blk_plug *plug = tsk->plug;