Re: Raid5 Batching Question

From: Logan Gunthorpe <logang@deltatee.com>
To: linux-raid@vger.kernel.org
Cc: Shaohua Li <shli@kernel.org>, Shaohua Li <shli@fb.com>,
	Song Liu <song@kernel.org>
Subject: Re: Raid5 Batching Question
Date: Wed, 2 Mar 2022 16:24:55 -0700	[thread overview]
Message-ID: <34a8b64a-37d3-9966-1fe8-d57c432600d7@deltatee.com> (raw)
In-Reply-To: <11cfe3aa-b778-b3e5-a152-50abc6c054ac@deltatee.com>

On 2022-02-24 5:17 p.m., Logan Gunthorpe wrote:
> Hello,
> 
> We've been looking at trying to improve sequential write performance out
> of Raid5 on modern hardware. Our profiling so far seems to indicate that
> one of the issues is high CPU due handling all the stripe heads, one for
> each page. Some investigation shows that Shaohua already added a
> batching feature back in 2015 which seems like it is exactly what we need.
> 
> However, after adding some additional debug prints we're not seeing any
> batching occurring in our basic testing and I find myself rather
> confused by the code.
> 
> I see that batches are supposed to be created at the end of
> add_stripe_bio() with a call to stripe_add_to_batch_list(). But in our
> testing stripe_can_batch() never returns true.
> 
> stripe_can_batch() calls is_full_stripe_write() which returns the
> following formula:
> 
>   overwrite_disks == (disks - max_degraded)
> 
> In our simple tests, disks is 3 and this is raid5 so max_degraded is 1.
> However, overwrite_disks is also always 1. So, 1 != (3-1) and
> is_full_stripe_write() always seems to return false.
> 
> overwrite_disks appears to be incremented on the stripe only once
> earlier in add_stripe_bio() after seeming to check if all sectors in the
> page are being written. But I don't see how overwrite_disks would ever
> be 2 for a single stripe.
> 
> What am I missing? How can I ensure batches are being used with large
> sequential writes?

Replying to myself:

Looks like batching is happening for me, it's just that when I tried to
test it and trace it through, the chunk size was too large and the
amount of data was too low such that there was not enough to fill up the
other disks in the chunk.

When running larger jobs batching seems to work correctly, but with
larger chunks stripe heads might be handled before the rest of the batch
is added.

So I'm back to square one trying to find some performance improvements
on the write path.

Thanks,

Logan