Re: Enhancement Idea - Optional PGO+LTO build for btrfs-progs

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: DanglingPointer <danglingpointerexception@gmail.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Enhancement Idea - Optional PGO+LTO build for btrfs-progs
Date: Wed, 14 Jul 2021 15:57:29 +0800	[thread overview]
Message-ID: <2a29adba-8451-7550-a6f1-835be431953b@gmx.com> (raw)
In-Reply-To: <db80b801-9e7d-ce2b-15dd-84b30faf19cd@gmail.com>

On 2021/7/14 下午3:34, DanglingPointer wrote:
> "Why would you think btrfs-progs is the one needs to optimization?"
>
> Perhaps I should have written more context.  When the data migration was
> taking a very long time (days); and the pauses due to "btrfs-transacti"
> blocking all IO including nfsd.  We thought, "should we '$ btrfs scrub
> <mount>' to make sure nothing had gone wrong?"
>
> Problem is, scrubbing on the whole RAID5 takes ages!

First thing first, if your system may hit unexpected power loss, or disk
corruption, it's highly recommended not to use btrfs RAID5.

(It's OK if you build btrfs upon soft/hard RAID5).

Btrfs raid5 has its write-hole problem, meaning each power loss/disk
corruption will slightly degrade the robustness of RAID5.

Without enough such small degradation, some corruption will no longer be
repairable.
Scrub is the way to rebuild the robustness, so great you're already
doing that, but I still won't recommend btrfs RAID5 for now, just as the
btrfs(5) man page shows.

Another reason why btrfs RAID5 scrub takes so long is, how we do full fs
scrub.

We initiate scrub for *each* device.

That means, if we have 4 disks, we will scrub all 4 disks separately.

For each device scrub, we need to read all the 4 stripes to also make
sure the parity stripe is also fine.

But in theory, we only need to read such RAID5 stripe once, then all
devices are covered.

So you see we waste quite some disk IO to do some duplicated check.

That's also another reason we recommend RAID1/RAID10.
During scrubbing of each device, we really only need to read from that
device, and only when its data is corrupted, we will try to read the
other copy.

This has no extra IO wasted.

Thanks,
Qu

>  If we did one disk
> of the array only it would at least sample a quarter of the array with a
> quarter chance of detecting if something/anything had gone wrong and
> hopefully won't massively slow down the on-going migration.
>
> We tried it for a while on the single drive and it did indeed have 2x
> the scrubbing throughput but it was still very slow since we're talking
> multi-terrabytes on the single disk!  I believe the ETA forecast was ~3
> days.
>
> Interestingly scrubbing the whole lot (whole RAID5 array) in one go by
> just scrubbing the mount point is a 4day ETA which we do regularly every
> 3 months.  So even though it is slower on each disk, it finishes the
> whole lot faster than doing one disk at a time sequentially.
>
> Anyways, thanks for informing us on what btrfs-progs does and how 'scrub
> speed' is independent of btrfs-progs and done by the kernel ioctls (on
> the other email thread).
>
> regards,
>
> DP
>
> I thought btrfs scrub was part of btrfs-progs.  Pardon my ignorance if
> it isn't.
>
>
> On 14/7/21 3:00 pm, Qu Wenruo wrote:
>>
>>
>> On 2021/7/14 上午10:51, DanglingPointer wrote:
>>> Recently we have been impacted by some performance issues with the
>>> workstations in my lab with large multi-terabyte arrays in btrfs.  I
>>> have detailed this on a separate thread.  It got me thinking however,
>>> why not have an optional configure option for btrfs-progs to use PGO
>>> against the entire suite of regression tests?
>>>
>>> Idea is:
>>>
>>> 1. configure with optional "-pgo" or "-fdo" option which will configure
>>>     a relative path from source root where instrumentation files will go
>>>     (let's start with gcc only for now, so *.gcda files into a folder).
>>>     We then add the instrumentation compiler option
>>> 2. build btrfs-progs
>>> 3. run every single tests available ( make test &&  make test-fsck &&
>>>     make test-convert)
>>> 4. clean-up except for instrumentation files
>>> 5. re-build without the instrumentation flag from point 1; and use the
>>>     instrumentation files for feedback directed optimisation (FDO) (for
>>>     gcc add additional partial-training flag); add LTO.
>>
>> Why would you think btrfs-progs is the one needs to optimization?
>>
>> From your original report, there is nothing btrfs-progs related at all.
>>
>> All your work, from scrub to IO, it's all handled by kernel btrfs module.
>>
>> Thus optimization of btrfs-progs would bring no impact.
>>
>> Thanks,
>> Qu
>>>
>>> I know btrfs is primarily IO bound and not cpu.  But just thinking of
>>> squeezing every last efficiency out of whatever is running in the cpu,
>>> cache and memory.
>>>
>>> I suppose people can do the above on their own, but was thinking if it
>>> was provided as a configuration optional option then it would make it
>>> easier for people to do without more hacking.  Just need to add warnings
>>> that it will take a long time, have a coffee.
>>>
>>> The python3 configure process has the process above as an optional
>>> option but caters for gcc and clang (might even cater for icc).
>>>
>>> Anyways, that's my idea for an enhancement above.
>>>
>>> Would like to know your thoughts.  cheers...
>>>