Re: migrating to space_cache=2 and btrfs userspace commands

From: DanglingPointer <danglingpointerexception@gmail.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Cc: danglingpointerexception@gmail.com
Subject: Re: migrating to space_cache=2 and btrfs userspace commands
Date: Fri, 16 Jul 2021 02:40:23 +1000	[thread overview]
Message-ID: <a4ef513e-c7a4-99e0-c957-206a3763d9d1@gmail.com> (raw)
In-Reply-To: <ec9e92d8-ddfd-a103-6175-5176827ce9aa@gmx.com>

Hi Qu,

Just updating here that setting the mount option "space_cache=v2" and 
"noatime" completely SOLVED the performance problem!
Basically like night and day!

These are my full fstab mount options...

btrfs defaults,autodefrag,space_cache=v2,noatime 0 2

Perhaps defaulting the space_cache=v2 should be considered?  Why default 
to v1, what's the value of v1?

So for conclusion, for large multi-terrabyte arrays (in my case RAID5s), 
setting space_cache=v2 and noatime massively increases performance and 
eliminates the large long pauses in frequent intervals by 
"btrfs-transacti" blocking all IO.

Thanks Qu for your help!

On 14/7/21 5:45 pm, Qu Wenruo wrote:
>
>
> On 2021/7/14 下午3:18, DanglingPointer wrote:
>> a) "echo l > /proc/sysrq-trigger"
>>
>> The backup finished today already unfortunately and we are unlikely to
>> run it again until we get an outage to remount the array with the
>> space_cache=v2 and noatime mount options.
>> Thanks for the command, we'll definitely use it if/when it happens again
>> on the next large migration of data.
>
> Just to avoid confusion, after that command, "dmesg" output is still
> needed, as that's where sysrq put its output.
>>
>>
>> b) "sudo btrfs qgroup show -prce" ........
>>
>> $ ERROR: can't list qgroups: quotas not enabled
>>
>> So looks like it isn't enabled.
>
> One less thing to bother.
>>
>> File sizes are between: 1,048,576 bytes and 16,777,216 bytes (Duplicacy
>> backup defaults)
>
> Between 1~16MiB, thus tons of small files.
>
> Btrfs is not really good at handling tons of small files, as they
> generate a lot of metadata.
>
> That may contribute to the hang.
>
>>
>> What classifies as a transaction?
>
> It's a little complex.
>
> Technically it's a check point where before the checkpoint, all you see
> is old data, after the checkpoint, all you see is new data.
>
> To end users, any data and metadata write will be included into one
> transaction (with proper dependency handled).
>
> One way to finish (or commit) current transaction is to sync the fs,
> using "sync" command (sync all filesystems).
>
>> Any/All writes done in a 30sec
>> interval?
>
> This the default commit interval. Almost all fses will try to commit its
> data/metadata to disk after a configurable interval.
>
> The default one is 30s. That's also one way to commit current 
> transaction.
>
>>   If 100 unique files were written in 30secs, is that 1
>> transaction or 100 transactions?
>
> It depends. As things like syncfs() and subvolume/snapshot creation may
> try to commit transaction.
>
> But without those special operations, just writing 100 unique files
> using buffered write, it would only start one transaction, and when the
> 30s interval get hit, the transaction will be committed to disk.
>
>>   Millions of files of the size range
>> above were backed up.
>
> The amount of files may not force a transaction commit, if it doesn't
> trigger enough memory pressure, or free space pressure.
>
> Anyway, the "echo l" sysrq would help us to locate what's taking so long
> time.
>
>>
>>
>> c) "Just mount with "space_cache=v2""
>>
>> Ok so no need to "clear_cache" the v1 cache, right?
>
> Yes, and "clear_cache" won't really remove all the v1 cache anyway.
>
> Thus it doesn't help much.
>
> The only way to fully clear v1 cache is by using "btrfs check
> --clear-space-cache v1" on a *unmounted* btrfs.
>
>> I wrote this in the fstab but hadn't remounted yet until I can get an
>> outage....
>
> IMHO if you really want to test if v2 would help, you can just remount,
> no need to wait for a break.
>
> Thanks,
> Qu
>>
>> ..."btrfs defaults,autodefrag,clear_cache,space_cache=v2,noatime  0  2 >
>> Thanks again for your help Qu!
>>
>> On 14/7/21 2:59 pm, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/7/13 下午11:38, DanglingPointer wrote:
>>>> We're currently considering switching to "space_cache=v2" with noatime
>>>> mount options for my lab server-workstations running RAID5.
>>>
>>> Btrfs RAID5 is unsafe due to its write-hole problem.
>>>
>>>>
>>>>   * One has 13TB of data/metadata in a bunch of 6TB and 2TB disks
>>>>     totalling 26TB.
>>>>   * Another has about 12TB data/metadata in uniformly sized 6TB disks
>>>>     totalling 24TB.
>>>>   * Both of the arrays are on individually luks encrypted disks with
>>>>     btrfs on top of the luks.
>>>>   * Both have "defaults,autodefrag" turned on in fstab.
>>>>
>>>> We're starting to see large pauses during constant backups of millions
>>>> of chunk files (using duplicacy backup) in the 24TB array.
>>>>
>>>> Pauses sometimes take up to 20+ seconds in frequencies after every
>>>> ~30secs of the end of the last pause.  "btrfs-transacti" process
>>>> consistently shows up as the blocking process/thread locking up
>>>> filesystem IO.  IO gets into the RAID5 array via nfsd. There are no 
>>>> disk
>>>> or btrfs errors recorded.  scrub last finished yesterday successfully.
>>>
>>> Please provide the "echo l > /proc/sysrq-trigger" output when such 
>>> pause
>>> happens.
>>>
>>> If you're using qgroup (may be enabled by things like snapper), it may
>>> be the cause, as qgroup does its accounting when committing 
>>> transaction.
>>>
>>> If one transaction is super large, it can cause such problem.
>>>
>>> You can test if qgroup is enabled by:
>>>
>>> # btrfs qgroup show -prce <mnt>
>>>
>>>>
>>>> After doing some research around the internet, we've come to the
>>>> consideration above as described.  Unfortunately the official
>>>> documentation isn't clear on the following.
>>>>
>>>> Official documentation URL -
>>>> https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5)
>>>>
>>>> 1. How to migrate from default space_cache=v1 to space_cache=v2? It
>>>>     talks about the reverse, from v2 to v1!
>>>
>>> Just mount with "space_cache=v2".
>>>
>>>> 2. If we use space_cache=v2, is it indeed still the case that the
>>>>     "btrfs" command will NOT work with the filesystem?
>>>
>>> Why would you think "btrfs" won't work on a btrfs?
>>>
>>> Thanks,
>>> Qu
>>>
>>>>   So will our
>>>>     "btrfs scrub start /mount/point/..." cron jobs FAIL? I'm guessing
>>>>     the btrfs command comes from btrfs-progs which is currently 
>>>> v5.4.1-2
>>>>     amd64, is that correct?
>>>> 3. Any other ideas on how we can get rid of those annoying pauses with
>>>>     large backups into the array?
>>>>
>>>> Thanks in advance!
>>>>
>>>> DP
>>>>