All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joshua Villwock <joshua@mailmag.net>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: migrating to space_cache=2 and btrfs userspace commands
Date: Fri, 16 Jul 2021 13:33:24 -0700	[thread overview]
Message-ID: <4C74CCF4-95CD-446D-A01B-F61AB61550A4@mailmag.net> (raw)
In-Reply-To: <40c94987-936f-e6ac-bcec-2051284e1821@gmx.com>

Qu, thanks indeed for your responses!

> On Jul 16, 2021, at 5:59 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
> 
> 
>> On 2021/7/16 下午8:42, DanglingPointer wrote:
>> Hi Joshua, on that system where you tried to run the
>> "--clear-space-cache v1", when you gave up, did you continue using
>> "space_cache=v2" on it?
>> 
>> 
>> Here are some more questions to add for anyone who can help educate us:...
>> 
>> Why would someone want to clear the space_cache v1?
> 
> Because it takes up space and we won't be able to delete them.
> 
>> 
>> What's the value of clearing the previous version space_cache before
>> using the new version?
> 
> AFAIK, just save some space and reduce the root tree size.
> 
> The v1 cache exists as special files inside root tree (not accessible by
> end users).
> 
> Their existence takes up space and fragments the file system (one file
> is normally around 64K, and we have one such v1 file for each block
> group, you can see how many small files it has now)
> 
>> 
>> Why "clear" and not just "delete"?  Won't "deleting" the whole previous
>> space_cache files, blocks, whatever in the filesystem be faster then
>> doing whatever "clear" does?
> 
> Just bad naming, and properly from me.
> 
> Indeed "delete" would be more proper here.
> 
> And we're indeed deleting them in "btrfs check --clear-space-cache v1",
> that's also why it's so slow.
> 
> If you have 20T used space, then the it would be around 20,000 block
> groups, meaning 20,000 64K files inside root tree, and deleting them one
> by one, and each deletion will cause a new transaction, no wonder it
> will be slow to hell.

If v1 space cache exists as special files in the root tree, could those be a contributing factor to the issue I was having in the past with mounts taking too long and dropping to recovery mode?

I discovered running btrfs fi defrag on each of my subvolumes (for the subvolume tree/extent tree, not the files) every couple weeks has reduced the mount time enough I don’t run into that issue on my massive FS anymore.

Could actually getting rid of those thousands of now-useless v1 entries help reduce mount times on such a massive FS, or would that be completely unrelated?

Thanks,
-Joshua

>> 
>> Am I missing out on something by not attempting to clear the previous
>> version space_cache before using the new v2 version?
> 
> Except some wasted space, you're completely fine to skip the slow deletion.
> 
> This also means, I should enhance the deletion process to avoid too many
> transactions...
> 
> Thanks,
> Qu
> 
>> 
>> 
>>> On 16/7/21 3:51 am, Joshua wrote:
>>> Just as a point of data, I have a 96 TB array with RAID1 data, and
>>> RAID1C3 metadata.
>>> 
>>> I made the switch to space_cache=v2 some time ago, and I remember it
>>> made a huge difference when I did so!
>>> (It was RAID1 metadata at the time, as RAID1C3 was not available at
>>> the time.)
>>> 
>>> 
>>> However, I also tried a check with '--clear-space-cache v1' at the
>>> time, and after waiting a literal whole day without it completing, I
>>> gave up, canceled it, and put it back into production.  Is a
>>> --clear-space-cache v1 operation expected to take so long on such a
>>> large file system?
>>> 
>>> Thanks!
>>> --Joshua Villwock
>>> 
>>> 
>>> 
>>> July 15, 2021 9:40 AM, "DanglingPointer"
>>> <danglingpointerexception@gmail.com> wrote:
>>> 
>>>> Hi Qu,
>>>> 
>>>> Just updating here that setting the mount option "space_cache=v2" and
>>>> "noatime" completely SOLVED
>>>> the performance problem!
>>>> Basically like night and day!
>>>> 
>>>> These are my full fstab mount options...
>>>> 
>>>> btrfs defaults,autodefrag,space_cache=v2,noatime 0 2
>>>> 
>>>> Perhaps defaulting the space_cache=v2 should be considered?  Why
>>>> default to v1, what's the value of
>>>> v1?
>>>> 
>>>> So for conclusion, for large multi-terrabyte arrays (in my case
>>>> RAID5s), setting space_cache=v2 and
>>>> noatime massively increases performance and eliminates the large long
>>>> pauses in frequent intervals
>>>> by "btrfs-transacti" blocking all IO.
>>>> 
>>>> Thanks Qu for your help!
>>>> 
>>>> On 14/7/21 5:45 pm, Qu Wenruo wrote:
>>>> 
>>>>> On 2021/7/14 下午3:18, DanglingPointer wrote:
>>>>>> a) "echo l > /proc/sysrq-trigger"
>>>>>> 
>>>>>> The backup finished today already unfortunately and we are unlikely to
>>>>>> run it again until we get an outage to remount the array with the
>>>>>> space_cache=v2 and noatime mount options.
>>>>>> Thanks for the command, we'll definitely use it if/when it happens
>>>>>> again
>>>>>> on the next large migration of data.
>>>>> Just to avoid confusion, after that command, "dmesg" output is still
>>>>> needed, as that's where sysrq put its output.
>>>>>> b) "sudo btrfs qgroup show -prce" ........
>>>>>> 
>>>>>> $ ERROR: can't list qgroups: quotas not enabled
>>>>>> 
>>>>>> So looks like it isn't enabled.
>>>>> One less thing to bother.
>>>>>> File sizes are between: 1,048,576 bytes and 16,777,216 bytes
>>>>>> (Duplicacy
>>>>>> backup defaults)
>>>>> Between 1~16MiB, thus tons of small files.
>>>>> 
>>>>> Btrfs is not really good at handling tons of small files, as they
>>>>> generate a lot of metadata.
>>>>> 
>>>>> That may contribute to the hang.
>>>>> 
>>>>>> What classifies as a transaction?
>>>>> It's a little complex.
>>>>> 
>>>>> Technically it's a check point where before the checkpoint, all you see
>>>>> is old data, after the checkpoint, all you see is new data.
>>>>> 
>>>>> To end users, any data and metadata write will be included into one
>>>>> transaction (with proper dependency handled).
>>>>> 
>>>>> One way to finish (or commit) current transaction is to sync the fs,
>>>>> using "sync" command (sync all filesystems).
>>>>> 
>>>>>> Any/All writes done in a 30sec
>>>>>> interval?
>>>>> This the default commit interval. Almost all fses will try to commit
>>>>> its
>>>>> data/metadata to disk after a configurable interval.
>>>>> 
>>>>> The default one is 30s. That's also one way to commit current >
>>>>> transaction.
>>>>> 
>>>>>> If 100 unique files were written in 30secs, is that 1
>>>>>> transaction or 100 transactions?
>>>>> It depends. As things like syncfs() and subvolume/snapshot creation may
>>>>> try to commit transaction.
>>>>> 
>>>>> But without those special operations, just writing 100 unique files
>>>>> using buffered write, it would only start one transaction, and when the
>>>>> 30s interval get hit, the transaction will be committed to disk.
>>>>> 
>>>>>> Millions of files of the size range
>>>>>> above were backed up.
>>>>> The amount of files may not force a transaction commit, if it doesn't
>>>>> trigger enough memory pressure, or free space pressure.
>>>>> 
>>>>> Anyway, the "echo l" sysrq would help us to locate what's taking so
>>>>> long
>>>>> time.
>>>>> 
>>>>>> c) "Just mount with "space_cache=v2""
>>>>>> 
>>>>>> Ok so no need to "clear_cache" the v1 cache, right?
>>>>> Yes, and "clear_cache" won't really remove all the v1 cache anyway.
>>>>> 
>>>>> Thus it doesn't help much.
>>>>> 
>>>>> The only way to fully clear v1 cache is by using "btrfs check
>>>>> --clear-space-cache v1" on a *unmounted* btrfs.
>>>>> 
>>>>>> I wrote this in the fstab but hadn't remounted yet until I can get an
>>>>>> outage....
>>>>> IMHO if you really want to test if v2 would help, you can just remount,
>>>>> no need to wait for a break.
>>>>> 
>>>>> Thanks,
>>>>> Qu
>>>>>> ..."btrfs defaults,autodefrag,clear_cache,space_cache=v2,noatime
>>>>>> 0  2 >
>>>>>> Thanks again for your help Qu!
>>>>>> 
>>>>>> On 14/7/21 2:59 pm, Qu Wenruo wrote:
>>>>> On 2021/7/13 下午11:38, DanglingPointer wrote:
>>>>> We're currently considering switching to "space_cache=v2" with noatime
>>>>> mount options for my lab server-workstations running RAID5.
>>>>> 
>>>>> Btrfs RAID5 is unsafe due to its write-hole problem.
>>>>> 
>>>>> * One has 13TB of data/metadata in a bunch of 6TB and 2TB disks
>>>>> totalling 26TB.
>>>>> * Another has about 12TB data/metadata in uniformly sized 6TB disks
>>>>> totalling 24TB.
>>>>> * Both of the arrays are on individually luks encrypted disks with
>>>>> btrfs on top of the luks.
>>>>> * Both have "defaults,autodefrag" turned on in fstab.
>>>>> 
>>>>> We're starting to see large pauses during constant backups of millions
>>>>> of chunk files (using duplicacy backup) in the 24TB array.
>>>>> 
>>>>> Pauses sometimes take up to 20+ seconds in frequencies after every
>>>>> ~30secs of the end of the last pause.  "btrfs-transacti" process
>>>>> consistently shows up as the blocking process/thread locking up
>>>>> filesystem IO.  IO gets into the RAID5 array via nfsd. There are no
>>>>> >>>> disk
>>>>> or btrfs errors recorded.  scrub last finished yesterday successfully.
>>>>> 
>>>>> Please provide the "echo l > /proc/sysrq-trigger" output when such
>>>>> >>> pause
>>>>> happens.
>>>>> 
>>>>> If you're using qgroup (may be enabled by things like snapper), it may
>>>>> be the cause, as qgroup does its accounting when committing >>>
>>>>> transaction.
>>>>> 
>>>>> If one transaction is super large, it can cause such problem.
>>>>> 
>>>>> You can test if qgroup is enabled by:
>>>>> 
>>>>> # btrfs qgroup show -prce <mnt>
>>>>> 
>>>>> After doing some research around the internet, we've come to the
>>>>> consideration above as described.  Unfortunately the official
>>>>> documentation isn't clear on the following.
>>>>> 
>>>>> Official documentation URL -
>>>>> https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5)
>>>>> 
>>>>> 1. How to migrate from default space_cache=v1 to space_cache=v2? It
>>>>> talks about the reverse, from v2 to v1!
>>>>> 
>>>>> Just mount with "space_cache=v2".
>>>>> 
>>>>> 2. If we use space_cache=v2, is it indeed still the case that the
>>>>> "btrfs" command will NOT work with the filesystem?
>>>>> 
>>>>> Why would you think "btrfs" won't work on a btrfs?
>>>>> 
>>>>> Thanks,
>>>>> Qu
>>>>> 
>>>>> So will our
>>>>> "btrfs scrub start /mount/point/..." cron jobs FAIL? I'm guessing
>>>>> the btrfs command comes from btrfs-progs which is currently >>>>
>>>>> v5.4.1-2
>>>>> amd64, is that correct?
>>>>> 3. Any other ideas on how we can get rid of those annoying pauses with
>>>>> large backups into the array?
>>>>> 
>>>>> Thanks in advance!
>>>>> 
>>>>> DP

  parent reply	other threads:[~2021-07-16 20:33 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-13 15:38 migrating to space_cache=2 and btrfs userspace commands DanglingPointer
2021-07-14  4:59 ` Qu Wenruo
2021-07-14  5:44   ` Chris Murphy
2021-07-14  6:05     ` Qu Wenruo
2021-07-14  6:54       ` DanglingPointer
2021-07-14  7:07         ` Qu Wenruo
2021-07-14  7:18   ` DanglingPointer
2021-07-14  7:45     ` Qu Wenruo
2021-07-15 16:40       ` DanglingPointer
2021-07-15 22:13         ` Qu Wenruo
2021-07-15 17:51       ` Joshua
2021-07-16 12:42         ` DanglingPointer
2021-07-16 12:59           ` Qu Wenruo
2021-07-16 13:23             ` DanglingPointer
2021-07-16 20:33             ` Joshua Villwock [this message]
2021-07-16 23:00               ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C74CCF4-95CD-446D-A01B-F61AB61550A4@mailmag.net \
    --to=joshua@mailmag.net \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.