* Salvaging the performance of a high-metadata filesystem
@ 2023-03-03 4:34 Matt Corallo
2023-03-03 5:22 ` Roman Mamedov
2023-03-05 9:36 ` Lukas Straub
0 siblings, 2 replies; 10+ messages in thread
From: Matt Corallo @ 2023-03-03 4:34 UTC (permalink / raw)
To: Btrfs BTRFS
I have a ~seven year old BTRFS filesystem who's performance has slowly degraded to unusability.
Its a mix of eight 6-16TB 7200 RPM NAS spinning rust which has slowly upgraded over the years as
drives failed. It was build back when raid1 was the only option, but metadata has since been
converted to raid1c3. That process took a month or two, but was relatively painless.
The problem is there's one folder that has backups of workstation, which were done by `cp
--reflink=always`ing the previous backup followed by rsync'ing over it. The latest backup has about
3 million files, so each folder varies mostly around that number, but there's only < 100 backups.
This has led to a lot of metadata:
Metadata,RAID1C3: Size:1.48TiB, Used:1.46TiB (98.73%)
Sufficiently slow that when I tried to convert data to raid1c3 from raid1 I gave up about six months
in when it was clear the finish date was still years out:
Data,RAID1: Size:21.13TiB, Used:21.07TiB (99.71%)
Data,RAID1C3: Size:5.94TiB, Used:5.46TiB (91.86%)
I recently started adding some I/O to the machine, writing 1MB/s or two of writes from openstack
swift, which has now racked up a million or three files itself (in a directory tree two layers of
~1000-folder directories deep). This has made the filesystem largely unusable.
The usual every-30-second commit takes upwards of ten minutes and locks the entire filesystem for
much of that commit time. The actual bandwidth of writes is trivially manageable, and if I set the
commit time to something absurd like an hour, the filesystem is very usable.
I assume there's not much to be done here - the volume needs to move off of BTRFS onto something
that can better handle lots of files? The metadata-device-preference patches don't seem to be making
any progress (but from what I understand would very likely trivially solve this issue?).
Thanks,
Matt
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-03 4:34 Salvaging the performance of a high-metadata filesystem Matt Corallo
@ 2023-03-03 5:22 ` Roman Mamedov
2023-03-03 9:30 ` Forza
2023-03-05 9:36 ` Lukas Straub
1 sibling, 1 reply; 10+ messages in thread
From: Roman Mamedov @ 2023-03-03 5:22 UTC (permalink / raw)
To: Matt Corallo; +Cc: Btrfs BTRFS
On Thu, 2 Mar 2023 20:34:27 -0800
Matt Corallo <blnxfsl@bluematt.me> wrote:
> The problem is there's one folder that has backups of workstation, which were done by `cp
> --reflink=always`ing the previous backup followed by rsync'ing over it.
I believe this is what might cause the metadata inflation. Each time cp
creates a whole another copy of all 3 million files in the metadata, just
pointing to old extents for data.
Could you instead make this backup destination a subvolume, so that during each
backup you create a snapshot of it for historical storage, and then rsync over
the current version?
--
With respect,
Roman
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-03 5:22 ` Roman Mamedov
@ 2023-03-03 9:30 ` Forza
2023-03-03 19:04 ` Matt Corallo
0 siblings, 1 reply; 10+ messages in thread
From: Forza @ 2023-03-03 9:30 UTC (permalink / raw)
To: Roman Mamedov, Matt Corallo; +Cc: Btrfs BTRFS
On 2023-03-03 06:22, Roman Mamedov wrote:
> On Thu, 2 Mar 2023 20:34:27 -0800
> Matt Corallo <blnxfsl@bluematt.me> wrote:
>
>> The problem is there's one folder that has backups of workstation, which were done by `cp
>> --reflink=always`ing the previous backup followed by rsync'ing over it.
>
> I believe this is what might cause the metadata inflation. Each time cp
> creates a whole another copy of all 3 million files in the metadata, just
> pointing to old extents for data.
>
> Could you instead make this backup destination a subvolume, so that during each
> backup you create a snapshot of it for historical storage, and then rsync over
> the current version?
>
I agree. If you make a snapshot of a subvolume, the additional metadata
is effectively 0. Then you rsync into the source subvolume. This would
add metadata for all changed files,
Make sure you use `mount -o noatime` to prevent metadata updates when
rsync checks all files.
Matt, what are your mount options for your filesystem (output of
`mount`). Can you also provide the output of `btrfs fi us -T
/your/mountpoint`
Forza
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-03 9:30 ` Forza
@ 2023-03-03 19:04 ` Matt Corallo
2023-03-03 19:05 ` Matt Corallo
0 siblings, 1 reply; 10+ messages in thread
From: Matt Corallo @ 2023-03-03 19:04 UTC (permalink / raw)
To: Forza, Roman Mamedov; +Cc: Btrfs BTRFS
On 3/3/23 1:30 AM, Forza wrote:
>
>
> On 2023-03-03 06:22, Roman Mamedov wrote:
>> On Thu, 2 Mar 2023 20:34:27 -0800
>> Matt Corallo <blnxfsl@bluematt.me> wrote:
>>
>>> The problem is there's one folder that has backups of workstation, which were done by `cp
>>> --reflink=always`ing the previous backup followed by rsync'ing over it.
>>
>> I believe this is what might cause the metadata inflation. Each time cp
>> creates a whole another copy of all 3 million files in the metadata, just
>> pointing to old extents for data.
>>
>> Could you instead make this backup destination a subvolume, so that during each
>> backup you create a snapshot of it for historical storage, and then rsync over
>> the current version?
>>
>
> I agree. If you make a snapshot of a subvolume, the additional metadata is effectively 0. Then you
> rsync into the source subvolume. This would add metadata for all changed files,
Ah, good point, I hadn't considered that as an option, to be honest. I'll convert the snapshots to
subvolumes and see how much metadata is reduced...may take a month or two to run, though :/
> Make sure you use `mount -o noatime` to prevent metadata updates when rsync checks all files.
Ah, that's quite the footgun. Shame noatime was never made default :(
> Matt, what are your mount options for your filesystem (output of `mount`). Can you also provide the
> output of `btrfs fi us -T /your/mountpoint`
Sure:
btrfs filesystem usage -T /bigraid
Overall:
Device size: 85.50TiB
Device allocated: 64.67TiB
Device unallocated: 20.83TiB
Device missing: 0.00B
Used: 63.03TiB
Free (estimated): 10.10TiB (min: 5.92TiB)
Free (statfs, df): 6.30TiB
Data ratio: 2.22
Metadata ratio: 3.00
Global reserve: 512.00MiB (used: 48.00KiB)
Multiple profiles: yes (data)
Data Data Metadata System
Id Path RAID1 RAID1C3 RAID1C3 RAID1C4 Unallocated
-- --------------------------- -------- --------- --------- -------- -----------
1 /dev/mapper/bigraid33_crypt 7.48TiB 3.73TiB 808.00GiB 32.00MiB 2.56TiB
2 /dev/mapper/bigraid36_crypt 6.22TiB 4.00GiB 689.00GiB - 2.20TiB
3 /dev/mapper/bigraid39_crypt 8.20TiB 3.36TiB 443.00GiB 32.00MiB 2.56TiB
4 /dev/mapper/bigraid37_crypt 3.64TiB 4.57TiB 152.00GiB 32.00MiB 2.56TiB
5 /dev/mapper/bigraid35_crypt 3.46TiB 367.00GiB 310.00GiB - 1.33TiB
6 /dev/mapper/bigraid38_crypt 3.71TiB 3.24TiB 1.40TiB 32.00MiB 2.56TiB
7 /dev/mapper/bigraid41_crypt 3.05TiB 25.00GiB 377.00GiB - 2.02TiB
8 /dev/mapper/bigraid20_crypt 6.66TiB 2.54TiB 322.00GiB - 5.03TiB
-- --------------------------- -------- --------- --------- -------- -----------
Total 21.21TiB 5.94TiB 1.48TiB 32.00MiB 20.83TiB
Used 21.14TiB 5.46TiB 1.46TiB 4.70MiB
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-03 19:04 ` Matt Corallo
@ 2023-03-03 19:05 ` Matt Corallo
2023-03-04 8:24 ` Forza
0 siblings, 1 reply; 10+ messages in thread
From: Matt Corallo @ 2023-03-03 19:05 UTC (permalink / raw)
To: Forza, Roman Mamedov; +Cc: Btrfs BTRFS
On 3/3/23 11:04 AM, Matt Corallo wrote:
>
>
> On 3/3/23 1:30 AM, Forza wrote:
>>
>>
>> On 2023-03-03 06:22, Roman Mamedov wrote:
>>> On Thu, 2 Mar 2023 20:34:27 -0800
>>> Matt Corallo <blnxfsl@bluematt.me> wrote:
>>>
>>>> The problem is there's one folder that has backups of workstation, which were done by `cp
>>>> --reflink=always`ing the previous backup followed by rsync'ing over it.
>>>
>>> I believe this is what might cause the metadata inflation. Each time cp
>>> creates a whole another copy of all 3 million files in the metadata, just
>>> pointing to old extents for data.
>>>
>>> Could you instead make this backup destination a subvolume, so that during each
>>> backup you create a snapshot of it for historical storage, and then rsync over
>>> the current version?
>>>
>>
>> I agree. If you make a snapshot of a subvolume, the additional metadata is effectively 0. Then you
>> rsync into the source subvolume. This would add metadata for all changed files,
>
> Ah, good point, I hadn't considered that as an option, to be honest. I'll convert the snapshots to
> subvolumes and see how much metadata is reduced...may take a month or two to run, though :/
>
>> Make sure you use `mount -o noatime` to prevent metadata updates when rsync checks all files.
>
> Ah, that's quite the footgun. Shame noatime was never made default :(
>
>> Matt, what are your mount options for your filesystem (output of `mount`). Can you also provide
>> the output of `btrfs fi us -T /your/mountpoint`
Oops, sorry, mount options are default with a long commit:
/dev/mapper/bigraid33_crypt on /bigraid type btrfs
(rw,relatime,space_cache=v2,commit=3600,subvolid=5,subvol=/)
> Sure:
>
> btrfs filesystem usage -T /bigraid
> Overall:
> Device size: 85.50TiB
> Device allocated: 64.67TiB
> Device unallocated: 20.83TiB
> Device missing: 0.00B
> Used: 63.03TiB
> Free (estimated): 10.10TiB (min: 5.92TiB)
> Free (statfs, df): 6.30TiB
> Data ratio: 2.22
> Metadata ratio: 3.00
> Global reserve: 512.00MiB (used: 48.00KiB)
> Multiple profiles: yes (data)
>
> Data Data Metadata System
> Id Path RAID1 RAID1C3 RAID1C3 RAID1C4 Unallocated
> -- --------------------------- -------- --------- --------- -------- -----------
> 1 /dev/mapper/bigraid33_crypt 7.48TiB 3.73TiB 808.00GiB 32.00MiB 2.56TiB
> 2 /dev/mapper/bigraid36_crypt 6.22TiB 4.00GiB 689.00GiB - 2.20TiB
> 3 /dev/mapper/bigraid39_crypt 8.20TiB 3.36TiB 443.00GiB 32.00MiB 2.56TiB
> 4 /dev/mapper/bigraid37_crypt 3.64TiB 4.57TiB 152.00GiB 32.00MiB 2.56TiB
> 5 /dev/mapper/bigraid35_crypt 3.46TiB 367.00GiB 310.00GiB - 1.33TiB
> 6 /dev/mapper/bigraid38_crypt 3.71TiB 3.24TiB 1.40TiB 32.00MiB 2.56TiB
> 7 /dev/mapper/bigraid41_crypt 3.05TiB 25.00GiB 377.00GiB - 2.02TiB
> 8 /dev/mapper/bigraid20_crypt 6.66TiB 2.54TiB 322.00GiB - 5.03TiB
> -- --------------------------- -------- --------- --------- -------- -----------
> Total 21.21TiB 5.94TiB 1.48TiB 32.00MiB 20.83TiB
> Used 21.14TiB 5.46TiB 1.46TiB 4.70MiB
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-03 19:05 ` Matt Corallo
@ 2023-03-04 8:24 ` Forza
2023-03-04 17:25 ` Goffredo Baroncelli
2023-03-05 1:22 ` Matt Corallo
0 siblings, 2 replies; 10+ messages in thread
From: Forza @ 2023-03-04 8:24 UTC (permalink / raw)
To: Matt Corallo, Roman Mamedov, Btrfs BTRFS
On 2023-03-03 20:05, Matt Corallo wrote:
>
>
> On 3/3/23 11:04 AM, Matt Corallo wrote:
>>
>>
>> On 3/3/23 1:30 AM, Forza wrote:
>>>
>>>
>>> On 2023-03-03 06:22, Roman Mamedov wrote:
>>>> On Thu, 2 Mar 2023 20:34:27 -0800
>>>> Matt Corallo <blnxfsl@bluematt.me> wrote:
>>>>
>>>>> The problem is there's one folder that has backups of workstation,
>>>>> which were done by `cp
>>>>> --reflink=always`ing the previous backup followed by rsync'ing over
>>>>> it.
>>>>
>>>> I believe this is what might cause the metadata inflation. Each time cp
>>>> creates a whole another copy of all 3 million files in the metadata,
>>>> just
>>>> pointing to old extents for data.
>>>>
>>>> Could you instead make this backup destination a subvolume, so that
>>>> during each
>>>> backup you create a snapshot of it for historical storage, and then
>>>> rsync over
>>>> the current version?
>>>>
>>>
>>> I agree. If you make a snapshot of a subvolume, the additional
>>> metadata is effectively 0. Then you rsync into the source subvolume.
>>> This would add metadata for all changed files,
>>
>> Ah, good point, I hadn't considered that as an option, to be honest.
>> I'll convert the snapshots to subvolumes and see how much metadata is
>> reduced...may take a month or two to run, though :/
>>
>>> Make sure you use `mount -o noatime` to prevent metadata updates when
>>> rsync checks all files.
>>
>> Ah, that's quite the footgun. Shame noatime was never made default :(
>>
>>> Matt, what are your mount options for your filesystem (output of
>>> `mount`). Can you also provide the output of `btrfs fi us -T
>>> /your/mountpoint`
>
> Oops, sorry, mount options are default with a long commit:
>
> /dev/mapper/bigraid33_crypt on /bigraid type btrfs
> (rw,relatime,space_cache=v2,commit=3600,subvolid=5,subvol=/)
Unless you need to, replace relatime with noatime. This makes a big
difference when you have lots of reflinks or snapshots as it avoids
de-duplication of metadata when the atimes are updated.
>
>> Sure:
>>
>> btrfs filesystem usage -T /bigraid
>> Overall:
>> Device size: 85.50TiB
>> Device allocated: 64.67TiB
>> Device unallocated: 20.83TiB
>> Device missing: 0.00B
>> Used: 63.03TiB
>> Free (estimated): 10.10TiB (min: 5.92TiB)
>> Free (statfs, df): 6.30TiB
>> Data ratio: 2.22
>> Metadata ratio: 3.00
>> Global reserve: 512.00MiB (used: 48.00KiB)
>> Multiple profiles: yes (data)
>>
>> Data Data Metadata System
>> Id Path RAID1 RAID1C3 RAID1C3 RAID1C4
>> Unallocated
>> -- --------------------------- -------- --------- --------- --------
>> -----------
>> 1 /dev/mapper/bigraid33_crypt 7.48TiB 3.73TiB 808.00GiB
>> 32.00MiB 2.56TiB
>> 2 /dev/mapper/bigraid36_crypt 6.22TiB 4.00GiB 689.00GiB
>> - 2.20TiB
>> 3 /dev/mapper/bigraid39_crypt 8.20TiB 3.36TiB 443.00GiB
>> 32.00MiB 2.56TiB
>> 4 /dev/mapper/bigraid37_crypt 3.64TiB 4.57TiB 152.00GiB
>> 32.00MiB 2.56TiB
>> 5 /dev/mapper/bigraid35_crypt 3.46TiB 367.00GiB 310.00GiB
>> - 1.33TiB
>> 6 /dev/mapper/bigraid38_crypt 3.71TiB 3.24TiB 1.40TiB
>> 32.00MiB 2.56TiB
>> 7 /dev/mapper/bigraid41_crypt 3.05TiB 25.00GiB 377.00GiB
>> - 2.02TiB
>> 8 /dev/mapper/bigraid20_crypt 6.66TiB 2.54TiB 322.00GiB
>> - 5.03TiB
>> -- --------------------------- -------- --------- --------- --------
>> -----------
>> Total 21.21TiB 5.94TiB 1.48TiB
>> 32.00MiB 20.83TiB
>> Used 21.14TiB 5.46TiB 1.46TiB 4.70MiB
Not sure if running with multiple profiles will cause issues or
slowness, but it might be good to try to convert the old raid1c3 data
chunks into raid1 over time. You can use balance filters to minimise the
work each run.
# btrfs balance start -dconvert=raid1,soft,limit=10 /bigraid
This will avoid balancing blockgroups already in RAID1 (soft option) and
limit to only balance 10 block groups. You can then schedule this during
times with less active I/O.
It is also possible to defragment the subvolume and extent trees[*].
This could help a little, though if the filesystem is frequently
changing it might only be a temporary thing. It can also take a long
time to complete.
# btrfs filesystem defragment /path/to/subvolume-root
[*]
https://wiki.tnonline.net/w/Btrfs/Defrag#Defragmenting_the_subvolume_and_extent_trees
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-04 8:24 ` Forza
@ 2023-03-04 17:25 ` Goffredo Baroncelli
2023-03-05 1:22 ` Matt Corallo
1 sibling, 0 replies; 10+ messages in thread
From: Goffredo Baroncelli @ 2023-03-04 17:25 UTC (permalink / raw)
To: Forza, Matt Corallo, Roman Mamedov, Btrfs BTRFS
On 04/03/2023 09.24, Forza wrote:
>
>
> On 2023-03-03 20:05, Matt Corallo wrote:
>>> btrfs filesystem usage -T /bigraid
>>> Overall:
>>> Device size: 85.50TiB
>>> Device allocated: 64.67TiB
>>> Device unallocated: 20.83TiB
>>> Device missing: 0.00B
>>> Used: 63.03TiB
>>> Free (estimated): 10.10TiB (min: 5.92TiB)
>>> Free (statfs, df): 6.30TiB
>>> Data ratio: 2.22
>>> Metadata ratio: 3.00
>>> Global reserve: 512.00MiB (used: 48.00KiB)
>>> Multiple profiles: yes (data)
>>>
>>> Data Data Metadata System
>>> Id Path RAID1 RAID1C3 RAID1C3 RAID1C4 Unallocated
>>> -- --------------------------- -------- --------- --------- -------- -----------
>>> 1 /dev/mapper/bigraid33_crypt 7.48TiB 3.73TiB 808.00GiB 32.00MiB 2.56TiB
>>> 2 /dev/mapper/bigraid36_crypt 6.22TiB 4.00GiB 689.00GiB - 2.20TiB
>>> 3 /dev/mapper/bigraid39_crypt 8.20TiB 3.36TiB 443.00GiB 32.00MiB 2.56TiB
>>> 4 /dev/mapper/bigraid37_crypt 3.64TiB 4.57TiB 152.00GiB 32.00MiB 2.56TiB
>>> 5 /dev/mapper/bigraid35_crypt 3.46TiB 367.00GiB 310.00GiB - 1.33TiB
>>> 6 /dev/mapper/bigraid38_crypt 3.71TiB 3.24TiB 1.40TiB 32.00MiB 2.56TiB
>>> 7 /dev/mapper/bigraid41_crypt 3.05TiB 25.00GiB 377.00GiB - 2.02TiB
>>> 8 /dev/mapper/bigraid20_crypt 6.66TiB 2.54TiB 322.00GiB - 5.03TiB
>>> -- --------------------------- -------- --------- --------- -------- -----------
>>> Total 21.21TiB 5.94TiB 1.48TiB 32.00MiB 20.83TiB
>>> Used 21.14TiB 5.46TiB 1.46TiB 4.70MiB
>
> Not sure if running with multiple profiles will cause issues or slowness, but it might be good to try to convert the old raid1c3 data chunks into raid1 over time. You can use balance filters to minimise the work each run.
>
> # btrfs balance start -dconvert=raid1,soft,limit=10 /bigraid
If I remember correctly BTRFS consider the highest redundancy profile as the default one. So having both raid1c3 and raid1 means that the new data is written as raid1c3.
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-04 8:24 ` Forza
2023-03-04 17:25 ` Goffredo Baroncelli
@ 2023-03-05 1:22 ` Matt Corallo
2023-03-05 8:23 ` Forza
1 sibling, 1 reply; 10+ messages in thread
From: Matt Corallo @ 2023-03-05 1:22 UTC (permalink / raw)
To: Forza, Roman Mamedov, Btrfs BTRFS
On 3/4/23 12:24 AM, Forza wrote:
> Unless you need to, replace relatime with noatime. This makes a big difference when you have lots of
> reflinks or snapshots as it avoids de-duplication of metadata when the atimes are updated.
Yea, I've done that now, thanks. I'm vaguely surprised this big a footgun is the default, and not
much more aggressively in the subvolume manpage, at least.
> Not sure if running with multiple profiles will cause issues or slowness, but it might be good to
> try to convert the old raid1c3 data chunks into raid1 over time. You can use balance filters to
> minimise the work each run.
I don't think that's really an option. It took something like six months or a year to get as much
raid1c3 as there is, and the filesystem has slowed down considerably since. Trying to rate-limit
going back just means it'll take forever instead.
> # btrfs balance start -dconvert=raid1,soft,limit=10 /bigraid
>
> This will avoid balancing blockgroups already in RAID1 (soft option) and limit to only balance 10
> block groups. You can then schedule this during times with less active I/O.
>
> It is also possible to defragment the subvolume and extent trees[*]. This could help a little,
> though if the filesystem is frequently changing it might only be a temporary thing. It can also take
> a long time to complete.
IIUC that can de-share the metadata from subvolumes though, no? Which is a big part of the
(presumed) problem currently.
> # btrfs filesystem defragment /path/to/subvolume-root
>
> [*] https://wiki.tnonline.net/w/Btrfs/Defrag#Defragmenting_the_subvolume_and_extent_trees
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-05 1:22 ` Matt Corallo
@ 2023-03-05 8:23 ` Forza
0 siblings, 0 replies; 10+ messages in thread
From: Forza @ 2023-03-05 8:23 UTC (permalink / raw)
To: Matt Corallo, Roman Mamedov, Btrfs BTRFS
On 2023-03-05 02:22, Matt Corallo wrote:
>
>
> On 3/4/23 12:24 AM, Forza wrote:
>> Unless you need to, replace relatime with noatime. This makes a big
>> difference when you have lots of reflinks or snapshots as it avoids
>> de-duplication of metadata when the atimes are updated.
>
> Yea, I've done that now, thanks. I'm vaguely surprised this big a
> footgun is the default, and not much more aggressively in the subvolume
> manpage, at least.
It is Linux default AFAIK. Many distros don't want to change this. Some
(very few) softwares do use atimes, so this is why relatime is default
still. But now you are aware and it could start improving the situation
for you.
>
>> Not sure if running with multiple profiles will cause issues or
>> slowness, but it might be good to try to convert the old raid1c3 data
>> chunks into raid1 over time. You can use balance filters to minimise
>> the work each run.
>
> I don't think that's really an option. It took something like six months
> or a year to get as much raid1c3 as there is, and the filesystem has
> slowed down considerably since. Trying to rate-limit going back just
> means it'll take forever instead.
Your current metadata allocation is ~7% of the filesystem. On HDDs this
is going to be slow no matter what you do. But if you can change your
`cp --reflink` into `btrfs sub snap src dst` and rsync into `src`
instead, it could perhaps reduce the amount of metadata over time. How
many of the files that you backup changes on each backup?
>
>> # btrfs balance start -dconvert=raid1,soft,limit=10 /bigraid
>>
>> This will avoid balancing blockgroups already in RAID1 (soft option)
>> and limit to only balance 10 block groups. You can then schedule this
>> during times with less active I/O.
>>
>> It is also possible to defragment the subvolume and extent trees[*].
>> This could help a little, though if the filesystem is frequently
>> changing it might only be a temporary thing. It can also take a long
>> time to complete.
>
> IIUC that can de-share the metadata from subvolumes though, no? Which is
> a big part of the (presumed) problem currently.
It can, but also reduces makes metadata seeks, which could be an
improvement. But since this could take time, maybe it is something to
try another time.
>
>> # btrfs filesystem defragment /path/to/subvolume-root
>>
>> [*]
>> https://wiki.tnonline.net/w/Btrfs/Defrag#Defragmenting_the_subvolume_and_extent_trees
>>
What IO scheduler do you use? Have you tried different schedulers to see
if that makes any difference? for example mq-deadline, kyber and BFQ.
BFQ is sometimes friendlier on HDDs than the others, but it seems to
vary greatly depending on use-case.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Salvaging the performance of a high-metadata filesystem
2023-03-03 4:34 Salvaging the performance of a high-metadata filesystem Matt Corallo
2023-03-03 5:22 ` Roman Mamedov
@ 2023-03-05 9:36 ` Lukas Straub
1 sibling, 0 replies; 10+ messages in thread
From: Lukas Straub @ 2023-03-05 9:36 UTC (permalink / raw)
To: Matt Corallo; +Cc: Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 1194 bytes --]
On Thu, 2 Mar 2023 20:34:27 -0800
Matt Corallo <blnxfsl@bluematt.me> wrote:
> I have a ~seven year old BTRFS filesystem who's performance has slowly degraded to unusability.
>
> ...
>
> This has led to a lot of metadata:
> Metadata,RAID1C3: Size:1.48TiB, Used:1.46TiB (98.73%)
>
> ...
>
> I recently started adding some I/O to the machine, writing 1MB/s or two of writes from openstack
> swift, which has now racked up a million or three files itself (in a directory tree two layers of
> ~1000-folder directories deep). This has made the filesystem largely unusable.
>
> ...
>
> Thanks,
> Matt
Hi,
I suspect lots of inline files are bloating your metadata. Especially
from openstack swift, given that each object is stored as it's own file:
https://docs.openstack.org/swift/latest/overview_architecture.html#object-server
By default, btrfs will store all files smaller than 2048 bytes inline
(i.e. directly in the metadata). You can change that with the
"max_inline" mount option
You can count the number of inline files with something like:
find /mnt/hdd -type f -print0 | xargs -0 filefrag -v | grep inline | wc -l
Regards,
Lukas Straub
--
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-03-05 9:49 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-03 4:34 Salvaging the performance of a high-metadata filesystem Matt Corallo
2023-03-03 5:22 ` Roman Mamedov
2023-03-03 9:30 ` Forza
2023-03-03 19:04 ` Matt Corallo
2023-03-03 19:05 ` Matt Corallo
2023-03-04 8:24 ` Forza
2023-03-04 17:25 ` Goffredo Baroncelli
2023-03-05 1:22 ` Matt Corallo
2023-03-05 8:23 ` Forza
2023-03-05 9:36 ` Lukas Straub
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.