All of lore.kernel.org
 help / color / mirror / Atom feed
* Salvaging the performance of a high-metadata filesystem
@ 2023-03-03  4:34 Matt Corallo
  2023-03-03  5:22 ` Roman Mamedov
  2023-03-05  9:36 ` Lukas Straub
  0 siblings, 2 replies; 10+ messages in thread
From: Matt Corallo @ 2023-03-03  4:34 UTC (permalink / raw)
  To: Btrfs BTRFS

I have a ~seven year old BTRFS filesystem who's performance has slowly degraded to unusability.

Its a mix of eight 6-16TB 7200 RPM NAS spinning rust which has slowly upgraded over the years as 
drives failed. It was build back when raid1 was the only option, but metadata has since been 
converted to raid1c3. That process took a month or two, but was relatively painless.

The problem is there's one folder that has backups of workstation, which were done by `cp 
--reflink=always`ing the previous backup followed by rsync'ing over it. The latest backup has about 
3 million files, so each folder varies mostly around that number, but there's only < 100 backups.

This has led to a lot of metadata:
Metadata,RAID1C3: Size:1.48TiB, Used:1.46TiB (98.73%)

Sufficiently slow that when I tried to convert data to raid1c3 from raid1 I gave up about six months 
in when it was clear the finish date was still years out:
Data,RAID1: Size:21.13TiB, Used:21.07TiB (99.71%)
Data,RAID1C3: Size:5.94TiB, Used:5.46TiB (91.86%)

I recently started adding some I/O to the machine, writing 1MB/s or two of writes from openstack 
swift, which has now racked up a million or three files itself (in a directory tree two layers of 
~1000-folder directories deep). This has made the filesystem largely unusable.

The usual every-30-second commit takes upwards of ten minutes and locks the entire filesystem for 
much of that commit time. The actual bandwidth of writes is trivially manageable, and if I set the 
commit time to something absurd like an hour, the filesystem is very usable.

I assume there's not much to be done here - the volume needs to move off of BTRFS onto something 
that can better handle lots of files? The metadata-device-preference patches don't seem to be making 
any progress (but from what I understand would very likely trivially solve this issue?).


Thanks,
Matt

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-03  4:34 Salvaging the performance of a high-metadata filesystem Matt Corallo
@ 2023-03-03  5:22 ` Roman Mamedov
  2023-03-03  9:30   ` Forza
  2023-03-05  9:36 ` Lukas Straub
  1 sibling, 1 reply; 10+ messages in thread
From: Roman Mamedov @ 2023-03-03  5:22 UTC (permalink / raw)
  To: Matt Corallo; +Cc: Btrfs BTRFS

On Thu, 2 Mar 2023 20:34:27 -0800
Matt Corallo <blnxfsl@bluematt.me> wrote:

> The problem is there's one folder that has backups of workstation, which were done by `cp 
> --reflink=always`ing the previous backup followed by rsync'ing over it.

I believe this is what might cause the metadata inflation. Each time cp
creates a whole another copy of all 3 million files in the metadata, just
pointing to old extents for data.

Could you instead make this backup destination a subvolume, so that during each
backup you create a snapshot of it for historical storage, and then rsync over
the current version?

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-03  5:22 ` Roman Mamedov
@ 2023-03-03  9:30   ` Forza
  2023-03-03 19:04     ` Matt Corallo
  0 siblings, 1 reply; 10+ messages in thread
From: Forza @ 2023-03-03  9:30 UTC (permalink / raw)
  To: Roman Mamedov, Matt Corallo; +Cc: Btrfs BTRFS



On 2023-03-03 06:22, Roman Mamedov wrote:
> On Thu, 2 Mar 2023 20:34:27 -0800
> Matt Corallo <blnxfsl@bluematt.me> wrote:
> 
>> The problem is there's one folder that has backups of workstation, which were done by `cp
>> --reflink=always`ing the previous backup followed by rsync'ing over it.
> 
> I believe this is what might cause the metadata inflation. Each time cp
> creates a whole another copy of all 3 million files in the metadata, just
> pointing to old extents for data.
> 
> Could you instead make this backup destination a subvolume, so that during each
> backup you create a snapshot of it for historical storage, and then rsync over
> the current version?
> 

I agree. If you make a snapshot of a subvolume, the additional metadata 
is effectively 0. Then you rsync into the source subvolume. This would 
add metadata for all changed files,

Make sure you use `mount -o noatime` to prevent metadata updates when 
rsync checks all files.

Matt, what are your mount options for your filesystem (output of 
`mount`). Can you also provide the output of `btrfs fi us -T 
/your/mountpoint`

Forza

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-03  9:30   ` Forza
@ 2023-03-03 19:04     ` Matt Corallo
  2023-03-03 19:05       ` Matt Corallo
  0 siblings, 1 reply; 10+ messages in thread
From: Matt Corallo @ 2023-03-03 19:04 UTC (permalink / raw)
  To: Forza, Roman Mamedov; +Cc: Btrfs BTRFS



On 3/3/23 1:30 AM, Forza wrote:
> 
> 
> On 2023-03-03 06:22, Roman Mamedov wrote:
>> On Thu, 2 Mar 2023 20:34:27 -0800
>> Matt Corallo <blnxfsl@bluematt.me> wrote:
>>
>>> The problem is there's one folder that has backups of workstation, which were done by `cp
>>> --reflink=always`ing the previous backup followed by rsync'ing over it.
>>
>> I believe this is what might cause the metadata inflation. Each time cp
>> creates a whole another copy of all 3 million files in the metadata, just
>> pointing to old extents for data.
>>
>> Could you instead make this backup destination a subvolume, so that during each
>> backup you create a snapshot of it for historical storage, and then rsync over
>> the current version?
>>
> 
> I agree. If you make a snapshot of a subvolume, the additional metadata is effectively 0. Then you 
> rsync into the source subvolume. This would add metadata for all changed files,

Ah, good point, I hadn't considered that as an option, to be honest. I'll convert the snapshots to 
subvolumes and see how much metadata is reduced...may take a month or two to run, though :/

> Make sure you use `mount -o noatime` to prevent metadata updates when rsync checks all files.

Ah, that's quite the footgun. Shame noatime was never made default :(

> Matt, what are your mount options for your filesystem (output of `mount`). Can you also provide the 
> output of `btrfs fi us -T /your/mountpoint`

Sure:

btrfs filesystem usage -T /bigraid
Overall:
     Device size:		 85.50TiB
     Device allocated:		 64.67TiB
     Device unallocated:		 20.83TiB
     Device missing:		    0.00B
     Used:			 63.03TiB
     Free (estimated):		 10.10TiB	(min: 5.92TiB)
     Free (statfs, df):		  6.30TiB
     Data ratio:			     2.22
     Metadata ratio:		     3.00
     Global reserve:		512.00MiB	(used: 48.00KiB)
     Multiple profiles:		      yes	(data)

                                Data     Data      Metadata  System
Id Path                        RAID1    RAID1C3   RAID1C3   RAID1C4  Unallocated
-- --------------------------- -------- --------- --------- -------- -----------
  1 /dev/mapper/bigraid33_crypt  7.48TiB   3.73TiB 808.00GiB 32.00MiB     2.56TiB
  2 /dev/mapper/bigraid36_crypt  6.22TiB   4.00GiB 689.00GiB        -     2.20TiB
  3 /dev/mapper/bigraid39_crypt  8.20TiB   3.36TiB 443.00GiB 32.00MiB     2.56TiB
  4 /dev/mapper/bigraid37_crypt  3.64TiB   4.57TiB 152.00GiB 32.00MiB     2.56TiB
  5 /dev/mapper/bigraid35_crypt  3.46TiB 367.00GiB 310.00GiB        -     1.33TiB
  6 /dev/mapper/bigraid38_crypt  3.71TiB   3.24TiB   1.40TiB 32.00MiB     2.56TiB
  7 /dev/mapper/bigraid41_crypt  3.05TiB  25.00GiB 377.00GiB        -     2.02TiB
  8 /dev/mapper/bigraid20_crypt  6.66TiB   2.54TiB 322.00GiB        -     5.03TiB
-- --------------------------- -------- --------- --------- -------- -----------
    Total                       21.21TiB   5.94TiB   1.48TiB 32.00MiB    20.83TiB
    Used                        21.14TiB   5.46TiB   1.46TiB  4.70MiB

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-03 19:04     ` Matt Corallo
@ 2023-03-03 19:05       ` Matt Corallo
  2023-03-04  8:24         ` Forza
  0 siblings, 1 reply; 10+ messages in thread
From: Matt Corallo @ 2023-03-03 19:05 UTC (permalink / raw)
  To: Forza, Roman Mamedov; +Cc: Btrfs BTRFS



On 3/3/23 11:04 AM, Matt Corallo wrote:
> 
> 
> On 3/3/23 1:30 AM, Forza wrote:
>>
>>
>> On 2023-03-03 06:22, Roman Mamedov wrote:
>>> On Thu, 2 Mar 2023 20:34:27 -0800
>>> Matt Corallo <blnxfsl@bluematt.me> wrote:
>>>
>>>> The problem is there's one folder that has backups of workstation, which were done by `cp
>>>> --reflink=always`ing the previous backup followed by rsync'ing over it.
>>>
>>> I believe this is what might cause the metadata inflation. Each time cp
>>> creates a whole another copy of all 3 million files in the metadata, just
>>> pointing to old extents for data.
>>>
>>> Could you instead make this backup destination a subvolume, so that during each
>>> backup you create a snapshot of it for historical storage, and then rsync over
>>> the current version?
>>>
>>
>> I agree. If you make a snapshot of a subvolume, the additional metadata is effectively 0. Then you 
>> rsync into the source subvolume. This would add metadata for all changed files,
> 
> Ah, good point, I hadn't considered that as an option, to be honest. I'll convert the snapshots to 
> subvolumes and see how much metadata is reduced...may take a month or two to run, though :/
> 
>> Make sure you use `mount -o noatime` to prevent metadata updates when rsync checks all files.
> 
> Ah, that's quite the footgun. Shame noatime was never made default :(
> 
>> Matt, what are your mount options for your filesystem (output of `mount`). Can you also provide 
>> the output of `btrfs fi us -T /your/mountpoint`

Oops, sorry, mount options are default with a long commit:

/dev/mapper/bigraid33_crypt on /bigraid type btrfs 
(rw,relatime,space_cache=v2,commit=3600,subvolid=5,subvol=/)

> Sure:
> 
> btrfs filesystem usage -T /bigraid
> Overall:
>      Device size:         85.50TiB
>      Device allocated:         64.67TiB
>      Device unallocated:         20.83TiB
>      Device missing:            0.00B
>      Used:             63.03TiB
>      Free (estimated):         10.10TiB    (min: 5.92TiB)
>      Free (statfs, df):          6.30TiB
>      Data ratio:                 2.22
>      Metadata ratio:             3.00
>      Global reserve:        512.00MiB    (used: 48.00KiB)
>      Multiple profiles:              yes    (data)
> 
>                                 Data     Data      Metadata  System
> Id Path                        RAID1    RAID1C3   RAID1C3   RAID1C4  Unallocated
> -- --------------------------- -------- --------- --------- -------- -----------
>   1 /dev/mapper/bigraid33_crypt  7.48TiB   3.73TiB 808.00GiB 32.00MiB     2.56TiB
>   2 /dev/mapper/bigraid36_crypt  6.22TiB   4.00GiB 689.00GiB        -     2.20TiB
>   3 /dev/mapper/bigraid39_crypt  8.20TiB   3.36TiB 443.00GiB 32.00MiB     2.56TiB
>   4 /dev/mapper/bigraid37_crypt  3.64TiB   4.57TiB 152.00GiB 32.00MiB     2.56TiB
>   5 /dev/mapper/bigraid35_crypt  3.46TiB 367.00GiB 310.00GiB        -     1.33TiB
>   6 /dev/mapper/bigraid38_crypt  3.71TiB   3.24TiB   1.40TiB 32.00MiB     2.56TiB
>   7 /dev/mapper/bigraid41_crypt  3.05TiB  25.00GiB 377.00GiB        -     2.02TiB
>   8 /dev/mapper/bigraid20_crypt  6.66TiB   2.54TiB 322.00GiB        -     5.03TiB
> -- --------------------------- -------- --------- --------- -------- -----------
>     Total                       21.21TiB   5.94TiB   1.48TiB 32.00MiB    20.83TiB
>     Used                        21.14TiB   5.46TiB   1.46TiB  4.70MiB

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-03 19:05       ` Matt Corallo
@ 2023-03-04  8:24         ` Forza
  2023-03-04 17:25           ` Goffredo Baroncelli
  2023-03-05  1:22           ` Matt Corallo
  0 siblings, 2 replies; 10+ messages in thread
From: Forza @ 2023-03-04  8:24 UTC (permalink / raw)
  To: Matt Corallo, Roman Mamedov, Btrfs BTRFS



On 2023-03-03 20:05, Matt Corallo wrote:
> 
> 
> On 3/3/23 11:04 AM, Matt Corallo wrote:
>>
>>
>> On 3/3/23 1:30 AM, Forza wrote:
>>>
>>>
>>> On 2023-03-03 06:22, Roman Mamedov wrote:
>>>> On Thu, 2 Mar 2023 20:34:27 -0800
>>>> Matt Corallo <blnxfsl@bluematt.me> wrote:
>>>>
>>>>> The problem is there's one folder that has backups of workstation, 
>>>>> which were done by `cp
>>>>> --reflink=always`ing the previous backup followed by rsync'ing over 
>>>>> it.
>>>>
>>>> I believe this is what might cause the metadata inflation. Each time cp
>>>> creates a whole another copy of all 3 million files in the metadata, 
>>>> just
>>>> pointing to old extents for data.
>>>>
>>>> Could you instead make this backup destination a subvolume, so that 
>>>> during each
>>>> backup you create a snapshot of it for historical storage, and then 
>>>> rsync over
>>>> the current version?
>>>>
>>>
>>> I agree. If you make a snapshot of a subvolume, the additional 
>>> metadata is effectively 0. Then you rsync into the source subvolume. 
>>> This would add metadata for all changed files,
>>
>> Ah, good point, I hadn't considered that as an option, to be honest. 
>> I'll convert the snapshots to subvolumes and see how much metadata is 
>> reduced...may take a month or two to run, though :/
>>
>>> Make sure you use `mount -o noatime` to prevent metadata updates when 
>>> rsync checks all files.
>>
>> Ah, that's quite the footgun. Shame noatime was never made default :(
>>
>>> Matt, what are your mount options for your filesystem (output of 
>>> `mount`). Can you also provide the output of `btrfs fi us -T 
>>> /your/mountpoint`
> 
> Oops, sorry, mount options are default with a long commit:
> 
> /dev/mapper/bigraid33_crypt on /bigraid type btrfs 
> (rw,relatime,space_cache=v2,commit=3600,subvolid=5,subvol=/)

Unless you need to, replace relatime with noatime. This makes a big 
difference when you have lots of reflinks or snapshots as it avoids 
de-duplication of metadata when the atimes are updated.
> 
>> Sure:
>>
>> btrfs filesystem usage -T /bigraid
>> Overall:
>>      Device size:         85.50TiB
>>      Device allocated:         64.67TiB
>>      Device unallocated:         20.83TiB
>>      Device missing:            0.00B
>>      Used:             63.03TiB
>>      Free (estimated):         10.10TiB    (min: 5.92TiB)
>>      Free (statfs, df):          6.30TiB
>>      Data ratio:                 2.22
>>      Metadata ratio:             3.00
>>      Global reserve:        512.00MiB    (used: 48.00KiB)
>>      Multiple profiles:              yes    (data)
>>
>>                                 Data     Data      Metadata  System
>> Id Path                        RAID1    RAID1C3   RAID1C3   RAID1C4  
>> Unallocated
>> -- --------------------------- -------- --------- --------- -------- 
>> -----------
>>   1 /dev/mapper/bigraid33_crypt  7.48TiB   3.73TiB 808.00GiB 
>> 32.00MiB     2.56TiB
>>   2 /dev/mapper/bigraid36_crypt  6.22TiB   4.00GiB 689.00GiB        
>> -     2.20TiB
>>   3 /dev/mapper/bigraid39_crypt  8.20TiB   3.36TiB 443.00GiB 
>> 32.00MiB     2.56TiB
>>   4 /dev/mapper/bigraid37_crypt  3.64TiB   4.57TiB 152.00GiB 
>> 32.00MiB     2.56TiB
>>   5 /dev/mapper/bigraid35_crypt  3.46TiB 367.00GiB 310.00GiB        
>> -     1.33TiB
>>   6 /dev/mapper/bigraid38_crypt  3.71TiB   3.24TiB   1.40TiB 
>> 32.00MiB     2.56TiB
>>   7 /dev/mapper/bigraid41_crypt  3.05TiB  25.00GiB 377.00GiB        
>> -     2.02TiB
>>   8 /dev/mapper/bigraid20_crypt  6.66TiB   2.54TiB 322.00GiB        
>> -     5.03TiB
>> -- --------------------------- -------- --------- --------- -------- 
>> -----------
>>     Total                       21.21TiB   5.94TiB   1.48TiB 
>> 32.00MiB    20.83TiB
>>     Used                        21.14TiB   5.46TiB   1.46TiB  4.70MiB

Not sure if running with multiple profiles will cause issues or 
slowness, but it might be good to try to convert the old raid1c3 data 
chunks into raid1 over time. You can use balance filters to minimise the 
work each run.

# btrfs balance start -dconvert=raid1,soft,limit=10 /bigraid

This will avoid balancing blockgroups already in RAID1 (soft option) and 
limit to only balance 10 block groups. You can then schedule this during 
times with less active I/O.

It is also possible to defragment the subvolume and extent trees[*]. 
This could help a little, though if the filesystem is frequently 
changing it might only be a temporary thing. It can also take a long 
time to complete.

# btrfs filesystem defragment /path/to/subvolume-root

[*] 
https://wiki.tnonline.net/w/Btrfs/Defrag#Defragmenting_the_subvolume_and_extent_trees


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-04  8:24         ` Forza
@ 2023-03-04 17:25           ` Goffredo Baroncelli
  2023-03-05  1:22           ` Matt Corallo
  1 sibling, 0 replies; 10+ messages in thread
From: Goffredo Baroncelli @ 2023-03-04 17:25 UTC (permalink / raw)
  To: Forza, Matt Corallo, Roman Mamedov, Btrfs BTRFS

On 04/03/2023 09.24, Forza wrote:
> 
> 
> On 2023-03-03 20:05, Matt Corallo wrote:

>>> btrfs filesystem usage -T /bigraid
>>> Overall:
>>>      Device size:         85.50TiB
>>>      Device allocated:         64.67TiB
>>>      Device unallocated:         20.83TiB
>>>      Device missing:            0.00B
>>>      Used:             63.03TiB
>>>      Free (estimated):         10.10TiB    (min: 5.92TiB)
>>>      Free (statfs, df):          6.30TiB
>>>      Data ratio:                 2.22
>>>      Metadata ratio:             3.00
>>>      Global reserve:        512.00MiB    (used: 48.00KiB)
>>>      Multiple profiles:              yes    (data)
>>>
>>>                                 Data     Data      Metadata  System
>>> Id Path                        RAID1    RAID1C3   RAID1C3   RAID1C4 Unallocated
>>> -- --------------------------- -------- --------- --------- -------- -----------
>>>   1 /dev/mapper/bigraid33_crypt  7.48TiB   3.73TiB 808.00GiB 32.00MiB     2.56TiB
>>>   2 /dev/mapper/bigraid36_crypt  6.22TiB   4.00GiB 689.00GiB -     2.20TiB
>>>   3 /dev/mapper/bigraid39_crypt  8.20TiB   3.36TiB 443.00GiB 32.00MiB     2.56TiB
>>>   4 /dev/mapper/bigraid37_crypt  3.64TiB   4.57TiB 152.00GiB 32.00MiB     2.56TiB
>>>   5 /dev/mapper/bigraid35_crypt  3.46TiB 367.00GiB 310.00GiB -     1.33TiB
>>>   6 /dev/mapper/bigraid38_crypt  3.71TiB   3.24TiB   1.40TiB 32.00MiB     2.56TiB
>>>   7 /dev/mapper/bigraid41_crypt  3.05TiB  25.00GiB 377.00GiB -     2.02TiB
>>>   8 /dev/mapper/bigraid20_crypt  6.66TiB   2.54TiB 322.00GiB -     5.03TiB
>>> -- --------------------------- -------- --------- --------- -------- -----------
>>>     Total                       21.21TiB   5.94TiB   1.48TiB 32.00MiB    20.83TiB
>>>     Used                        21.14TiB   5.46TiB   1.46TiB  4.70MiB
> 
> Not sure if running with multiple profiles will cause issues or slowness, but it might be good to try to convert the old raid1c3 data chunks into raid1 over time. You can use balance filters to minimise the work each run.
> 
> # btrfs balance start -dconvert=raid1,soft,limit=10 /bigraid

If I remember correctly BTRFS consider the highest redundancy profile as the default one. So having both raid1c3 and raid1 means that the new data is written as raid1c3.




-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-04  8:24         ` Forza
  2023-03-04 17:25           ` Goffredo Baroncelli
@ 2023-03-05  1:22           ` Matt Corallo
  2023-03-05  8:23             ` Forza
  1 sibling, 1 reply; 10+ messages in thread
From: Matt Corallo @ 2023-03-05  1:22 UTC (permalink / raw)
  To: Forza, Roman Mamedov, Btrfs BTRFS



On 3/4/23 12:24 AM, Forza wrote:
> Unless you need to, replace relatime with noatime. This makes a big difference when you have lots of 
> reflinks or snapshots as it avoids de-duplication of metadata when the atimes are updated.

Yea, I've done that now, thanks. I'm vaguely surprised this big a footgun is the default, and not 
much more aggressively in the subvolume manpage, at least.

> Not sure if running with multiple profiles will cause issues or slowness, but it might be good to 
> try to convert the old raid1c3 data chunks into raid1 over time. You can use balance filters to 
> minimise the work each run.

I don't think that's really an option. It took something like six months or a year to get as much 
raid1c3 as there is, and the filesystem has slowed down considerably since. Trying to rate-limit 
going back just means it'll take forever instead.

> # btrfs balance start -dconvert=raid1,soft,limit=10 /bigraid
> 
> This will avoid balancing blockgroups already in RAID1 (soft option) and limit to only balance 10 
> block groups. You can then schedule this during times with less active I/O.
> 
> It is also possible to defragment the subvolume and extent trees[*]. This could help a little, 
> though if the filesystem is frequently changing it might only be a temporary thing. It can also take 
> a long time to complete.

IIUC that can de-share the metadata from subvolumes though, no? Which is a big part of the 
(presumed) problem currently.

> # btrfs filesystem defragment /path/to/subvolume-root
> 
> [*] https://wiki.tnonline.net/w/Btrfs/Defrag#Defragmenting_the_subvolume_and_extent_trees
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-05  1:22           ` Matt Corallo
@ 2023-03-05  8:23             ` Forza
  0 siblings, 0 replies; 10+ messages in thread
From: Forza @ 2023-03-05  8:23 UTC (permalink / raw)
  To: Matt Corallo, Roman Mamedov, Btrfs BTRFS



On 2023-03-05 02:22, Matt Corallo wrote:
> 
> 
> On 3/4/23 12:24 AM, Forza wrote:
>> Unless you need to, replace relatime with noatime. This makes a big 
>> difference when you have lots of reflinks or snapshots as it avoids 
>> de-duplication of metadata when the atimes are updated.
> 
> Yea, I've done that now, thanks. I'm vaguely surprised this big a 
> footgun is the default, and not much more aggressively in the subvolume 
> manpage, at least.

It is Linux default AFAIK. Many distros don't want to change this. Some 
(very few) softwares do use atimes, so this is why relatime is default 
still. But now you are aware and it could start improving the situation 
for you.

> 
>> Not sure if running with multiple profiles will cause issues or 
>> slowness, but it might be good to try to convert the old raid1c3 data 
>> chunks into raid1 over time. You can use balance filters to minimise 
>> the work each run.
> 
> I don't think that's really an option. It took something like six months 
> or a year to get as much raid1c3 as there is, and the filesystem has 
> slowed down considerably since. Trying to rate-limit going back just 
> means it'll take forever instead.

Your current metadata allocation is ~7% of the filesystem. On HDDs this 
is going to be slow no matter what you do. But if you can change your 
`cp --reflink` into `btrfs sub snap src dst` and rsync into `src` 
instead, it could perhaps reduce the amount of metadata over time. How 
many of the files that you backup changes on each backup?

> 
>> # btrfs balance start -dconvert=raid1,soft,limit=10 /bigraid
>>
>> This will avoid balancing blockgroups already in RAID1 (soft option) 
>> and limit to only balance 10 block groups. You can then schedule this 
>> during times with less active I/O.
>>
>> It is also possible to defragment the subvolume and extent trees[*]. 
>> This could help a little, though if the filesystem is frequently 
>> changing it might only be a temporary thing. It can also take a long 
>> time to complete.
> 
> IIUC that can de-share the metadata from subvolumes though, no? Which is 
> a big part of the (presumed) problem currently.

It can, but also reduces makes metadata seeks, which could be an 
improvement. But since this could take time, maybe it is something to 
try another time.
> 
>> # btrfs filesystem defragment /path/to/subvolume-root
>>
>> [*] 
>> https://wiki.tnonline.net/w/Btrfs/Defrag#Defragmenting_the_subvolume_and_extent_trees
>>


What IO scheduler do you use? Have you tried different schedulers to see 
if that makes any difference? for example mq-deadline, kyber and BFQ. 
BFQ is sometimes friendlier on HDDs than the others, but it seems to 
vary greatly depending on use-case.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Salvaging the performance of a high-metadata filesystem
  2023-03-03  4:34 Salvaging the performance of a high-metadata filesystem Matt Corallo
  2023-03-03  5:22 ` Roman Mamedov
@ 2023-03-05  9:36 ` Lukas Straub
  1 sibling, 0 replies; 10+ messages in thread
From: Lukas Straub @ 2023-03-05  9:36 UTC (permalink / raw)
  To: Matt Corallo; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1194 bytes --]

On Thu, 2 Mar 2023 20:34:27 -0800
Matt Corallo <blnxfsl@bluematt.me> wrote:

> I have a ~seven year old BTRFS filesystem who's performance has slowly degraded to unusability.
> 
> ...
>
> This has led to a lot of metadata:
> Metadata,RAID1C3: Size:1.48TiB, Used:1.46TiB (98.73%)
> 
> ...
>
> I recently started adding some I/O to the machine, writing 1MB/s or two of writes from openstack 
> swift, which has now racked up a million or three files itself (in a directory tree two layers of 
> ~1000-folder directories deep). This has made the filesystem largely unusable.
> 
> ...
> 
> Thanks,
> Matt

Hi,
I suspect lots of inline files are bloating your metadata. Especially
from openstack swift, given that each object is stored as it's own file:
https://docs.openstack.org/swift/latest/overview_architecture.html#object-server
By default, btrfs will store all files smaller than 2048 bytes inline
(i.e. directly in the metadata). You can change that with the
"max_inline" mount option

You can count the number of inline files with something like:

find /mnt/hdd -type f -print0 | xargs -0 filefrag -v | grep inline | wc -l

Regards,
Lukas Straub

-- 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-03-05  9:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-03  4:34 Salvaging the performance of a high-metadata filesystem Matt Corallo
2023-03-03  5:22 ` Roman Mamedov
2023-03-03  9:30   ` Forza
2023-03-03 19:04     ` Matt Corallo
2023-03-03 19:05       ` Matt Corallo
2023-03-04  8:24         ` Forza
2023-03-04 17:25           ` Goffredo Baroncelli
2023-03-05  1:22           ` Matt Corallo
2023-03-05  8:23             ` Forza
2023-03-05  9:36 ` Lukas Straub

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.