* mount time for big filesystems
@ 2017-08-31 10:43 Marco Lorenzo Crociani
2017-08-31 11:00 ` Hans van Kranenburg
2017-08-31 11:36 ` Roman Mamedov
0 siblings, 2 replies; 11+ messages in thread
From: Marco Lorenzo Crociani @ 2017-08-31 10:43 UTC (permalink / raw)
To: linux-btrfs
Hi,
this 37T filesystem took some times to mount. It has 47
subvolumes/snapshots and is mounted with
noatime,compress=zlib,space_cache. Is it normal, due to its size?
# time mount /data/R6HW
real 1m32.383s
user 0m0.000s
sys 0m1.348s
# time umount /data/R6HW
real 0m2.562s
user 0m0.000s
sys 0m0.466s
# df -h
/dev/sdo 37T 20T 18T 53% /data/R6HW
# btrfs fi df /data/R6HW/
Data, single: total=19.12TiB, used=19.12TiB
System, DUP: total=8.00MiB, used=2.05MiB
Metadata, DUP: total=37.50GiB, used=36.52GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
# btrfs device usage /data/R6HW/
/dev/sdo, ID: 1
Device size: 36.38TiB
Device slack: 0.00B
Data,single: 19.12TiB
Metadata,DUP: 75.00GiB
System,DUP: 16.00MiB
Unallocated: 17.18TiB
Regards,
--
Marco Crociani
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mount time for big filesystems
2017-08-31 10:43 mount time for big filesystems Marco Lorenzo Crociani
@ 2017-08-31 11:00 ` Hans van Kranenburg
2017-08-31 11:22 ` Austin S. Hemmelgarn
2017-08-31 11:36 ` Roman Mamedov
1 sibling, 1 reply; 11+ messages in thread
From: Hans van Kranenburg @ 2017-08-31 11:00 UTC (permalink / raw)
To: Marco Lorenzo Crociani, linux-btrfs
On 08/31/2017 12:43 PM, Marco Lorenzo Crociani wrote:
> Hi,
> this 37T filesystem took some times to mount. It has 47
> subvolumes/snapshots and is mounted with
> noatime,compress=zlib,space_cache. Is it normal, due to its size?
Yes, unfortunately it is. It depends on the size of the metadata extent
tree. During mount, the BLOCK_GROUP_ITEM objects are loaded from that
tree. They're scattered all around, causing a lot of random reads when
your disk cache is still ice cold.
> # time mount /data/R6HW
>
> real 1m32.383s
> user 0m0.000s
> sys 0m1.348s
>
> # time umount /data/R6HW
>
> real 0m2.562s
> user 0m0.000s
> sys 0m0.466s
>
>
> # df -h
> /dev/sdo 37T 20T 18T 53% /data/R6HW
>
> # btrfs fi df /data/R6HW/
> Data, single: total=19.12TiB, used=19.12TiB
> System, DUP: total=8.00MiB, used=2.05MiB
> Metadata, DUP: total=37.50GiB, used=36.52GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> # btrfs device usage /data/R6HW/
> /dev/sdo, ID: 1
> Device size: 36.38TiB
> Device slack: 0.00B
> Data,single: 19.12TiB
> Metadata,DUP: 75.00GiB
> System,DUP: 16.00MiB
> Unallocated: 17.18TiB
--
Hans van Kranenburg
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mount time for big filesystems
2017-08-31 11:00 ` Hans van Kranenburg
@ 2017-08-31 11:22 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-31 11:22 UTC (permalink / raw)
To: Hans van Kranenburg, Marco Lorenzo Crociani, linux-btrfs
On 2017-08-31 07:00, Hans van Kranenburg wrote:
> On 08/31/2017 12:43 PM, Marco Lorenzo Crociani wrote:
>> Hi,
>> this 37T filesystem took some times to mount. It has 47
>> subvolumes/snapshots and is mounted with
>> noatime,compress=zlib,space_cache. Is it normal, due to its size?
>
> Yes, unfortunately it is. It depends on the size of the metadata extent
> tree. During mount, the BLOCK_GROUP_ITEM objects are loaded from that
> tree. They're scattered all around, causing a lot of random reads when
> your disk cache is still ice cold.
FWIW, you can (sometimes) improve things by running a full balance.
Other than that, there's not much that can be done unless BTRFS gets
changed to actively group certain data close to each other on disk
(which would be nice for other reasons too TBH).
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mount time for big filesystems
2017-08-31 10:43 mount time for big filesystems Marco Lorenzo Crociani
2017-08-31 11:00 ` Hans van Kranenburg
@ 2017-08-31 11:36 ` Roman Mamedov
2017-08-31 11:45 ` Austin S. Hemmelgarn
` (2 more replies)
1 sibling, 3 replies; 11+ messages in thread
From: Roman Mamedov @ 2017-08-31 11:36 UTC (permalink / raw)
To: Marco Lorenzo Crociani; +Cc: linux-btrfs
On Thu, 31 Aug 2017 12:43:19 +0200
Marco Lorenzo Crociani <marcoc@prismatelecomtesting.com> wrote:
> Hi,
> this 37T filesystem took some times to mount. It has 47
> subvolumes/snapshots and is mounted with
> noatime,compress=zlib,space_cache. Is it normal, due to its size?
If you could implement SSD caching in front of your FS (such as lvmcache or
bcache), that would work wonders for performance in general, and especially
for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
a 14 TB FS.
As for in general, with your FS size perhaps you should be using
"space_cache=v2" for better performance, but I'm not sure if that will have
any effect on mount time (aside from slowing down the first mount with that).
--
With respect,
Roman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mount time for big filesystems
2017-08-31 11:36 ` Roman Mamedov
@ 2017-08-31 11:45 ` Austin S. Hemmelgarn
2017-08-31 12:16 ` Roman Mamedov
2017-08-31 14:13 ` Qu Wenruo
2017-09-01 13:52 ` Juan Orti Alcaine
2 siblings, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-31 11:45 UTC (permalink / raw)
To: Roman Mamedov, Marco Lorenzo Crociani; +Cc: linux-btrfs
On 2017-08-31 07:36, Roman Mamedov wrote:
> On Thu, 31 Aug 2017 12:43:19 +0200
> Marco Lorenzo Crociani <marcoc@prismatelecomtesting.com> wrote:
>
>> Hi,
>> this 37T filesystem took some times to mount. It has 47
>> subvolumes/snapshots and is mounted with
>> noatime,compress=zlib,space_cache. Is it normal, due to its size?
>
> If you could implement SSD caching in front of your FS (such as lvmcache or
> bcache), that would work wonders for performance in general, and especially
> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
> a 14 TB FS.
If you use dm-cache (what LVM uses), you need to be _VERY_ careful and
can't use it safely at all with multi-device volumes because it leaves
the underlying block device exposed.
>
> As for in general, with your FS size perhaps you should be using
> "space_cache=v2" for better performance, but I'm not sure if that will have
> any effect on mount time (aside from slowing down the first mount with that).
It shouldn't have any other impact on mount time, but it may speed up
other operations.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mount time for big filesystems
2017-08-31 11:45 ` Austin S. Hemmelgarn
@ 2017-08-31 12:16 ` Roman Mamedov
0 siblings, 0 replies; 11+ messages in thread
From: Roman Mamedov @ 2017-08-31 12:16 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: Marco Lorenzo Crociani, linux-btrfs
On Thu, 31 Aug 2017 07:45:55 -0400
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
> If you use dm-cache (what LVM uses), you need to be _VERY_ careful and
> can't use it safely at all with multi-device volumes because it leaves
> the underlying block device exposed.
It locks the underlying device so it can't be seen by Btrfs and cause problems.
# btrfs dev scan
Scanning for Btrfs filesystems
# btrfs fi show
Label: none uuid: 62ff7619-8202-47f6-8c7e-cef6f082530e
Total devices 1 FS bytes used 112.00KiB
devid 1 size 16.00GiB used 2.02GiB path /dev/mapper/vg-OriginLV
# ls -la /dev/mapper/
total 0
drwxr-xr-x 2 root root 140 Aug 31 12:01 .
drwxr-xr-x 16 root root 2980 Aug 31 12:01 ..
crw------- 1 root root 10, 236 Aug 31 11:59 control
lrwxrwxrwx 1 root root 7 Aug 31 12:01 vg-CacheDataLV_cdata -> ../dm-1
lrwxrwxrwx 1 root root 7 Aug 31 12:01 vg-CacheDataLV_cmeta -> ../dm-2
lrwxrwxrwx 1 root root 7 Aug 31 12:06 vg-OriginLV -> ../dm-0
lrwxrwxrwx 1 root root 7 Aug 31 12:01 vg-OriginLV_corig -> ../dm-3
# btrfs dev scan /dev/dm-0
Scanning for Btrfs filesystems in '/dev/mapper/vg-OriginLV'
# btrfs dev scan /dev/dm-3
Scanning for Btrfs filesystems in '/dev/mapper/vg-OriginLV_corig'
ERROR: device scan failed on '/dev/mapper/vg-OriginLV_corig': Device or resource busy
--
With respect,
Roman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mount time for big filesystems
2017-08-31 11:36 ` Roman Mamedov
2017-08-31 11:45 ` Austin S. Hemmelgarn
@ 2017-08-31 14:13 ` Qu Wenruo
2017-09-01 13:52 ` Juan Orti Alcaine
2 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2017-08-31 14:13 UTC (permalink / raw)
To: Roman Mamedov, Marco Lorenzo Crociani; +Cc: linux-btrfs
On 2017年08月31日 19:36, Roman Mamedov wrote:
> On Thu, 31 Aug 2017 12:43:19 +0200
> Marco Lorenzo Crociani <marcoc@prismatelecomtesting.com> wrote:
>
>> Hi,
>> this 37T filesystem took some times to mount. It has 47
>> subvolumes/snapshots and is mounted with
>> noatime,compress=zlib,space_cache. Is it normal, due to its size?
Just like Han said, this is caused by BLOCK_GROUP_ITEM scattered in the
large extent tree.
So, it's hard to improve soon.
Some ideas like delay BLOCK_GROUP_ITEM loading can greatly enhance the
mount speed.
But such enhancement may affect extent allocator (that's to say we can't
do any write before at least some BLOCK_GROUP_ITEM is loaded) and may
cause more bugs.
Other ideas like per-chunk extent tree may also greatly reduce mount
time but need on-disk format change.
(Well, in fact btrfs on-disk format is never well designed anyway, so if
anyone is really doing this, please make a comprehensive wiki/white
paper for this)
>
> If you could implement SSD caching in front of your FS (such as lvmcache or
> bcache), that would work wonders for performance in general, and especially
> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
> a 14 TB FS.
That's impressive.
Since extent tree is a super hot tree (any CoW will modify extent tree),
it makes sense.
>
> As for in general, with your FS size perhaps you should be using
> "space_cache=v2" for better performance, but I'm not sure if that will have
> any effect on mount time (aside from slowing down the first mount with that).
>
Unfortunately, space tree is not loaded until used (at least for v1), so
space_cache may not help much.
Thanks,
Qu
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mount time for big filesystems
2017-08-31 11:36 ` Roman Mamedov
2017-08-31 11:45 ` Austin S. Hemmelgarn
2017-08-31 14:13 ` Qu Wenruo
@ 2017-09-01 13:52 ` Juan Orti Alcaine
2017-09-01 13:59 ` Austin S. Hemmelgarn
2 siblings, 1 reply; 11+ messages in thread
From: Juan Orti Alcaine @ 2017-09-01 13:52 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Marco Lorenzo Crociani, Btrfs BTRFS
2017-08-31 13:36 GMT+02:00 Roman Mamedov <rm@romanrm.net>:
> If you could implement SSD caching in front of your FS (such as lvmcache or
> bcache), that would work wonders for performance in general, and especially
> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
> a 14 TB FS.
I'm thinking about adding a SSD for my 4 disks RAID1 filesystem, but I
have doubts about how to correctly do it in a multidevice filesystem.
I guess I should make 4 partitions on the SSD and pair them with my
backing devices, then create the btrfs on top of bcache0, bcache1,...
is this the right way to do it?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mount time for big filesystems
2017-09-01 13:52 ` Juan Orti Alcaine
@ 2017-09-01 13:59 ` Austin S. Hemmelgarn
[not found] ` <CAC+fKQWFbdF6b3jGO_6hG_pNNzKobBYMeSNyEi5XRCf5YKa81Q@mail.gmail.com>
0 siblings, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-01 13:59 UTC (permalink / raw)
To: Juan Orti Alcaine, Roman Mamedov; +Cc: Marco Lorenzo Crociani, Btrfs BTRFS
On 2017-09-01 09:52, Juan Orti Alcaine wrote:
> 2017-08-31 13:36 GMT+02:00 Roman Mamedov <rm@romanrm.net>:
>> If you could implement SSD caching in front of your FS (such as lvmcache or
>> bcache), that would work wonders for performance in general, and especially
>> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
>> a 14 TB FS.
>
> I'm thinking about adding a SSD for my 4 disks RAID1 filesystem, but I
> have doubts about how to correctly do it in a multidevice filesystem.
>
> I guess I should make 4 partitions on the SSD and pair them with my
> backing devices, then create the btrfs on top of bcache0, bcache1,...
> is this the right way to do it?
If you are going to use bcache, you don't need separate caches for each
device (and in fact, you're probably better off sharing a cache across
devices).
If instead you're going to use dm-cache/LVM, you will need two logical
volumes per-device for the cache, one big one (for the actual cache),
and one little one (for metadata, usually a few hundred MB is fine).
In general though, you're correct, it is preferred to do things in the
order you suggested. It is technically possible sometimes to convert an
existing device to being cached in-place, but it's risky, and restoring
from a backup onto a clean filesystem has other benefits too.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-09-01 22:41 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-31 10:43 mount time for big filesystems Marco Lorenzo Crociani
2017-08-31 11:00 ` Hans van Kranenburg
2017-08-31 11:22 ` Austin S. Hemmelgarn
2017-08-31 11:36 ` Roman Mamedov
2017-08-31 11:45 ` Austin S. Hemmelgarn
2017-08-31 12:16 ` Roman Mamedov
2017-08-31 14:13 ` Qu Wenruo
2017-09-01 13:52 ` Juan Orti Alcaine
2017-09-01 13:59 ` Austin S. Hemmelgarn
[not found] ` <CAC+fKQWFbdF6b3jGO_6hG_pNNzKobBYMeSNyEi5XRCf5YKa81Q@mail.gmail.com>
2017-09-01 15:20 ` Austin S. Hemmelgarn
2017-09-01 22:41 ` Dan Merillat
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.