All of lore.kernel.org
 help / color / mirror / Atom feed
* mount time for big filesystems
@ 2017-08-31 10:43 Marco Lorenzo Crociani
  2017-08-31 11:00 ` Hans van Kranenburg
  2017-08-31 11:36 ` Roman Mamedov
  0 siblings, 2 replies; 11+ messages in thread
From: Marco Lorenzo Crociani @ 2017-08-31 10:43 UTC (permalink / raw)
  To: linux-btrfs

Hi,
this 37T filesystem took some times to mount. It has 47 
subvolumes/snapshots and is mounted with 
noatime,compress=zlib,space_cache. Is it normal, due to its size?

# time mount /data/R6HW

real	1m32.383s
user	0m0.000s
sys	0m1.348s

# time umount /data/R6HW

real	0m2.562s
user	0m0.000s
sys	0m0.466s


# df -h
/dev/sdo                          37T   20T     18T  53% /data/R6HW

# btrfs fi df /data/R6HW/
Data, single: total=19.12TiB, used=19.12TiB
System, DUP: total=8.00MiB, used=2.05MiB
Metadata, DUP: total=37.50GiB, used=36.52GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# btrfs device usage /data/R6HW/
/dev/sdo, ID: 1
    Device size:            36.38TiB
    Device slack:              0.00B
    Data,single:            19.12TiB
    Metadata,DUP:           75.00GiB
    System,DUP:             16.00MiB
    Unallocated:            17.18TiB

Regards,

-- 
Marco Crociani

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-08-31 10:43 mount time for big filesystems Marco Lorenzo Crociani
@ 2017-08-31 11:00 ` Hans van Kranenburg
  2017-08-31 11:22   ` Austin S. Hemmelgarn
  2017-08-31 11:36 ` Roman Mamedov
  1 sibling, 1 reply; 11+ messages in thread
From: Hans van Kranenburg @ 2017-08-31 11:00 UTC (permalink / raw)
  To: Marco Lorenzo Crociani, linux-btrfs

On 08/31/2017 12:43 PM, Marco Lorenzo Crociani wrote:
> Hi,
> this 37T filesystem took some times to mount. It has 47
> subvolumes/snapshots and is mounted with
> noatime,compress=zlib,space_cache. Is it normal, due to its size?

Yes, unfortunately it is. It depends on the size of the metadata extent
tree. During mount, the BLOCK_GROUP_ITEM objects are loaded from that
tree. They're scattered all around, causing a lot of random reads when
your disk cache is still ice cold.

> # time mount /data/R6HW
> 
> real    1m32.383s
> user    0m0.000s
> sys    0m1.348s
> 
> # time umount /data/R6HW
> 
> real    0m2.562s
> user    0m0.000s
> sys    0m0.466s
> 
> 
> # df -h
> /dev/sdo                          37T   20T     18T  53% /data/R6HW
> 
> # btrfs fi df /data/R6HW/
> Data, single: total=19.12TiB, used=19.12TiB
> System, DUP: total=8.00MiB, used=2.05MiB
> Metadata, DUP: total=37.50GiB, used=36.52GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> # btrfs device usage /data/R6HW/
> /dev/sdo, ID: 1
>    Device size:            36.38TiB
>    Device slack:              0.00B
>    Data,single:            19.12TiB
>    Metadata,DUP:           75.00GiB
>    System,DUP:             16.00MiB
>    Unallocated:            17.18TiB


-- 
Hans van Kranenburg


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-08-31 11:00 ` Hans van Kranenburg
@ 2017-08-31 11:22   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-31 11:22 UTC (permalink / raw)
  To: Hans van Kranenburg, Marco Lorenzo Crociani, linux-btrfs

On 2017-08-31 07:00, Hans van Kranenburg wrote:
> On 08/31/2017 12:43 PM, Marco Lorenzo Crociani wrote:
>> Hi,
>> this 37T filesystem took some times to mount. It has 47
>> subvolumes/snapshots and is mounted with
>> noatime,compress=zlib,space_cache. Is it normal, due to its size?
> 
> Yes, unfortunately it is. It depends on the size of the metadata extent
> tree. During mount, the BLOCK_GROUP_ITEM objects are loaded from that
> tree. They're scattered all around, causing a lot of random reads when
> your disk cache is still ice cold.
FWIW, you can (sometimes) improve things by running a full balance.

Other than that, there's not much that can be done unless BTRFS gets 
changed to actively group certain data close to each other on disk 
(which would be nice for other reasons too TBH).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-08-31 10:43 mount time for big filesystems Marco Lorenzo Crociani
  2017-08-31 11:00 ` Hans van Kranenburg
@ 2017-08-31 11:36 ` Roman Mamedov
  2017-08-31 11:45   ` Austin S. Hemmelgarn
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Roman Mamedov @ 2017-08-31 11:36 UTC (permalink / raw)
  To: Marco Lorenzo Crociani; +Cc: linux-btrfs

On Thu, 31 Aug 2017 12:43:19 +0200
Marco Lorenzo Crociani <marcoc@prismatelecomtesting.com> wrote:

> Hi,
> this 37T filesystem took some times to mount. It has 47 
> subvolumes/snapshots and is mounted with 
> noatime,compress=zlib,space_cache. Is it normal, due to its size?

If you could implement SSD caching in front of your FS (such as lvmcache or
bcache), that would work wonders for performance in general, and especially
for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
a 14 TB FS.

As for in general, with your FS size perhaps you should be using
"space_cache=v2" for better performance, but I'm not sure if that will have
any effect on mount time (aside from slowing down the first mount with that).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-08-31 11:36 ` Roman Mamedov
@ 2017-08-31 11:45   ` Austin S. Hemmelgarn
  2017-08-31 12:16     ` Roman Mamedov
  2017-08-31 14:13   ` Qu Wenruo
  2017-09-01 13:52   ` Juan Orti Alcaine
  2 siblings, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-31 11:45 UTC (permalink / raw)
  To: Roman Mamedov, Marco Lorenzo Crociani; +Cc: linux-btrfs

On 2017-08-31 07:36, Roman Mamedov wrote:
> On Thu, 31 Aug 2017 12:43:19 +0200
> Marco Lorenzo Crociani <marcoc@prismatelecomtesting.com> wrote:
> 
>> Hi,
>> this 37T filesystem took some times to mount. It has 47
>> subvolumes/snapshots and is mounted with
>> noatime,compress=zlib,space_cache. Is it normal, due to its size?
> 
> If you could implement SSD caching in front of your FS (such as lvmcache or
> bcache), that would work wonders for performance in general, and especially
> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
> a 14 TB FS.
If you use dm-cache (what LVM uses), you need to be _VERY_ careful and 
can't use it safely at all with multi-device volumes because it leaves 
the underlying block device exposed.
> 
> As for in general, with your FS size perhaps you should be using
> "space_cache=v2" for better performance, but I'm not sure if that will have
> any effect on mount time (aside from slowing down the first mount with that).
It shouldn't have any other impact on mount time, but it may speed up 
other operations.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-08-31 11:45   ` Austin S. Hemmelgarn
@ 2017-08-31 12:16     ` Roman Mamedov
  0 siblings, 0 replies; 11+ messages in thread
From: Roman Mamedov @ 2017-08-31 12:16 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Marco Lorenzo Crociani, linux-btrfs

On Thu, 31 Aug 2017 07:45:55 -0400
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

> If you use dm-cache (what LVM uses), you need to be _VERY_ careful and 
> can't use it safely at all with multi-device volumes because it leaves 
> the underlying block device exposed.

It locks the underlying device so it can't be seen by Btrfs and cause problems.

# btrfs dev scan
Scanning for Btrfs filesystems

# btrfs fi show
Label: none  uuid: 62ff7619-8202-47f6-8c7e-cef6f082530e
	Total devices 1 FS bytes used 112.00KiB
	devid    1 size 16.00GiB used 2.02GiB path /dev/mapper/vg-OriginLV

# ls -la /dev/mapper/
total 0
drwxr-xr-x  2 root root     140 Aug 31 12:01 .
drwxr-xr-x 16 root root    2980 Aug 31 12:01 ..
crw-------  1 root root 10, 236 Aug 31 11:59 control
lrwxrwxrwx  1 root root       7 Aug 31 12:01 vg-CacheDataLV_cdata -> ../dm-1
lrwxrwxrwx  1 root root       7 Aug 31 12:01 vg-CacheDataLV_cmeta -> ../dm-2
lrwxrwxrwx  1 root root       7 Aug 31 12:06 vg-OriginLV -> ../dm-0
lrwxrwxrwx  1 root root       7 Aug 31 12:01 vg-OriginLV_corig -> ../dm-3

# btrfs dev scan /dev/dm-0
Scanning for Btrfs filesystems in '/dev/mapper/vg-OriginLV'

# btrfs dev scan /dev/dm-3
Scanning for Btrfs filesystems in '/dev/mapper/vg-OriginLV_corig'
ERROR: device scan failed on '/dev/mapper/vg-OriginLV_corig': Device or resource busy

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-08-31 11:36 ` Roman Mamedov
  2017-08-31 11:45   ` Austin S. Hemmelgarn
@ 2017-08-31 14:13   ` Qu Wenruo
  2017-09-01 13:52   ` Juan Orti Alcaine
  2 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2017-08-31 14:13 UTC (permalink / raw)
  To: Roman Mamedov, Marco Lorenzo Crociani; +Cc: linux-btrfs



On 2017年08月31日 19:36, Roman Mamedov wrote:
> On Thu, 31 Aug 2017 12:43:19 +0200
> Marco Lorenzo Crociani <marcoc@prismatelecomtesting.com> wrote:
> 
>> Hi,
>> this 37T filesystem took some times to mount. It has 47
>> subvolumes/snapshots and is mounted with
>> noatime,compress=zlib,space_cache. Is it normal, due to its size?

Just like Han said, this is caused by BLOCK_GROUP_ITEM scattered in the 
large extent tree.
So, it's hard to improve soon.

Some ideas like delay BLOCK_GROUP_ITEM loading can greatly enhance the 
mount speed.
But such enhancement may affect extent allocator (that's to say we can't 
do any write before at least some BLOCK_GROUP_ITEM is loaded) and may 
cause more bugs.

Other ideas like per-chunk extent tree may also greatly reduce mount 
time but need on-disk format change.
(Well, in fact btrfs on-disk format is never well designed anyway, so if 
anyone is really doing this, please make a comprehensive wiki/white 
paper for this)

> 
> If you could implement SSD caching in front of your FS (such as lvmcache or
> bcache), that would work wonders for performance in general, and especially
> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
> a 14 TB FS.

That's impressive.
Since extent tree is a super hot tree (any CoW will modify extent tree), 
it makes sense.

> 
> As for in general, with your FS size perhaps you should be using
> "space_cache=v2" for better performance, but I'm not sure if that will have
> any effect on mount time (aside from slowing down the first mount with that).
> 
Unfortunately, space tree is not loaded until used (at least for v1), so 
space_cache may not help much.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-08-31 11:36 ` Roman Mamedov
  2017-08-31 11:45   ` Austin S. Hemmelgarn
  2017-08-31 14:13   ` Qu Wenruo
@ 2017-09-01 13:52   ` Juan Orti Alcaine
  2017-09-01 13:59     ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 11+ messages in thread
From: Juan Orti Alcaine @ 2017-09-01 13:52 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Marco Lorenzo Crociani, Btrfs BTRFS

2017-08-31 13:36 GMT+02:00 Roman Mamedov <rm@romanrm.net>:
> If you could implement SSD caching in front of your FS (such as lvmcache or
> bcache), that would work wonders for performance in general, and especially
> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
> a 14 TB FS.

I'm thinking about adding a SSD for my 4 disks RAID1 filesystem, but I
have doubts about how to correctly do it in a multidevice filesystem.

I guess I should make 4 partitions on the SSD and pair them with my
backing devices, then create the btrfs on top of bcache0, bcache1,...
is this the right way to do it?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-09-01 13:52   ` Juan Orti Alcaine
@ 2017-09-01 13:59     ` Austin S. Hemmelgarn
       [not found]       ` <CAC+fKQWFbdF6b3jGO_6hG_pNNzKobBYMeSNyEi5XRCf5YKa81Q@mail.gmail.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-01 13:59 UTC (permalink / raw)
  To: Juan Orti Alcaine, Roman Mamedov; +Cc: Marco Lorenzo Crociani, Btrfs BTRFS

On 2017-09-01 09:52, Juan Orti Alcaine wrote:
> 2017-08-31 13:36 GMT+02:00 Roman Mamedov <rm@romanrm.net>:
>> If you could implement SSD caching in front of your FS (such as lvmcache or
>> bcache), that would work wonders for performance in general, and especially
>> for mount times. I have seen amazing results with lvmcache (of just 32 GB) for
>> a 14 TB FS.
> 
> I'm thinking about adding a SSD for my 4 disks RAID1 filesystem, but I
> have doubts about how to correctly do it in a multidevice filesystem.
> 
> I guess I should make 4 partitions on the SSD and pair them with my
> backing devices, then create the btrfs on top of bcache0, bcache1,...
> is this the right way to do it?
If you are going to use bcache, you don't need separate caches for each 
device (and in fact, you're probably better off sharing a cache across 
devices).

If instead you're going to use dm-cache/LVM, you will need two logical 
volumes per-device for the cache, one big one (for the actual cache), 
and one little one (for metadata, usually a few hundred MB is fine).

In general though, you're correct, it is preferred to do things in the 
order you suggested.  It is technically possible sometimes to convert an 
existing device to being cached in-place, but it's risky, and restoring 
from a backup onto a clean filesystem has other benefits too.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
       [not found]       ` <CAC+fKQWFbdF6b3jGO_6hG_pNNzKobBYMeSNyEi5XRCf5YKa81Q@mail.gmail.com>
@ 2017-09-01 15:20         ` Austin S. Hemmelgarn
  2017-09-01 22:41           ` Dan Merillat
  0 siblings, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-01 15:20 UTC (permalink / raw)
  To: Juan Orti Alcaine; +Cc: Btrfs BTRFS, Marco Lorenzo Crociani, Roman Mamedov

On 2017-09-01 11:00, Juan Orti Alcaine wrote:
> 
> 
> El 1 sept. 2017 15:59, "Austin S. Hemmelgarn" <ahferroin7@gmail.com 
> <mailto:ahferroin7@gmail.com>> escribió:
> 
>     If you are going to use bcache, you don't need separate caches for
>     each device (and in fact, you're probably better off sharing a cache
>     across devices).
> 
> 
> But, if I mix all the backing devices, I'll only get one bcache device, 
> so I won't be able to do btrfs RAID1 on that.
No, that's not what I'm talking about.  You always get one bcache device 
per backing device, but multiple bcache devices can use the same 
physical cache device (that is, backing devices map 1:1 to bcache 
devices, but cache devices can map 1:N to bcache devices).  So, in other 
words, the layout I'm suggesting looks like this:

/dev/sda1: Backing device.
/dev/sdb1: Backing device.
/dev/sdc1: Backing device.
/dev/sdd1: Backing device.
/dev/sde1: SSD cache device.
/dev/bcache0: Corresponds to /dev/sda1, uses /dev/sde1 as cache
/dev/bcache1: Corresponds to /dev/sdb1, uses /dev/sde1 as cache
/dev/bcache2: Corresponds to /dev/sdc1, uses /dev/sde1 as cache
/dev/bcache3: Corresponds to /dev/sdd1, uses /dev/sde1 as cache

This is actually simpler to manage for multiple reasons, and will avoid 
wasting space on the cache device because of random choices made by 
BTRFS when deciding where to read data.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mount time for big filesystems
  2017-09-01 15:20         ` Austin S. Hemmelgarn
@ 2017-09-01 22:41           ` Dan Merillat
  0 siblings, 0 replies; 11+ messages in thread
From: Dan Merillat @ 2017-09-01 22:41 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Juan Orti Alcaine, Btrfs BTRFS, Marco Lorenzo Crociani, Roman Mamedov

On Fri, Sep 1, 2017 at 11:20 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> No, that's not what I'm talking about.  You always get one bcache device per
> backing device, but multiple bcache devices can use the same physical cache
> device (that is, backing devices map 1:1 to bcache devices, but cache
> devices can map 1:N to bcache devices).  So, in other words, the layout I'm
> suggesting looks like this:
>
> This is actually simpler to manage for multiple reasons, and will avoid
> wasting space on the cache device because of random choices made by BTRFS
> when deciding where to read data.

Be careful with bcache - if you lose the SSD and it has dirty data on
it, your entire FS is gone.   I ended up contributing a number of
patches to the recovery tools digging my array out from that.   Even
if a single file is dirty, the new metadata tree will only exist on
the cache device, which doesn't honor barriers writing back to the
underlying storage.   That means it's likely to have a root pointing
at a metadata tree that's no longer there.  The recovery method is
finding an older root that has a complete tree, and recovery-walking
the entire FS from that.

I don't know if dm-cache honors write barriers from the cache to the
backing storage, but I would still recommend using them both in
write-through mode, not write-back.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-09-01 22:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-31 10:43 mount time for big filesystems Marco Lorenzo Crociani
2017-08-31 11:00 ` Hans van Kranenburg
2017-08-31 11:22   ` Austin S. Hemmelgarn
2017-08-31 11:36 ` Roman Mamedov
2017-08-31 11:45   ` Austin S. Hemmelgarn
2017-08-31 12:16     ` Roman Mamedov
2017-08-31 14:13   ` Qu Wenruo
2017-09-01 13:52   ` Juan Orti Alcaine
2017-09-01 13:59     ` Austin S. Hemmelgarn
     [not found]       ` <CAC+fKQWFbdF6b3jGO_6hG_pNNzKobBYMeSNyEi5XRCf5YKa81Q@mail.gmail.com>
2017-09-01 15:20         ` Austin S. Hemmelgarn
2017-09-01 22:41           ` Dan Merillat

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.