All of lore.kernel.org
 help / color / mirror / Atom feed
* mount time of multi-disk arrays
@ 2014-07-07 13:38 André-Sebastian Liebe
  2014-07-07 13:54 ` Konstantinos Skarlatos
  0 siblings, 1 reply; 9+ messages in thread
From: André-Sebastian Liebe @ 2014-07-07 13:38 UTC (permalink / raw)
  To: linux-btrfs

Hello List,

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.

My hardware setup contains a
- Intel Core i7 4770
- Kernel 3.15.2-1-ARCH
- 32GB RAM
- dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
- dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)

Thanks in advance

André-Sebastian Liebe
--------------------------------------------------------------------------------------------------

# btrfs fi sh
Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
        Total devices 5 FS bytes used 14.21TiB
        devid    1 size 3.64TiB used 2.86TiB path /dev/sdd
        devid    2 size 3.64TiB used 2.86TiB path /dev/sdc
        devid    3 size 3.64TiB used 2.86TiB path /dev/sdf
        devid    4 size 3.64TiB used 2.86TiB path /dev/sde
        devid    5 size 3.64TiB used 2.88TiB path /dev/sdb

Btrfs v3.14.2-dirty

# btrfs fi df /data/pool0/
Data, single: total=14.28TiB, used=14.19TiB
System, RAID1: total=8.00MiB, used=1.54MiB
Metadata, RAID1: total=26.00GiB, used=20.20GiB
unknown, single: total=512.00MiB, used=0.00



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: mount time of multi-disk arrays
  2014-07-07 13:38 mount time of multi-disk arrays André-Sebastian Liebe
@ 2014-07-07 13:54 ` Konstantinos Skarlatos
  2014-07-07 14:14   ` Austin S Hemmelgarn
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Konstantinos Skarlatos @ 2014-07-07 13:54 UTC (permalink / raw)
  To: André-Sebastian Liebe, linux-btrfs

On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
> Hello List,
>
> can anyone tell me how much time is acceptable and assumable for a
> multi-disk btrfs array with classical hard disk drives to mount?
>
> I'm having a bit of trouble with my current systemd setup, because it
> couldn't mount my btrfs raid anymore after adding the 5th drive. With
> the 4 drive setup it failed to mount once in a few times. Now it fails
> everytime because the default timeout of 1m 30s is reached and mount is
> aborted.
> My last 10 manual mounts took between 1m57s and 2m12s to finish.
I have the exact same problem, and have to manually mount my large 
multi-disk btrfs filesystems, so I would be interested in a solution as 
well.

>
> My hardware setup contains a
> - Intel Core i7 4770
> - Kernel 3.15.2-1-ARCH
> - 32GB RAM
> - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
> - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)
>
> Thanks in advance
>
> André-Sebastian Liebe
> --------------------------------------------------------------------------------------------------
>
> # btrfs fi sh
> Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
>          Total devices 5 FS bytes used 14.21TiB
>          devid    1 size 3.64TiB used 2.86TiB path /dev/sdd
>          devid    2 size 3.64TiB used 2.86TiB path /dev/sdc
>          devid    3 size 3.64TiB used 2.86TiB path /dev/sdf
>          devid    4 size 3.64TiB used 2.86TiB path /dev/sde
>          devid    5 size 3.64TiB used 2.88TiB path /dev/sdb
>
> Btrfs v3.14.2-dirty
>
> # btrfs fi df /data/pool0/
> Data, single: total=14.28TiB, used=14.19TiB
> System, RAID1: total=8.00MiB, used=1.54MiB
> Metadata, RAID1: total=26.00GiB, used=20.20GiB
> unknown, single: total=512.00MiB, used=0.00
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Konstantinos Skarlatos


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: mount time of multi-disk arrays
  2014-07-07 13:54 ` Konstantinos Skarlatos
@ 2014-07-07 14:14   ` Austin S Hemmelgarn
  2014-07-07 16:57     ` André-Sebastian Liebe
  2014-07-07 14:24   ` André-Sebastian Liebe
  2014-07-07 15:48   ` Duncan
  2 siblings, 1 reply; 9+ messages in thread
From: Austin S Hemmelgarn @ 2014-07-07 14:14 UTC (permalink / raw)
  To: Konstantinos Skarlatos, André-Sebastian Liebe, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2832 bytes --]

On 2014-07-07 09:54, Konstantinos Skarlatos wrote:
> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>> Hello List,
>>
>> can anyone tell me how much time is acceptable and assumable for a
>> multi-disk btrfs array with classical hard disk drives to mount?
>>
>> I'm having a bit of trouble with my current systemd setup, because it
>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>> the 4 drive setup it failed to mount once in a few times. Now it fails
>> everytime because the default timeout of 1m 30s is reached and mount is
>> aborted.
>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
> I have the exact same problem, and have to manually mount my large
> multi-disk btrfs filesystems, so I would be interested in a solution as
> well.
> 
>>
>> My hardware setup contains a
>> - Intel Core i7 4770
>> - Kernel 3.15.2-1-ARCH
>> - 32GB RAM
>> - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
>> - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)
>>
>> Thanks in advance
>>
>> André-Sebastian Liebe
>> --------------------------------------------------------------------------------------------------
>>
>>
>> # btrfs fi sh
>> Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
>>          Total devices 5 FS bytes used 14.21TiB
>>          devid    1 size 3.64TiB used 2.86TiB path /dev/sdd
>>          devid    2 size 3.64TiB used 2.86TiB path /dev/sdc
>>          devid    3 size 3.64TiB used 2.86TiB path /dev/sdf
>>          devid    4 size 3.64TiB used 2.86TiB path /dev/sde
>>          devid    5 size 3.64TiB used 2.88TiB path /dev/sdb
>>
>> Btrfs v3.14.2-dirty
>>
>> # btrfs fi df /data/pool0/
>> Data, single: total=14.28TiB, used=14.19TiB
>> System, RAID1: total=8.00MiB, used=1.54MiB
>> Metadata, RAID1: total=26.00GiB, used=20.20GiB
>> unknown, single: total=512.00MiB, used=0.00

This is interesting, I actually did some profiling of the mount timings
for a bunch of different configurations of 4 (identical other than
hardware age) 1TB Seagate disks.  One of the arrangements I tested was
Data using single profile and Metadata/System using RAID1.  Based on the
results I got, and what you are reporting, the mount time doesn't scale
linearly in proportion to the amount of storage space.

You might want to try the RAID10 profile for Metadata, of the
configurations I tested, the fastest used Single for Data and RAID10 for
Metadata/System.

Also, based on the System chunk usage, I'm guessing that you have a LOT
of subvolumes/snapshots, and I do know that having very large (100+)
numbers of either does slow down the mount command (I don't think that
we cache subvolume information between mount invocations, so it has to
re-parse the system chunks for each individual mount).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: mount time of multi-disk arrays
  2014-07-07 13:54 ` Konstantinos Skarlatos
  2014-07-07 14:14   ` Austin S Hemmelgarn
@ 2014-07-07 14:24   ` André-Sebastian Liebe
  2014-07-07 22:34     ` Konstantinos Skarlatos
  2014-07-07 15:48   ` Duncan
  2 siblings, 1 reply; 9+ messages in thread
From: André-Sebastian Liebe @ 2014-07-07 14:24 UTC (permalink / raw)
  To: Konstantinos Skarlatos, linux-btrfs

On 07/07/2014 03:54 PM, Konstantinos Skarlatos wrote:
> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>> Hello List,
>>
>> can anyone tell me how much time is acceptable and assumable for a
>> multi-disk btrfs array with classical hard disk drives to mount?
>>
>> I'm having a bit of trouble with my current systemd setup, because it
>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>> the 4 drive setup it failed to mount once in a few times. Now it fails
>> everytime because the default timeout of 1m 30s is reached and mount is
>> aborted.
>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
> I have the exact same problem, and have to manually mount my large
> multi-disk btrfs filesystems, so I would be interested in a solution
> as well.
Hi Konstantinos , you can workaround this by manual creating a systemd
mount unit.

- First review the autogenerated systemd mount unit (systemctl show
<your-mount-unit>.mount).  You you can get the unit name by issuing a
'systemctl' and look for your failed mount.
- Then you have to take the needed values (After, Before, Conflicts,
RequiresMountsFor, Where, What, Options, Type, Wantedby) and put them
into an new systemd mount unit file (possibly under
/usr/lib/systemd/system/<your-mount-unit>.mount ).
- Now just add the TimeoutSec with a large enough value below [Mount].
- If you later want to automount you raid, add the WantedBy under [Install]
- now issue a 'systemctl daemon-reload' and look for error messages in
syslog.
- If there are no errors you could enable your manual mount entry by
'systemctl enable <your-mount-unit>.mount' and safely comment out your
old fstab entry (systemd no longer generates autogenerated units).

-- 8< ----------- 8< ----------- 8< ----------- 8< ----------- 8<
----------- 8< ----------- 8< -----------
[Unit]
Description=Mount /data/pool0
After=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
systemd-journald.socket local-fs-pre.target system.slice -.mount
Before=umount.target
Conflicts=umount.target
RequiresMountsFor=/data
/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb

[Mount]
Where=/data/pool0
What=/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb
Options=rw,relatime,skip_balance,compress
Type=btrfs
TimeoutSec=3min

[Install]
WantedBy=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
-- 8< ----------- 8< ----------- 8< ----------- 8< ----------- 8<
----------- 8< ----------- 8< -----------


>
>>
>> My hardware setup contains a
>> - Intel Core i7 4770
>> - Kernel 3.15.2-1-ARCH
>> - 32GB RAM
>> - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
>> - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)
>>
>> Thanks in advance
>>
>> André-Sebastian Liebe
>> --------------------------------------------------------------------------------------------------
>>
>>
>> # btrfs fi sh
>> Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
>>          Total devices 5 FS bytes used 14.21TiB
>>          devid    1 size 3.64TiB used 2.86TiB path /dev/sdd
>>          devid    2 size 3.64TiB used 2.86TiB path /dev/sdc
>>          devid    3 size 3.64TiB used 2.86TiB path /dev/sdf
>>          devid    4 size 3.64TiB used 2.86TiB path /dev/sde
>>          devid    5 size 3.64TiB used 2.88TiB path /dev/sdb
>>
>> Btrfs v3.14.2-dirty
>>
>> # btrfs fi df /data/pool0/
>> Data, single: total=14.28TiB, used=14.19TiB
>> System, RAID1: total=8.00MiB, used=1.54MiB
>> Metadata, RAID1: total=26.00GiB, used=20.20GiB
>> unknown, single: total=512.00MiB, used=0.00
>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> -- 
> Konstantinos Skarlatos
--
André-Sebastian Liebe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: mount time of multi-disk arrays
  2014-07-07 13:54 ` Konstantinos Skarlatos
  2014-07-07 14:14   ` Austin S Hemmelgarn
  2014-07-07 14:24   ` André-Sebastian Liebe
@ 2014-07-07 15:48   ` Duncan
  2014-07-07 16:40     ` Benjamin O'Connor
  2014-07-07 22:31     ` Konstantinos Skarlatos
  2 siblings, 2 replies; 9+ messages in thread
From: Duncan @ 2014-07-07 15:48 UTC (permalink / raw)
  To: linux-btrfs

Konstantinos Skarlatos posted on Mon, 07 Jul 2014 16:54:05 +0300 as
excerpted:

> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>>
>> can anyone tell me how much time is acceptable and assumable for a
>> multi-disk btrfs array with classical hard disk drives to mount?
>>
>> I'm having a bit of trouble with my current systemd setup, because it
>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>> the 4 drive setup it failed to mount once in a few times. Now it fails
>> everytime because the default timeout of 1m 30s is reached and mount is
>> aborted.
>> My last 10 manual mounts took between 1m57s and 2m12s to finish.

> I have the exact same problem, and have to manually mount my large
> multi-disk btrfs filesystems, so I would be interested in a solution as
> well.

I don't have a direct answer, as my btrfs devices are all SSD, but...

a) Btrfs, like some other filesystems, is designed not to need a
pre-mount (or pre-rw-mount) fsck, because it does what /should/ be a 
quick-scan at mount-time.  However, that isn't always as quick as it 
might be for a number of reasons:

a1) Btrfs is still a relatively immature filesystem and certain 
operations are not yet optimized.  In particular, multi-device btrfs 
operations tend to still be using a first-working-implementation type of 
algorithm instead of a well optimized for parallel operation algorithm, 
and thus often serialize access to multiple devices where a more 
optimized algorithm would parallelize operations across multiple devices 
at the same time.  That will come, but it's not there yet.

a2) Certain operations such as orphan cleanup ("orphans" are files that 
were deleted while they were in use and thus weren't fully deleted at the 
time; if they were still in use at unmount (remount-read-only), cleanup 
is done at mount-time) can delay mount as well.

a3) Inode_cache mount option:  Don't use this unless you can explain 
exactly WHY you are using it, preferably backed up with benchmark 
numbers, etc.  It's useful only on 32-bit, generally high-file-activity 
server systems and has general-case problems, including long mount times 
and possible overflow issues that make it inappropriate for normal use.  
Unfortunately there's a lot of people out there using it that shouldn't 
be, and I even saw it listed on at least one distro (not mine!) wiki. =:^(

a4) The space_cache mount option OTOH *IS* appropriate for normal use 
(and is in fact enabled by default these days), but particularly in 
improper shutdown cases can require rebuilding at mount time -- altho 
this should happen /after/ mount, the system will just be busy for some 
minutes, until the space-cache is rebuilt.  But the IO from a space_cache 
rebuild on one filesystem could slow down the mounting of filesystems 
that mount after it, as well as the boot-time launching of other post-
mount launched services.

If you're seeing the time go up dramatically with the addition of more 
filesystem devices, however, and you do /not/ have inode_cache active, 
I'd guess it's mainly the not-yet-optimized multi-device operations.


b) As with any systemd launched unit, however, there's systemd 
configuration mechanisms for working around specific unit issues, 
including timeout issues.  Of course most systems continue to use fstab 
and let systemd auto-generate the mount units, and in fact that is 
recommended, but either with fstab or directly created mount units, 
there's a timeout configuration option that can be set.

b1) The general systemd *.mount unit [Mount] section option appears to be 
TimeoutSec=.  As is usual with systemd times, the default is seconds, or 
pass the unit(s, like "5min 20s").

b2) I don't see it /specifically/ stated, but with a bit of reading 
between the lines, the corresponding fstab option appears to be either
x-systemd.timeoutsec= or x-systemd.TimeoutSec= (IOW I'm not sure of the 
case).  You may also want to try x-systemd.device-timeout=, which /is/ 
specifically mentioned, altho that appears to be specifically the timeout 
for the device to appear, NOT for the filesystem to mount after it does.

b3) See the systemd.mount (5) and systemd-fstab-generator (8) manpages 
for more, that being what the above is based on.

So it might take a bit of experimentation to find the exact command, but 
based on the above anyway, it /should/ be pretty easy to tell systemd to 
wait a bit longer for that filesystem.

When you find the right invocation, please reply with it here, as I'm 
sure there's others who will benefit as well.  FWIW, I'm still on 
reiserfs for my spinning rust (only btrfs on my ssds), but I expect I'll 
switch them to btrfs at some point, so I may well use the information 
myself.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: mount time of multi-disk arrays
  2014-07-07 15:48   ` Duncan
@ 2014-07-07 16:40     ` Benjamin O'Connor
  2014-07-07 22:31     ` Konstantinos Skarlatos
  1 sibling, 0 replies; 9+ messages in thread
From: Benjamin O'Connor @ 2014-07-07 16:40 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

As a point of reference, my BTRFS filesystem with 11 x 21TB devices in 
RAID0 with space cache enabled takes about 4 minutes to mount after a 
clean unmount.

There is a decent amount of variation in the amount of time (has been as 
low as 3 minutes or taken 5 minutes or longer).  These devices are all 
connected via 10gb iscsi.

Mount time seems to have not increased relative to the number of devices 
(so far).  I think that back when we had only 6 devices, it still took 
roughly that amount of time.

-ben

-- 
-----------------------------
Benjamin O'Connor
TechOps Systems Administrator
TripAdvisor Media Group

benoc@tripadvisor.com
c. 617-312-9072
-----------------------------


Duncan wrote:
> Konstantinos Skarlatos posted on Mon, 07 Jul 2014 16:54:05 +0300 as
> excerpted:
>
>> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>>> can anyone tell me how much time is acceptable and assumable for a
>>> multi-disk btrfs array with classical hard disk drives to mount?
>>>
>>> I'm having a bit of trouble with my current systemd setup, because it
>>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>>> the 4 drive setup it failed to mount once in a few times. Now it fails
>>> everytime because the default timeout of 1m 30s is reached and mount is
>>> aborted.
>>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
>
>> I have the exact same problem, and have to manually mount my large
>> multi-disk btrfs filesystems, so I would be interested in a solution as
>> well.
>
> I don't have a direct answer, as my btrfs devices are all SSD, but...
>
> a) Btrfs, like some other filesystems, is designed not to need a
> pre-mount (or pre-rw-mount) fsck, because it does what /should/ be a
> quick-scan at mount-time.  However, that isn't always as quick as it
> might be for a number of reasons:
>
> a1) Btrfs is still a relatively immature filesystem and certain
> operations are not yet optimized.  In particular, multi-device btrfs
> operations tend to still be using a first-working-implementation type of
> algorithm instead of a well optimized for parallel operation algorithm,
> and thus often serialize access to multiple devices where a more
> optimized algorithm would parallelize operations across multiple devices
> at the same time.  That will come, but it's not there yet.
>
> a2) Certain operations such as orphan cleanup ("orphans" are files that
> were deleted while they were in use and thus weren't fully deleted at the
> time; if they were still in use at unmount (remount-read-only), cleanup
> is done at mount-time) can delay mount as well.
>
> a3) Inode_cache mount option:  Don't use this unless you can explain
> exactly WHY you are using it, preferably backed up with benchmark
> numbers, etc.  It's useful only on 32-bit, generally high-file-activity
> server systems and has general-case problems, including long mount times
> and possible overflow issues that make it inappropriate for normal use.
> Unfortunately there's a lot of people out there using it that shouldn't
> be, and I even saw it listed on at least one distro (not mine!) wiki. =:^(
>
> a4) The space_cache mount option OTOH *IS* appropriate for normal use
> (and is in fact enabled by default these days), but particularly in
> improper shutdown cases can require rebuilding at mount time -- altho
> this should happen /after/ mount, the system will just be busy for some
> minutes, until the space-cache is rebuilt.  But the IO from a space_cache
> rebuild on one filesystem could slow down the mounting of filesystems
> that mount after it, as well as the boot-time launching of other post-
> mount launched services.
>
> If you're seeing the time go up dramatically with the addition of more
> filesystem devices, however, and you do /not/ have inode_cache active,
> I'd guess it's mainly the not-yet-optimized multi-device operations.
>
>
> b) As with any systemd launched unit, however, there's systemd
> configuration mechanisms for working around specific unit issues,
> including timeout issues.  Of course most systems continue to use fstab
> and let systemd auto-generate the mount units, and in fact that is
> recommended, but either with fstab or directly created mount units,
> there's a timeout configuration option that can be set.
>
> b1) The general systemd *.mount unit [Mount] section option appears to be
> TimeoutSec=.  As is usual with systemd times, the default is seconds, or
> pass the unit(s, like "5min 20s").
>
> b2) I don't see it /specifically/ stated, but with a bit of reading
> between the lines, the corresponding fstab option appears to be either
> x-systemd.timeoutsec= or x-systemd.TimeoutSec= (IOW I'm not sure of the
> case).  You may also want to try x-systemd.device-timeout=, which /is/
> specifically mentioned, altho that appears to be specifically the timeout
> for the device to appear, NOT for the filesystem to mount after it does.
>
> b3) See the systemd.mount (5) and systemd-fstab-generator (8) manpages
> for more, that being what the above is based on.
>
> So it might take a bit of experimentation to find the exact command, but
> based on the above anyway, it /should/ be pretty easy to tell systemd to
> wait a bit longer for that filesystem.
>
> When you find the right invocation, please reply with it here, as I'm
> sure there's others who will benefit as well.  FWIW, I'm still on
> reiserfs for my spinning rust (only btrfs on my ssds), but I expect I'll
> switch them to btrfs at some point, so I may well use the information
> myself.  =:^)
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: mount time of multi-disk arrays
  2014-07-07 14:14   ` Austin S Hemmelgarn
@ 2014-07-07 16:57     ` André-Sebastian Liebe
  0 siblings, 0 replies; 9+ messages in thread
From: André-Sebastian Liebe @ 2014-07-07 16:57 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Konstantinos Skarlatos, linux-btrfs

On 07/07/2014 04:14 PM, Austin S Hemmelgarn wrote:
> On 2014-07-07 09:54, Konstantinos Skarlatos wrote:
>> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>>> Hello List,
>>>
>>> can anyone tell me how much time is acceptable and assumable for a
>>> multi-disk btrfs array with classical hard disk drives to mount?
>>>
>>> I'm having a bit of trouble with my current systemd setup, because it
>>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>>> the 4 drive setup it failed to mount once in a few times. Now it fails
>>> everytime because the default timeout of 1m 30s is reached and mount is
>>> aborted.
>>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
>> I have the exact same problem, and have to manually mount my large
>> multi-disk btrfs filesystems, so I would be interested in a solution as
>> well.
>>
>>> My hardware setup contains a
>>> - Intel Core i7 4770
>>> - Kernel 3.15.2-1-ARCH
>>> - 32GB RAM
>>> - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
>>> - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)
>>>
>>> Thanks in advance
>>>
>>> André-Sebastian Liebe
>>> --------------------------------------------------------------------------------------------------
>>>
>>>
>>> # btrfs fi sh
>>> Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
>>>          Total devices 5 FS bytes used 14.21TiB
>>>          devid    1 size 3.64TiB used 2.86TiB path /dev/sdd
>>>          devid    2 size 3.64TiB used 2.86TiB path /dev/sdc
>>>          devid    3 size 3.64TiB used 2.86TiB path /dev/sdf
>>>          devid    4 size 3.64TiB used 2.86TiB path /dev/sde
>>>          devid    5 size 3.64TiB used 2.88TiB path /dev/sdb
>>>
>>> Btrfs v3.14.2-dirty
>>>
>>> # btrfs fi df /data/pool0/
>>> Data, single: total=14.28TiB, used=14.19TiB
>>> System, RAID1: total=8.00MiB, used=1.54MiB
>>> Metadata, RAID1: total=26.00GiB, used=20.20GiB
>>> unknown, single: total=512.00MiB, used=0.00
> This is interesting, I actually did some profiling of the mount timings
> for a bunch of different configurations of 4 (identical other than
> hardware age) 1TB Seagate disks.  One of the arrangements I tested was
> Data using single profile and Metadata/System using RAID1.  Based on the
> results I got, and what you are reporting, the mount time doesn't scale
> linearly in proportion to the amount of storage space.
>
> You might want to try the RAID10 profile for Metadata, of the
> configurations I tested, the fastest used Single for Data and RAID10 for
> Metadata/System.
Switching Metadata from raid1 to raid10 reduced mount times from roughly
120s to 38s!
>
> Also, based on the System chunk usage, I'm guessing that you have a LOT
> of subvolumes/snapshots, and I do know that having very large (100+)
> numbers of either does slow down the mount command (I don't think that
> we cache subvolume information between mount invocations, so it has to
> re-parse the system chunks for each individual mount).
No, I had to remove the one and only snapshot to recover from a 'no
space left on device' to regain metadata space
(http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html)

-- 
André-Sebastian Liebe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: mount time of multi-disk arrays
  2014-07-07 15:48   ` Duncan
  2014-07-07 16:40     ` Benjamin O'Connor
@ 2014-07-07 22:31     ` Konstantinos Skarlatos
  1 sibling, 0 replies; 9+ messages in thread
From: Konstantinos Skarlatos @ 2014-07-07 22:31 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 7/7/2014 6:48 μμ, Duncan wrote:
> Konstantinos Skarlatos posted on Mon, 07 Jul 2014 16:54:05 +0300 as
> excerpted:
>
>> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>>> can anyone tell me how much time is acceptable and assumable for a
>>> multi-disk btrfs array with classical hard disk drives to mount?
>>>
>>> I'm having a bit of trouble with my current systemd setup, because it
>>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>>> the 4 drive setup it failed to mount once in a few times. Now it fails
>>> everytime because the default timeout of 1m 30s is reached and mount is
>>> aborted.
>>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
>> I have the exact same problem, and have to manually mount my large
>> multi-disk btrfs filesystems, so I would be interested in a solution as
>> well.
> I don't have a direct answer, as my btrfs devices are all SSD, but...
>
> a) Btrfs, like some other filesystems, is designed not to need a
> pre-mount (or pre-rw-mount) fsck, because it does what /should/ be a
> quick-scan at mount-time.  However, that isn't always as quick as it
> might be for a number of reasons:
>
> a1) Btrfs is still a relatively immature filesystem and certain
> operations are not yet optimized.  In particular, multi-device btrfs
> operations tend to still be using a first-working-implementation type of
> algorithm instead of a well optimized for parallel operation algorithm,
> and thus often serialize access to multiple devices where a more
> optimized algorithm would parallelize operations across multiple devices
> at the same time.  That will come, but it's not there yet.
>
> a2) Certain operations such as orphan cleanup ("orphans" are files that
> were deleted while they were in use and thus weren't fully deleted at the
> time; if they were still in use at unmount (remount-read-only), cleanup
> is done at mount-time) can delay mount as well.
>
> a3) Inode_cache mount option:  Don't use this unless you can explain
> exactly WHY you are using it, preferably backed up with benchmark
> numbers, etc.  It's useful only on 32-bit, generally high-file-activity
> server systems and has general-case problems, including long mount times
> and possible overflow issues that make it inappropriate for normal use.
> Unfortunately there's a lot of people out there using it that shouldn't
> be, and I even saw it listed on at least one distro (not mine!) wiki. =:^(
>
> a4) The space_cache mount option OTOH *IS* appropriate for normal use
> (and is in fact enabled by default these days), but particularly in
> improper shutdown cases can require rebuilding at mount time -- altho
> this should happen /after/ mount, the system will just be busy for some
> minutes, until the space-cache is rebuilt.  But the IO from a space_cache
> rebuild on one filesystem could slow down the mounting of filesystems
> that mount after it, as well as the boot-time launching of other post-
> mount launched services.
>
> If you're seeing the time go up dramatically with the addition of more
> filesystem devices, however, and you do /not/ have inode_cache active,
> I'd guess it's mainly the not-yet-optimized multi-device operations.
>
>
> b) As with any systemd launched unit, however, there's systemd
> configuration mechanisms for working around specific unit issues,
> including timeout issues.  Of course most systems continue to use fstab
> and let systemd auto-generate the mount units, and in fact that is
> recommended, but either with fstab or directly created mount units,
> there's a timeout configuration option that can be set.
>
> b1) The general systemd *.mount unit [Mount] section option appears to be
> TimeoutSec=.  As is usual with systemd times, the default is seconds, or
> pass the unit(s, like "5min 20s").
>
> b2) I don't see it /specifically/ stated, but with a bit of reading
> between the lines, the corresponding fstab option appears to be either
> x-systemd.timeoutsec= or x-systemd.TimeoutSec= (IOW I'm not sure of the
> case).  You may also want to try x-systemd.device-timeout=, which /is/
> specifically mentioned, altho that appears to be specifically the timeout
> for the device to appear, NOT for the filesystem to mount after it does.
>
> b3) See the systemd.mount (5) and systemd-fstab-generator (8) manpages
> for more, that being what the above is based on.
Thanks for your detailed answer. A mount unit with a larger timeout 
works fine, maybe we should tell distro maintainers to up the limit for 
btrfs to 5 minutes or so?

In my experience, mount time definitely grows as the filesystem grows 
older, and times out after snapshot count gets more than 500-1000 . I 
guess thats something that can be optimized in the future, but i believe 
stability is a much more urgent need now...

>
> So it might take a bit of experimentation to find the exact command, but
> based on the above anyway, it /should/ be pretty easy to tell systemd to
> wait a bit longer for that filesystem.
>
> When you find the right invocation, please reply with it here, as I'm
> sure there's others who will benefit as well.  FWIW, I'm still on
> reiserfs for my spinning rust (only btrfs on my ssds), but I expect I'll
> switch them to btrfs at some point, so I may well use the information
> myself.  =:^)
>


-- 
Konstantinos Skarlatos


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: mount time of multi-disk arrays
  2014-07-07 14:24   ` André-Sebastian Liebe
@ 2014-07-07 22:34     ` Konstantinos Skarlatos
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantinos Skarlatos @ 2014-07-07 22:34 UTC (permalink / raw)
  To: André-Sebastian Liebe, linux-btrfs

On 7/7/2014 5:24 μμ, André-Sebastian Liebe wrote:
> On 07/07/2014 03:54 PM, Konstantinos Skarlatos wrote:
>> On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
>>> Hello List,
>>>
>>> can anyone tell me how much time is acceptable and assumable for a
>>> multi-disk btrfs array with classical hard disk drives to mount?
>>>
>>> I'm having a bit of trouble with my current systemd setup, because it
>>> couldn't mount my btrfs raid anymore after adding the 5th drive. With
>>> the 4 drive setup it failed to mount once in a few times. Now it fails
>>> everytime because the default timeout of 1m 30s is reached and mount is
>>> aborted.
>>> My last 10 manual mounts took between 1m57s and 2m12s to finish.
>> I have the exact same problem, and have to manually mount my large
>> multi-disk btrfs filesystems, so I would be interested in a solution
>> as well.
> Hi Konstantinos , you can workaround this by manual creating a systemd
> mount unit.
>
> - First review the autogenerated systemd mount unit (systemctl show
> <your-mount-unit>.mount).  You you can get the unit name by issuing a
> 'systemctl' and look for your failed mount.
> - Then you have to take the needed values (After, Before, Conflicts,
> RequiresMountsFor, Where, What, Options, Type, Wantedby) and put them
> into an new systemd mount unit file (possibly under
> /usr/lib/systemd/system/<your-mount-unit>.mount ).
> - Now just add the TimeoutSec with a large enough value below [Mount].
> - If you later want to automount you raid, add the WantedBy under [Install]
> - now issue a 'systemctl daemon-reload' and look for error messages in
> syslog.
> - If there are no errors you could enable your manual mount entry by
> 'systemctl enable <your-mount-unit>.mount' and safely comment out your
> old fstab entry (systemd no longer generates autogenerated units).
>
> -- 8< ----------- 8< ----------- 8< ----------- 8< ----------- 8<
> ----------- 8< ----------- 8< -----------
> [Unit]
> Description=Mount /data/pool0
> After=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
> systemd-journald.socket local-fs-pre.target system.slice -.mount
> Before=umount.target
> Conflicts=umount.target
> RequiresMountsFor=/data
> /dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb
>
> [Mount]
> Where=/data/pool0
> What=/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb
> Options=rw,relatime,skip_balance,compress
> Type=btrfs
> TimeoutSec=3min
>
> [Install]
> WantedBy=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
> -- 8< ----------- 8< ----------- 8< ----------- 8< ----------- 8<
> ----------- 8< ----------- 8< -----------

Hi André,
This unit file works for me, thank you for creating it! Can somebody put 
it on the wiki?




>
>
>>> My hardware setup contains a
>>> - Intel Core i7 4770
>>> - Kernel 3.15.2-1-ARCH
>>> - 32GB RAM
>>> - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
>>> - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)
>>>
>>> Thanks in advance
>>>
>>> André-Sebastian Liebe
>>> --------------------------------------------------------------------------------------------------
>>>
>>>
>>> # btrfs fi sh
>>> Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
>>>           Total devices 5 FS bytes used 14.21TiB
>>>           devid    1 size 3.64TiB used 2.86TiB path /dev/sdd
>>>           devid    2 size 3.64TiB used 2.86TiB path /dev/sdc
>>>           devid    3 size 3.64TiB used 2.86TiB path /dev/sdf
>>>           devid    4 size 3.64TiB used 2.86TiB path /dev/sde
>>>           devid    5 size 3.64TiB used 2.88TiB path /dev/sdb
>>>
>>> Btrfs v3.14.2-dirty
>>>
>>> # btrfs fi df /data/pool0/
>>> Data, single: total=14.28TiB, used=14.19TiB
>>> System, RAID1: total=8.00MiB, used=1.54MiB
>>> Metadata, RAID1: total=26.00GiB, used=20.20GiB
>>> unknown, single: total=512.00MiB, used=0.00
>>>
>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> -- 
>> Konstantinos Skarlatos
> --
> André-Sebastian Liebe
>


-- 
Konstantinos Skarlatos


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-07-07 22:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-07 13:38 mount time of multi-disk arrays André-Sebastian Liebe
2014-07-07 13:54 ` Konstantinos Skarlatos
2014-07-07 14:14   ` Austin S Hemmelgarn
2014-07-07 16:57     ` André-Sebastian Liebe
2014-07-07 14:24   ` André-Sebastian Liebe
2014-07-07 22:34     ` Konstantinos Skarlatos
2014-07-07 15:48   ` Duncan
2014-07-07 16:40     ` Benjamin O'Connor
2014-07-07 22:31     ` Konstantinos Skarlatos

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.