All of lore.kernel.org
 help / color / mirror / Atom feed
* Volume appears full but TB's of space available
@ 2017-04-07  0:47 John Petrini
  2017-04-07  1:15 ` John Petrini
  2017-04-07  1:17 ` Chris Murphy
  0 siblings, 2 replies; 20+ messages in thread
From: John Petrini @ 2017-04-07  0:47 UTC (permalink / raw)
  To: linux-btrfs

Hello List,

I have a volume that appears to be full despite having multiple
Terabytes of free space available. Just yesterday I ran a re-balance
but it didn't change anything. I've just added two more disks to the
array and am currently in the process of another re-balance but the
available space has not increased.

Currently I can still write to the volume (I haven't tried any large
writes) so I'm not sure if this is just a reporting issue or if writes
will eventually fail.

Any help is appreciated. Here's the details:

uname -a
Linux yuengling.johnpetrini.com 4.4.0-66-generic #87-Ubuntu SMP Fri
Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version
btrfs-progs v4.4

sudo btrfs fi df /mnt/storage-array/
Data, RAID10: total=10.72TiB, used=10.72TiB
System, RAID0: total=128.00MiB, used=944.00KiB
Metadata, RAID10: total=14.00GiB, used=12.63GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

sudo btrfs fi show /mnt/storage-array/
Label: none  uuid: e113ab87-7869-4ec7-9508-95691f455018
Total devices 10 FS bytes used 10.73TiB
devid    1 size 4.55TiB used 2.65TiB path /dev/sdj
devid    2 size 4.55TiB used 2.65TiB path /dev/sdk
devid    3 size 3.64TiB used 2.65TiB path /dev/sdd
devid    4 size 3.64TiB used 2.65TiB path /dev/sdf
devid    5 size 3.64TiB used 2.65TiB path /dev/sdg
devid    6 size 3.64TiB used 2.65TiB path /dev/sde
devid    7 size 3.64TiB used 2.65TiB path /dev/sdb
devid    8 size 3.64TiB used 2.65TiB path /dev/sdc
devid    9 size 9.10TiB used 149.00GiB path /dev/sdh
devid   10 size 9.10TiB used 149.00GiB path /dev/sdi

sudo btrfs fi usage /mnt/storage-array/
Overall:
    Device size:   49.12TiB
    Device allocated:   21.47TiB
    Device unallocated:   27.65TiB
    Device missing:      0.00B
    Used:   21.45TiB
    Free (estimated):   13.83TiB (min: 13.83TiB)
    Data ratio:       2.00
    Metadata ratio:       2.00
    Global reserve:  512.00MiB (used: 0.00B)

Data,RAID10: Size:10.72TiB, Used:10.71TiB
   /dev/sdb    1.32TiB
   /dev/sdc    1.32TiB
   /dev/sdd    1.32TiB
   /dev/sde    1.32TiB
   /dev/sdf    1.32TiB
   /dev/sdg    1.32TiB
   /dev/sdh   72.00GiB
   /dev/sdi   72.00GiB
   /dev/sdj    1.32TiB
   /dev/sdk    1.32TiB

Metadata,RAID10: Size:14.00GiB, Used:12.63GiB
   /dev/sdb    1.75GiB
   /dev/sdc    1.75GiB
   /dev/sdd    1.75GiB
   /dev/sde    1.75GiB
   /dev/sdf    1.75GiB
   /dev/sdg    1.75GiB
   /dev/sdj    1.75GiB
   /dev/sdk    1.75GiB

System,RAID0: Size:128.00MiB, Used:944.00KiB
   /dev/sdb   16.00MiB
   /dev/sdc   16.00MiB
   /dev/sdd   16.00MiB
   /dev/sde   16.00MiB
   /dev/sdf   16.00MiB
   /dev/sdg   16.00MiB
   /dev/sdj   16.00MiB
   /dev/sdk   16.00MiB

Unallocated:
   /dev/sdb    2.31TiB
   /dev/sdc    2.31TiB
   /dev/sdd    2.31TiB
   /dev/sde    2.31TiB
   /dev/sdf    2.31TiB
   /dev/sdg    2.31TiB
   /dev/sdh    9.03TiB
   /dev/sdi    9.03TiB
   /dev/sdj    3.22TiB
   /dev/sdk    3.22TiB

Thank You,

John Petrini

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07  0:47 Volume appears full but TB's of space available John Petrini
@ 2017-04-07  1:15 ` John Petrini
  2017-04-07  1:21   ` Chris Murphy
  2017-04-07  1:17 ` Chris Murphy
  1 sibling, 1 reply; 20+ messages in thread
From: John Petrini @ 2017-04-07  1:15 UTC (permalink / raw)
  To: linux-btrfs

Okay so I came across this bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1243986

It looks like I'm just misinterpreting the output of btrfs fi df. What
should I be looking at to determine the actual free space? Is Free
(estimated):   13.83TiB (min: 13.83TiB) the proper metric?

Simply running df does not seem to report the usage properly

/dev/sdj                      25T   11T  5.9T  65% /mnt/storage-array

Thank you,

John Petrini

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07  0:47 Volume appears full but TB's of space available John Petrini
  2017-04-07  1:15 ` John Petrini
@ 2017-04-07  1:17 ` Chris Murphy
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Murphy @ 2017-04-07  1:17 UTC (permalink / raw)
  To: John Petrini; +Cc: Btrfs BTRFS

On Thu, Apr 6, 2017 at 6:47 PM, John Petrini <jpetrini@coredial.com> wrote:

> sudo btrfs fi df /mnt/storage-array/
> Data, RAID10: total=10.72TiB, used=10.72TiB
> System, RAID0: total=128.00MiB, used=944.00KiB
> Metadata, RAID10: total=14.00GiB, used=12.63GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

The third line is kinda scary. System chunk is raid0, so ostensibly a
single device failure means the entire array is lost.

The fastest way to fix it is:

btrfs balance start -mconvert=raid10,soft <mountpoint>

That will make the system chunk raid10.


>
> sudo btrfs fi usage /mnt/storage-array/
> Overall:
>     Device size:   49.12TiB
>     Device allocated:   21.47TiB
>     Device unallocated:   27.65TiB
>     Device missing:      0.00B
>     Used:   21.45TiB
>     Free (estimated):   13.83TiB (min: 13.83TiB)
>     Data ratio:       2.00
>     Metadata ratio:       2.00
>     Global reserve:  512.00MiB (used: 0.00B)
>
> Data,RAID10: Size:10.72TiB, Used:10.71TiB

This is saying you have 10.72T of data. But because it's raid10, it
will take up 2x that much space. This is what's reflected by the
Overall: Used: value of 21.45T, plus some extra for metadata which is
also 2x.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07  1:15 ` John Petrini
@ 2017-04-07  1:21   ` Chris Murphy
  2017-04-07  1:31     ` John Petrini
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2017-04-07  1:21 UTC (permalink / raw)
  To: John Petrini; +Cc: Btrfs BTRFS

On Thu, Apr 6, 2017 at 7:15 PM, John Petrini <jpetrini@coredial.com> wrote:
> Okay so I came across this bug report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1243986
>
> It looks like I'm just misinterpreting the output of btrfs fi df. What
> should I be looking at to determine the actual free space? Is Free
> (estimated):   13.83TiB (min: 13.83TiB) the proper metric?
>
> Simply running df does not seem to report the usage properly
>
> /dev/sdj                      25T   11T  5.9T  65% /mnt/storage-array


Free should be correct. And df -h should be IEC units, so I'd expect
it to be closer to the value of btrfs fi us than this. But the code
has changed over time, I'm not sure when the last adjustment was made.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07  1:21   ` Chris Murphy
@ 2017-04-07  1:31     ` John Petrini
  2017-04-07  2:42       ` Chris Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: John Petrini @ 2017-04-07  1:31 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Hi Chris,

I've followed your advice and converted the system chunk to raid10. I
hadn't noticed it was raid0 and it's scary to think that I've been
running this array for three months like that. Thank you for saving me
a lot of pain down the road!

Also thank you for the clarification on the output - this is making a
lot more sense.

Regards,

John Petrini

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07  1:31     ` John Petrini
@ 2017-04-07  2:42       ` Chris Murphy
  2017-04-07  3:25         ` John Petrini
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2017-04-07  2:42 UTC (permalink / raw)
  To: John Petrini; +Cc: Chris Murphy, Btrfs BTRFS

On Thu, Apr 6, 2017 at 7:31 PM, John Petrini <jpetrini@coredial.com> wrote:
> Hi Chris,
>
> I've followed your advice and converted the system chunk to raid10. I
> hadn't noticed it was raid0 and it's scary to think that I've been
> running this array for three months like that. Thank you for saving me
> a lot of pain down the road!

For what it's worth, it is imperative to keep frequent backups with
Btrfs raid10, it is in some ways more like raid0+1. It can only
tolerate the loss of a single device. It will continue to function
with 2+ devices in a very deceptive degraded state, until it
inevitably hits dual missing chunks of metadata or data, and then it
will faceplant. And then you'll be looking at a scrape operation.

It's not like raid10 where you can lose one of each mirrored pair.
Btrfs raid10 mirrors chunks, not drives. So your metadata and data are
all distributed across all of the drives, and that in effect means you
can only lose 1 drive. If you lose a 2nd drive, some amount of
metadata and data will have been lost.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07  2:42       ` Chris Murphy
@ 2017-04-07  3:25         ` John Petrini
  2017-04-07 11:41           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 20+ messages in thread
From: John Petrini @ 2017-04-07  3:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Interesting. That's the first time I'm hearing this. If that's the
case I feel like it's a stretch to call it RAID10 at all. It sounds a
lot more like basic replication similar to Ceph only Ceph understands
failure domains and therefore can be configured to handle device
failure (albeit at a higher level)

I do of course keep backups but I chose RAID10 for the mix of
performance and reliability. It doesn't seems worth it losing 50% of
my usable space for the performance gain alone.

Thank you for letting me know about this. Knowing that I think I may
have to reconsider my choice here. I've really been enjoying the
flexibility of BTRS which is why I switched to it in the first place
but with experimental RAID5/6 and what you've just told me I'm
beginning to doubt that it's the right choice.

What's more concerning is that I haven't found a good way to monitor
BTRFS. I might be able to accept that the array can only handle a
single drive failure if I was confident that I could detect it but so
far I haven't found a good solution for this.
___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   coredial.com
//
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: jpetrini@coredial.com


Interested in sponsoring PartnerConnex 2017? Learn more.

The information transmitted is intended only for the person or entity
to which it is addressed and may contain confidential and/or
privileged material. Any review, retransmission,  dissemination or
other use of, or taking of any action in reliance upon, this
information by persons or entities other than the intended recipient
is prohibited. If you received this in error, please contact the
sender and delete the material from any computer.


On Thu, Apr 6, 2017 at 10:42 PM, Chris Murphy <lists@colorremedies.com> wrote:
> On Thu, Apr 6, 2017 at 7:31 PM, John Petrini <jpetrini@coredial.com> wrote:
>> Hi Chris,
>>
>> I've followed your advice and converted the system chunk to raid10. I
>> hadn't noticed it was raid0 and it's scary to think that I've been
>> running this array for three months like that. Thank you for saving me
>> a lot of pain down the road!
>
> For what it's worth, it is imperative to keep frequent backups with
> Btrfs raid10, it is in some ways more like raid0+1. It can only
> tolerate the loss of a single device. It will continue to function
> with 2+ devices in a very deceptive degraded state, until it
> inevitably hits dual missing chunks of metadata or data, and then it
> will faceplant. And then you'll be looking at a scrape operation.
>
> It's not like raid10 where you can lose one of each mirrored pair.
> Btrfs raid10 mirrors chunks, not drives. So your metadata and data are
> all distributed across all of the drives, and that in effect means you
> can only lose 1 drive. If you lose a 2nd drive, some amount of
> metadata and data will have been lost.
>
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07  3:25         ` John Petrini
@ 2017-04-07 11:41           ` Austin S. Hemmelgarn
  2017-04-07 13:28             ` John Petrini
                               ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-07 11:41 UTC (permalink / raw)
  To: John Petrini; +Cc: Chris Murphy, Btrfs BTRFS

On 2017-04-06 23:25, John Petrini wrote:
> Interesting. That's the first time I'm hearing this. If that's the
> case I feel like it's a stretch to call it RAID10 at all. It sounds a
> lot more like basic replication similar to Ceph only Ceph understands
> failure domains and therefore can be configured to handle device
> failure (albeit at a higher level)
Yeah, the stacking is a bit odd, and there are some rather annoying 
caveats that make most of the names other than raid5/raid6 misleading. 
In fact, raid1 mode in BTRFS is more like what most people think of as 
RAID10 when run on more than 2 disks than BTRFS raid10 mode is, although 
it stripes at a much higher level.
>
> I do of course keep backups but I chose RAID10 for the mix of
> performance and reliability. It doesn't seems worth it losing 50% of
> my usable space for the performance gain alone.
>
> Thank you for letting me know about this. Knowing that I think I may
> have to reconsider my choice here. I've really been enjoying the
> flexibility of BTRS which is why I switched to it in the first place
> but with experimental RAID5/6 and what you've just told me I'm
> beginning to doubt that it's the right choice.
There are some other options in how you configure it.  Most of the more 
useful operational modes actually require stacking BTRFS on top of LVM 
or MD.  I'm rather fond of running BTRFS raid1 on top of LVM RAID0 
volumes, which while it provides no better data safety than BTRFS raid10 
mode, gets noticeably better performance.  You can also reverse that to 
get something more like traditional RAID10, but you lose the 
self-correcting aspect of BTRFS.
>
> What's more concerning is that I haven't found a good way to monitor
> BTRFS. I might be able to accept that the array can only handle a
> single drive failure if I was confident that I could detect it but so
> far I haven't found a good solution for this.
This I can actually give some advice on.  There are a couple of options, 
but the easiest is to find a piece of generic monitoring software that 
can check the return code of external programs, and then write some 
simple scripts to perform the checks on BTRFS.  The things you want to 
keep an eye on are:

1. Output of 'btrfs dev stats'.  If you've got a new enough copy of 
btrfs-progs, you can pass '--check' and the return code will be non-zero 
if any of the error counters isn't zero.  If you've got to use an older 
version, you'll instead have to write a script to parse the output (I 
will comment that this is much easier in a language like Perl or Python 
than it is in bash).  You want to watch for steady increases in error 
counts or sudden large jumps.  Single intermittent errors are worth 
tracking, but they tend to happen more frequently the larger the array is.

2. Results from 'btrfs scrub'.  This is somewhat tricky because scrub is 
either asynchronous or blocks for a _long_ time.  The simplest option 
I've found is to fire off an asynchronous scrub to run during down-time, 
and then schedule recurring checks with 'btrfs scrub status'.  On the 
plus side, 'btrfs scrub status' already returns non-zero if the scrub 
found errors.

3. Watch the filesystem flags.  Some monitoring software can easily do 
this for you (Monit for example can watch for changes in the flags). 
The general idea here is that BTRFS will go read-only if it hits certain 
serious errors, so you can watch for that transition and send a 
notification when it happens.  This is also worth watching since the 
filesystem flags should not change during normal operation of any 
filesystem.

4. Watch SMART status on the drives and run regular self-tests.  Most of 
the time, issues will show up here before they show up in the FS, so by 
watching this, you may have an opportunity to replace devices before the 
filesystem ends up completely broken.

5. If you're feeling really ambitious, watch the kernel logs for errors 
from BTRFS and whatever storage drivers you use.  This is the least 
reliable thing out of this list to automate,  so I'd not suggest just 
doing this by itself.

The first two items are BTRFS specific.  The rest however, are standard 
things you should be monitoring regardless of what type of storage stack 
you have.  Of these, item 3 will immediately trigger in the event of a 
catastrophic device failure, while 1, 2, and 5 will provide better 
coverage of slow failures, and 4 will cover both aspects.

As far as what to use to actually track these, that really depends on 
your use case.  For tracking on an individual system basis, I'd suggest 
Monit, it's efficient, easy to configure, provides some degree of error 
resilience, and can actually cover a lot of monitoring tasks beyond 
stuff like this.  If you want some kind of centralized monitoring, I'd 
probably go with Nagios, but that's more because that's the standard for 
that type of thing, not because I've used it myself (I much prefer 
per-system decentralized monitoring, with only the checks that systems 
are online centralized).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 11:41           ` Austin S. Hemmelgarn
@ 2017-04-07 13:28             ` John Petrini
  2017-04-07 13:50               ` Austin S. Hemmelgarn
  2017-04-07 16:04             ` Chris Murphy
  2017-04-08  5:12             ` Duncan
  2 siblings, 1 reply; 20+ messages in thread
From: John Petrini @ 2017-04-07 13:28 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS

Hi Austin,

Thanks for taking to time to provide all of this great information!

You've got me curious about RAID1. If I were to convert the array to
RAID1 could it then sustain a multi drive failure? Or in other words
do I actually end up with mirrored pairs or can a chunk still be
mirrored to any disk in the array? Are there performance implications
to using RAID1 vs RAID10?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 13:28             ` John Petrini
@ 2017-04-07 13:50               ` Austin S. Hemmelgarn
  2017-04-07 16:28                 ` Chris Murphy
  0 siblings, 1 reply; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-07 13:50 UTC (permalink / raw)
  To: John Petrini; +Cc: Chris Murphy, Btrfs BTRFS

On 2017-04-07 09:28, John Petrini wrote:
> Hi Austin,
>
> Thanks for taking to time to provide all of this great information!
Glad I could help.
>
> You've got me curious about RAID1. If I were to convert the array to
> RAID1 could it then sustain a multi drive failure? Or in other words
> do I actually end up with mirrored pairs or can a chunk still be
> mirrored to any disk in the array? Are there performance implications
> to using RAID1 vs RAID10?
>
For raid10, your data is stored as 2 replicas striped at or below the 
filesystem-block level across all the disks in the array.  Because of 
how the data striping is done currently, you're functionally guaranteed 
to lose data if you lose more than one disk in raid10 mode.  This 
theoretically could be improved so that partial losses could be 
recovered, but doing so with the current implementation would be 
extremely complicated, and as such is not a high priority (although 
patches would almost certainly be welcome).

For raid1, your data is stored as 2 replicas with each entirely on one 
disk, but individual chunks (the higher level allocation in BTRFS) are 
distributed in a round-robin fashion among the disks, so any given 
filesystem block is on exactly 2 disks.  With the current 
implementation, for any reasonably utilized filesystem, you will lose 
data if you lose 2 or more disks in raid1 mode.  That said, there are 
plans (still currently vaporware in favor of getting raid5/6 working) to 
add arbitrary replication levels to BTRFS, so once that hits, you could 
set things to have as many replicas as you want.

In effect, both can currently only sustain one disk failure, but losing 
2 disks in raid10 will probably corrupt files (currently, it will 
functionally kill the FS, although with a bit of theoretically simple 
work this could be changed), while losing 2 disks in raid1 mode will 
usually just make files disappear unless they are larger than the data 
chunk size (usually between 1-5GB depending on the size of the FS), so 
if you're just storing small files, you'll have an easier time 
quantifying data loss with raid1 than raid10.  Both modes have the 
possibility of completely losing the FS if the lost disks happen to take 
out the System chunk.

As for performance, raid10 mode in BTRFS gets better performance, but 
you can get even better performance than that by running BTRFS in raid1 
mode on top of 2 LVM or MD raid0 volumes.  Such a configuration provides 
the same effective data safety as BTRFS raid10, but can get anywhere 
from 5-30% better performance depending on the workload.

If you care about both performance and data safety, I would suggest 
using BTRFS raid1 mode on top of LVM or MD RAID0 together with having 
good backups and good monitoring.  Statistically speaking, catastrophic 
hardware failures are rare, and you'll usually have more than enough 
warning that a device is failing before it actually does, so provided 
you keep on top of monitoring and replace disks that are showing signs 
of impending failure as soon as possible, you will be no worse off in 
terms of data integrity than running ext4 or XFS on top of a LVM or MD 
RAID10 volume.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 11:41           ` Austin S. Hemmelgarn
  2017-04-07 13:28             ` John Petrini
@ 2017-04-07 16:04             ` Chris Murphy
  2017-04-07 16:51               ` Austin S. Hemmelgarn
  2017-04-08  5:12             ` Duncan
  2 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2017-04-07 16:04 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: John Petrini, Chris Murphy, Btrfs BTRFS

On Fri, Apr 7, 2017 at 5:41 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> I'm rather fond of running BTRFS raid1 on top of LVM RAID0 volumes,
> which while it provides no better data safety than BTRFS raid10 mode, gets
> noticeably better performance.

This does in fact have better data safety than Btrfs raid10 because it
is possible to lose more than one drive without data loss. You can
only lose drives on one side of the mirroring, however. This is a
conventional raid0+1, so it's not as scalable as raid10 when it comes
to rebuild time.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 13:50               ` Austin S. Hemmelgarn
@ 2017-04-07 16:28                 ` Chris Murphy
  2017-04-07 16:58                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Murphy @ 2017-04-07 16:28 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: John Petrini, Chris Murphy, Btrfs BTRFS

On Fri, Apr 7, 2017 at 7:50 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> If you care about both performance and data safety, I would suggest using
> BTRFS raid1 mode on top of LVM or MD RAID0 together with having good backups
> and good monitoring.  Statistically speaking, catastrophic hardware failures
> are rare, and you'll usually have more than enough warning that a device is
> failing before it actually does, so provided you keep on top of monitoring
> and replace disks that are showing signs of impending failure as soon as
> possible, you will be no worse off in terms of data integrity than running
> ext4 or XFS on top of a LVM or MD RAID10 volume.


Depending on the workload, and what replication is being used by Ceph
above this storage stack, it might make make more sense to do
something like three lvm/md raid5 arrays, and then Btrfs single data,
raid1 metadata, across those three raid5s. That's giving up only three
drives to parity rather than 1/2 the drives, and rebuild time is
shorter than losing one drive in a raid0 array.

If this is one ceph host, then it might make sense to split the drives
up so there are two storage bricks using ceph replication between them
for the equivalent of raid1. One brick can do Btrfs on LVM/md raid5,
call it brick A. The other brick can do XFS on LVM/md linear, call it
brick B. The advantage there is the different bricks are going to have
faster commit to stable media times with a mixed workload. The Btrfs
on raid5 brick will do better with sequential reads and writes. The
XFS on linear will do better with metadata heavy reads and writes.
There's probably some Ceph tuning where you can point certain
workloads to particular volumes, where those volumes are backed by
different priorities to the underlying storage. So you'd setup ceph
volume "mail" to be backed in order by brick B then A.

Not very well known but XFS will parallelize across drives in a
linear/concat arrangement, it's quite useful for e.g. busy mail
servers.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 16:04             ` Chris Murphy
@ 2017-04-07 16:51               ` Austin S. Hemmelgarn
  2017-04-07 16:58                 ` John Petrini
  0 siblings, 1 reply; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-07 16:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: John Petrini, Btrfs BTRFS

On 2017-04-07 12:04, Chris Murphy wrote:
> On Fri, Apr 7, 2017 at 5:41 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> I'm rather fond of running BTRFS raid1 on top of LVM RAID0 volumes,
>> which while it provides no better data safety than BTRFS raid10 mode, gets
>> noticeably better performance.
>
> This does in fact have better data safety than Btrfs raid10 because it
> is possible to lose more than one drive without data loss. You can
> only lose drives on one side of the mirroring, however. This is a
> conventional raid0+1, so it's not as scalable as raid10 when it comes
> to rebuild time.
>
That's a good point that I don't often remember, and I'm pretty sure 
that such an array will rebuild slower from a single device loss than 
BTRFS raid10 would, but most of that should be that BTRFS is smart 
enough to only rewrite what it has to.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 16:28                 ` Chris Murphy
@ 2017-04-07 16:58                   ` Austin S. Hemmelgarn
  2017-04-07 17:05                     ` John Petrini
  0 siblings, 1 reply; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-07 16:58 UTC (permalink / raw)
  To: Chris Murphy, John Petrini; +Cc: Btrfs BTRFS

On 2017-04-07 12:28, Chris Murphy wrote:
> On Fri, Apr 7, 2017 at 7:50 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> If you care about both performance and data safety, I would suggest using
>> BTRFS raid1 mode on top of LVM or MD RAID0 together with having good backups
>> and good monitoring.  Statistically speaking, catastrophic hardware failures
>> are rare, and you'll usually have more than enough warning that a device is
>> failing before it actually does, so provided you keep on top of monitoring
>> and replace disks that are showing signs of impending failure as soon as
>> possible, you will be no worse off in terms of data integrity than running
>> ext4 or XFS on top of a LVM or MD RAID10 volume.
>
>
> Depending on the workload, and what replication is being used by Ceph
> above this storage stack, it might make make more sense to do
> something like three lvm/md raid5 arrays, and then Btrfs single data,
> raid1 metadata, across those three raid5s. That's giving up only three
> drives to parity rather than 1/2 the drives, and rebuild time is
> shorter than losing one drive in a raid0 array.
Ah, I had forgotten it was a Ceph back-end system.  In that case, I 
would actually suggest essentially the same setup that Chris did, 
although I would personally be a bit more conservative and use RAID6 
instead of RAID5 for the LVM/MD arrays.  As he said though, it really 
depends on what higher-level replication you're doing.  In particular, 
if you're running erasure coding instead of replication at the Ceph 
level, I would probably still go with BTRFS raid1 on top of LVM/MD RAID0 
just to balance out the performance hit from the erasure coding.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 16:51               ` Austin S. Hemmelgarn
@ 2017-04-07 16:58                 ` John Petrini
  2017-04-07 17:04                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 20+ messages in thread
From: John Petrini @ 2017-04-07 16:58 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS

When you say "running BTRFS raid1 on top of LVM RAID0 volumes" do you
mean creating two LVM RAID-0 volumes and then putting BTRFS RAID1 on
the two resulting logical volumes?
___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   coredial.com
//
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: jpetrini@coredial.com


Interested in sponsoring PartnerConnex 2017? Learn more.

The information transmitted is intended only for the person or entity
to which it is addressed and may contain confidential and/or
privileged material. Any review, retransmission,  dissemination or
other use of, or taking of any action in reliance upon, this
information by persons or entities other than the intended recipient
is prohibited. If you received this in error, please contact the
sender and delete the material from any computer.


On Fri, Apr 7, 2017 at 12:51 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2017-04-07 12:04, Chris Murphy wrote:
>>
>> On Fri, Apr 7, 2017 at 5:41 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>>> I'm rather fond of running BTRFS raid1 on top of LVM RAID0 volumes,
>>> which while it provides no better data safety than BTRFS raid10 mode,
>>> gets
>>> noticeably better performance.
>>
>>
>> This does in fact have better data safety than Btrfs raid10 because it
>> is possible to lose more than one drive without data loss. You can
>> only lose drives on one side of the mirroring, however. This is a
>> conventional raid0+1, so it's not as scalable as raid10 when it comes
>> to rebuild time.
>>
> That's a good point that I don't often remember, and I'm pretty sure that
> such an array will rebuild slower from a single device loss than BTRFS
> raid10 would, but most of that should be that BTRFS is smart enough to only
> rewrite what it has to.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 16:58                 ` John Petrini
@ 2017-04-07 17:04                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-07 17:04 UTC (permalink / raw)
  To: John Petrini; +Cc: Chris Murphy, Btrfs BTRFS

On 2017-04-07 12:58, John Petrini wrote:
> When you say "running BTRFS raid1 on top of LVM RAID0 volumes" do you
> mean creating two LVM RAID-0 volumes and then putting BTRFS RAID1 on
> the two resulting logical volumes?
Yes, although it doesn't have to be LVM, it could just as easily be MD 
or even hardware RAID (I just prefer LVM for the flexibility it offers).

A quick tip regarding this, it seems to get the best performance if the 
stripe size (the -I option for lvcreate) is chosen so that it either 
matches the BTRFS block size, or such that each block in BTRFS gets 
striped across all the disks.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 16:58                   ` Austin S. Hemmelgarn
@ 2017-04-07 17:05                     ` John Petrini
  2017-04-07 17:11                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 20+ messages in thread
From: John Petrini @ 2017-04-07 17:05 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS

The use case actually is not Ceph, I was just drawing a comparison
between Ceph's object replication strategy vs BTRF's chunk mirroring.

I do find the conversation interesting however as I work with Ceph
quite a lot but have always gone with the default XFS filesystem for
on OSD's.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 17:05                     ` John Petrini
@ 2017-04-07 17:11                       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-07 17:11 UTC (permalink / raw)
  To: John Petrini; +Cc: Chris Murphy, Btrfs BTRFS

On 2017-04-07 13:05, John Petrini wrote:
> The use case actually is not Ceph, I was just drawing a comparison
> between Ceph's object replication strategy vs BTRF's chunk mirroring.
That's actually a really good comparison that I hadn't thought of 
before.  From what I can tell from my limited understanding of how Ceph 
works, the general principals are pretty similar, except that BTRFS 
doesn't understand or implement failure domains (although having CRUSH 
implemented in BTRFS for chunk placement would be a killer feature IMO).
>
> I do find the conversation interesting however as I work with Ceph
> quite a lot but have always gone with the default XFS filesystem for
> on OSD's.
>
 From a stability perspective, I would normally go with XFS still for 
the OSD's.  Most of the data integrity features provided by BTRFS are 
also implemented in Ceph, so you don't gain much other than flexibility 
currently by using BTRFS instead of XFS.  The one advantage BTRFS has in 
my experience over XFS for something like this is that it seems (with 
recent versions at least) to be more likely to survive a power-failure 
without any serious data loss than XFS is, but that's not really a 
common concern in Ceph's primary use case.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-07 11:41           ` Austin S. Hemmelgarn
  2017-04-07 13:28             ` John Petrini
  2017-04-07 16:04             ` Chris Murphy
@ 2017-04-08  5:12             ` Duncan
  2017-04-10 11:31               ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 20+ messages in thread
From: Duncan @ 2017-04-08  5:12 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Fri, 07 Apr 2017 07:41:22 -0400 as
excerpted:

> 2. Results from 'btrfs scrub'.  This is somewhat tricky because scrub is
> either asynchronous or blocks for a _long_ time.  The simplest option
> I've found is to fire off an asynchronous scrub to run during down-time,
> and then schedule recurring checks with 'btrfs scrub status'.  On the
> plus side, 'btrfs scrub status' already returns non-zero if the scrub
> found errors.

This is (one place) where my "keep it small enough to be in-practice-
manageable" comes in.

I always run my scrubs with -B (don't background, always, because I've 
scripted it), and they normally come back within a minute. =:^)

But that's because I'm running multiple btrfs pair-device raid1 on a pair 
of partitioned SSDs, with each independent btrfs built on a partition 
from each ssd, with all partitions under 50 GiB.  So scrubs takes less 
than a minute to run (on the under 1 GiB /var/log, it returns effectively 
immediately, as soon as I hit enter on the command), but that's not 
entirely surprising at the sizes of the ssd-based btrfs' I am running.

When scrubs (and balances, and checks) come back in a minute or so, it 
makes maintenance /so/ much less of a hassle. =:^)

And the generally single-purpose and relatively small size of each 
filesystem means I can, for instance, keep / (with all the system libs, 
bins, manpages, and the installed-package database, among other things) 
mounted read-only by default, and keep the updates partition (gentoo so 
that's the gentoo and overlay trees, the sources and binpkg cache, ccache 
cache, etc) and (large non-ssd/non-btrfs) media partitions unmounted by 
default.

Which in turn means when something /does/ go wrong, as long as it wasn't 
a physical device, there's much less data at risk, because most of it was 
probably either unmounted, or mounted read-only.

Which in turn means I don't have to worry about scrub/check or other 
repair on those filesystems at all, only the ones that were actually 
mounted writable.  And as mentioned, those scrub and check fast enough 
that I can literally wait at the terminal for command completion. =:^)

Of course my setup's what most would call partitioned to the extreme, but 
it does have its advantages, and it works well for me, which after all is 
the important thing for /my/ setup.

But the more generic point remains, if you setup multi-TB filesystems 
that take days or weeks for a maintenance command to complete, running 
those maintenance commands isn't going to be something done as often as 
one arguably should, and rebuilding from a filesystem or device failure 
is going to take far longer than one would like, as well.  We've seen the 
reports here.  If that's what you're doing, strongly consider breaking 
your filesystems down to something rather more manageable, say a couple 
TiB each.  Broken along natural usage lines, it can save a lot on the  
caffeine and headache pills when something does go wrong.

Unless of course like one poster here, you're handling double-digit-TB 
super-collider data files.  Those tend to be a bit difficult to store on 
sub-double-digit-TB filesystems.  =:^)  But that's the other extreme from 
what I've done here, and he actually has a good /reason/ for /his/
double-digit- or even triple-digit-TB filesystems.  There's not much to 
be done about his use-case, and indeed, AFAIK he decided btrfs simply 
isn't stable and mature enough for that use-case yet, tho I believe he's 
using it for some other, more minor and less gargantuan use-cases.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Volume appears full but TB's of space available
  2017-04-08  5:12             ` Duncan
@ 2017-04-10 11:31               ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 20+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-10 11:31 UTC (permalink / raw)
  To: linux-btrfs

On 2017-04-08 01:12, Duncan wrote:
> Austin S. Hemmelgarn posted on Fri, 07 Apr 2017 07:41:22 -0400 as
> excerpted:
>
>> 2. Results from 'btrfs scrub'.  This is somewhat tricky because scrub is
>> either asynchronous or blocks for a _long_ time.  The simplest option
>> I've found is to fire off an asynchronous scrub to run during down-time,
>> and then schedule recurring checks with 'btrfs scrub status'.  On the
>> plus side, 'btrfs scrub status' already returns non-zero if the scrub
>> found errors.
>
> This is (one place) where my "keep it small enough to be in-practice-
> manageable" comes in.
>
> I always run my scrubs with -B (don't background, always, because I've
> scripted it), and they normally come back within a minute. =:^)
>
> But that's because I'm running multiple btrfs pair-device raid1 on a pair
> of partitioned SSDs, with each independent btrfs built on a partition
> from each ssd, with all partitions under 50 GiB.  So scrubs takes less
> than a minute to run (on the under 1 GiB /var/log, it returns effectively
> immediately, as soon as I hit enter on the command), but that's not
> entirely surprising at the sizes of the ssd-based btrfs' I am running.
>
> When scrubs (and balances, and checks) come back in a minute or so, it
> makes maintenance /so/ much less of a hassle. =:^)
>
> And the generally single-purpose and relatively small size of each
> filesystem means I can, for instance, keep / (with all the system libs,
> bins, manpages, and the installed-package database, among other things)
> mounted read-only by default, and keep the updates partition (gentoo so
> that's the gentoo and overlay trees, the sources and binpkg cache, ccache
> cache, etc) and (large non-ssd/non-btrfs) media partitions unmounted by
> default.
>
> Which in turn means when something /does/ go wrong, as long as it wasn't
> a physical device, there's much less data at risk, because most of it was
> probably either unmounted, or mounted read-only.
>
> Which in turn means I don't have to worry about scrub/check or other
> repair on those filesystems at all, only the ones that were actually
> mounted writable.  And as mentioned, those scrub and check fast enough
> that I can literally wait at the terminal for command completion. =:^)
>
> Of course my setup's what most would call partitioned to the extreme, but
> it does have its advantages, and it works well for me, which after all is
> the important thing for /my/ setup.
Eh, maybe most people who never dealt with disks with capacities on the 
order of triple-digit _megabytes_.  TBH, most of my systems look pretty 
similar, although I split at places that most people think are odd until 
I explain the reasoning (like /var/cache or the RRD storage for 
collectd).  With the exception of the backing storage for the storage 
micro-cluster I have on my home network and the VM storage, all my 
filesystems are 32GB or less (and usually some multiple of 8G), although 
I'm not lucky enough to have a good enough system to run maintenance 
that fast (although part of that might be that I don't heavily 
over-provision space in most of the filesystems, but instead leave a 
reasonable amount of slack-space at the LVM level, so if a filesystem 
gets wedged, I just temporarily resize the LV it's on so I can fix it).
>
> But the more generic point remains, if you setup multi-TB filesystems
> that take days or weeks for a maintenance command to complete, running
> those maintenance commands isn't going to be something done as often as
> one arguably should, and rebuilding from a filesystem or device failure
> is going to take far longer than one would like, as well.  We've seen the
> reports here.  If that's what you're doing, strongly consider breaking
> your filesystems down to something rather more manageable, say a couple
> TiB each.  Broken along natural usage lines, it can save a lot on the
> caffeine and headache pills when something does go wrong.
>
> Unless of course like one poster here, you're handling double-digit-TB
> super-collider data files.  Those tend to be a bit difficult to store on
> sub-double-digit-TB filesystems.  =:^)  But that's the other extreme from
> what I've done here, and he actually has a good /reason/ for /his/
> double-digit- or even triple-digit-TB filesystems.  There's not much to
> be done about his use-case, and indeed, AFAIK he decided btrfs simply
> isn't stable and mature enough for that use-case yet, tho I believe he's
> using it for some other, more minor and less gargantuan use-cases.
Even aside from that, there are cases where you essentially need large 
filesystems.  One good example is NAS usage.  In that case, it's a lot 
simpler to provision one filesystem and then share out subsets of it 
than it is to provision one for each share.  Clustering is another good 
example (the micro-cluster I mentioned above being a good example of 
this, by just using one filesystem for each back-end system, I end up 
saving a very large amount of resources without compromising performance 
(although, the 200GB back-end filesystems are nowhere near the multi-TB 
filesystems that are usually the issue).

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-04-10 11:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-07  0:47 Volume appears full but TB's of space available John Petrini
2017-04-07  1:15 ` John Petrini
2017-04-07  1:21   ` Chris Murphy
2017-04-07  1:31     ` John Petrini
2017-04-07  2:42       ` Chris Murphy
2017-04-07  3:25         ` John Petrini
2017-04-07 11:41           ` Austin S. Hemmelgarn
2017-04-07 13:28             ` John Petrini
2017-04-07 13:50               ` Austin S. Hemmelgarn
2017-04-07 16:28                 ` Chris Murphy
2017-04-07 16:58                   ` Austin S. Hemmelgarn
2017-04-07 17:05                     ` John Petrini
2017-04-07 17:11                       ` Austin S. Hemmelgarn
2017-04-07 16:04             ` Chris Murphy
2017-04-07 16:51               ` Austin S. Hemmelgarn
2017-04-07 16:58                 ` John Petrini
2017-04-07 17:04                   ` Austin S. Hemmelgarn
2017-04-08  5:12             ` Duncan
2017-04-10 11:31               ` Austin S. Hemmelgarn
2017-04-07  1:17 ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.