All of lore.kernel.org
 help / color / mirror / Atom feed
* Add device while rebalancing
@ 2016-04-22 20:36 Juan Alberto Cirez
  2016-04-23  5:38 ` Duncan
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-22 20:36 UTC (permalink / raw)
  To: linux-btrfs

Good morning,
I am new to this list and to btrfs in general. I have a quick
question: Can I add a new device to the pool while the btrfs
filesystem balance command is running on the drive pool?

Thanks

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-22 20:36 Add device while rebalancing Juan Alberto Cirez
@ 2016-04-23  5:38 ` Duncan
  2016-04-25 11:18   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 21+ messages in thread
From: Duncan @ 2016-04-23  5:38 UTC (permalink / raw)
  To: linux-btrfs

Juan Alberto Cirez posted on Fri, 22 Apr 2016 14:36:44 -0600 as excerpted:

> Good morning,
> I am new to this list and to btrfs in general. I have a quick question:
> Can I add a new device to the pool while the btrfs filesystem balance
> command is running on the drive pool?

Adding a device while balancing shouldn't be a problem.  However, 
depending on your redundancy mode, you may wish to cancel the balance and 
start a new one after the device add, so the balance will take account of 
it as well and balance it into the mix.

Note that while device add doesn't do more than that on its own, device 
delete/remove effectively initiates its own balance, moving the chunks on 
the device being removed to the other devices.  So you wouldn't want to 
be running a balance and then do a device remove at the same time.

Similarly with btrfs replace, altho in that case, it's more directly 
moving data from the device being replaced (if it's still there, or using 
redundancy or parity to recover it if not) to the replacement device, a 
more limited and often faster operation.  But you probably still don't 
want to do a balance at the same time as it places unnecessary stress on 
both the filesystem and the hardware, and even if the filesystem and 
devices handle the stress fine, the result is going to be that both 
operations take longer as they're both intensive operations that will 
interfere with each other to some extent.

Similarly with btrfs scrub.  The operations are logically different 
enough that they shouldn't really interfere with each other logically, 
but they're both hardware intensive operations that will put unnecessary 
stress on the system if you're doing more than one at a time, and will 
result in both going slower than they normally would.

And again with snapshotting operations.  Making a snapshot is normally 
nearly instantaneous, but there's a scaling issue if you have too many 
per filesystem (try to keep it under 2000 snapshots per filesystem total, 
if possible, and definitely keep it under 10K or some operations will 
slow down substantially), and deleting snapshots is more work, so while 
you should ordinarily automatically thin down snapshots if you're 
automatically making them quite frequently (say daily or more 
frequently), you may want to put the snapshot deletion, at least, on hold 
while you scrub or balance or device delete or replace.

Meanwhile, you mentioned being new to btrfs.  If you haven't discovered 
the wiki yet, please spend some time reading the user documentation 
there, as it's likely to clear up a lot of questions you may have, and 
you'll better understand how to effectively work with the filesystem when 
you're done.  It's well worth the time invested! =:^)

https://btrfs.wiki.kernel.org


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-23  5:38 ` Duncan
@ 2016-04-25 11:18   ` Austin S. Hemmelgarn
  2016-04-25 12:43     ` Duncan
  0 siblings, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-25 11:18 UTC (permalink / raw)
  To: linux-btrfs

On 2016-04-23 01:38, Duncan wrote:
> Juan Alberto Cirez posted on Fri, 22 Apr 2016 14:36:44 -0600 as excerpted:
>
>> Good morning,
>> I am new to this list and to btrfs in general. I have a quick question:
>> Can I add a new device to the pool while the btrfs filesystem balance
>> command is running on the drive pool?
>
> Adding a device while balancing shouldn't be a problem.  However,
> depending on your redundancy mode, you may wish to cancel the balance and
> start a new one after the device add, so the balance will take account of
> it as well and balance it into the mix.
I'm not 100% certain about how balance will handle this, except that 
nothing should break.  I believe that it picks a device each time it 
goes to move a chunk, so it should evaluate any chunks operated on after 
the addition of the device for possible placement on that device (and it 
will probably end up putting a lot of them there because that device 
will almost certainly be less full than any of the others).  That said, 
you probably do want to cancel the balance, add the device, and re-run 
the balance so that things end up more evenly distributed.
>
> Note that while device add doesn't do more than that on its own, device
> delete/remove effectively initiates its own balance, moving the chunks on
> the device being removed to the other devices.  So you wouldn't want to
> be running a balance and then do a device remove at the same time.
IIRC, trying to delete a device while running a balance will fail, and 
return an error, because only one balance can be running at a given moment.
>
> Similarly with btrfs replace, altho in that case, it's more directly
> moving data from the device being replaced (if it's still there, or using
> redundancy or parity to recover it if not) to the replacement device, a
> more limited and often faster operation.  But you probably still don't
> want to do a balance at the same time as it places unnecessary stress on
> both the filesystem and the hardware, and even if the filesystem and
> devices handle the stress fine, the result is going to be that both
> operations take longer as they're both intensive operations that will
> interfere with each other to some extent.
Agreed, this is generally not a good idea because of the stress it puts 
on the devices (and because it probably isn't well tested).
>
> Similarly with btrfs scrub.  The operations are logically different
> enough that they shouldn't really interfere with each other logically,
> but they're both hardware intensive operations that will put unnecessary
> stress on the system if you're doing more than one at a time, and will
> result in both going slower than they normally would.
Actually, depending on a number of factors, scrubbing while balancing 
can actually finish faster than running one then the other in sequence. 
  It's really dependent on how both decide to pick chunks, and how your 
underlying devices handle read and write caching, but it can happen. 
Most of the time though, it should take around the same amount of time 
as running one then the other, or a little bit longer if you're on 
traditional disks.
>
> And again with snapshotting operations.  Making a snapshot is normally
> nearly instantaneous, but there's a scaling issue if you have too many
> per filesystem (try to keep it under 2000 snapshots per filesystem total,
> if possible, and definitely keep it under 10K or some operations will
> slow down substantially), and deleting snapshots is more work, so while
> you should ordinarily automatically thin down snapshots if you're
> automatically making them quite frequently (say daily or more
> frequently), you may want to put the snapshot deletion, at least, on hold
> while you scrub or balance or device delete or replace.
I would actually recommend putting all snapshot operations on hold, as 
well as most writes to the filesystem, while doing a balance or device 
deletion.  The more writes you have while doing those, the longer they 
take, and the less likely that you end up with a good on-disk layout of 
the data.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-25 11:18   ` Austin S. Hemmelgarn
@ 2016-04-25 12:43     ` Duncan
  2016-04-25 13:02       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 21+ messages in thread
From: Duncan @ 2016-04-25 12:43 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as
excerpted:

> On 2016-04-23 01:38, Duncan wrote:
>>
>> And again with snapshotting operations.  Making a snapshot is normally
>> nearly instantaneous, but there's a scaling issue if you have too many
>> per filesystem (try to keep it under 2000 snapshots per filesystem
>> total, if possible, and definitely keep it under 10K or some operations
>> will slow down substantially), and deleting snapshots is more work, so
>> while you should ordinarily automatically thin down snapshots if you're
>> automatically making them quite frequently (say daily or more
>> frequently), you may want to put the snapshot deletion, at least, on
>> hold while you scrub or balance or device delete or replace.

> I would actually recommend putting all snapshot operations on hold, as
> well as most writes to the filesystem, while doing a balance or device
> deletion.  The more writes you have while doing those, the longer they
> take, and the less likely that you end up with a good on-disk layout of
> the data.

The thing with snapshot writing is that all snapshot creation effectively 
does is a bit of metadata writing.  What snapshots primarily do is lock 
existing extents in place (down within their chunk, with the higher chunk 
level being the scope at which balance works), that would otherwise be 
COWed elsewhere with the existing extent deleted on change, or simply 
deleted on on file delete.  A snapshot simply adds a reference to the 
current version, so that deletion, either directly or from the COW, never 
happens, and to do that simply requires a relatively small metadata write.

So while I agree in general that more writes means balances taking 
longer, snapshot creation writes are pretty tiny in the scheme of things, 
and won't affect the balance much, compared to larger writes you'll very 
possibly still be doing unless you really do suspend pretty much all 
write operations to that filesystem during the balance.

But as I said, snapshot deletions are an entirely different story, as 
then all those previously locked in place extents are potentially freed, 
and the filesystem must do a lot of work to figure out which ones it can 
actually free and free them, vs. ones that still have other references 
which therefore cannot yet be freed.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-25 12:43     ` Duncan
@ 2016-04-25 13:02       ` Austin S. Hemmelgarn
  2016-04-26 10:50         ` Juan Alberto Cirez
  0 siblings, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-25 13:02 UTC (permalink / raw)
  To: linux-btrfs

On 2016-04-25 08:43, Duncan wrote:
> Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as
> excerpted:
>
>> On 2016-04-23 01:38, Duncan wrote:
>>>
>>> And again with snapshotting operations.  Making a snapshot is normally
>>> nearly instantaneous, but there's a scaling issue if you have too many
>>> per filesystem (try to keep it under 2000 snapshots per filesystem
>>> total, if possible, and definitely keep it under 10K or some operations
>>> will slow down substantially), and deleting snapshots is more work, so
>>> while you should ordinarily automatically thin down snapshots if you're
>>> automatically making them quite frequently (say daily or more
>>> frequently), you may want to put the snapshot deletion, at least, on
>>> hold while you scrub or balance or device delete or replace.
>
>> I would actually recommend putting all snapshot operations on hold, as
>> well as most writes to the filesystem, while doing a balance or device
>> deletion.  The more writes you have while doing those, the longer they
>> take, and the less likely that you end up with a good on-disk layout of
>> the data.
>
> The thing with snapshot writing is that all snapshot creation effectively
> does is a bit of metadata writing.  What snapshots primarily do is lock
> existing extents in place (down within their chunk, with the higher chunk
> level being the scope at which balance works), that would otherwise be
> COWed elsewhere with the existing extent deleted on change, or simply
> deleted on on file delete.  A snapshot simply adds a reference to the
> current version, so that deletion, either directly or from the COW, never
> happens, and to do that simply requires a relatively small metadata write.
Unless I'm mistaken about the internals of BTRFS (which might be the 
case), creating a snapshot has to update reference counts on every 
single extent in every single file in the snapshot.  For something small 
this isn't much, but if you are snapshotting something big (say, 
snapshotting an entire system with all the data in one subvolume), it 
can amount to multiple MB of writes, and it gets even worse if you have 
no shared extents to begin with (which is still pretty typical).  On 
some of the systems I work with at work, snapshotting a terabyte of data 
can end up resulting in 10-20 MB of writes to disk (in this case, that 
figure came from a partition containing mostly small files that were 
just big enough that they didn't fit in-line in the metadata blocks).

This is of course still significantly faster than copying everything, 
but it's not free either.
>
> So while I agree in general that more writes means balances taking
> longer, snapshot creation writes are pretty tiny in the scheme of things,
> and won't affect the balance much, compared to larger writes you'll very
> possibly still be doing unless you really do suspend pretty much all
> write operations to that filesystem during the balance.
In general, yes, except that there's the case of running with mostly 
full metadata chunks, where it might result in a further chunk 
allocation, which in turn can throw off the balanced layout.  Balance 
always allocates new chunks, and doesn't write into existing ones, so if 
you're writing enough to allocate a new chunk while a balance is happening:
1. That chunk may or may not get considered by the balance code (I'm not 
100% certain about this, but I believe it will be ignored by any balance 
running at the time it gets allocated).
2. You run the risk of ending up with a chunk with almost nothing in it 
which could be packed into another existing chunk.
Snapshots are not likely to trigger this, but it is still possible, 
especially if you're taking lots of snapshots in a short period of time.
>
> But as I said, snapshot deletions are an entirely different story, as
> then all those previously locked in place extents are potentially freed,
> and the filesystem must do a lot of work to figure out which ones it can
> actually free and free them, vs. ones that still have other references
> which therefore cannot yet be freed.
Most of the issue here with balance is that you end up potentially doing 
an amount of unnecessary work which is unquantifiable before it's done.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-25 13:02       ` Austin S. Hemmelgarn
@ 2016-04-26 10:50         ` Juan Alberto Cirez
  2016-04-26 11:11           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-26 10:50 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

Thank you guys so very kindly for all your help and taking the time to
answer my question. I have been reading the wiki and online use cases
and otherwise delving deeper into the btrfs architecture.

I am managing a 520TB storage pool spread across 16 server pods and
have tried several methods of distributed storage. Last attempt was
using Zfs as a base for the physical bricks and GlusterFS as a glue to
string together the storage pool. I was not satisfied with the results
(mainly Zfs). Once I have run btrfs for a while on the test server
(32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph

On Mon, Apr 25, 2016 at 7:02 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-25 08:43, Duncan wrote:
>>
>> Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as
>> excerpted:
>>
>>> On 2016-04-23 01:38, Duncan wrote:
>>>>
>>>>
>>>> And again with snapshotting operations.  Making a snapshot is normally
>>>> nearly instantaneous, but there's a scaling issue if you have too many
>>>> per filesystem (try to keep it under 2000 snapshots per filesystem
>>>> total, if possible, and definitely keep it under 10K or some operations
>>>> will slow down substantially), and deleting snapshots is more work, so
>>>> while you should ordinarily automatically thin down snapshots if you're
>>>> automatically making them quite frequently (say daily or more
>>>> frequently), you may want to put the snapshot deletion, at least, on
>>>> hold while you scrub or balance or device delete or replace.
>>
>>
>>> I would actually recommend putting all snapshot operations on hold, as
>>> well as most writes to the filesystem, while doing a balance or device
>>> deletion.  The more writes you have while doing those, the longer they
>>> take, and the less likely that you end up with a good on-disk layout of
>>> the data.
>>
>>
>> The thing with snapshot writing is that all snapshot creation effectively
>> does is a bit of metadata writing.  What snapshots primarily do is lock
>> existing extents in place (down within their chunk, with the higher chunk
>> level being the scope at which balance works), that would otherwise be
>> COWed elsewhere with the existing extent deleted on change, or simply
>> deleted on on file delete.  A snapshot simply adds a reference to the
>> current version, so that deletion, either directly or from the COW, never
>> happens, and to do that simply requires a relatively small metadata write.
>
> Unless I'm mistaken about the internals of BTRFS (which might be the case),
> creating a snapshot has to update reference counts on every single extent in
> every single file in the snapshot.  For something small this isn't much, but
> if you are snapshotting something big (say, snapshotting an entire system
> with all the data in one subvolume), it can amount to multiple MB of writes,
> and it gets even worse if you have no shared extents to begin with (which is
> still pretty typical).  On some of the systems I work with at work,
> snapshotting a terabyte of data can end up resulting in 10-20 MB of writes
> to disk (in this case, that figure came from a partition containing mostly
> small files that were just big enough that they didn't fit in-line in the
> metadata blocks).
>
> This is of course still significantly faster than copying everything, but
> it's not free either.
>>
>>
>> So while I agree in general that more writes means balances taking
>> longer, snapshot creation writes are pretty tiny in the scheme of things,
>> and won't affect the balance much, compared to larger writes you'll very
>> possibly still be doing unless you really do suspend pretty much all
>> write operations to that filesystem during the balance.
>
> In general, yes, except that there's the case of running with mostly full
> metadata chunks, where it might result in a further chunk allocation, which
> in turn can throw off the balanced layout.  Balance always allocates new
> chunks, and doesn't write into existing ones, so if you're writing enough to
> allocate a new chunk while a balance is happening:
> 1. That chunk may or may not get considered by the balance code (I'm not
> 100% certain about this, but I believe it will be ignored by any balance
> running at the time it gets allocated).
> 2. You run the risk of ending up with a chunk with almost nothing in it
> which could be packed into another existing chunk.
> Snapshots are not likely to trigger this, but it is still possible,
> especially if you're taking lots of snapshots in a short period of time.
>>
>>
>> But as I said, snapshot deletions are an entirely different story, as
>> then all those previously locked in place extents are potentially freed,
>> and the filesystem must do a lot of work to figure out which ones it can
>> actually free and free them, vs. ones that still have other references
>> which therefore cannot yet be freed.
>
> Most of the issue here with balance is that you end up potentially doing an
> amount of unnecessary work which is unquantifiable before it's done.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-26 10:50         ` Juan Alberto Cirez
@ 2016-04-26 11:11           ` Austin S. Hemmelgarn
  2016-04-26 11:44             ` Juan Alberto Cirez
  0 siblings, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-26 11:11 UTC (permalink / raw)
  To: Juan Alberto Cirez; +Cc: linux-btrfs

On 2016-04-26 06:50, Juan Alberto Cirez wrote:
> Thank you guys so very kindly for all your help and taking the time to
> answer my question. I have been reading the wiki and online use cases
> and otherwise delving deeper into the btrfs architecture.
>
> I am managing a 520TB storage pool spread across 16 server pods and
> have tried several methods of distributed storage. Last attempt was
> using Zfs as a base for the physical bricks and GlusterFS as a glue to
> string together the storage pool. I was not satisfied with the results
> (mainly Zfs). Once I have run btrfs for a while on the test server
> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
For what it's worth, GlusterFS works great on top of BTRFS.  I don't 
have any claims to usage in production, but I've done _a lot_ of testing 
with it because we're replacing one of our critical file servers at work 
with a couple of systems set up with Gluster on top of BTRFS, and I've 
been looking at setting up a small storage cluster at home using it on a 
couple of laptops I have which have non-functional displays.  Based on 
what I've seen, it appears to be rock solid with respect to the common 
failure modes, provided you use something like raid1 mode on the BTRFS 
side of things.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-26 11:11           ` Austin S. Hemmelgarn
@ 2016-04-26 11:44             ` Juan Alberto Cirez
  2016-04-26 12:04               ` Austin S. Hemmelgarn
  2016-04-27  0:58               ` Chris Murphy
  0 siblings, 2 replies; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-26 11:44 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

Well,
RAID1 offers no parity, striping, or spanning of disk space across
multiple disks.

RAID10 configuration, on the other hand, requires a minimum of four
HDD, but it stripes data across mirrored pairs. As long as one disk in
each mirrored pair is functional, data can be retrieved.

With GlusterFS as a distributed volume, the files are already spread
among the servers causing file I/O to be spread fairly evenly among
them as well, thus probably providing the benefit one might expect
with stripe (RAID10).

The question I have now is: Should I use a RAID10 or RAID1 underneath
of a GlusterFS stripped (and possibly replicated) volume?

On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-26 06:50, Juan Alberto Cirez wrote:
>>
>> Thank you guys so very kindly for all your help and taking the time to
>> answer my question. I have been reading the wiki and online use cases
>> and otherwise delving deeper into the btrfs architecture.
>>
>> I am managing a 520TB storage pool spread across 16 server pods and
>> have tried several methods of distributed storage. Last attempt was
>> using Zfs as a base for the physical bricks and GlusterFS as a glue to
>> string together the storage pool. I was not satisfied with the results
>> (mainly Zfs). Once I have run btrfs for a while on the test server
>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
>
> For what it's worth, GlusterFS works great on top of BTRFS.  I don't have
> any claims to usage in production, but I've done _a lot_ of testing with it
> because we're replacing one of our critical file servers at work with a
> couple of systems set up with Gluster on top of BTRFS, and I've been looking
> at setting up a small storage cluster at home using it on a couple of
> laptops I have which have non-functional displays.  Based on what I've seen,
> it appears to be rock solid with respect to the common failure modes,
> provided you use something like raid1 mode on the BTRFS side of things.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-26 11:44             ` Juan Alberto Cirez
@ 2016-04-26 12:04               ` Austin S. Hemmelgarn
  2016-04-26 12:14                 ` Juan Alberto Cirez
  2016-04-27  0:58               ` Chris Murphy
  1 sibling, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-26 12:04 UTC (permalink / raw)
  To: Juan Alberto Cirez; +Cc: linux-btrfs

On 2016-04-26 07:44, Juan Alberto Cirez wrote:
> Well,
> RAID1 offers no parity, striping, or spanning of disk space across
> multiple disks.
>
> RAID10 configuration, on the other hand, requires a minimum of four
> HDD, but it stripes data across mirrored pairs. As long as one disk in
> each mirrored pair is functional, data can be retrieved.
>
> With GlusterFS as a distributed volume, the files are already spread
> among the servers causing file I/O to be spread fairly evenly among
> them as well, thus probably providing the benefit one might expect
> with stripe (RAID10).
>
> The question I have now is: Should I use a RAID10 or RAID1 underneath
> of a GlusterFS stripped (and possibly replicated) volume?
If you have enough systems and a new enough version of GlusterFS, I'd 
suggest using raid1 on the low level, and then either a distributed 
replicated volume or an erasure coded volume in GlusterFS.
Having more individual nodes involved will improve your scalability to 
larger numbers of clients, and you can have more nodes with the same 
number of disks if you use raid1 instead of raid10 on BTRFS.  Using 
Erasure coding in Gluster will provide better resiliency with higher 
node counts for each individual file, at the cost of moderately higher 
CPU time being used.  FWIW, RAID5 and RAID6 are both specific cases of 
(mathematically) optimal erasure coding (RAID5 is n,n+1 and RAID6 is 
n,n+2 using the normal notation), but the equivalent forms in Gluster 
are somewhat risky with any decent sized cluster.

It is worth noting that I would not personally trust just GlusterFS or 
just BTRFS with the data replication, BTRFS is still somewhat new 
(although I haven't had a truly broken filesystem in more than a year), 
and GlusterFS has a lot more failure modes because of the networking.
>
> On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-26 06:50, Juan Alberto Cirez wrote:
>>>
>>> Thank you guys so very kindly for all your help and taking the time to
>>> answer my question. I have been reading the wiki and online use cases
>>> and otherwise delving deeper into the btrfs architecture.
>>>
>>> I am managing a 520TB storage pool spread across 16 server pods and
>>> have tried several methods of distributed storage. Last attempt was
>>> using Zfs as a base for the physical bricks and GlusterFS as a glue to
>>> string together the storage pool. I was not satisfied with the results
>>> (mainly Zfs). Once I have run btrfs for a while on the test server
>>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
>>
>> For what it's worth, GlusterFS works great on top of BTRFS.  I don't have
>> any claims to usage in production, but I've done _a lot_ of testing with it
>> because we're replacing one of our critical file servers at work with a
>> couple of systems set up with Gluster on top of BTRFS, and I've been looking
>> at setting up a small storage cluster at home using it on a couple of
>> laptops I have which have non-functional displays.  Based on what I've seen,
>> it appears to be rock solid with respect to the common failure modes,
>> provided you use something like raid1 mode on the BTRFS side of things.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-26 12:04               ` Austin S. Hemmelgarn
@ 2016-04-26 12:14                 ` Juan Alberto Cirez
  2016-04-26 12:44                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-26 12:14 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

Thank you again, Austin.

My ideal case would be high availability coupled with reliable data
replication and integrity against accidental lost. I am willing to
cede ground on the write speed; but the read has to be as optimized as
possible.
So far BTRFS, RAID10 on the 32TB test server is quite good both read &
write and data lost/corruption has not been an issue yet. When I
introduce the network/distributed layer, I would like the same.
BTW does Ceph provides similar functionality, reliability and performace?

On Tue, Apr 26, 2016 at 6:04 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-26 07:44, Juan Alberto Cirez wrote:
>>
>> Well,
>> RAID1 offers no parity, striping, or spanning of disk space across
>> multiple disks.
>>
>> RAID10 configuration, on the other hand, requires a minimum of four
>> HDD, but it stripes data across mirrored pairs. As long as one disk in
>> each mirrored pair is functional, data can be retrieved.
>>
>> With GlusterFS as a distributed volume, the files are already spread
>> among the servers causing file I/O to be spread fairly evenly among
>> them as well, thus probably providing the benefit one might expect
>> with stripe (RAID10).
>>
>> The question I have now is: Should I use a RAID10 or RAID1 underneath
>> of a GlusterFS stripped (and possibly replicated) volume?
>
> If you have enough systems and a new enough version of GlusterFS, I'd
> suggest using raid1 on the low level, and then either a distributed
> replicated volume or an erasure coded volume in GlusterFS.
> Having more individual nodes involved will improve your scalability to
> larger numbers of clients, and you can have more nodes with the same number
> of disks if you use raid1 instead of raid10 on BTRFS.  Using Erasure coding
> in Gluster will provide better resiliency with higher node counts for each
> individual file, at the cost of moderately higher CPU time being used.
> FWIW, RAID5 and RAID6 are both specific cases of (mathematically) optimal
> erasure coding (RAID5 is n,n+1 and RAID6 is n,n+2 using the normal
> notation), but the equivalent forms in Gluster are somewhat risky with any
> decent sized cluster.
>
> It is worth noting that I would not personally trust just GlusterFS or just
> BTRFS with the data replication, BTRFS is still somewhat new (although I
> haven't had a truly broken filesystem in more than a year), and GlusterFS
> has a lot more failure modes because of the networking.
>
>>
>> On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>>
>>> On 2016-04-26 06:50, Juan Alberto Cirez wrote:
>>>>
>>>>
>>>> Thank you guys so very kindly for all your help and taking the time to
>>>> answer my question. I have been reading the wiki and online use cases
>>>> and otherwise delving deeper into the btrfs architecture.
>>>>
>>>> I am managing a 520TB storage pool spread across 16 server pods and
>>>> have tried several methods of distributed storage. Last attempt was
>>>> using Zfs as a base for the physical bricks and GlusterFS as a glue to
>>>> string together the storage pool. I was not satisfied with the results
>>>> (mainly Zfs). Once I have run btrfs for a while on the test server
>>>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
>>>
>>>
>>> For what it's worth, GlusterFS works great on top of BTRFS.  I don't have
>>> any claims to usage in production, but I've done _a lot_ of testing with
>>> it
>>> because we're replacing one of our critical file servers at work with a
>>> couple of systems set up with Gluster on top of BTRFS, and I've been
>>> looking
>>> at setting up a small storage cluster at home using it on a couple of
>>> laptops I have which have non-functional displays.  Based on what I've
>>> seen,
>>> it appears to be rock solid with respect to the common failure modes,
>>> provided you use something like raid1 mode on the BTRFS side of things.
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-26 12:14                 ` Juan Alberto Cirez
@ 2016-04-26 12:44                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-26 12:44 UTC (permalink / raw)
  To: Juan Alberto Cirez; +Cc: linux-btrfs

On 2016-04-26 08:14, Juan Alberto Cirez wrote:
> Thank you again, Austin.
>
> My ideal case would be high availability coupled with reliable data
> replication and integrity against accidental lost. I am willing to
> cede ground on the write speed; but the read has to be as optimized as
> possible.
> So far BTRFS, RAID10 on the 32TB test server is quite good both read &
> write and data lost/corruption has not been an issue yet. When I
> introduce the network/distributed layer, I would like the same.
> BTW does Ceph provides similar functionality, reliability and performace?
I can't give as much advice on Ceph, except to say that when I last 
tested it more than 2 years ago, the filesystem front-end had some 
serious data integrity issues, and the block device front-end had some 
sanity issues when dealing with systems going off-line (either crashing, 
or being shut down).  I don't know if they're fixed or not by now.  It's 
worth noting that while Glsuter and Ceph are both intended for cluster 
storage, Ceph has a very much more data-center oriented approach (it 
appears from what I've seen to be optimized for lots of small systems 
running as OSD's with a few bigger ones running as monitors and possibly 
MDS's), while Gluster seems (again, personal perspective) to try to be 
more agnostic of what hardware is involved.  I will comment though that 
it is exponentially easier to recover data from a failed GlusterFS 
cluster than it is a failed Ceph cluster, Gluster uses flat files with a 
few extended attributes for storage, whereas Ceph uses it's own internal 
binary object format (partly because Ceph is first and foremost an 
object storage system, whereas Gluster is primarily intended as an 
actual filesystem).

Also, with respect to performance, you may want to compare BTRFS raid10 
mode to BTRFS raid1 on top of two LVM RAID0 volumes.  I find this tends 
to get better overall performance with no difference in data safety, 
because BTRFS still has a pretty brain-dead I/O scheduler in the 
multi-device code.
> On Tue, Apr 26, 2016 at 6:04 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-26 07:44, Juan Alberto Cirez wrote:
>>>
>>> Well,
>>> RAID1 offers no parity, striping, or spanning of disk space across
>>> multiple disks.
>>>
>>> RAID10 configuration, on the other hand, requires a minimum of four
>>> HDD, but it stripes data across mirrored pairs. As long as one disk in
>>> each mirrored pair is functional, data can be retrieved.
>>>
>>> With GlusterFS as a distributed volume, the files are already spread
>>> among the servers causing file I/O to be spread fairly evenly among
>>> them as well, thus probably providing the benefit one might expect
>>> with stripe (RAID10).
>>>
>>> The question I have now is: Should I use a RAID10 or RAID1 underneath
>>> of a GlusterFS stripped (and possibly replicated) volume?
>>
>> If you have enough systems and a new enough version of GlusterFS, I'd
>> suggest using raid1 on the low level, and then either a distributed
>> replicated volume or an erasure coded volume in GlusterFS.
>> Having more individual nodes involved will improve your scalability to
>> larger numbers of clients, and you can have more nodes with the same number
>> of disks if you use raid1 instead of raid10 on BTRFS.  Using Erasure coding
>> in Gluster will provide better resiliency with higher node counts for each
>> individual file, at the cost of moderately higher CPU time being used.
>> FWIW, RAID5 and RAID6 are both specific cases of (mathematically) optimal
>> erasure coding (RAID5 is n,n+1 and RAID6 is n,n+2 using the normal
>> notation), but the equivalent forms in Gluster are somewhat risky with any
>> decent sized cluster.
>>
>> It is worth noting that I would not personally trust just GlusterFS or just
>> BTRFS with the data replication, BTRFS is still somewhat new (although I
>> haven't had a truly broken filesystem in more than a year), and GlusterFS
>> has a lot more failure modes because of the networking.
>>
>>>
>>> On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>>
>>>> On 2016-04-26 06:50, Juan Alberto Cirez wrote:
>>>>>
>>>>>
>>>>> Thank you guys so very kindly for all your help and taking the time to
>>>>> answer my question. I have been reading the wiki and online use cases
>>>>> and otherwise delving deeper into the btrfs architecture.
>>>>>
>>>>> I am managing a 520TB storage pool spread across 16 server pods and
>>>>> have tried several methods of distributed storage. Last attempt was
>>>>> using Zfs as a base for the physical bricks and GlusterFS as a glue to
>>>>> string together the storage pool. I was not satisfied with the results
>>>>> (mainly Zfs). Once I have run btrfs for a while on the test server
>>>>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
>>>>
>>>>
>>>> For what it's worth, GlusterFS works great on top of BTRFS.  I don't have
>>>> any claims to usage in production, but I've done _a lot_ of testing with
>>>> it
>>>> because we're replacing one of our critical file servers at work with a
>>>> couple of systems set up with Gluster on top of BTRFS, and I've been
>>>> looking
>>>> at setting up a small storage cluster at home using it on a couple of
>>>> laptops I have which have non-functional displays.  Based on what I've
>>>> seen,
>>>> it appears to be rock solid with respect to the common failure modes,
>>>> provided you use something like raid1 mode on the BTRFS side of things.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-26 11:44             ` Juan Alberto Cirez
  2016-04-26 12:04               ` Austin S. Hemmelgarn
@ 2016-04-27  0:58               ` Chris Murphy
  2016-04-27 10:37                 ` Duncan
  2016-04-27 11:22                 ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 21+ messages in thread
From: Chris Murphy @ 2016-04-27  0:58 UTC (permalink / raw)
  To: Juan Alberto Cirez; +Cc: Austin S. Hemmelgarn, linux-btrfs

On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
<jacirez@rdcsafety.com> wrote:
> Well,
> RAID1 offers no parity, striping, or spanning of disk space across
> multiple disks.

Btrfs raid1 does span, although it's typically called the "volume", or
a "pool" similar to ZFS terminology. e.g. 10 2TiB disks will get you a
single volume on which you can store about 10TiB of data with two
copies (called stripes in Btrfs). In effect the way chunk replication
works, it's a concat+raid1.


> RAID10 configuration, on the other hand, requires a minimum of four
> HDD, but it stripes data across mirrored pairs. As long as one disk in
> each mirrored pair is functional, data can be retrieved.

Not Btrfs raid10. It's not the devices that are mirrored pairs, but
rather the chunks. There's no way to control or determine on what
devices the pairs are on. It's certain you get at least a partial
failure (data for sure and likely metadata if it's also using raid10
profile) of the volume if you lose more than 1 device, planning wise
you have to assume you lose the entire array.



>
> With GlusterFS as a distributed volume, the files are already spread
> among the servers causing file I/O to be spread fairly evenly among
> them as well, thus probably providing the benefit one might expect
> with stripe (RAID10).

Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
if you lose a drive. But since raid1 is not n-way copies, and only
means two copies, you don't really want the file systems getting that
big or you increase the chances of a double failure.

I've always though it'd be neat in a Btrfs + GlusterFS, if it were
possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
and then for Btrfs to drop reference for those files, instead of
either rebuilding or remaining degraded. And then let GlusterFS deal
with replication of those files to maintain redundancy. i.e. the Btrfs
volumes would be single profile for data, and raid1 for metadata. When
there's n-way raid1, each drive can have a copy of the file system,
and it'd tolerate in effect n-1 drive failures and the file system
could at least still inform Gluster (or Ceph) of the missing data, the
file system still remains valid, only briefly degraded, and can still
be expanded when new drives become available.

I'm not a big fan of hot (or cold) spares. They contribute nothing,
but take up physical space and power.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27  0:58               ` Chris Murphy
@ 2016-04-27 10:37                 ` Duncan
  2016-04-27 11:22                 ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 21+ messages in thread
From: Duncan @ 2016-04-27 10:37 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Tue, 26 Apr 2016 18:58:06 -0600 as excerpted:

> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
> <jacirez@rdcsafety.com> wrote:

>> RAID10 configuration, on the other hand, requires a minimum of four
>> HDD, but it stripes data across mirrored pairs. As long as one disk in
>> each mirrored pair is functional, data can be retrieved.
> 
> Not Btrfs raid10. It's not the devices that are mirrored pairs, but
> rather the chunks. There's no way to control or determine on what
> devices the pairs are on. It's certain you get at least a partial
> failure (data for sure and likely metadata if it's also using raid10
> profile) of the volume if you lose more than 1 device, planning wise you
> have to assume you lose the entire array.

Primarily quoting and restating the above (and below) to emphasize it.

Remember:

* btrfs raid is chunk-level, *NOT* device-level.  That has important 
implications in terms of recovery from degraded.

* btrfs parity-raid (raid56 mode) isn't yet mature and definitely nothing 
I'd trust in production.

* btrfs redundancy-raid (raid1 and raid10 modes, as well as dup-mode on a 
single device) are precisely pair-copy -- two copies, with the raid modes 
forcing each copy to a different device or set of devices.  More devices 
simply means more space, *NOT* more redundancy/copies.

Again, these copies are at the chunk level.  The chunks can and will be 
distributed across devices based on most space available, meaning loss of 
more than one device will in most cases kill the array.  Because mirror-
pairs happen at the chunk, not the device level, there is no such thing 
as loss of only one mirror in the mirror pair allowing more than a single 
device to fail, because statistically, the chances of both copies of some 
chunks being on those two now failed/missing devices is pretty high.

* btrfs raid10 stripes N/2-way, while only duplicating exactly two-way.  
So a six-device raid10 will stripe three devices per mirror, while a 5-
device raid10 will stripe 2 devices per mirror, with the odd device out 
being on a different device for each new chunk, due to the most-space-
left allocation algorithm.

>> With GlusterFS as a distributed volume, the files are already spread
>> among the servers causing file I/O to be spread fairly evenly among
>> them as well, thus probably providing the benefit one might expect with
>> stripe (RAID10).
> 
> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes if
> you lose a drive. But since raid1 is not n-way copies, and only means
> two copies, you don't really want the file systems getting that big or
> you increase the chances of a double failure.

Again emphasizing.  Since you're running a distributed filesystem on top, 
keep the lower level btrfs raids small and do more of them, multiple 
btrfs raid bricks per machine even, as long as your distributed level is 
specced to be able to lose the bricks of at least one entire machine, of 
course.

OTOH, unlike traditional raid, btrfs does actual checksumming and data/
metadata integrity at the block level, and can and will detect integrity 
issues and correct from the second copy when the raid level supplies one, 
assuming it's good of course.  That should fix problems at the lower 
level that other filesystems wouldn't, meaning less problems ever reach 
the distributed level in the first place.

Thus, also emphasizing something Austin suggested.  You may wish to 
consider btrfs raid1 on top of a pair of mdraid or dmraid raid0s.

As you are likely well aware, normally, raid1 on top of raid0 is called 
raid01 and is discouraged in favor of raid10 (raid0 on top of raid1) for 
rebuild from lost device state efficiency reasons (with raid1 underneath, 
the rebuild of a lost device is localized to the presumably two-device 
raid1, with raid1 on top, the whole raid0 stripe must be rebuilt, and 
that's normally at the whole-device level)

Of course putting the btrfs raid1 on top reverses this and would 
*normally* be discouraged as raid01, but btrfs raid1's operational data 
integrity handling, while not getting away from having to rebuild the 
whole raid0 stripe from the other one, does mean that gets done for an 
individual bad block -- no whole device failure necessary.

And of course you can't get that putting btrfs raid0 on top and get that, 
since then the underneath raid1 layer won't be doing that integrity 
verification, and if that bad block happens to be returned by the 
underlying raid1 layer, the btrfs raid0 will simply fail the verification 
and error out that read, despite another good copy on the underlying 
raid1, because btrfs won't know anything about it.

Meanwhile, as Austin says, btrfs' A/B copy read scheduling is... 
unoptimized.  Basically, it's simple even/odd PID based, so a single read 
thread will always hit the same copy, leaving the other one idle.  I've 
argued before that precisely that is a very good indication of where the 
btrfs devs themselves think btrfs is at, as it's clearly suboptimal, 
while there are much better scheduling examples, including the mdraid 
read-scheduling code, praised for its efficiency, in the kernel, and 
failure to optimize must then be considered either simply lacking the 
time due to higher priority development and bugfixing tasks, or an 
avoidance of the dangers of "premature optimization".  In either case, 
that such unoptimized code remains in such a highly visible and 
performance critical place is an extremely strong indicator that btrfs 
devs themselves don't consider btrfs a stable and mature filesystem yet.

And putting a pair of md/dm raid0s below that btrfs raid1, both helps to 
make up a bit for the btrfs raid1 braindead read-scheduling, and lets you 
exploit btrfs raid1's data integrity features.  Of course it also forces 
btrfs to a more deterministic distribution of those chunk copies, so you 
can loose up to all the devices in one of those raid0s, as long as the 
other one remains functional, but that's nothing to really count on, so 
you still plan for single device failure redundancy only at the 
individual brick level, and use the distributed filesystem layer to deal 
with whole brick failure above that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27  0:58               ` Chris Murphy
  2016-04-27 10:37                 ` Duncan
@ 2016-04-27 11:22                 ` Austin S. Hemmelgarn
  2016-04-27 15:58                   ` Juan Alberto Cirez
  2016-04-27 23:19                   ` Chris Murphy
  1 sibling, 2 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-27 11:22 UTC (permalink / raw)
  To: Chris Murphy, Juan Alberto Cirez; +Cc: linux-btrfs

On 2016-04-26 20:58, Chris Murphy wrote:
> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
> <jacirez@rdcsafety.com> wrote:
>>
>> With GlusterFS as a distributed volume, the files are already spread
>> among the servers causing file I/O to be spread fairly evenly among
>> them as well, thus probably providing the benefit one might expect
>> with stripe (RAID10).
>
> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
> if you lose a drive. But since raid1 is not n-way copies, and only
> means two copies, you don't really want the file systems getting that
> big or you increase the chances of a double failure.
>
> I've always though it'd be neat in a Btrfs + GlusterFS, if it were
> possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
> and then for Btrfs to drop reference for those files, instead of
> either rebuilding or remaining degraded. And then let GlusterFS deal
> with replication of those files to maintain redundancy. i.e. the Btrfs
> volumes would be single profile for data, and raid1 for metadata. When
> there's n-way raid1, each drive can have a copy of the file system,
> and it'd tolerate in effect n-1 drive failures and the file system
> could at least still inform Gluster (or Ceph) of the missing data, the
> file system still remains valid, only briefly degraded, and can still
> be expanded when new drives become available.
FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. 
It's designed to repair data mismatches, but I'm not sure how it handles 
missing copies of data.  However, in the current state, there's no way 
without external scripts to handle re-shaping of the storage bricks if 
part of them fails.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27 11:22                 ` Austin S. Hemmelgarn
@ 2016-04-27 15:58                   ` Juan Alberto Cirez
  2016-04-27 16:29                     ` Holger Hoffstätte
  2016-04-27 23:19                   ` Chris Murphy
  1 sibling, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-27 15:58 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, linux-btrfs

WOW!
Correct me if I'm wrong but the sum total of the above seems to
suggest (at first glance) that BRTFS add several layers of complexity,
but for little real benefit (at least in the case use of btrfs at the
brick layer with a distributed filesystem on top)...

"...I've always though it'd be neat in a Btrfs + GlusterFS, if it were
possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
and then for Btrfs to drop reference for those files, instead of
either rebuilding or remaining degraded. And then let GlusterFS deal
with replication of those files to maintain redundancy. i.e. the Btrfs
volumes would be single profile for data, and raid1 for metadata. When
there's n-way raid1, each drive can have a copy of the file system,
and it'd tolerate in effect n-1 drive failures and the file system
could at least still inform Gluster (or Ceph) of the missing data, the
file system still remains valid, only briefly degraded, and can still
be expanded when new drives become available..."

That in my n00b opinion would be brilliant in a real world use case.


On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-26 20:58, Chris Murphy wrote:
>>
>> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
>> <jacirez@rdcsafety.com> wrote:
>>>
>>>
>>> With GlusterFS as a distributed volume, the files are already spread
>>> among the servers causing file I/O to be spread fairly evenly among
>>> them as well, thus probably providing the benefit one might expect
>>> with stripe (RAID10).
>>
>>
>> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
>> if you lose a drive. But since raid1 is not n-way copies, and only
>> means two copies, you don't really want the file systems getting that
>> big or you increase the chances of a double failure.
>>
>> I've always though it'd be neat in a Btrfs + GlusterFS, if it were
>> possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
>> and then for Btrfs to drop reference for those files, instead of
>> either rebuilding or remaining degraded. And then let GlusterFS deal
>> with replication of those files to maintain redundancy. i.e. the Btrfs
>> volumes would be single profile for data, and raid1 for metadata. When
>> there's n-way raid1, each drive can have a copy of the file system,
>> and it'd tolerate in effect n-1 drive failures and the file system
>> could at least still inform Gluster (or Ceph) of the missing data, the
>> file system still remains valid, only briefly degraded, and can still
>> be expanded when new drives become available.
>
> FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's
> designed to repair data mismatches, but I'm not sure how it handles missing
> copies of data.  However, in the current state, there's no way without
> external scripts to handle re-shaping of the storage bricks if part of them
> fails.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27 15:58                   ` Juan Alberto Cirez
@ 2016-04-27 16:29                     ` Holger Hoffstätte
  2016-04-27 16:38                       ` Juan Alberto Cirez
  0 siblings, 1 reply; 21+ messages in thread
From: Holger Hoffstätte @ 2016-04-27 16:29 UTC (permalink / raw)
  To: linux-btrfs

On 04/27/16 17:58, Juan Alberto Cirez wrote:
> Correct me if I'm wrong but the sum total of the above seems to
> suggest (at first glance) that BRTFS add several layers of complexity,
> but for little real benefit (at least in the case use of btrfs at the
> brick layer with a distributed filesystem on top)...

This may come as a surprise, but the same can be said for every other
(common) filesystem (+ device management stack) that can be used
standalone.

Jeff Darcy (of GlusterFS) just wrote a really nice blog post why
current filesystems and their historically grown requirements (mostly
as they relate to the POSIX interface standard) are in many ways
just not a good fit for scale-out/redundant storage:
http://pl.atyp.us/2016-05-updating-posix.html

Quite a few of the capabilities & features which are useful or
necessary in standalone operation (regardless of single- or multi-
device setup) are *actively unhelpful* in a distributed context, which
is why e.g. Ceph will soon do away with the on-disk filesystem for
data, and manage metadata exclusively by itself.

cheers,
Holger


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27 16:29                     ` Holger Hoffstätte
@ 2016-04-27 16:38                       ` Juan Alberto Cirez
  2016-04-27 16:40                         ` Juan Alberto Cirez
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-27 16:38 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-btrfs

Holger,
If this is so, then it leaves with even confused. I was under the
impression that the driving imperative for the creation of btrfs was
to address the shortcomings of current filesystem within the context
of distributed data. That the idea was to create a low level
filesystem that would the primary choice as a block/brick layer for a
scale-out, distributed data storage...

On Wed, Apr 27, 2016 at 10:29 AM, Holger Hoffstätte
<holger.hoffstaette@googlemail.com> wrote:
> On 04/27/16 17:58, Juan Alberto Cirez wrote:
>> Correct me if I'm wrong but the sum total of the above seems to
>> suggest (at first glance) that BRTFS add several layers of complexity,
>> but for little real benefit (at least in the case use of btrfs at the
>> brick layer with a distributed filesystem on top)...
>
> This may come as a surprise, but the same can be said for every other
> (common) filesystem (+ device management stack) that can be used
> standalone.
>
> Jeff Darcy (of GlusterFS) just wrote a really nice blog post why
> current filesystems and their historically grown requirements (mostly
> as they relate to the POSIX interface standard) are in many ways
> just not a good fit for scale-out/redundant storage:
> http://pl.atyp.us/2016-05-updating-posix.html
>
> Quite a few of the capabilities & features which are useful or
> necessary in standalone operation (regardless of single- or multi-
> device setup) are *actively unhelpful* in a distributed context, which
> is why e.g. Ceph will soon do away with the on-disk filesystem for
> data, and manage metadata exclusively by itself.
>
> cheers,
> Holger
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27 16:38                       ` Juan Alberto Cirez
@ 2016-04-27 16:40                         ` Juan Alberto Cirez
  2016-04-27 17:23                           ` Holger Hoffstätte
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-27 16:40 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-btrfs

Holger,
If this is so, then it leaves even confused. I was under the
impression that the driving imperative for the creation of btrfs was
to address the shortcomings of current filesystems (within the context
of distributed data). That the idea was to create a low level
filesystem that would be the primary choice as a block/brick layer for a
scale-out, distributed data storage...

On Wed, Apr 27, 2016 at 10:38 AM, Juan Alberto Cirez
<jacirez@rdcsafety.com> wrote:
> Holger,
> If this is so, then it leaves with even confused. I was under the
> impression that the driving imperative for the creation of btrfs was
> to address the shortcomings of current filesystem within the context
> of distributed data. That the idea was to create a low level
> filesystem that would the primary choice as a block/brick layer for a
> scale-out, distributed data storage...
>
> On Wed, Apr 27, 2016 at 10:29 AM, Holger Hoffstätte
> <holger.hoffstaette@googlemail.com> wrote:
>> On 04/27/16 17:58, Juan Alberto Cirez wrote:
>>> Correct me if I'm wrong but the sum total of the above seems to
>>> suggest (at first glance) that BRTFS add several layers of complexity,
>>> but for little real benefit (at least in the case use of btrfs at the
>>> brick layer with a distributed filesystem on top)...
>>
>> This may come as a surprise, but the same can be said for every other
>> (common) filesystem (+ device management stack) that can be used
>> standalone.
>>
>> Jeff Darcy (of GlusterFS) just wrote a really nice blog post why
>> current filesystems and their historically grown requirements (mostly
>> as they relate to the POSIX interface standard) are in many ways
>> just not a good fit for scale-out/redundant storage:
>> http://pl.atyp.us/2016-05-updating-posix.html
>>
>> Quite a few of the capabilities & features which are useful or
>> necessary in standalone operation (regardless of single- or multi-
>> device setup) are *actively unhelpful* in a distributed context, which
>> is why e.g. Ceph will soon do away with the on-disk filesystem for
>> data, and manage metadata exclusively by itself.
>>
>> cheers,
>> Holger
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27 16:40                         ` Juan Alberto Cirez
@ 2016-04-27 17:23                           ` Holger Hoffstätte
  0 siblings, 0 replies; 21+ messages in thread
From: Holger Hoffstätte @ 2016-04-27 17:23 UTC (permalink / raw)
  To: Juan Alberto Cirez; +Cc: linux-btrfs

On 04/27/16 18:40, Juan Alberto Cirez wrote:
> If this is so, then it leaves even confused. I was under the
> impression that the driving imperative for the creation of btrfs was
> to address the shortcomings of current filesystems (within the context
> of distributed data). That the idea was to create a low level
> filesystem that would be the primary choice as a block/brick layer for a
> scale-out, distributed data storage...

I can't speak for who was or is motivated by what. Btrfs was a necessary
reaction to ZFS, and AFAIK this had nothing to do with distributed storage
but rather growing concerns around reliability (checksumming), scalability
and operational ease: snapshotting, growing/shrinking etc.

It's true that some of btrfs' capabilities make it look like a a good
candidate, and e.g. Ceph started out using it. For many reasons that
didn't work out (AFAIK btrfs maturity + extensibility) - but it also
did not address a fundamental mismatch in requirements, which other
filesystems (ext4, xfs) could not address either. btrfs simply
does "too much" because it has to; you cannot remove or turn off half
of what makes a kernel-based filesystem a usable filesystem. This is
kind of sad because at its core btrfs *is* an object store with
various trees for metadata handling and whatnot - but there's no
easy way to turn off all the "Unix is stupid" stuff.

AFAIK Gluster will soon also start managing xattrs differently,
so this is not limited to Ceph.

I've been following this saga for several years now and it's
absolutely *astounding* how many bugs and performance problems
Ceph has unearthed in existing filesystems, simply because it
stresses them in ways they never have been stressed before..only to
create the illusion of a distributed key/value store, badly.
I don't want to argue about details, you can read more about some
of the reasons in [1].

[grumble grumble exokernels and composable things in userland grumble]

cheers
Holger

[1] http://www.slideshare.net/sageweil1/ceph-and-rocksdb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27 11:22                 ` Austin S. Hemmelgarn
  2016-04-27 15:58                   ` Juan Alberto Cirez
@ 2016-04-27 23:19                   ` Chris Murphy
  2016-04-28 11:21                     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Murphy @ 2016-04-27 23:19 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Juan Alberto Cirez, linux-btrfs

On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-26 20:58, Chris Murphy wrote:
>>
>> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
>> <jacirez@rdcsafety.com> wrote:
>>>
>>>
>>> With GlusterFS as a distributed volume, the files are already spread
>>> among the servers causing file I/O to be spread fairly evenly among
>>> them as well, thus probably providing the benefit one might expect
>>> with stripe (RAID10).
>>
>>
>> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
>> if you lose a drive. But since raid1 is not n-way copies, and only
>> means two copies, you don't really want the file systems getting that
>> big or you increase the chances of a double failure.
>>
>> I've always though it'd be neat in a Btrfs + GlusterFS, if it were
>> possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
>> and then for Btrfs to drop reference for those files, instead of
>> either rebuilding or remaining degraded. And then let GlusterFS deal
>> with replication of those files to maintain redundancy. i.e. the Btrfs
>> volumes would be single profile for data, and raid1 for metadata. When
>> there's n-way raid1, each drive can have a copy of the file system,
>> and it'd tolerate in effect n-1 drive failures and the file system
>> could at least still inform Gluster (or Ceph) of the missing data, the
>> file system still remains valid, only briefly degraded, and can still
>> be expanded when new drives become available.
>
> FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's
> designed to repair data mismatches, but I'm not sure how it handles missing
> copies of data.  However, in the current state, there's no way without
> external scripts to handle re-shaping of the storage bricks if part of them
> fails.

Yeah I haven't tried doing a scrub, parsing dmesg for busted file
paths, and feeling those paths into rm to see what happens. Will they
get deleted without additional errors? If so good, then scrub again
should be clean. And then btrfs dev missing to get rid of the broken
device *and* cause missing metadata to be replicated again and now in
theory the fs should be back to normal. But it'd have to be tested
with a umount followed by mount to see if -o degraded is still
required.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add device while rebalancing
  2016-04-27 23:19                   ` Chris Murphy
@ 2016-04-28 11:21                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-28 11:21 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Juan Alberto Cirez, linux-btrfs

On 2016-04-27 19:19, Chris Murphy wrote:
> On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-26 20:58, Chris Murphy wrote:
>>>
>>> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
>>> <jacirez@rdcsafety.com> wrote:
>>>>
>>>>
>>>> With GlusterFS as a distributed volume, the files are already spread
>>>> among the servers causing file I/O to be spread fairly evenly among
>>>> them as well, thus probably providing the benefit one might expect
>>>> with stripe (RAID10).
>>>
>>>
>>> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
>>> if you lose a drive. But since raid1 is not n-way copies, and only
>>> means two copies, you don't really want the file systems getting that
>>> big or you increase the chances of a double failure.
>>>
>>> I've always though it'd be neat in a Btrfs + GlusterFS, if it were
>>> possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
>>> and then for Btrfs to drop reference for those files, instead of
>>> either rebuilding or remaining degraded. And then let GlusterFS deal
>>> with replication of those files to maintain redundancy. i.e. the Btrfs
>>> volumes would be single profile for data, and raid1 for metadata. When
>>> there's n-way raid1, each drive can have a copy of the file system,
>>> and it'd tolerate in effect n-1 drive failures and the file system
>>> could at least still inform Gluster (or Ceph) of the missing data, the
>>> file system still remains valid, only briefly degraded, and can still
>>> be expanded when new drives become available.
>>
>> FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's
>> designed to repair data mismatches, but I'm not sure how it handles missing
>> copies of data.  However, in the current state, there's no way without
>> external scripts to handle re-shaping of the storage bricks if part of them
>> fails.
>
> Yeah I haven't tried doing a scrub, parsing dmesg for busted file
> paths, and feeling those paths into rm to see what happens. Will they
> get deleted without additional errors? If so good, then scrub again
> should be clean. And then btrfs dev missing to get rid of the broken
> device *and* cause missing metadata to be replicated again and now in
> theory the fs should be back to normal. But it'd have to be tested
> with a umount followed by mount to see if -o degraded is still
> required.
>
I'm not entirely certain, although I had been planning on adding a test 
to check this to my usual testing before the system I use for it went 
offline, I just haven't had the time to get it working again.  If I find 
the time in the near future, I may just test it on my laptop in a VM.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-04-28 11:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-22 20:36 Add device while rebalancing Juan Alberto Cirez
2016-04-23  5:38 ` Duncan
2016-04-25 11:18   ` Austin S. Hemmelgarn
2016-04-25 12:43     ` Duncan
2016-04-25 13:02       ` Austin S. Hemmelgarn
2016-04-26 10:50         ` Juan Alberto Cirez
2016-04-26 11:11           ` Austin S. Hemmelgarn
2016-04-26 11:44             ` Juan Alberto Cirez
2016-04-26 12:04               ` Austin S. Hemmelgarn
2016-04-26 12:14                 ` Juan Alberto Cirez
2016-04-26 12:44                   ` Austin S. Hemmelgarn
2016-04-27  0:58               ` Chris Murphy
2016-04-27 10:37                 ` Duncan
2016-04-27 11:22                 ` Austin S. Hemmelgarn
2016-04-27 15:58                   ` Juan Alberto Cirez
2016-04-27 16:29                     ` Holger Hoffstätte
2016-04-27 16:38                       ` Juan Alberto Cirez
2016-04-27 16:40                         ` Juan Alberto Cirez
2016-04-27 17:23                           ` Holger Hoffstätte
2016-04-27 23:19                   ` Chris Murphy
2016-04-28 11:21                     ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.