* Add device while rebalancing
@ 2016-04-22 20:36 Juan Alberto Cirez
2016-04-23 5:38 ` Duncan
0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-22 20:36 UTC (permalink / raw)
To: linux-btrfs
Good morning,
I am new to this list and to btrfs in general. I have a quick
question: Can I add a new device to the pool while the btrfs
filesystem balance command is running on the drive pool?
Thanks
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-22 20:36 Add device while rebalancing Juan Alberto Cirez
@ 2016-04-23 5:38 ` Duncan
2016-04-25 11:18 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 21+ messages in thread
From: Duncan @ 2016-04-23 5:38 UTC (permalink / raw)
To: linux-btrfs
Juan Alberto Cirez posted on Fri, 22 Apr 2016 14:36:44 -0600 as excerpted:
> Good morning,
> I am new to this list and to btrfs in general. I have a quick question:
> Can I add a new device to the pool while the btrfs filesystem balance
> command is running on the drive pool?
Adding a device while balancing shouldn't be a problem. However,
depending on your redundancy mode, you may wish to cancel the balance and
start a new one after the device add, so the balance will take account of
it as well and balance it into the mix.
Note that while device add doesn't do more than that on its own, device
delete/remove effectively initiates its own balance, moving the chunks on
the device being removed to the other devices. So you wouldn't want to
be running a balance and then do a device remove at the same time.
Similarly with btrfs replace, altho in that case, it's more directly
moving data from the device being replaced (if it's still there, or using
redundancy or parity to recover it if not) to the replacement device, a
more limited and often faster operation. But you probably still don't
want to do a balance at the same time as it places unnecessary stress on
both the filesystem and the hardware, and even if the filesystem and
devices handle the stress fine, the result is going to be that both
operations take longer as they're both intensive operations that will
interfere with each other to some extent.
Similarly with btrfs scrub. The operations are logically different
enough that they shouldn't really interfere with each other logically,
but they're both hardware intensive operations that will put unnecessary
stress on the system if you're doing more than one at a time, and will
result in both going slower than they normally would.
And again with snapshotting operations. Making a snapshot is normally
nearly instantaneous, but there's a scaling issue if you have too many
per filesystem (try to keep it under 2000 snapshots per filesystem total,
if possible, and definitely keep it under 10K or some operations will
slow down substantially), and deleting snapshots is more work, so while
you should ordinarily automatically thin down snapshots if you're
automatically making them quite frequently (say daily or more
frequently), you may want to put the snapshot deletion, at least, on hold
while you scrub or balance or device delete or replace.
Meanwhile, you mentioned being new to btrfs. If you haven't discovered
the wiki yet, please spend some time reading the user documentation
there, as it's likely to clear up a lot of questions you may have, and
you'll better understand how to effectively work with the filesystem when
you're done. It's well worth the time invested! =:^)
https://btrfs.wiki.kernel.org
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-23 5:38 ` Duncan
@ 2016-04-25 11:18 ` Austin S. Hemmelgarn
2016-04-25 12:43 ` Duncan
0 siblings, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-25 11:18 UTC (permalink / raw)
To: linux-btrfs
On 2016-04-23 01:38, Duncan wrote:
> Juan Alberto Cirez posted on Fri, 22 Apr 2016 14:36:44 -0600 as excerpted:
>
>> Good morning,
>> I am new to this list and to btrfs in general. I have a quick question:
>> Can I add a new device to the pool while the btrfs filesystem balance
>> command is running on the drive pool?
>
> Adding a device while balancing shouldn't be a problem. However,
> depending on your redundancy mode, you may wish to cancel the balance and
> start a new one after the device add, so the balance will take account of
> it as well and balance it into the mix.
I'm not 100% certain about how balance will handle this, except that
nothing should break. I believe that it picks a device each time it
goes to move a chunk, so it should evaluate any chunks operated on after
the addition of the device for possible placement on that device (and it
will probably end up putting a lot of them there because that device
will almost certainly be less full than any of the others). That said,
you probably do want to cancel the balance, add the device, and re-run
the balance so that things end up more evenly distributed.
>
> Note that while device add doesn't do more than that on its own, device
> delete/remove effectively initiates its own balance, moving the chunks on
> the device being removed to the other devices. So you wouldn't want to
> be running a balance and then do a device remove at the same time.
IIRC, trying to delete a device while running a balance will fail, and
return an error, because only one balance can be running at a given moment.
>
> Similarly with btrfs replace, altho in that case, it's more directly
> moving data from the device being replaced (if it's still there, or using
> redundancy or parity to recover it if not) to the replacement device, a
> more limited and often faster operation. But you probably still don't
> want to do a balance at the same time as it places unnecessary stress on
> both the filesystem and the hardware, and even if the filesystem and
> devices handle the stress fine, the result is going to be that both
> operations take longer as they're both intensive operations that will
> interfere with each other to some extent.
Agreed, this is generally not a good idea because of the stress it puts
on the devices (and because it probably isn't well tested).
>
> Similarly with btrfs scrub. The operations are logically different
> enough that they shouldn't really interfere with each other logically,
> but they're both hardware intensive operations that will put unnecessary
> stress on the system if you're doing more than one at a time, and will
> result in both going slower than they normally would.
Actually, depending on a number of factors, scrubbing while balancing
can actually finish faster than running one then the other in sequence.
It's really dependent on how both decide to pick chunks, and how your
underlying devices handle read and write caching, but it can happen.
Most of the time though, it should take around the same amount of time
as running one then the other, or a little bit longer if you're on
traditional disks.
>
> And again with snapshotting operations. Making a snapshot is normally
> nearly instantaneous, but there's a scaling issue if you have too many
> per filesystem (try to keep it under 2000 snapshots per filesystem total,
> if possible, and definitely keep it under 10K or some operations will
> slow down substantially), and deleting snapshots is more work, so while
> you should ordinarily automatically thin down snapshots if you're
> automatically making them quite frequently (say daily or more
> frequently), you may want to put the snapshot deletion, at least, on hold
> while you scrub or balance or device delete or replace.
I would actually recommend putting all snapshot operations on hold, as
well as most writes to the filesystem, while doing a balance or device
deletion. The more writes you have while doing those, the longer they
take, and the less likely that you end up with a good on-disk layout of
the data.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-25 11:18 ` Austin S. Hemmelgarn
@ 2016-04-25 12:43 ` Duncan
2016-04-25 13:02 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 21+ messages in thread
From: Duncan @ 2016-04-25 12:43 UTC (permalink / raw)
To: linux-btrfs
Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as
excerpted:
> On 2016-04-23 01:38, Duncan wrote:
>>
>> And again with snapshotting operations. Making a snapshot is normally
>> nearly instantaneous, but there's a scaling issue if you have too many
>> per filesystem (try to keep it under 2000 snapshots per filesystem
>> total, if possible, and definitely keep it under 10K or some operations
>> will slow down substantially), and deleting snapshots is more work, so
>> while you should ordinarily automatically thin down snapshots if you're
>> automatically making them quite frequently (say daily or more
>> frequently), you may want to put the snapshot deletion, at least, on
>> hold while you scrub or balance or device delete or replace.
> I would actually recommend putting all snapshot operations on hold, as
> well as most writes to the filesystem, while doing a balance or device
> deletion. The more writes you have while doing those, the longer they
> take, and the less likely that you end up with a good on-disk layout of
> the data.
The thing with snapshot writing is that all snapshot creation effectively
does is a bit of metadata writing. What snapshots primarily do is lock
existing extents in place (down within their chunk, with the higher chunk
level being the scope at which balance works), that would otherwise be
COWed elsewhere with the existing extent deleted on change, or simply
deleted on on file delete. A snapshot simply adds a reference to the
current version, so that deletion, either directly or from the COW, never
happens, and to do that simply requires a relatively small metadata write.
So while I agree in general that more writes means balances taking
longer, snapshot creation writes are pretty tiny in the scheme of things,
and won't affect the balance much, compared to larger writes you'll very
possibly still be doing unless you really do suspend pretty much all
write operations to that filesystem during the balance.
But as I said, snapshot deletions are an entirely different story, as
then all those previously locked in place extents are potentially freed,
and the filesystem must do a lot of work to figure out which ones it can
actually free and free them, vs. ones that still have other references
which therefore cannot yet be freed.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-25 12:43 ` Duncan
@ 2016-04-25 13:02 ` Austin S. Hemmelgarn
2016-04-26 10:50 ` Juan Alberto Cirez
0 siblings, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-25 13:02 UTC (permalink / raw)
To: linux-btrfs
On 2016-04-25 08:43, Duncan wrote:
> Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as
> excerpted:
>
>> On 2016-04-23 01:38, Duncan wrote:
>>>
>>> And again with snapshotting operations. Making a snapshot is normally
>>> nearly instantaneous, but there's a scaling issue if you have too many
>>> per filesystem (try to keep it under 2000 snapshots per filesystem
>>> total, if possible, and definitely keep it under 10K or some operations
>>> will slow down substantially), and deleting snapshots is more work, so
>>> while you should ordinarily automatically thin down snapshots if you're
>>> automatically making them quite frequently (say daily or more
>>> frequently), you may want to put the snapshot deletion, at least, on
>>> hold while you scrub or balance or device delete or replace.
>
>> I would actually recommend putting all snapshot operations on hold, as
>> well as most writes to the filesystem, while doing a balance or device
>> deletion. The more writes you have while doing those, the longer they
>> take, and the less likely that you end up with a good on-disk layout of
>> the data.
>
> The thing with snapshot writing is that all snapshot creation effectively
> does is a bit of metadata writing. What snapshots primarily do is lock
> existing extents in place (down within their chunk, with the higher chunk
> level being the scope at which balance works), that would otherwise be
> COWed elsewhere with the existing extent deleted on change, or simply
> deleted on on file delete. A snapshot simply adds a reference to the
> current version, so that deletion, either directly or from the COW, never
> happens, and to do that simply requires a relatively small metadata write.
Unless I'm mistaken about the internals of BTRFS (which might be the
case), creating a snapshot has to update reference counts on every
single extent in every single file in the snapshot. For something small
this isn't much, but if you are snapshotting something big (say,
snapshotting an entire system with all the data in one subvolume), it
can amount to multiple MB of writes, and it gets even worse if you have
no shared extents to begin with (which is still pretty typical). On
some of the systems I work with at work, snapshotting a terabyte of data
can end up resulting in 10-20 MB of writes to disk (in this case, that
figure came from a partition containing mostly small files that were
just big enough that they didn't fit in-line in the metadata blocks).
This is of course still significantly faster than copying everything,
but it's not free either.
>
> So while I agree in general that more writes means balances taking
> longer, snapshot creation writes are pretty tiny in the scheme of things,
> and won't affect the balance much, compared to larger writes you'll very
> possibly still be doing unless you really do suspend pretty much all
> write operations to that filesystem during the balance.
In general, yes, except that there's the case of running with mostly
full metadata chunks, where it might result in a further chunk
allocation, which in turn can throw off the balanced layout. Balance
always allocates new chunks, and doesn't write into existing ones, so if
you're writing enough to allocate a new chunk while a balance is happening:
1. That chunk may or may not get considered by the balance code (I'm not
100% certain about this, but I believe it will be ignored by any balance
running at the time it gets allocated).
2. You run the risk of ending up with a chunk with almost nothing in it
which could be packed into another existing chunk.
Snapshots are not likely to trigger this, but it is still possible,
especially if you're taking lots of snapshots in a short period of time.
>
> But as I said, snapshot deletions are an entirely different story, as
> then all those previously locked in place extents are potentially freed,
> and the filesystem must do a lot of work to figure out which ones it can
> actually free and free them, vs. ones that still have other references
> which therefore cannot yet be freed.
Most of the issue here with balance is that you end up potentially doing
an amount of unnecessary work which is unquantifiable before it's done.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-25 13:02 ` Austin S. Hemmelgarn
@ 2016-04-26 10:50 ` Juan Alberto Cirez
2016-04-26 11:11 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-26 10:50 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: linux-btrfs
Thank you guys so very kindly for all your help and taking the time to
answer my question. I have been reading the wiki and online use cases
and otherwise delving deeper into the btrfs architecture.
I am managing a 520TB storage pool spread across 16 server pods and
have tried several methods of distributed storage. Last attempt was
using Zfs as a base for the physical bricks and GlusterFS as a glue to
string together the storage pool. I was not satisfied with the results
(mainly Zfs). Once I have run btrfs for a while on the test server
(32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
On Mon, Apr 25, 2016 at 7:02 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-25 08:43, Duncan wrote:
>>
>> Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as
>> excerpted:
>>
>>> On 2016-04-23 01:38, Duncan wrote:
>>>>
>>>>
>>>> And again with snapshotting operations. Making a snapshot is normally
>>>> nearly instantaneous, but there's a scaling issue if you have too many
>>>> per filesystem (try to keep it under 2000 snapshots per filesystem
>>>> total, if possible, and definitely keep it under 10K or some operations
>>>> will slow down substantially), and deleting snapshots is more work, so
>>>> while you should ordinarily automatically thin down snapshots if you're
>>>> automatically making them quite frequently (say daily or more
>>>> frequently), you may want to put the snapshot deletion, at least, on
>>>> hold while you scrub or balance or device delete or replace.
>>
>>
>>> I would actually recommend putting all snapshot operations on hold, as
>>> well as most writes to the filesystem, while doing a balance or device
>>> deletion. The more writes you have while doing those, the longer they
>>> take, and the less likely that you end up with a good on-disk layout of
>>> the data.
>>
>>
>> The thing with snapshot writing is that all snapshot creation effectively
>> does is a bit of metadata writing. What snapshots primarily do is lock
>> existing extents in place (down within their chunk, with the higher chunk
>> level being the scope at which balance works), that would otherwise be
>> COWed elsewhere with the existing extent deleted on change, or simply
>> deleted on on file delete. A snapshot simply adds a reference to the
>> current version, so that deletion, either directly or from the COW, never
>> happens, and to do that simply requires a relatively small metadata write.
>
> Unless I'm mistaken about the internals of BTRFS (which might be the case),
> creating a snapshot has to update reference counts on every single extent in
> every single file in the snapshot. For something small this isn't much, but
> if you are snapshotting something big (say, snapshotting an entire system
> with all the data in one subvolume), it can amount to multiple MB of writes,
> and it gets even worse if you have no shared extents to begin with (which is
> still pretty typical). On some of the systems I work with at work,
> snapshotting a terabyte of data can end up resulting in 10-20 MB of writes
> to disk (in this case, that figure came from a partition containing mostly
> small files that were just big enough that they didn't fit in-line in the
> metadata blocks).
>
> This is of course still significantly faster than copying everything, but
> it's not free either.
>>
>>
>> So while I agree in general that more writes means balances taking
>> longer, snapshot creation writes are pretty tiny in the scheme of things,
>> and won't affect the balance much, compared to larger writes you'll very
>> possibly still be doing unless you really do suspend pretty much all
>> write operations to that filesystem during the balance.
>
> In general, yes, except that there's the case of running with mostly full
> metadata chunks, where it might result in a further chunk allocation, which
> in turn can throw off the balanced layout. Balance always allocates new
> chunks, and doesn't write into existing ones, so if you're writing enough to
> allocate a new chunk while a balance is happening:
> 1. That chunk may or may not get considered by the balance code (I'm not
> 100% certain about this, but I believe it will be ignored by any balance
> running at the time it gets allocated).
> 2. You run the risk of ending up with a chunk with almost nothing in it
> which could be packed into another existing chunk.
> Snapshots are not likely to trigger this, but it is still possible,
> especially if you're taking lots of snapshots in a short period of time.
>>
>>
>> But as I said, snapshot deletions are an entirely different story, as
>> then all those previously locked in place extents are potentially freed,
>> and the filesystem must do a lot of work to figure out which ones it can
>> actually free and free them, vs. ones that still have other references
>> which therefore cannot yet be freed.
>
> Most of the issue here with balance is that you end up potentially doing an
> amount of unnecessary work which is unquantifiable before it's done.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-26 10:50 ` Juan Alberto Cirez
@ 2016-04-26 11:11 ` Austin S. Hemmelgarn
2016-04-26 11:44 ` Juan Alberto Cirez
0 siblings, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-26 11:11 UTC (permalink / raw)
To: Juan Alberto Cirez; +Cc: linux-btrfs
On 2016-04-26 06:50, Juan Alberto Cirez wrote:
> Thank you guys so very kindly for all your help and taking the time to
> answer my question. I have been reading the wiki and online use cases
> and otherwise delving deeper into the btrfs architecture.
>
> I am managing a 520TB storage pool spread across 16 server pods and
> have tried several methods of distributed storage. Last attempt was
> using Zfs as a base for the physical bricks and GlusterFS as a glue to
> string together the storage pool. I was not satisfied with the results
> (mainly Zfs). Once I have run btrfs for a while on the test server
> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
For what it's worth, GlusterFS works great on top of BTRFS. I don't
have any claims to usage in production, but I've done _a lot_ of testing
with it because we're replacing one of our critical file servers at work
with a couple of systems set up with Gluster on top of BTRFS, and I've
been looking at setting up a small storage cluster at home using it on a
couple of laptops I have which have non-functional displays. Based on
what I've seen, it appears to be rock solid with respect to the common
failure modes, provided you use something like raid1 mode on the BTRFS
side of things.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-26 11:11 ` Austin S. Hemmelgarn
@ 2016-04-26 11:44 ` Juan Alberto Cirez
2016-04-26 12:04 ` Austin S. Hemmelgarn
2016-04-27 0:58 ` Chris Murphy
0 siblings, 2 replies; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-26 11:44 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: linux-btrfs
Well,
RAID1 offers no parity, striping, or spanning of disk space across
multiple disks.
RAID10 configuration, on the other hand, requires a minimum of four
HDD, but it stripes data across mirrored pairs. As long as one disk in
each mirrored pair is functional, data can be retrieved.
With GlusterFS as a distributed volume, the files are already spread
among the servers causing file I/O to be spread fairly evenly among
them as well, thus probably providing the benefit one might expect
with stripe (RAID10).
The question I have now is: Should I use a RAID10 or RAID1 underneath
of a GlusterFS stripped (and possibly replicated) volume?
On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-26 06:50, Juan Alberto Cirez wrote:
>>
>> Thank you guys so very kindly for all your help and taking the time to
>> answer my question. I have been reading the wiki and online use cases
>> and otherwise delving deeper into the btrfs architecture.
>>
>> I am managing a 520TB storage pool spread across 16 server pods and
>> have tried several methods of distributed storage. Last attempt was
>> using Zfs as a base for the physical bricks and GlusterFS as a glue to
>> string together the storage pool. I was not satisfied with the results
>> (mainly Zfs). Once I have run btrfs for a while on the test server
>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
>
> For what it's worth, GlusterFS works great on top of BTRFS. I don't have
> any claims to usage in production, but I've done _a lot_ of testing with it
> because we're replacing one of our critical file servers at work with a
> couple of systems set up with Gluster on top of BTRFS, and I've been looking
> at setting up a small storage cluster at home using it on a couple of
> laptops I have which have non-functional displays. Based on what I've seen,
> it appears to be rock solid with respect to the common failure modes,
> provided you use something like raid1 mode on the BTRFS side of things.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-26 11:44 ` Juan Alberto Cirez
@ 2016-04-26 12:04 ` Austin S. Hemmelgarn
2016-04-26 12:14 ` Juan Alberto Cirez
2016-04-27 0:58 ` Chris Murphy
1 sibling, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-26 12:04 UTC (permalink / raw)
To: Juan Alberto Cirez; +Cc: linux-btrfs
On 2016-04-26 07:44, Juan Alberto Cirez wrote:
> Well,
> RAID1 offers no parity, striping, or spanning of disk space across
> multiple disks.
>
> RAID10 configuration, on the other hand, requires a minimum of four
> HDD, but it stripes data across mirrored pairs. As long as one disk in
> each mirrored pair is functional, data can be retrieved.
>
> With GlusterFS as a distributed volume, the files are already spread
> among the servers causing file I/O to be spread fairly evenly among
> them as well, thus probably providing the benefit one might expect
> with stripe (RAID10).
>
> The question I have now is: Should I use a RAID10 or RAID1 underneath
> of a GlusterFS stripped (and possibly replicated) volume?
If you have enough systems and a new enough version of GlusterFS, I'd
suggest using raid1 on the low level, and then either a distributed
replicated volume or an erasure coded volume in GlusterFS.
Having more individual nodes involved will improve your scalability to
larger numbers of clients, and you can have more nodes with the same
number of disks if you use raid1 instead of raid10 on BTRFS. Using
Erasure coding in Gluster will provide better resiliency with higher
node counts for each individual file, at the cost of moderately higher
CPU time being used. FWIW, RAID5 and RAID6 are both specific cases of
(mathematically) optimal erasure coding (RAID5 is n,n+1 and RAID6 is
n,n+2 using the normal notation), but the equivalent forms in Gluster
are somewhat risky with any decent sized cluster.
It is worth noting that I would not personally trust just GlusterFS or
just BTRFS with the data replication, BTRFS is still somewhat new
(although I haven't had a truly broken filesystem in more than a year),
and GlusterFS has a lot more failure modes because of the networking.
>
> On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-26 06:50, Juan Alberto Cirez wrote:
>>>
>>> Thank you guys so very kindly for all your help and taking the time to
>>> answer my question. I have been reading the wiki and online use cases
>>> and otherwise delving deeper into the btrfs architecture.
>>>
>>> I am managing a 520TB storage pool spread across 16 server pods and
>>> have tried several methods of distributed storage. Last attempt was
>>> using Zfs as a base for the physical bricks and GlusterFS as a glue to
>>> string together the storage pool. I was not satisfied with the results
>>> (mainly Zfs). Once I have run btrfs for a while on the test server
>>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
>>
>> For what it's worth, GlusterFS works great on top of BTRFS. I don't have
>> any claims to usage in production, but I've done _a lot_ of testing with it
>> because we're replacing one of our critical file servers at work with a
>> couple of systems set up with Gluster on top of BTRFS, and I've been looking
>> at setting up a small storage cluster at home using it on a couple of
>> laptops I have which have non-functional displays. Based on what I've seen,
>> it appears to be rock solid with respect to the common failure modes,
>> provided you use something like raid1 mode on the BTRFS side of things.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-26 12:04 ` Austin S. Hemmelgarn
@ 2016-04-26 12:14 ` Juan Alberto Cirez
2016-04-26 12:44 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-26 12:14 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: linux-btrfs
Thank you again, Austin.
My ideal case would be high availability coupled with reliable data
replication and integrity against accidental lost. I am willing to
cede ground on the write speed; but the read has to be as optimized as
possible.
So far BTRFS, RAID10 on the 32TB test server is quite good both read &
write and data lost/corruption has not been an issue yet. When I
introduce the network/distributed layer, I would like the same.
BTW does Ceph provides similar functionality, reliability and performace?
On Tue, Apr 26, 2016 at 6:04 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-26 07:44, Juan Alberto Cirez wrote:
>>
>> Well,
>> RAID1 offers no parity, striping, or spanning of disk space across
>> multiple disks.
>>
>> RAID10 configuration, on the other hand, requires a minimum of four
>> HDD, but it stripes data across mirrored pairs. As long as one disk in
>> each mirrored pair is functional, data can be retrieved.
>>
>> With GlusterFS as a distributed volume, the files are already spread
>> among the servers causing file I/O to be spread fairly evenly among
>> them as well, thus probably providing the benefit one might expect
>> with stripe (RAID10).
>>
>> The question I have now is: Should I use a RAID10 or RAID1 underneath
>> of a GlusterFS stripped (and possibly replicated) volume?
>
> If you have enough systems and a new enough version of GlusterFS, I'd
> suggest using raid1 on the low level, and then either a distributed
> replicated volume or an erasure coded volume in GlusterFS.
> Having more individual nodes involved will improve your scalability to
> larger numbers of clients, and you can have more nodes with the same number
> of disks if you use raid1 instead of raid10 on BTRFS. Using Erasure coding
> in Gluster will provide better resiliency with higher node counts for each
> individual file, at the cost of moderately higher CPU time being used.
> FWIW, RAID5 and RAID6 are both specific cases of (mathematically) optimal
> erasure coding (RAID5 is n,n+1 and RAID6 is n,n+2 using the normal
> notation), but the equivalent forms in Gluster are somewhat risky with any
> decent sized cluster.
>
> It is worth noting that I would not personally trust just GlusterFS or just
> BTRFS with the data replication, BTRFS is still somewhat new (although I
> haven't had a truly broken filesystem in more than a year), and GlusterFS
> has a lot more failure modes because of the networking.
>
>>
>> On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>>
>>> On 2016-04-26 06:50, Juan Alberto Cirez wrote:
>>>>
>>>>
>>>> Thank you guys so very kindly for all your help and taking the time to
>>>> answer my question. I have been reading the wiki and online use cases
>>>> and otherwise delving deeper into the btrfs architecture.
>>>>
>>>> I am managing a 520TB storage pool spread across 16 server pods and
>>>> have tried several methods of distributed storage. Last attempt was
>>>> using Zfs as a base for the physical bricks and GlusterFS as a glue to
>>>> string together the storage pool. I was not satisfied with the results
>>>> (mainly Zfs). Once I have run btrfs for a while on the test server
>>>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
>>>
>>>
>>> For what it's worth, GlusterFS works great on top of BTRFS. I don't have
>>> any claims to usage in production, but I've done _a lot_ of testing with
>>> it
>>> because we're replacing one of our critical file servers at work with a
>>> couple of systems set up with Gluster on top of BTRFS, and I've been
>>> looking
>>> at setting up a small storage cluster at home using it on a couple of
>>> laptops I have which have non-functional displays. Based on what I've
>>> seen,
>>> it appears to be rock solid with respect to the common failure modes,
>>> provided you use something like raid1 mode on the BTRFS side of things.
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-26 12:14 ` Juan Alberto Cirez
@ 2016-04-26 12:44 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-26 12:44 UTC (permalink / raw)
To: Juan Alberto Cirez; +Cc: linux-btrfs
On 2016-04-26 08:14, Juan Alberto Cirez wrote:
> Thank you again, Austin.
>
> My ideal case would be high availability coupled with reliable data
> replication and integrity against accidental lost. I am willing to
> cede ground on the write speed; but the read has to be as optimized as
> possible.
> So far BTRFS, RAID10 on the 32TB test server is quite good both read &
> write and data lost/corruption has not been an issue yet. When I
> introduce the network/distributed layer, I would like the same.
> BTW does Ceph provides similar functionality, reliability and performace?
I can't give as much advice on Ceph, except to say that when I last
tested it more than 2 years ago, the filesystem front-end had some
serious data integrity issues, and the block device front-end had some
sanity issues when dealing with systems going off-line (either crashing,
or being shut down). I don't know if they're fixed or not by now. It's
worth noting that while Glsuter and Ceph are both intended for cluster
storage, Ceph has a very much more data-center oriented approach (it
appears from what I've seen to be optimized for lots of small systems
running as OSD's with a few bigger ones running as monitors and possibly
MDS's), while Gluster seems (again, personal perspective) to try to be
more agnostic of what hardware is involved. I will comment though that
it is exponentially easier to recover data from a failed GlusterFS
cluster than it is a failed Ceph cluster, Gluster uses flat files with a
few extended attributes for storage, whereas Ceph uses it's own internal
binary object format (partly because Ceph is first and foremost an
object storage system, whereas Gluster is primarily intended as an
actual filesystem).
Also, with respect to performance, you may want to compare BTRFS raid10
mode to BTRFS raid1 on top of two LVM RAID0 volumes. I find this tends
to get better overall performance with no difference in data safety,
because BTRFS still has a pretty brain-dead I/O scheduler in the
multi-device code.
> On Tue, Apr 26, 2016 at 6:04 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-26 07:44, Juan Alberto Cirez wrote:
>>>
>>> Well,
>>> RAID1 offers no parity, striping, or spanning of disk space across
>>> multiple disks.
>>>
>>> RAID10 configuration, on the other hand, requires a minimum of four
>>> HDD, but it stripes data across mirrored pairs. As long as one disk in
>>> each mirrored pair is functional, data can be retrieved.
>>>
>>> With GlusterFS as a distributed volume, the files are already spread
>>> among the servers causing file I/O to be spread fairly evenly among
>>> them as well, thus probably providing the benefit one might expect
>>> with stripe (RAID10).
>>>
>>> The question I have now is: Should I use a RAID10 or RAID1 underneath
>>> of a GlusterFS stripped (and possibly replicated) volume?
>>
>> If you have enough systems and a new enough version of GlusterFS, I'd
>> suggest using raid1 on the low level, and then either a distributed
>> replicated volume or an erasure coded volume in GlusterFS.
>> Having more individual nodes involved will improve your scalability to
>> larger numbers of clients, and you can have more nodes with the same number
>> of disks if you use raid1 instead of raid10 on BTRFS. Using Erasure coding
>> in Gluster will provide better resiliency with higher node counts for each
>> individual file, at the cost of moderately higher CPU time being used.
>> FWIW, RAID5 and RAID6 are both specific cases of (mathematically) optimal
>> erasure coding (RAID5 is n,n+1 and RAID6 is n,n+2 using the normal
>> notation), but the equivalent forms in Gluster are somewhat risky with any
>> decent sized cluster.
>>
>> It is worth noting that I would not personally trust just GlusterFS or just
>> BTRFS with the data replication, BTRFS is still somewhat new (although I
>> haven't had a truly broken filesystem in more than a year), and GlusterFS
>> has a lot more failure modes because of the networking.
>>
>>>
>>> On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>>
>>>> On 2016-04-26 06:50, Juan Alberto Cirez wrote:
>>>>>
>>>>>
>>>>> Thank you guys so very kindly for all your help and taking the time to
>>>>> answer my question. I have been reading the wiki and online use cases
>>>>> and otherwise delving deeper into the btrfs architecture.
>>>>>
>>>>> I am managing a 520TB storage pool spread across 16 server pods and
>>>>> have tried several methods of distributed storage. Last attempt was
>>>>> using Zfs as a base for the physical bricks and GlusterFS as a glue to
>>>>> string together the storage pool. I was not satisfied with the results
>>>>> (mainly Zfs). Once I have run btrfs for a while on the test server
>>>>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph
>>>>
>>>>
>>>> For what it's worth, GlusterFS works great on top of BTRFS. I don't have
>>>> any claims to usage in production, but I've done _a lot_ of testing with
>>>> it
>>>> because we're replacing one of our critical file servers at work with a
>>>> couple of systems set up with Gluster on top of BTRFS, and I've been
>>>> looking
>>>> at setting up a small storage cluster at home using it on a couple of
>>>> laptops I have which have non-functional displays. Based on what I've
>>>> seen,
>>>> it appears to be rock solid with respect to the common failure modes,
>>>> provided you use something like raid1 mode on the BTRFS side of things.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-26 11:44 ` Juan Alberto Cirez
2016-04-26 12:04 ` Austin S. Hemmelgarn
@ 2016-04-27 0:58 ` Chris Murphy
2016-04-27 10:37 ` Duncan
2016-04-27 11:22 ` Austin S. Hemmelgarn
1 sibling, 2 replies; 21+ messages in thread
From: Chris Murphy @ 2016-04-27 0:58 UTC (permalink / raw)
To: Juan Alberto Cirez; +Cc: Austin S. Hemmelgarn, linux-btrfs
On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
<jacirez@rdcsafety.com> wrote:
> Well,
> RAID1 offers no parity, striping, or spanning of disk space across
> multiple disks.
Btrfs raid1 does span, although it's typically called the "volume", or
a "pool" similar to ZFS terminology. e.g. 10 2TiB disks will get you a
single volume on which you can store about 10TiB of data with two
copies (called stripes in Btrfs). In effect the way chunk replication
works, it's a concat+raid1.
> RAID10 configuration, on the other hand, requires a minimum of four
> HDD, but it stripes data across mirrored pairs. As long as one disk in
> each mirrored pair is functional, data can be retrieved.
Not Btrfs raid10. It's not the devices that are mirrored pairs, but
rather the chunks. There's no way to control or determine on what
devices the pairs are on. It's certain you get at least a partial
failure (data for sure and likely metadata if it's also using raid10
profile) of the volume if you lose more than 1 device, planning wise
you have to assume you lose the entire array.
>
> With GlusterFS as a distributed volume, the files are already spread
> among the servers causing file I/O to be spread fairly evenly among
> them as well, thus probably providing the benefit one might expect
> with stripe (RAID10).
Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
if you lose a drive. But since raid1 is not n-way copies, and only
means two copies, you don't really want the file systems getting that
big or you increase the chances of a double failure.
I've always though it'd be neat in a Btrfs + GlusterFS, if it were
possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
and then for Btrfs to drop reference for those files, instead of
either rebuilding or remaining degraded. And then let GlusterFS deal
with replication of those files to maintain redundancy. i.e. the Btrfs
volumes would be single profile for data, and raid1 for metadata. When
there's n-way raid1, each drive can have a copy of the file system,
and it'd tolerate in effect n-1 drive failures and the file system
could at least still inform Gluster (or Ceph) of the missing data, the
file system still remains valid, only briefly degraded, and can still
be expanded when new drives become available.
I'm not a big fan of hot (or cold) spares. They contribute nothing,
but take up physical space and power.
--
Chris Murphy
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 0:58 ` Chris Murphy
@ 2016-04-27 10:37 ` Duncan
2016-04-27 11:22 ` Austin S. Hemmelgarn
1 sibling, 0 replies; 21+ messages in thread
From: Duncan @ 2016-04-27 10:37 UTC (permalink / raw)
To: linux-btrfs
Chris Murphy posted on Tue, 26 Apr 2016 18:58:06 -0600 as excerpted:
> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
> <jacirez@rdcsafety.com> wrote:
>> RAID10 configuration, on the other hand, requires a minimum of four
>> HDD, but it stripes data across mirrored pairs. As long as one disk in
>> each mirrored pair is functional, data can be retrieved.
>
> Not Btrfs raid10. It's not the devices that are mirrored pairs, but
> rather the chunks. There's no way to control or determine on what
> devices the pairs are on. It's certain you get at least a partial
> failure (data for sure and likely metadata if it's also using raid10
> profile) of the volume if you lose more than 1 device, planning wise you
> have to assume you lose the entire array.
Primarily quoting and restating the above (and below) to emphasize it.
Remember:
* btrfs raid is chunk-level, *NOT* device-level. That has important
implications in terms of recovery from degraded.
* btrfs parity-raid (raid56 mode) isn't yet mature and definitely nothing
I'd trust in production.
* btrfs redundancy-raid (raid1 and raid10 modes, as well as dup-mode on a
single device) are precisely pair-copy -- two copies, with the raid modes
forcing each copy to a different device or set of devices. More devices
simply means more space, *NOT* more redundancy/copies.
Again, these copies are at the chunk level. The chunks can and will be
distributed across devices based on most space available, meaning loss of
more than one device will in most cases kill the array. Because mirror-
pairs happen at the chunk, not the device level, there is no such thing
as loss of only one mirror in the mirror pair allowing more than a single
device to fail, because statistically, the chances of both copies of some
chunks being on those two now failed/missing devices is pretty high.
* btrfs raid10 stripes N/2-way, while only duplicating exactly two-way.
So a six-device raid10 will stripe three devices per mirror, while a 5-
device raid10 will stripe 2 devices per mirror, with the odd device out
being on a different device for each new chunk, due to the most-space-
left allocation algorithm.
>> With GlusterFS as a distributed volume, the files are already spread
>> among the servers causing file I/O to be spread fairly evenly among
>> them as well, thus probably providing the benefit one might expect with
>> stripe (RAID10).
>
> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes if
> you lose a drive. But since raid1 is not n-way copies, and only means
> two copies, you don't really want the file systems getting that big or
> you increase the chances of a double failure.
Again emphasizing. Since you're running a distributed filesystem on top,
keep the lower level btrfs raids small and do more of them, multiple
btrfs raid bricks per machine even, as long as your distributed level is
specced to be able to lose the bricks of at least one entire machine, of
course.
OTOH, unlike traditional raid, btrfs does actual checksumming and data/
metadata integrity at the block level, and can and will detect integrity
issues and correct from the second copy when the raid level supplies one,
assuming it's good of course. That should fix problems at the lower
level that other filesystems wouldn't, meaning less problems ever reach
the distributed level in the first place.
Thus, also emphasizing something Austin suggested. You may wish to
consider btrfs raid1 on top of a pair of mdraid or dmraid raid0s.
As you are likely well aware, normally, raid1 on top of raid0 is called
raid01 and is discouraged in favor of raid10 (raid0 on top of raid1) for
rebuild from lost device state efficiency reasons (with raid1 underneath,
the rebuild of a lost device is localized to the presumably two-device
raid1, with raid1 on top, the whole raid0 stripe must be rebuilt, and
that's normally at the whole-device level)
Of course putting the btrfs raid1 on top reverses this and would
*normally* be discouraged as raid01, but btrfs raid1's operational data
integrity handling, while not getting away from having to rebuild the
whole raid0 stripe from the other one, does mean that gets done for an
individual bad block -- no whole device failure necessary.
And of course you can't get that putting btrfs raid0 on top and get that,
since then the underneath raid1 layer won't be doing that integrity
verification, and if that bad block happens to be returned by the
underlying raid1 layer, the btrfs raid0 will simply fail the verification
and error out that read, despite another good copy on the underlying
raid1, because btrfs won't know anything about it.
Meanwhile, as Austin says, btrfs' A/B copy read scheduling is...
unoptimized. Basically, it's simple even/odd PID based, so a single read
thread will always hit the same copy, leaving the other one idle. I've
argued before that precisely that is a very good indication of where the
btrfs devs themselves think btrfs is at, as it's clearly suboptimal,
while there are much better scheduling examples, including the mdraid
read-scheduling code, praised for its efficiency, in the kernel, and
failure to optimize must then be considered either simply lacking the
time due to higher priority development and bugfixing tasks, or an
avoidance of the dangers of "premature optimization". In either case,
that such unoptimized code remains in such a highly visible and
performance critical place is an extremely strong indicator that btrfs
devs themselves don't consider btrfs a stable and mature filesystem yet.
And putting a pair of md/dm raid0s below that btrfs raid1, both helps to
make up a bit for the btrfs raid1 braindead read-scheduling, and lets you
exploit btrfs raid1's data integrity features. Of course it also forces
btrfs to a more deterministic distribution of those chunk copies, so you
can loose up to all the devices in one of those raid0s, as long as the
other one remains functional, but that's nothing to really count on, so
you still plan for single device failure redundancy only at the
individual brick level, and use the distributed filesystem layer to deal
with whole brick failure above that.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 0:58 ` Chris Murphy
2016-04-27 10:37 ` Duncan
@ 2016-04-27 11:22 ` Austin S. Hemmelgarn
2016-04-27 15:58 ` Juan Alberto Cirez
2016-04-27 23:19 ` Chris Murphy
1 sibling, 2 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-27 11:22 UTC (permalink / raw)
To: Chris Murphy, Juan Alberto Cirez; +Cc: linux-btrfs
On 2016-04-26 20:58, Chris Murphy wrote:
> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
> <jacirez@rdcsafety.com> wrote:
>>
>> With GlusterFS as a distributed volume, the files are already spread
>> among the servers causing file I/O to be spread fairly evenly among
>> them as well, thus probably providing the benefit one might expect
>> with stripe (RAID10).
>
> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
> if you lose a drive. But since raid1 is not n-way copies, and only
> means two copies, you don't really want the file systems getting that
> big or you increase the chances of a double failure.
>
> I've always though it'd be neat in a Btrfs + GlusterFS, if it were
> possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
> and then for Btrfs to drop reference for those files, instead of
> either rebuilding or remaining degraded. And then let GlusterFS deal
> with replication of those files to maintain redundancy. i.e. the Btrfs
> volumes would be single profile for data, and raid1 for metadata. When
> there's n-way raid1, each drive can have a copy of the file system,
> and it'd tolerate in effect n-1 drive failures and the file system
> could at least still inform Gluster (or Ceph) of the missing data, the
> file system still remains valid, only briefly degraded, and can still
> be expanded when new drives become available.
FWIW, I _think_ this can be done with the scrubbing code in GlusterFS.
It's designed to repair data mismatches, but I'm not sure how it handles
missing copies of data. However, in the current state, there's no way
without external scripts to handle re-shaping of the storage bricks if
part of them fails.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 11:22 ` Austin S. Hemmelgarn
@ 2016-04-27 15:58 ` Juan Alberto Cirez
2016-04-27 16:29 ` Holger Hoffstätte
2016-04-27 23:19 ` Chris Murphy
1 sibling, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-27 15:58 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: Chris Murphy, linux-btrfs
WOW!
Correct me if I'm wrong but the sum total of the above seems to
suggest (at first glance) that BRTFS add several layers of complexity,
but for little real benefit (at least in the case use of btrfs at the
brick layer with a distributed filesystem on top)...
"...I've always though it'd be neat in a Btrfs + GlusterFS, if it were
possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
and then for Btrfs to drop reference for those files, instead of
either rebuilding or remaining degraded. And then let GlusterFS deal
with replication of those files to maintain redundancy. i.e. the Btrfs
volumes would be single profile for data, and raid1 for metadata. When
there's n-way raid1, each drive can have a copy of the file system,
and it'd tolerate in effect n-1 drive failures and the file system
could at least still inform Gluster (or Ceph) of the missing data, the
file system still remains valid, only briefly degraded, and can still
be expanded when new drives become available..."
That in my n00b opinion would be brilliant in a real world use case.
On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-26 20:58, Chris Murphy wrote:
>>
>> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
>> <jacirez@rdcsafety.com> wrote:
>>>
>>>
>>> With GlusterFS as a distributed volume, the files are already spread
>>> among the servers causing file I/O to be spread fairly evenly among
>>> them as well, thus probably providing the benefit one might expect
>>> with stripe (RAID10).
>>
>>
>> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
>> if you lose a drive. But since raid1 is not n-way copies, and only
>> means two copies, you don't really want the file systems getting that
>> big or you increase the chances of a double failure.
>>
>> I've always though it'd be neat in a Btrfs + GlusterFS, if it were
>> possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
>> and then for Btrfs to drop reference for those files, instead of
>> either rebuilding or remaining degraded. And then let GlusterFS deal
>> with replication of those files to maintain redundancy. i.e. the Btrfs
>> volumes would be single profile for data, and raid1 for metadata. When
>> there's n-way raid1, each drive can have a copy of the file system,
>> and it'd tolerate in effect n-1 drive failures and the file system
>> could at least still inform Gluster (or Ceph) of the missing data, the
>> file system still remains valid, only briefly degraded, and can still
>> be expanded when new drives become available.
>
> FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's
> designed to repair data mismatches, but I'm not sure how it handles missing
> copies of data. However, in the current state, there's no way without
> external scripts to handle re-shaping of the storage bricks if part of them
> fails.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 15:58 ` Juan Alberto Cirez
@ 2016-04-27 16:29 ` Holger Hoffstätte
2016-04-27 16:38 ` Juan Alberto Cirez
0 siblings, 1 reply; 21+ messages in thread
From: Holger Hoffstätte @ 2016-04-27 16:29 UTC (permalink / raw)
To: linux-btrfs
On 04/27/16 17:58, Juan Alberto Cirez wrote:
> Correct me if I'm wrong but the sum total of the above seems to
> suggest (at first glance) that BRTFS add several layers of complexity,
> but for little real benefit (at least in the case use of btrfs at the
> brick layer with a distributed filesystem on top)...
This may come as a surprise, but the same can be said for every other
(common) filesystem (+ device management stack) that can be used
standalone.
Jeff Darcy (of GlusterFS) just wrote a really nice blog post why
current filesystems and their historically grown requirements (mostly
as they relate to the POSIX interface standard) are in many ways
just not a good fit for scale-out/redundant storage:
http://pl.atyp.us/2016-05-updating-posix.html
Quite a few of the capabilities & features which are useful or
necessary in standalone operation (regardless of single- or multi-
device setup) are *actively unhelpful* in a distributed context, which
is why e.g. Ceph will soon do away with the on-disk filesystem for
data, and manage metadata exclusively by itself.
cheers,
Holger
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 16:29 ` Holger Hoffstätte
@ 2016-04-27 16:38 ` Juan Alberto Cirez
2016-04-27 16:40 ` Juan Alberto Cirez
0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-27 16:38 UTC (permalink / raw)
To: Holger Hoffstätte; +Cc: linux-btrfs
Holger,
If this is so, then it leaves with even confused. I was under the
impression that the driving imperative for the creation of btrfs was
to address the shortcomings of current filesystem within the context
of distributed data. That the idea was to create a low level
filesystem that would the primary choice as a block/brick layer for a
scale-out, distributed data storage...
On Wed, Apr 27, 2016 at 10:29 AM, Holger Hoffstätte
<holger.hoffstaette@googlemail.com> wrote:
> On 04/27/16 17:58, Juan Alberto Cirez wrote:
>> Correct me if I'm wrong but the sum total of the above seems to
>> suggest (at first glance) that BRTFS add several layers of complexity,
>> but for little real benefit (at least in the case use of btrfs at the
>> brick layer with a distributed filesystem on top)...
>
> This may come as a surprise, but the same can be said for every other
> (common) filesystem (+ device management stack) that can be used
> standalone.
>
> Jeff Darcy (of GlusterFS) just wrote a really nice blog post why
> current filesystems and their historically grown requirements (mostly
> as they relate to the POSIX interface standard) are in many ways
> just not a good fit for scale-out/redundant storage:
> http://pl.atyp.us/2016-05-updating-posix.html
>
> Quite a few of the capabilities & features which are useful or
> necessary in standalone operation (regardless of single- or multi-
> device setup) are *actively unhelpful* in a distributed context, which
> is why e.g. Ceph will soon do away with the on-disk filesystem for
> data, and manage metadata exclusively by itself.
>
> cheers,
> Holger
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 16:38 ` Juan Alberto Cirez
@ 2016-04-27 16:40 ` Juan Alberto Cirez
2016-04-27 17:23 ` Holger Hoffstätte
0 siblings, 1 reply; 21+ messages in thread
From: Juan Alberto Cirez @ 2016-04-27 16:40 UTC (permalink / raw)
To: Holger Hoffstätte; +Cc: linux-btrfs
Holger,
If this is so, then it leaves even confused. I was under the
impression that the driving imperative for the creation of btrfs was
to address the shortcomings of current filesystems (within the context
of distributed data). That the idea was to create a low level
filesystem that would be the primary choice as a block/brick layer for a
scale-out, distributed data storage...
On Wed, Apr 27, 2016 at 10:38 AM, Juan Alberto Cirez
<jacirez@rdcsafety.com> wrote:
> Holger,
> If this is so, then it leaves with even confused. I was under the
> impression that the driving imperative for the creation of btrfs was
> to address the shortcomings of current filesystem within the context
> of distributed data. That the idea was to create a low level
> filesystem that would the primary choice as a block/brick layer for a
> scale-out, distributed data storage...
>
> On Wed, Apr 27, 2016 at 10:29 AM, Holger Hoffstätte
> <holger.hoffstaette@googlemail.com> wrote:
>> On 04/27/16 17:58, Juan Alberto Cirez wrote:
>>> Correct me if I'm wrong but the sum total of the above seems to
>>> suggest (at first glance) that BRTFS add several layers of complexity,
>>> but for little real benefit (at least in the case use of btrfs at the
>>> brick layer with a distributed filesystem on top)...
>>
>> This may come as a surprise, but the same can be said for every other
>> (common) filesystem (+ device management stack) that can be used
>> standalone.
>>
>> Jeff Darcy (of GlusterFS) just wrote a really nice blog post why
>> current filesystems and their historically grown requirements (mostly
>> as they relate to the POSIX interface standard) are in many ways
>> just not a good fit for scale-out/redundant storage:
>> http://pl.atyp.us/2016-05-updating-posix.html
>>
>> Quite a few of the capabilities & features which are useful or
>> necessary in standalone operation (regardless of single- or multi-
>> device setup) are *actively unhelpful* in a distributed context, which
>> is why e.g. Ceph will soon do away with the on-disk filesystem for
>> data, and manage metadata exclusively by itself.
>>
>> cheers,
>> Holger
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 16:40 ` Juan Alberto Cirez
@ 2016-04-27 17:23 ` Holger Hoffstätte
0 siblings, 0 replies; 21+ messages in thread
From: Holger Hoffstätte @ 2016-04-27 17:23 UTC (permalink / raw)
To: Juan Alberto Cirez; +Cc: linux-btrfs
On 04/27/16 18:40, Juan Alberto Cirez wrote:
> If this is so, then it leaves even confused. I was under the
> impression that the driving imperative for the creation of btrfs was
> to address the shortcomings of current filesystems (within the context
> of distributed data). That the idea was to create a low level
> filesystem that would be the primary choice as a block/brick layer for a
> scale-out, distributed data storage...
I can't speak for who was or is motivated by what. Btrfs was a necessary
reaction to ZFS, and AFAIK this had nothing to do with distributed storage
but rather growing concerns around reliability (checksumming), scalability
and operational ease: snapshotting, growing/shrinking etc.
It's true that some of btrfs' capabilities make it look like a a good
candidate, and e.g. Ceph started out using it. For many reasons that
didn't work out (AFAIK btrfs maturity + extensibility) - but it also
did not address a fundamental mismatch in requirements, which other
filesystems (ext4, xfs) could not address either. btrfs simply
does "too much" because it has to; you cannot remove or turn off half
of what makes a kernel-based filesystem a usable filesystem. This is
kind of sad because at its core btrfs *is* an object store with
various trees for metadata handling and whatnot - but there's no
easy way to turn off all the "Unix is stupid" stuff.
AFAIK Gluster will soon also start managing xattrs differently,
so this is not limited to Ceph.
I've been following this saga for several years now and it's
absolutely *astounding* how many bugs and performance problems
Ceph has unearthed in existing filesystems, simply because it
stresses them in ways they never have been stressed before..only to
create the illusion of a distributed key/value store, badly.
I don't want to argue about details, you can read more about some
of the reasons in [1].
[grumble grumble exokernels and composable things in userland grumble]
cheers
Holger
[1] http://www.slideshare.net/sageweil1/ceph-and-rocksdb
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 11:22 ` Austin S. Hemmelgarn
2016-04-27 15:58 ` Juan Alberto Cirez
@ 2016-04-27 23:19 ` Chris Murphy
2016-04-28 11:21 ` Austin S. Hemmelgarn
1 sibling, 1 reply; 21+ messages in thread
From: Chris Murphy @ 2016-04-27 23:19 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Juan Alberto Cirez, linux-btrfs
On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-26 20:58, Chris Murphy wrote:
>>
>> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
>> <jacirez@rdcsafety.com> wrote:
>>>
>>>
>>> With GlusterFS as a distributed volume, the files are already spread
>>> among the servers causing file I/O to be spread fairly evenly among
>>> them as well, thus probably providing the benefit one might expect
>>> with stripe (RAID10).
>>
>>
>> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
>> if you lose a drive. But since raid1 is not n-way copies, and only
>> means two copies, you don't really want the file systems getting that
>> big or you increase the chances of a double failure.
>>
>> I've always though it'd be neat in a Btrfs + GlusterFS, if it were
>> possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
>> and then for Btrfs to drop reference for those files, instead of
>> either rebuilding or remaining degraded. And then let GlusterFS deal
>> with replication of those files to maintain redundancy. i.e. the Btrfs
>> volumes would be single profile for data, and raid1 for metadata. When
>> there's n-way raid1, each drive can have a copy of the file system,
>> and it'd tolerate in effect n-1 drive failures and the file system
>> could at least still inform Gluster (or Ceph) of the missing data, the
>> file system still remains valid, only briefly degraded, and can still
>> be expanded when new drives become available.
>
> FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's
> designed to repair data mismatches, but I'm not sure how it handles missing
> copies of data. However, in the current state, there's no way without
> external scripts to handle re-shaping of the storage bricks if part of them
> fails.
Yeah I haven't tried doing a scrub, parsing dmesg for busted file
paths, and feeling those paths into rm to see what happens. Will they
get deleted without additional errors? If so good, then scrub again
should be clean. And then btrfs dev missing to get rid of the broken
device *and* cause missing metadata to be replicated again and now in
theory the fs should be back to normal. But it'd have to be tested
with a umount followed by mount to see if -o degraded is still
required.
--
Chris Murphy
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing
2016-04-27 23:19 ` Chris Murphy
@ 2016-04-28 11:21 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-28 11:21 UTC (permalink / raw)
To: Chris Murphy; +Cc: Juan Alberto Cirez, linux-btrfs
On 2016-04-27 19:19, Chris Murphy wrote:
> On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-26 20:58, Chris Murphy wrote:
>>>
>>> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez
>>> <jacirez@rdcsafety.com> wrote:
>>>>
>>>>
>>>> With GlusterFS as a distributed volume, the files are already spread
>>>> among the servers causing file I/O to be spread fairly evenly among
>>>> them as well, thus probably providing the benefit one might expect
>>>> with stripe (RAID10).
>>>
>>>
>>> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes
>>> if you lose a drive. But since raid1 is not n-way copies, and only
>>> means two copies, you don't really want the file systems getting that
>>> big or you increase the chances of a double failure.
>>>
>>> I've always though it'd be neat in a Btrfs + GlusterFS, if it were
>>> possible for Btrfs to inform Gluster FS of "missing/corrupt" files,
>>> and then for Btrfs to drop reference for those files, instead of
>>> either rebuilding or remaining degraded. And then let GlusterFS deal
>>> with replication of those files to maintain redundancy. i.e. the Btrfs
>>> volumes would be single profile for data, and raid1 for metadata. When
>>> there's n-way raid1, each drive can have a copy of the file system,
>>> and it'd tolerate in effect n-1 drive failures and the file system
>>> could at least still inform Gluster (or Ceph) of the missing data, the
>>> file system still remains valid, only briefly degraded, and can still
>>> be expanded when new drives become available.
>>
>> FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's
>> designed to repair data mismatches, but I'm not sure how it handles missing
>> copies of data. However, in the current state, there's no way without
>> external scripts to handle re-shaping of the storage bricks if part of them
>> fails.
>
> Yeah I haven't tried doing a scrub, parsing dmesg for busted file
> paths, and feeling those paths into rm to see what happens. Will they
> get deleted without additional errors? If so good, then scrub again
> should be clean. And then btrfs dev missing to get rid of the broken
> device *and* cause missing metadata to be replicated again and now in
> theory the fs should be back to normal. But it'd have to be tested
> with a umount followed by mount to see if -o degraded is still
> required.
>
I'm not entirely certain, although I had been planning on adding a test
to check this to my usual testing before the system I use for it went
offline, I just haven't had the time to get it working again. If I find
the time in the near future, I may just test it on my laptop in a VM.
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2016-04-28 11:21 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-22 20:36 Add device while rebalancing Juan Alberto Cirez
2016-04-23 5:38 ` Duncan
2016-04-25 11:18 ` Austin S. Hemmelgarn
2016-04-25 12:43 ` Duncan
2016-04-25 13:02 ` Austin S. Hemmelgarn
2016-04-26 10:50 ` Juan Alberto Cirez
2016-04-26 11:11 ` Austin S. Hemmelgarn
2016-04-26 11:44 ` Juan Alberto Cirez
2016-04-26 12:04 ` Austin S. Hemmelgarn
2016-04-26 12:14 ` Juan Alberto Cirez
2016-04-26 12:44 ` Austin S. Hemmelgarn
2016-04-27 0:58 ` Chris Murphy
2016-04-27 10:37 ` Duncan
2016-04-27 11:22 ` Austin S. Hemmelgarn
2016-04-27 15:58 ` Juan Alberto Cirez
2016-04-27 16:29 ` Holger Hoffstätte
2016-04-27 16:38 ` Juan Alberto Cirez
2016-04-27 16:40 ` Juan Alberto Cirez
2016-04-27 17:23 ` Holger Hoffstätte
2016-04-27 23:19 ` Chris Murphy
2016-04-28 11:21 ` Austin S. Hemmelgarn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.