* Add device while rebalancing @ 2016-04-22 20:36 Juan Alberto Cirez 2016-04-23 5:38 ` Duncan 0 siblings, 1 reply; 21+ messages in thread From: Juan Alberto Cirez @ 2016-04-22 20:36 UTC (permalink / raw) To: linux-btrfs Good morning, I am new to this list and to btrfs in general. I have a quick question: Can I add a new device to the pool while the btrfs filesystem balance command is running on the drive pool? Thanks ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-22 20:36 Add device while rebalancing Juan Alberto Cirez @ 2016-04-23 5:38 ` Duncan 2016-04-25 11:18 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 21+ messages in thread From: Duncan @ 2016-04-23 5:38 UTC (permalink / raw) To: linux-btrfs Juan Alberto Cirez posted on Fri, 22 Apr 2016 14:36:44 -0600 as excerpted: > Good morning, > I am new to this list and to btrfs in general. I have a quick question: > Can I add a new device to the pool while the btrfs filesystem balance > command is running on the drive pool? Adding a device while balancing shouldn't be a problem. However, depending on your redundancy mode, you may wish to cancel the balance and start a new one after the device add, so the balance will take account of it as well and balance it into the mix. Note that while device add doesn't do more than that on its own, device delete/remove effectively initiates its own balance, moving the chunks on the device being removed to the other devices. So you wouldn't want to be running a balance and then do a device remove at the same time. Similarly with btrfs replace, altho in that case, it's more directly moving data from the device being replaced (if it's still there, or using redundancy or parity to recover it if not) to the replacement device, a more limited and often faster operation. But you probably still don't want to do a balance at the same time as it places unnecessary stress on both the filesystem and the hardware, and even if the filesystem and devices handle the stress fine, the result is going to be that both operations take longer as they're both intensive operations that will interfere with each other to some extent. Similarly with btrfs scrub. The operations are logically different enough that they shouldn't really interfere with each other logically, but they're both hardware intensive operations that will put unnecessary stress on the system if you're doing more than one at a time, and will result in both going slower than they normally would. And again with snapshotting operations. Making a snapshot is normally nearly instantaneous, but there's a scaling issue if you have too many per filesystem (try to keep it under 2000 snapshots per filesystem total, if possible, and definitely keep it under 10K or some operations will slow down substantially), and deleting snapshots is more work, so while you should ordinarily automatically thin down snapshots if you're automatically making them quite frequently (say daily or more frequently), you may want to put the snapshot deletion, at least, on hold while you scrub or balance or device delete or replace. Meanwhile, you mentioned being new to btrfs. If you haven't discovered the wiki yet, please spend some time reading the user documentation there, as it's likely to clear up a lot of questions you may have, and you'll better understand how to effectively work with the filesystem when you're done. It's well worth the time invested! =:^) https://btrfs.wiki.kernel.org -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-23 5:38 ` Duncan @ 2016-04-25 11:18 ` Austin S. Hemmelgarn 2016-04-25 12:43 ` Duncan 0 siblings, 1 reply; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-25 11:18 UTC (permalink / raw) To: linux-btrfs On 2016-04-23 01:38, Duncan wrote: > Juan Alberto Cirez posted on Fri, 22 Apr 2016 14:36:44 -0600 as excerpted: > >> Good morning, >> I am new to this list and to btrfs in general. I have a quick question: >> Can I add a new device to the pool while the btrfs filesystem balance >> command is running on the drive pool? > > Adding a device while balancing shouldn't be a problem. However, > depending on your redundancy mode, you may wish to cancel the balance and > start a new one after the device add, so the balance will take account of > it as well and balance it into the mix. I'm not 100% certain about how balance will handle this, except that nothing should break. I believe that it picks a device each time it goes to move a chunk, so it should evaluate any chunks operated on after the addition of the device for possible placement on that device (and it will probably end up putting a lot of them there because that device will almost certainly be less full than any of the others). That said, you probably do want to cancel the balance, add the device, and re-run the balance so that things end up more evenly distributed. > > Note that while device add doesn't do more than that on its own, device > delete/remove effectively initiates its own balance, moving the chunks on > the device being removed to the other devices. So you wouldn't want to > be running a balance and then do a device remove at the same time. IIRC, trying to delete a device while running a balance will fail, and return an error, because only one balance can be running at a given moment. > > Similarly with btrfs replace, altho in that case, it's more directly > moving data from the device being replaced (if it's still there, or using > redundancy or parity to recover it if not) to the replacement device, a > more limited and often faster operation. But you probably still don't > want to do a balance at the same time as it places unnecessary stress on > both the filesystem and the hardware, and even if the filesystem and > devices handle the stress fine, the result is going to be that both > operations take longer as they're both intensive operations that will > interfere with each other to some extent. Agreed, this is generally not a good idea because of the stress it puts on the devices (and because it probably isn't well tested). > > Similarly with btrfs scrub. The operations are logically different > enough that they shouldn't really interfere with each other logically, > but they're both hardware intensive operations that will put unnecessary > stress on the system if you're doing more than one at a time, and will > result in both going slower than they normally would. Actually, depending on a number of factors, scrubbing while balancing can actually finish faster than running one then the other in sequence. It's really dependent on how both decide to pick chunks, and how your underlying devices handle read and write caching, but it can happen. Most of the time though, it should take around the same amount of time as running one then the other, or a little bit longer if you're on traditional disks. > > And again with snapshotting operations. Making a snapshot is normally > nearly instantaneous, but there's a scaling issue if you have too many > per filesystem (try to keep it under 2000 snapshots per filesystem total, > if possible, and definitely keep it under 10K or some operations will > slow down substantially), and deleting snapshots is more work, so while > you should ordinarily automatically thin down snapshots if you're > automatically making them quite frequently (say daily or more > frequently), you may want to put the snapshot deletion, at least, on hold > while you scrub or balance or device delete or replace. I would actually recommend putting all snapshot operations on hold, as well as most writes to the filesystem, while doing a balance or device deletion. The more writes you have while doing those, the longer they take, and the less likely that you end up with a good on-disk layout of the data. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-25 11:18 ` Austin S. Hemmelgarn @ 2016-04-25 12:43 ` Duncan 2016-04-25 13:02 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 21+ messages in thread From: Duncan @ 2016-04-25 12:43 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as excerpted: > On 2016-04-23 01:38, Duncan wrote: >> >> And again with snapshotting operations. Making a snapshot is normally >> nearly instantaneous, but there's a scaling issue if you have too many >> per filesystem (try to keep it under 2000 snapshots per filesystem >> total, if possible, and definitely keep it under 10K or some operations >> will slow down substantially), and deleting snapshots is more work, so >> while you should ordinarily automatically thin down snapshots if you're >> automatically making them quite frequently (say daily or more >> frequently), you may want to put the snapshot deletion, at least, on >> hold while you scrub or balance or device delete or replace. > I would actually recommend putting all snapshot operations on hold, as > well as most writes to the filesystem, while doing a balance or device > deletion. The more writes you have while doing those, the longer they > take, and the less likely that you end up with a good on-disk layout of > the data. The thing with snapshot writing is that all snapshot creation effectively does is a bit of metadata writing. What snapshots primarily do is lock existing extents in place (down within their chunk, with the higher chunk level being the scope at which balance works), that would otherwise be COWed elsewhere with the existing extent deleted on change, or simply deleted on on file delete. A snapshot simply adds a reference to the current version, so that deletion, either directly or from the COW, never happens, and to do that simply requires a relatively small metadata write. So while I agree in general that more writes means balances taking longer, snapshot creation writes are pretty tiny in the scheme of things, and won't affect the balance much, compared to larger writes you'll very possibly still be doing unless you really do suspend pretty much all write operations to that filesystem during the balance. But as I said, snapshot deletions are an entirely different story, as then all those previously locked in place extents are potentially freed, and the filesystem must do a lot of work to figure out which ones it can actually free and free them, vs. ones that still have other references which therefore cannot yet be freed. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-25 12:43 ` Duncan @ 2016-04-25 13:02 ` Austin S. Hemmelgarn 2016-04-26 10:50 ` Juan Alberto Cirez 0 siblings, 1 reply; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-25 13:02 UTC (permalink / raw) To: linux-btrfs On 2016-04-25 08:43, Duncan wrote: > Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as > excerpted: > >> On 2016-04-23 01:38, Duncan wrote: >>> >>> And again with snapshotting operations. Making a snapshot is normally >>> nearly instantaneous, but there's a scaling issue if you have too many >>> per filesystem (try to keep it under 2000 snapshots per filesystem >>> total, if possible, and definitely keep it under 10K or some operations >>> will slow down substantially), and deleting snapshots is more work, so >>> while you should ordinarily automatically thin down snapshots if you're >>> automatically making them quite frequently (say daily or more >>> frequently), you may want to put the snapshot deletion, at least, on >>> hold while you scrub or balance or device delete or replace. > >> I would actually recommend putting all snapshot operations on hold, as >> well as most writes to the filesystem, while doing a balance or device >> deletion. The more writes you have while doing those, the longer they >> take, and the less likely that you end up with a good on-disk layout of >> the data. > > The thing with snapshot writing is that all snapshot creation effectively > does is a bit of metadata writing. What snapshots primarily do is lock > existing extents in place (down within their chunk, with the higher chunk > level being the scope at which balance works), that would otherwise be > COWed elsewhere with the existing extent deleted on change, or simply > deleted on on file delete. A snapshot simply adds a reference to the > current version, so that deletion, either directly or from the COW, never > happens, and to do that simply requires a relatively small metadata write. Unless I'm mistaken about the internals of BTRFS (which might be the case), creating a snapshot has to update reference counts on every single extent in every single file in the snapshot. For something small this isn't much, but if you are snapshotting something big (say, snapshotting an entire system with all the data in one subvolume), it can amount to multiple MB of writes, and it gets even worse if you have no shared extents to begin with (which is still pretty typical). On some of the systems I work with at work, snapshotting a terabyte of data can end up resulting in 10-20 MB of writes to disk (in this case, that figure came from a partition containing mostly small files that were just big enough that they didn't fit in-line in the metadata blocks). This is of course still significantly faster than copying everything, but it's not free either. > > So while I agree in general that more writes means balances taking > longer, snapshot creation writes are pretty tiny in the scheme of things, > and won't affect the balance much, compared to larger writes you'll very > possibly still be doing unless you really do suspend pretty much all > write operations to that filesystem during the balance. In general, yes, except that there's the case of running with mostly full metadata chunks, where it might result in a further chunk allocation, which in turn can throw off the balanced layout. Balance always allocates new chunks, and doesn't write into existing ones, so if you're writing enough to allocate a new chunk while a balance is happening: 1. That chunk may or may not get considered by the balance code (I'm not 100% certain about this, but I believe it will be ignored by any balance running at the time it gets allocated). 2. You run the risk of ending up with a chunk with almost nothing in it which could be packed into another existing chunk. Snapshots are not likely to trigger this, but it is still possible, especially if you're taking lots of snapshots in a short period of time. > > But as I said, snapshot deletions are an entirely different story, as > then all those previously locked in place extents are potentially freed, > and the filesystem must do a lot of work to figure out which ones it can > actually free and free them, vs. ones that still have other references > which therefore cannot yet be freed. Most of the issue here with balance is that you end up potentially doing an amount of unnecessary work which is unquantifiable before it's done. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-25 13:02 ` Austin S. Hemmelgarn @ 2016-04-26 10:50 ` Juan Alberto Cirez 2016-04-26 11:11 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 21+ messages in thread From: Juan Alberto Cirez @ 2016-04-26 10:50 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: linux-btrfs Thank you guys so very kindly for all your help and taking the time to answer my question. I have been reading the wiki and online use cases and otherwise delving deeper into the btrfs architecture. I am managing a 520TB storage pool spread across 16 server pods and have tried several methods of distributed storage. Last attempt was using Zfs as a base for the physical bricks and GlusterFS as a glue to string together the storage pool. I was not satisfied with the results (mainly Zfs). Once I have run btrfs for a while on the test server (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph On Mon, Apr 25, 2016 at 7:02 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-04-25 08:43, Duncan wrote: >> >> Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as >> excerpted: >> >>> On 2016-04-23 01:38, Duncan wrote: >>>> >>>> >>>> And again with snapshotting operations. Making a snapshot is normally >>>> nearly instantaneous, but there's a scaling issue if you have too many >>>> per filesystem (try to keep it under 2000 snapshots per filesystem >>>> total, if possible, and definitely keep it under 10K or some operations >>>> will slow down substantially), and deleting snapshots is more work, so >>>> while you should ordinarily automatically thin down snapshots if you're >>>> automatically making them quite frequently (say daily or more >>>> frequently), you may want to put the snapshot deletion, at least, on >>>> hold while you scrub or balance or device delete or replace. >> >> >>> I would actually recommend putting all snapshot operations on hold, as >>> well as most writes to the filesystem, while doing a balance or device >>> deletion. The more writes you have while doing those, the longer they >>> take, and the less likely that you end up with a good on-disk layout of >>> the data. >> >> >> The thing with snapshot writing is that all snapshot creation effectively >> does is a bit of metadata writing. What snapshots primarily do is lock >> existing extents in place (down within their chunk, with the higher chunk >> level being the scope at which balance works), that would otherwise be >> COWed elsewhere with the existing extent deleted on change, or simply >> deleted on on file delete. A snapshot simply adds a reference to the >> current version, so that deletion, either directly or from the COW, never >> happens, and to do that simply requires a relatively small metadata write. > > Unless I'm mistaken about the internals of BTRFS (which might be the case), > creating a snapshot has to update reference counts on every single extent in > every single file in the snapshot. For something small this isn't much, but > if you are snapshotting something big (say, snapshotting an entire system > with all the data in one subvolume), it can amount to multiple MB of writes, > and it gets even worse if you have no shared extents to begin with (which is > still pretty typical). On some of the systems I work with at work, > snapshotting a terabyte of data can end up resulting in 10-20 MB of writes > to disk (in this case, that figure came from a partition containing mostly > small files that were just big enough that they didn't fit in-line in the > metadata blocks). > > This is of course still significantly faster than copying everything, but > it's not free either. >> >> >> So while I agree in general that more writes means balances taking >> longer, snapshot creation writes are pretty tiny in the scheme of things, >> and won't affect the balance much, compared to larger writes you'll very >> possibly still be doing unless you really do suspend pretty much all >> write operations to that filesystem during the balance. > > In general, yes, except that there's the case of running with mostly full > metadata chunks, where it might result in a further chunk allocation, which > in turn can throw off the balanced layout. Balance always allocates new > chunks, and doesn't write into existing ones, so if you're writing enough to > allocate a new chunk while a balance is happening: > 1. That chunk may or may not get considered by the balance code (I'm not > 100% certain about this, but I believe it will be ignored by any balance > running at the time it gets allocated). > 2. You run the risk of ending up with a chunk with almost nothing in it > which could be packed into another existing chunk. > Snapshots are not likely to trigger this, but it is still possible, > especially if you're taking lots of snapshots in a short period of time. >> >> >> But as I said, snapshot deletions are an entirely different story, as >> then all those previously locked in place extents are potentially freed, >> and the filesystem must do a lot of work to figure out which ones it can >> actually free and free them, vs. ones that still have other references >> which therefore cannot yet be freed. > > Most of the issue here with balance is that you end up potentially doing an > amount of unnecessary work which is unquantifiable before it's done. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-26 10:50 ` Juan Alberto Cirez @ 2016-04-26 11:11 ` Austin S. Hemmelgarn 2016-04-26 11:44 ` Juan Alberto Cirez 0 siblings, 1 reply; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-26 11:11 UTC (permalink / raw) To: Juan Alberto Cirez; +Cc: linux-btrfs On 2016-04-26 06:50, Juan Alberto Cirez wrote: > Thank you guys so very kindly for all your help and taking the time to > answer my question. I have been reading the wiki and online use cases > and otherwise delving deeper into the btrfs architecture. > > I am managing a 520TB storage pool spread across 16 server pods and > have tried several methods of distributed storage. Last attempt was > using Zfs as a base for the physical bricks and GlusterFS as a glue to > string together the storage pool. I was not satisfied with the results > (mainly Zfs). Once I have run btrfs for a while on the test server > (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph For what it's worth, GlusterFS works great on top of BTRFS. I don't have any claims to usage in production, but I've done _a lot_ of testing with it because we're replacing one of our critical file servers at work with a couple of systems set up with Gluster on top of BTRFS, and I've been looking at setting up a small storage cluster at home using it on a couple of laptops I have which have non-functional displays. Based on what I've seen, it appears to be rock solid with respect to the common failure modes, provided you use something like raid1 mode on the BTRFS side of things. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-26 11:11 ` Austin S. Hemmelgarn @ 2016-04-26 11:44 ` Juan Alberto Cirez 2016-04-26 12:04 ` Austin S. Hemmelgarn 2016-04-27 0:58 ` Chris Murphy 0 siblings, 2 replies; 21+ messages in thread From: Juan Alberto Cirez @ 2016-04-26 11:44 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: linux-btrfs Well, RAID1 offers no parity, striping, or spanning of disk space across multiple disks. RAID10 configuration, on the other hand, requires a minimum of four HDD, but it stripes data across mirrored pairs. As long as one disk in each mirrored pair is functional, data can be retrieved. With GlusterFS as a distributed volume, the files are already spread among the servers causing file I/O to be spread fairly evenly among them as well, thus probably providing the benefit one might expect with stripe (RAID10). The question I have now is: Should I use a RAID10 or RAID1 underneath of a GlusterFS stripped (and possibly replicated) volume? On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-04-26 06:50, Juan Alberto Cirez wrote: >> >> Thank you guys so very kindly for all your help and taking the time to >> answer my question. I have been reading the wiki and online use cases >> and otherwise delving deeper into the btrfs architecture. >> >> I am managing a 520TB storage pool spread across 16 server pods and >> have tried several methods of distributed storage. Last attempt was >> using Zfs as a base for the physical bricks and GlusterFS as a glue to >> string together the storage pool. I was not satisfied with the results >> (mainly Zfs). Once I have run btrfs for a while on the test server >> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph > > For what it's worth, GlusterFS works great on top of BTRFS. I don't have > any claims to usage in production, but I've done _a lot_ of testing with it > because we're replacing one of our critical file servers at work with a > couple of systems set up with Gluster on top of BTRFS, and I've been looking > at setting up a small storage cluster at home using it on a couple of > laptops I have which have non-functional displays. Based on what I've seen, > it appears to be rock solid with respect to the common failure modes, > provided you use something like raid1 mode on the BTRFS side of things. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-26 11:44 ` Juan Alberto Cirez @ 2016-04-26 12:04 ` Austin S. Hemmelgarn 2016-04-26 12:14 ` Juan Alberto Cirez 2016-04-27 0:58 ` Chris Murphy 1 sibling, 1 reply; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-26 12:04 UTC (permalink / raw) To: Juan Alberto Cirez; +Cc: linux-btrfs On 2016-04-26 07:44, Juan Alberto Cirez wrote: > Well, > RAID1 offers no parity, striping, or spanning of disk space across > multiple disks. > > RAID10 configuration, on the other hand, requires a minimum of four > HDD, but it stripes data across mirrored pairs. As long as one disk in > each mirrored pair is functional, data can be retrieved. > > With GlusterFS as a distributed volume, the files are already spread > among the servers causing file I/O to be spread fairly evenly among > them as well, thus probably providing the benefit one might expect > with stripe (RAID10). > > The question I have now is: Should I use a RAID10 or RAID1 underneath > of a GlusterFS stripped (and possibly replicated) volume? If you have enough systems and a new enough version of GlusterFS, I'd suggest using raid1 on the low level, and then either a distributed replicated volume or an erasure coded volume in GlusterFS. Having more individual nodes involved will improve your scalability to larger numbers of clients, and you can have more nodes with the same number of disks if you use raid1 instead of raid10 on BTRFS. Using Erasure coding in Gluster will provide better resiliency with higher node counts for each individual file, at the cost of moderately higher CPU time being used. FWIW, RAID5 and RAID6 are both specific cases of (mathematically) optimal erasure coding (RAID5 is n,n+1 and RAID6 is n,n+2 using the normal notation), but the equivalent forms in Gluster are somewhat risky with any decent sized cluster. It is worth noting that I would not personally trust just GlusterFS or just BTRFS with the data replication, BTRFS is still somewhat new (although I haven't had a truly broken filesystem in more than a year), and GlusterFS has a lot more failure modes because of the networking. > > On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2016-04-26 06:50, Juan Alberto Cirez wrote: >>> >>> Thank you guys so very kindly for all your help and taking the time to >>> answer my question. I have been reading the wiki and online use cases >>> and otherwise delving deeper into the btrfs architecture. >>> >>> I am managing a 520TB storage pool spread across 16 server pods and >>> have tried several methods of distributed storage. Last attempt was >>> using Zfs as a base for the physical bricks and GlusterFS as a glue to >>> string together the storage pool. I was not satisfied with the results >>> (mainly Zfs). Once I have run btrfs for a while on the test server >>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph >> >> For what it's worth, GlusterFS works great on top of BTRFS. I don't have >> any claims to usage in production, but I've done _a lot_ of testing with it >> because we're replacing one of our critical file servers at work with a >> couple of systems set up with Gluster on top of BTRFS, and I've been looking >> at setting up a small storage cluster at home using it on a couple of >> laptops I have which have non-functional displays. Based on what I've seen, >> it appears to be rock solid with respect to the common failure modes, >> provided you use something like raid1 mode on the BTRFS side of things. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-26 12:04 ` Austin S. Hemmelgarn @ 2016-04-26 12:14 ` Juan Alberto Cirez 2016-04-26 12:44 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 21+ messages in thread From: Juan Alberto Cirez @ 2016-04-26 12:14 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: linux-btrfs Thank you again, Austin. My ideal case would be high availability coupled with reliable data replication and integrity against accidental lost. I am willing to cede ground on the write speed; but the read has to be as optimized as possible. So far BTRFS, RAID10 on the 32TB test server is quite good both read & write and data lost/corruption has not been an issue yet. When I introduce the network/distributed layer, I would like the same. BTW does Ceph provides similar functionality, reliability and performace? On Tue, Apr 26, 2016 at 6:04 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-04-26 07:44, Juan Alberto Cirez wrote: >> >> Well, >> RAID1 offers no parity, striping, or spanning of disk space across >> multiple disks. >> >> RAID10 configuration, on the other hand, requires a minimum of four >> HDD, but it stripes data across mirrored pairs. As long as one disk in >> each mirrored pair is functional, data can be retrieved. >> >> With GlusterFS as a distributed volume, the files are already spread >> among the servers causing file I/O to be spread fairly evenly among >> them as well, thus probably providing the benefit one might expect >> with stripe (RAID10). >> >> The question I have now is: Should I use a RAID10 or RAID1 underneath >> of a GlusterFS stripped (and possibly replicated) volume? > > If you have enough systems and a new enough version of GlusterFS, I'd > suggest using raid1 on the low level, and then either a distributed > replicated volume or an erasure coded volume in GlusterFS. > Having more individual nodes involved will improve your scalability to > larger numbers of clients, and you can have more nodes with the same number > of disks if you use raid1 instead of raid10 on BTRFS. Using Erasure coding > in Gluster will provide better resiliency with higher node counts for each > individual file, at the cost of moderately higher CPU time being used. > FWIW, RAID5 and RAID6 are both specific cases of (mathematically) optimal > erasure coding (RAID5 is n,n+1 and RAID6 is n,n+2 using the normal > notation), but the equivalent forms in Gluster are somewhat risky with any > decent sized cluster. > > It is worth noting that I would not personally trust just GlusterFS or just > BTRFS with the data replication, BTRFS is still somewhat new (although I > haven't had a truly broken filesystem in more than a year), and GlusterFS > has a lot more failure modes because of the networking. > >> >> On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn >> <ahferroin7@gmail.com> wrote: >>> >>> On 2016-04-26 06:50, Juan Alberto Cirez wrote: >>>> >>>> >>>> Thank you guys so very kindly for all your help and taking the time to >>>> answer my question. I have been reading the wiki and online use cases >>>> and otherwise delving deeper into the btrfs architecture. >>>> >>>> I am managing a 520TB storage pool spread across 16 server pods and >>>> have tried several methods of distributed storage. Last attempt was >>>> using Zfs as a base for the physical bricks and GlusterFS as a glue to >>>> string together the storage pool. I was not satisfied with the results >>>> (mainly Zfs). Once I have run btrfs for a while on the test server >>>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph >>> >>> >>> For what it's worth, GlusterFS works great on top of BTRFS. I don't have >>> any claims to usage in production, but I've done _a lot_ of testing with >>> it >>> because we're replacing one of our critical file servers at work with a >>> couple of systems set up with Gluster on top of BTRFS, and I've been >>> looking >>> at setting up a small storage cluster at home using it on a couple of >>> laptops I have which have non-functional displays. Based on what I've >>> seen, >>> it appears to be rock solid with respect to the common failure modes, >>> provided you use something like raid1 mode on the BTRFS side of things. > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-26 12:14 ` Juan Alberto Cirez @ 2016-04-26 12:44 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-26 12:44 UTC (permalink / raw) To: Juan Alberto Cirez; +Cc: linux-btrfs On 2016-04-26 08:14, Juan Alberto Cirez wrote: > Thank you again, Austin. > > My ideal case would be high availability coupled with reliable data > replication and integrity against accidental lost. I am willing to > cede ground on the write speed; but the read has to be as optimized as > possible. > So far BTRFS, RAID10 on the 32TB test server is quite good both read & > write and data lost/corruption has not been an issue yet. When I > introduce the network/distributed layer, I would like the same. > BTW does Ceph provides similar functionality, reliability and performace? I can't give as much advice on Ceph, except to say that when I last tested it more than 2 years ago, the filesystem front-end had some serious data integrity issues, and the block device front-end had some sanity issues when dealing with systems going off-line (either crashing, or being shut down). I don't know if they're fixed or not by now. It's worth noting that while Glsuter and Ceph are both intended for cluster storage, Ceph has a very much more data-center oriented approach (it appears from what I've seen to be optimized for lots of small systems running as OSD's with a few bigger ones running as monitors and possibly MDS's), while Gluster seems (again, personal perspective) to try to be more agnostic of what hardware is involved. I will comment though that it is exponentially easier to recover data from a failed GlusterFS cluster than it is a failed Ceph cluster, Gluster uses flat files with a few extended attributes for storage, whereas Ceph uses it's own internal binary object format (partly because Ceph is first and foremost an object storage system, whereas Gluster is primarily intended as an actual filesystem). Also, with respect to performance, you may want to compare BTRFS raid10 mode to BTRFS raid1 on top of two LVM RAID0 volumes. I find this tends to get better overall performance with no difference in data safety, because BTRFS still has a pretty brain-dead I/O scheduler in the multi-device code. > On Tue, Apr 26, 2016 at 6:04 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2016-04-26 07:44, Juan Alberto Cirez wrote: >>> >>> Well, >>> RAID1 offers no parity, striping, or spanning of disk space across >>> multiple disks. >>> >>> RAID10 configuration, on the other hand, requires a minimum of four >>> HDD, but it stripes data across mirrored pairs. As long as one disk in >>> each mirrored pair is functional, data can be retrieved. >>> >>> With GlusterFS as a distributed volume, the files are already spread >>> among the servers causing file I/O to be spread fairly evenly among >>> them as well, thus probably providing the benefit one might expect >>> with stripe (RAID10). >>> >>> The question I have now is: Should I use a RAID10 or RAID1 underneath >>> of a GlusterFS stripped (and possibly replicated) volume? >> >> If you have enough systems and a new enough version of GlusterFS, I'd >> suggest using raid1 on the low level, and then either a distributed >> replicated volume or an erasure coded volume in GlusterFS. >> Having more individual nodes involved will improve your scalability to >> larger numbers of clients, and you can have more nodes with the same number >> of disks if you use raid1 instead of raid10 on BTRFS. Using Erasure coding >> in Gluster will provide better resiliency with higher node counts for each >> individual file, at the cost of moderately higher CPU time being used. >> FWIW, RAID5 and RAID6 are both specific cases of (mathematically) optimal >> erasure coding (RAID5 is n,n+1 and RAID6 is n,n+2 using the normal >> notation), but the equivalent forms in Gluster are somewhat risky with any >> decent sized cluster. >> >> It is worth noting that I would not personally trust just GlusterFS or just >> BTRFS with the data replication, BTRFS is still somewhat new (although I >> haven't had a truly broken filesystem in more than a year), and GlusterFS >> has a lot more failure modes because of the networking. >> >>> >>> On Tue, Apr 26, 2016 at 5:11 AM, Austin S. Hemmelgarn >>> <ahferroin7@gmail.com> wrote: >>>> >>>> On 2016-04-26 06:50, Juan Alberto Cirez wrote: >>>>> >>>>> >>>>> Thank you guys so very kindly for all your help and taking the time to >>>>> answer my question. I have been reading the wiki and online use cases >>>>> and otherwise delving deeper into the btrfs architecture. >>>>> >>>>> I am managing a 520TB storage pool spread across 16 server pods and >>>>> have tried several methods of distributed storage. Last attempt was >>>>> using Zfs as a base for the physical bricks and GlusterFS as a glue to >>>>> string together the storage pool. I was not satisfied with the results >>>>> (mainly Zfs). Once I have run btrfs for a while on the test server >>>>> (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph >>>> >>>> >>>> For what it's worth, GlusterFS works great on top of BTRFS. I don't have >>>> any claims to usage in production, but I've done _a lot_ of testing with >>>> it >>>> because we're replacing one of our critical file servers at work with a >>>> couple of systems set up with Gluster on top of BTRFS, and I've been >>>> looking >>>> at setting up a small storage cluster at home using it on a couple of >>>> laptops I have which have non-functional displays. Based on what I've >>>> seen, >>>> it appears to be rock solid with respect to the common failure modes, >>>> provided you use something like raid1 mode on the BTRFS side of things. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-26 11:44 ` Juan Alberto Cirez 2016-04-26 12:04 ` Austin S. Hemmelgarn @ 2016-04-27 0:58 ` Chris Murphy 2016-04-27 10:37 ` Duncan 2016-04-27 11:22 ` Austin S. Hemmelgarn 1 sibling, 2 replies; 21+ messages in thread From: Chris Murphy @ 2016-04-27 0:58 UTC (permalink / raw) To: Juan Alberto Cirez; +Cc: Austin S. Hemmelgarn, linux-btrfs On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez <jacirez@rdcsafety.com> wrote: > Well, > RAID1 offers no parity, striping, or spanning of disk space across > multiple disks. Btrfs raid1 does span, although it's typically called the "volume", or a "pool" similar to ZFS terminology. e.g. 10 2TiB disks will get you a single volume on which you can store about 10TiB of data with two copies (called stripes in Btrfs). In effect the way chunk replication works, it's a concat+raid1. > RAID10 configuration, on the other hand, requires a minimum of four > HDD, but it stripes data across mirrored pairs. As long as one disk in > each mirrored pair is functional, data can be retrieved. Not Btrfs raid10. It's not the devices that are mirrored pairs, but rather the chunks. There's no way to control or determine on what devices the pairs are on. It's certain you get at least a partial failure (data for sure and likely metadata if it's also using raid10 profile) of the volume if you lose more than 1 device, planning wise you have to assume you lose the entire array. > > With GlusterFS as a distributed volume, the files are already spread > among the servers causing file I/O to be spread fairly evenly among > them as well, thus probably providing the benefit one might expect > with stripe (RAID10). Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes if you lose a drive. But since raid1 is not n-way copies, and only means two copies, you don't really want the file systems getting that big or you increase the chances of a double failure. I've always though it'd be neat in a Btrfs + GlusterFS, if it were possible for Btrfs to inform Gluster FS of "missing/corrupt" files, and then for Btrfs to drop reference for those files, instead of either rebuilding or remaining degraded. And then let GlusterFS deal with replication of those files to maintain redundancy. i.e. the Btrfs volumes would be single profile for data, and raid1 for metadata. When there's n-way raid1, each drive can have a copy of the file system, and it'd tolerate in effect n-1 drive failures and the file system could at least still inform Gluster (or Ceph) of the missing data, the file system still remains valid, only briefly degraded, and can still be expanded when new drives become available. I'm not a big fan of hot (or cold) spares. They contribute nothing, but take up physical space and power. -- Chris Murphy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 0:58 ` Chris Murphy @ 2016-04-27 10:37 ` Duncan 2016-04-27 11:22 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 21+ messages in thread From: Duncan @ 2016-04-27 10:37 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Tue, 26 Apr 2016 18:58:06 -0600 as excerpted: > On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez > <jacirez@rdcsafety.com> wrote: >> RAID10 configuration, on the other hand, requires a minimum of four >> HDD, but it stripes data across mirrored pairs. As long as one disk in >> each mirrored pair is functional, data can be retrieved. > > Not Btrfs raid10. It's not the devices that are mirrored pairs, but > rather the chunks. There's no way to control or determine on what > devices the pairs are on. It's certain you get at least a partial > failure (data for sure and likely metadata if it's also using raid10 > profile) of the volume if you lose more than 1 device, planning wise you > have to assume you lose the entire array. Primarily quoting and restating the above (and below) to emphasize it. Remember: * btrfs raid is chunk-level, *NOT* device-level. That has important implications in terms of recovery from degraded. * btrfs parity-raid (raid56 mode) isn't yet mature and definitely nothing I'd trust in production. * btrfs redundancy-raid (raid1 and raid10 modes, as well as dup-mode on a single device) are precisely pair-copy -- two copies, with the raid modes forcing each copy to a different device or set of devices. More devices simply means more space, *NOT* more redundancy/copies. Again, these copies are at the chunk level. The chunks can and will be distributed across devices based on most space available, meaning loss of more than one device will in most cases kill the array. Because mirror- pairs happen at the chunk, not the device level, there is no such thing as loss of only one mirror in the mirror pair allowing more than a single device to fail, because statistically, the chances of both copies of some chunks being on those two now failed/missing devices is pretty high. * btrfs raid10 stripes N/2-way, while only duplicating exactly two-way. So a six-device raid10 will stripe three devices per mirror, while a 5- device raid10 will stripe 2 devices per mirror, with the odd device out being on a different device for each new chunk, due to the most-space- left allocation algorithm. >> With GlusterFS as a distributed volume, the files are already spread >> among the servers causing file I/O to be spread fairly evenly among >> them as well, thus probably providing the benefit one might expect with >> stripe (RAID10). > > Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes if > you lose a drive. But since raid1 is not n-way copies, and only means > two copies, you don't really want the file systems getting that big or > you increase the chances of a double failure. Again emphasizing. Since you're running a distributed filesystem on top, keep the lower level btrfs raids small and do more of them, multiple btrfs raid bricks per machine even, as long as your distributed level is specced to be able to lose the bricks of at least one entire machine, of course. OTOH, unlike traditional raid, btrfs does actual checksumming and data/ metadata integrity at the block level, and can and will detect integrity issues and correct from the second copy when the raid level supplies one, assuming it's good of course. That should fix problems at the lower level that other filesystems wouldn't, meaning less problems ever reach the distributed level in the first place. Thus, also emphasizing something Austin suggested. You may wish to consider btrfs raid1 on top of a pair of mdraid or dmraid raid0s. As you are likely well aware, normally, raid1 on top of raid0 is called raid01 and is discouraged in favor of raid10 (raid0 on top of raid1) for rebuild from lost device state efficiency reasons (with raid1 underneath, the rebuild of a lost device is localized to the presumably two-device raid1, with raid1 on top, the whole raid0 stripe must be rebuilt, and that's normally at the whole-device level) Of course putting the btrfs raid1 on top reverses this and would *normally* be discouraged as raid01, but btrfs raid1's operational data integrity handling, while not getting away from having to rebuild the whole raid0 stripe from the other one, does mean that gets done for an individual bad block -- no whole device failure necessary. And of course you can't get that putting btrfs raid0 on top and get that, since then the underneath raid1 layer won't be doing that integrity verification, and if that bad block happens to be returned by the underlying raid1 layer, the btrfs raid0 will simply fail the verification and error out that read, despite another good copy on the underlying raid1, because btrfs won't know anything about it. Meanwhile, as Austin says, btrfs' A/B copy read scheduling is... unoptimized. Basically, it's simple even/odd PID based, so a single read thread will always hit the same copy, leaving the other one idle. I've argued before that precisely that is a very good indication of where the btrfs devs themselves think btrfs is at, as it's clearly suboptimal, while there are much better scheduling examples, including the mdraid read-scheduling code, praised for its efficiency, in the kernel, and failure to optimize must then be considered either simply lacking the time due to higher priority development and bugfixing tasks, or an avoidance of the dangers of "premature optimization". In either case, that such unoptimized code remains in such a highly visible and performance critical place is an extremely strong indicator that btrfs devs themselves don't consider btrfs a stable and mature filesystem yet. And putting a pair of md/dm raid0s below that btrfs raid1, both helps to make up a bit for the btrfs raid1 braindead read-scheduling, and lets you exploit btrfs raid1's data integrity features. Of course it also forces btrfs to a more deterministic distribution of those chunk copies, so you can loose up to all the devices in one of those raid0s, as long as the other one remains functional, but that's nothing to really count on, so you still plan for single device failure redundancy only at the individual brick level, and use the distributed filesystem layer to deal with whole brick failure above that. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 0:58 ` Chris Murphy 2016-04-27 10:37 ` Duncan @ 2016-04-27 11:22 ` Austin S. Hemmelgarn 2016-04-27 15:58 ` Juan Alberto Cirez 2016-04-27 23:19 ` Chris Murphy 1 sibling, 2 replies; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-27 11:22 UTC (permalink / raw) To: Chris Murphy, Juan Alberto Cirez; +Cc: linux-btrfs On 2016-04-26 20:58, Chris Murphy wrote: > On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez > <jacirez@rdcsafety.com> wrote: >> >> With GlusterFS as a distributed volume, the files are already spread >> among the servers causing file I/O to be spread fairly evenly among >> them as well, thus probably providing the benefit one might expect >> with stripe (RAID10). > > Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes > if you lose a drive. But since raid1 is not n-way copies, and only > means two copies, you don't really want the file systems getting that > big or you increase the chances of a double failure. > > I've always though it'd be neat in a Btrfs + GlusterFS, if it were > possible for Btrfs to inform Gluster FS of "missing/corrupt" files, > and then for Btrfs to drop reference for those files, instead of > either rebuilding or remaining degraded. And then let GlusterFS deal > with replication of those files to maintain redundancy. i.e. the Btrfs > volumes would be single profile for data, and raid1 for metadata. When > there's n-way raid1, each drive can have a copy of the file system, > and it'd tolerate in effect n-1 drive failures and the file system > could at least still inform Gluster (or Ceph) of the missing data, the > file system still remains valid, only briefly degraded, and can still > be expanded when new drives become available. FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's designed to repair data mismatches, but I'm not sure how it handles missing copies of data. However, in the current state, there's no way without external scripts to handle re-shaping of the storage bricks if part of them fails. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 11:22 ` Austin S. Hemmelgarn @ 2016-04-27 15:58 ` Juan Alberto Cirez 2016-04-27 16:29 ` Holger Hoffstätte 2016-04-27 23:19 ` Chris Murphy 1 sibling, 1 reply; 21+ messages in thread From: Juan Alberto Cirez @ 2016-04-27 15:58 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Chris Murphy, linux-btrfs WOW! Correct me if I'm wrong but the sum total of the above seems to suggest (at first glance) that BRTFS add several layers of complexity, but for little real benefit (at least in the case use of btrfs at the brick layer with a distributed filesystem on top)... "...I've always though it'd be neat in a Btrfs + GlusterFS, if it were possible for Btrfs to inform Gluster FS of "missing/corrupt" files, and then for Btrfs to drop reference for those files, instead of either rebuilding or remaining degraded. And then let GlusterFS deal with replication of those files to maintain redundancy. i.e. the Btrfs volumes would be single profile for data, and raid1 for metadata. When there's n-way raid1, each drive can have a copy of the file system, and it'd tolerate in effect n-1 drive failures and the file system could at least still inform Gluster (or Ceph) of the missing data, the file system still remains valid, only briefly degraded, and can still be expanded when new drives become available..." That in my n00b opinion would be brilliant in a real world use case. On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-04-26 20:58, Chris Murphy wrote: >> >> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez >> <jacirez@rdcsafety.com> wrote: >>> >>> >>> With GlusterFS as a distributed volume, the files are already spread >>> among the servers causing file I/O to be spread fairly evenly among >>> them as well, thus probably providing the benefit one might expect >>> with stripe (RAID10). >> >> >> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes >> if you lose a drive. But since raid1 is not n-way copies, and only >> means two copies, you don't really want the file systems getting that >> big or you increase the chances of a double failure. >> >> I've always though it'd be neat in a Btrfs + GlusterFS, if it were >> possible for Btrfs to inform Gluster FS of "missing/corrupt" files, >> and then for Btrfs to drop reference for those files, instead of >> either rebuilding or remaining degraded. And then let GlusterFS deal >> with replication of those files to maintain redundancy. i.e. the Btrfs >> volumes would be single profile for data, and raid1 for metadata. When >> there's n-way raid1, each drive can have a copy of the file system, >> and it'd tolerate in effect n-1 drive failures and the file system >> could at least still inform Gluster (or Ceph) of the missing data, the >> file system still remains valid, only briefly degraded, and can still >> be expanded when new drives become available. > > FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's > designed to repair data mismatches, but I'm not sure how it handles missing > copies of data. However, in the current state, there's no way without > external scripts to handle re-shaping of the storage bricks if part of them > fails. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 15:58 ` Juan Alberto Cirez @ 2016-04-27 16:29 ` Holger Hoffstätte 2016-04-27 16:38 ` Juan Alberto Cirez 0 siblings, 1 reply; 21+ messages in thread From: Holger Hoffstätte @ 2016-04-27 16:29 UTC (permalink / raw) To: linux-btrfs On 04/27/16 17:58, Juan Alberto Cirez wrote: > Correct me if I'm wrong but the sum total of the above seems to > suggest (at first glance) that BRTFS add several layers of complexity, > but for little real benefit (at least in the case use of btrfs at the > brick layer with a distributed filesystem on top)... This may come as a surprise, but the same can be said for every other (common) filesystem (+ device management stack) that can be used standalone. Jeff Darcy (of GlusterFS) just wrote a really nice blog post why current filesystems and their historically grown requirements (mostly as they relate to the POSIX interface standard) are in many ways just not a good fit for scale-out/redundant storage: http://pl.atyp.us/2016-05-updating-posix.html Quite a few of the capabilities & features which are useful or necessary in standalone operation (regardless of single- or multi- device setup) are *actively unhelpful* in a distributed context, which is why e.g. Ceph will soon do away with the on-disk filesystem for data, and manage metadata exclusively by itself. cheers, Holger ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 16:29 ` Holger Hoffstätte @ 2016-04-27 16:38 ` Juan Alberto Cirez 2016-04-27 16:40 ` Juan Alberto Cirez 0 siblings, 1 reply; 21+ messages in thread From: Juan Alberto Cirez @ 2016-04-27 16:38 UTC (permalink / raw) To: Holger Hoffstätte; +Cc: linux-btrfs Holger, If this is so, then it leaves with even confused. I was under the impression that the driving imperative for the creation of btrfs was to address the shortcomings of current filesystem within the context of distributed data. That the idea was to create a low level filesystem that would the primary choice as a block/brick layer for a scale-out, distributed data storage... On Wed, Apr 27, 2016 at 10:29 AM, Holger Hoffstätte <holger.hoffstaette@googlemail.com> wrote: > On 04/27/16 17:58, Juan Alberto Cirez wrote: >> Correct me if I'm wrong but the sum total of the above seems to >> suggest (at first glance) that BRTFS add several layers of complexity, >> but for little real benefit (at least in the case use of btrfs at the >> brick layer with a distributed filesystem on top)... > > This may come as a surprise, but the same can be said for every other > (common) filesystem (+ device management stack) that can be used > standalone. > > Jeff Darcy (of GlusterFS) just wrote a really nice blog post why > current filesystems and their historically grown requirements (mostly > as they relate to the POSIX interface standard) are in many ways > just not a good fit for scale-out/redundant storage: > http://pl.atyp.us/2016-05-updating-posix.html > > Quite a few of the capabilities & features which are useful or > necessary in standalone operation (regardless of single- or multi- > device setup) are *actively unhelpful* in a distributed context, which > is why e.g. Ceph will soon do away with the on-disk filesystem for > data, and manage metadata exclusively by itself. > > cheers, > Holger > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 16:38 ` Juan Alberto Cirez @ 2016-04-27 16:40 ` Juan Alberto Cirez 2016-04-27 17:23 ` Holger Hoffstätte 0 siblings, 1 reply; 21+ messages in thread From: Juan Alberto Cirez @ 2016-04-27 16:40 UTC (permalink / raw) To: Holger Hoffstätte; +Cc: linux-btrfs Holger, If this is so, then it leaves even confused. I was under the impression that the driving imperative for the creation of btrfs was to address the shortcomings of current filesystems (within the context of distributed data). That the idea was to create a low level filesystem that would be the primary choice as a block/brick layer for a scale-out, distributed data storage... On Wed, Apr 27, 2016 at 10:38 AM, Juan Alberto Cirez <jacirez@rdcsafety.com> wrote: > Holger, > If this is so, then it leaves with even confused. I was under the > impression that the driving imperative for the creation of btrfs was > to address the shortcomings of current filesystem within the context > of distributed data. That the idea was to create a low level > filesystem that would the primary choice as a block/brick layer for a > scale-out, distributed data storage... > > On Wed, Apr 27, 2016 at 10:29 AM, Holger Hoffstätte > <holger.hoffstaette@googlemail.com> wrote: >> On 04/27/16 17:58, Juan Alberto Cirez wrote: >>> Correct me if I'm wrong but the sum total of the above seems to >>> suggest (at first glance) that BRTFS add several layers of complexity, >>> but for little real benefit (at least in the case use of btrfs at the >>> brick layer with a distributed filesystem on top)... >> >> This may come as a surprise, but the same can be said for every other >> (common) filesystem (+ device management stack) that can be used >> standalone. >> >> Jeff Darcy (of GlusterFS) just wrote a really nice blog post why >> current filesystems and their historically grown requirements (mostly >> as they relate to the POSIX interface standard) are in many ways >> just not a good fit for scale-out/redundant storage: >> http://pl.atyp.us/2016-05-updating-posix.html >> >> Quite a few of the capabilities & features which are useful or >> necessary in standalone operation (regardless of single- or multi- >> device setup) are *actively unhelpful* in a distributed context, which >> is why e.g. Ceph will soon do away with the on-disk filesystem for >> data, and manage metadata exclusively by itself. >> >> cheers, >> Holger >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 16:40 ` Juan Alberto Cirez @ 2016-04-27 17:23 ` Holger Hoffstätte 0 siblings, 0 replies; 21+ messages in thread From: Holger Hoffstätte @ 2016-04-27 17:23 UTC (permalink / raw) To: Juan Alberto Cirez; +Cc: linux-btrfs On 04/27/16 18:40, Juan Alberto Cirez wrote: > If this is so, then it leaves even confused. I was under the > impression that the driving imperative for the creation of btrfs was > to address the shortcomings of current filesystems (within the context > of distributed data). That the idea was to create a low level > filesystem that would be the primary choice as a block/brick layer for a > scale-out, distributed data storage... I can't speak for who was or is motivated by what. Btrfs was a necessary reaction to ZFS, and AFAIK this had nothing to do with distributed storage but rather growing concerns around reliability (checksumming), scalability and operational ease: snapshotting, growing/shrinking etc. It's true that some of btrfs' capabilities make it look like a a good candidate, and e.g. Ceph started out using it. For many reasons that didn't work out (AFAIK btrfs maturity + extensibility) - but it also did not address a fundamental mismatch in requirements, which other filesystems (ext4, xfs) could not address either. btrfs simply does "too much" because it has to; you cannot remove or turn off half of what makes a kernel-based filesystem a usable filesystem. This is kind of sad because at its core btrfs *is* an object store with various trees for metadata handling and whatnot - but there's no easy way to turn off all the "Unix is stupid" stuff. AFAIK Gluster will soon also start managing xattrs differently, so this is not limited to Ceph. I've been following this saga for several years now and it's absolutely *astounding* how many bugs and performance problems Ceph has unearthed in existing filesystems, simply because it stresses them in ways they never have been stressed before..only to create the illusion of a distributed key/value store, badly. I don't want to argue about details, you can read more about some of the reasons in [1]. [grumble grumble exokernels and composable things in userland grumble] cheers Holger [1] http://www.slideshare.net/sageweil1/ceph-and-rocksdb ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 11:22 ` Austin S. Hemmelgarn 2016-04-27 15:58 ` Juan Alberto Cirez @ 2016-04-27 23:19 ` Chris Murphy 2016-04-28 11:21 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 21+ messages in thread From: Chris Murphy @ 2016-04-27 23:19 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Juan Alberto Cirez, linux-btrfs On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-04-26 20:58, Chris Murphy wrote: >> >> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez >> <jacirez@rdcsafety.com> wrote: >>> >>> >>> With GlusterFS as a distributed volume, the files are already spread >>> among the servers causing file I/O to be spread fairly evenly among >>> them as well, thus probably providing the benefit one might expect >>> with stripe (RAID10). >> >> >> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes >> if you lose a drive. But since raid1 is not n-way copies, and only >> means two copies, you don't really want the file systems getting that >> big or you increase the chances of a double failure. >> >> I've always though it'd be neat in a Btrfs + GlusterFS, if it were >> possible for Btrfs to inform Gluster FS of "missing/corrupt" files, >> and then for Btrfs to drop reference for those files, instead of >> either rebuilding or remaining degraded. And then let GlusterFS deal >> with replication of those files to maintain redundancy. i.e. the Btrfs >> volumes would be single profile for data, and raid1 for metadata. When >> there's n-way raid1, each drive can have a copy of the file system, >> and it'd tolerate in effect n-1 drive failures and the file system >> could at least still inform Gluster (or Ceph) of the missing data, the >> file system still remains valid, only briefly degraded, and can still >> be expanded when new drives become available. > > FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's > designed to repair data mismatches, but I'm not sure how it handles missing > copies of data. However, in the current state, there's no way without > external scripts to handle re-shaping of the storage bricks if part of them > fails. Yeah I haven't tried doing a scrub, parsing dmesg for busted file paths, and feeling those paths into rm to see what happens. Will they get deleted without additional errors? If so good, then scrub again should be clean. And then btrfs dev missing to get rid of the broken device *and* cause missing metadata to be replicated again and now in theory the fs should be back to normal. But it'd have to be tested with a umount followed by mount to see if -o degraded is still required. -- Chris Murphy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add device while rebalancing 2016-04-27 23:19 ` Chris Murphy @ 2016-04-28 11:21 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-28 11:21 UTC (permalink / raw) To: Chris Murphy; +Cc: Juan Alberto Cirez, linux-btrfs On 2016-04-27 19:19, Chris Murphy wrote: > On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2016-04-26 20:58, Chris Murphy wrote: >>> >>> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez >>> <jacirez@rdcsafety.com> wrote: >>>> >>>> >>>> With GlusterFS as a distributed volume, the files are already spread >>>> among the servers causing file I/O to be spread fairly evenly among >>>> them as well, thus probably providing the benefit one might expect >>>> with stripe (RAID10). >>> >>> >>> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes >>> if you lose a drive. But since raid1 is not n-way copies, and only >>> means two copies, you don't really want the file systems getting that >>> big or you increase the chances of a double failure. >>> >>> I've always though it'd be neat in a Btrfs + GlusterFS, if it were >>> possible for Btrfs to inform Gluster FS of "missing/corrupt" files, >>> and then for Btrfs to drop reference for those files, instead of >>> either rebuilding or remaining degraded. And then let GlusterFS deal >>> with replication of those files to maintain redundancy. i.e. the Btrfs >>> volumes would be single profile for data, and raid1 for metadata. When >>> there's n-way raid1, each drive can have a copy of the file system, >>> and it'd tolerate in effect n-1 drive failures and the file system >>> could at least still inform Gluster (or Ceph) of the missing data, the >>> file system still remains valid, only briefly degraded, and can still >>> be expanded when new drives become available. >> >> FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's >> designed to repair data mismatches, but I'm not sure how it handles missing >> copies of data. However, in the current state, there's no way without >> external scripts to handle re-shaping of the storage bricks if part of them >> fails. > > Yeah I haven't tried doing a scrub, parsing dmesg for busted file > paths, and feeling those paths into rm to see what happens. Will they > get deleted without additional errors? If so good, then scrub again > should be clean. And then btrfs dev missing to get rid of the broken > device *and* cause missing metadata to be replicated again and now in > theory the fs should be back to normal. But it'd have to be tested > with a umount followed by mount to see if -o degraded is still > required. > I'm not entirely certain, although I had been planning on adding a test to check this to my usual testing before the system I use for it went offline, I just haven't had the time to get it working again. If I find the time in the near future, I may just test it on my laptop in a VM. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2016-04-28 11:21 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-22 20:36 Add device while rebalancing Juan Alberto Cirez 2016-04-23 5:38 ` Duncan 2016-04-25 11:18 ` Austin S. Hemmelgarn 2016-04-25 12:43 ` Duncan 2016-04-25 13:02 ` Austin S. Hemmelgarn 2016-04-26 10:50 ` Juan Alberto Cirez 2016-04-26 11:11 ` Austin S. Hemmelgarn 2016-04-26 11:44 ` Juan Alberto Cirez 2016-04-26 12:04 ` Austin S. Hemmelgarn 2016-04-26 12:14 ` Juan Alberto Cirez 2016-04-26 12:44 ` Austin S. Hemmelgarn 2016-04-27 0:58 ` Chris Murphy 2016-04-27 10:37 ` Duncan 2016-04-27 11:22 ` Austin S. Hemmelgarn 2016-04-27 15:58 ` Juan Alberto Cirez 2016-04-27 16:29 ` Holger Hoffstätte 2016-04-27 16:38 ` Juan Alberto Cirez 2016-04-27 16:40 ` Juan Alberto Cirez 2016-04-27 17:23 ` Holger Hoffstätte 2016-04-27 23:19 ` Chris Murphy 2016-04-28 11:21 ` Austin S. Hemmelgarn
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.