All of lore.kernel.org
 help / color / mirror / Atom feed
* BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
@ 2017-04-11 15:40 Austin S. Hemmelgarn
  2017-04-12  1:43 ` Ravishankar N
  2017-04-12  5:49 ` Qu Wenruo
  0 siblings, 2 replies; 7+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-11 15:40 UTC (permalink / raw)
  To: BTRFS ML

About a year ago now, I decided to set up a small storage cluster to 
store backups (and partially replace Dropbox for my usage, but that's a 
separate story).  I ended up using GlusterFS as the clustering software 
itself, and BTRFS as the back-end storage.

GlusterFS itself is actually a pretty easy workload as far as cluster 
software goes.  It does some processing prior to actually storing the 
data (a significant amount in fact), but the actual on-device storage on 
any given node is pretty simple.  You have the full directory structure 
for the whole volume, and whatever files happen to be on that node are 
located within that tree exactly like they are in the GlusterFS volume. 
Beyond the basic data, gluster only stores 2-4 xattrs per-file (which 
are used to track synchronization, and also for it's internal data 
scrubbing), and a directory called .glusterfs in the top of the back-end 
storage location for the volume which contains the data required to 
figure out which node a file is on.  Overall, the access patterns mostly 
mirror whatever is using the Gluster volume, or are reduced to slow 
streaming writes (when writing files and the back-end nodes are 
computationally limited instead of I/O limited), with the addition of 
some serious metadata operations in the .glusterfs directory (lots of 
stat calls there, together with large numbers of small files).

As far as overall performance, BTRFS is actually on par for this usage 
with both ext4 and XFS (at least, on my hardware it is), and I actually 
see more SSD friendly access patterns when using BTRFS in this case than 
any other FS I tried.

After some serious experimentation with various configurations for this 
during the past few months, I've noticed a handful of other things:

1. The 'ssd' mount option does not actually improve performance on these 
SSD's.  To a certain extent, this actually surprised me at first, but 
having seen Hans' e-mail and what he found about this option, it 
actually makes sense, since erase-blocks on these devices are 4MB, not 
2MB, and the drives have a very good FTL (so they will aggregate all the 
little writes properly).

Given this, I'm beginning to wonder if it actually makes sense to not 
automatically enable this on mount when dealing with certain types of 
storage (for example, most SATA and SAS SSD's have reasonably good 
FTL's, so I would expect them to have similar behavior).  Extrapolating 
further, it might instead make sense to just never automatically enable 
this, and expose the value this option is manipulating as a mount option 
as there are other circumstances where setting specific values could 
improve performance (for example, if you're on hardware RAID6, setting 
this to the stripe size would probably improve performance on many 
cheaper controllers).

2. Up to a certain point, running a single larger BTRFS volume with 
multiple sub-volumes is more computationally efficient than running 
multiple smaller BTRFS volumes.  More specifically, there is lower load 
on the system and lower CPU utilization by BTRFS itself without much 
noticeable difference in performance (in my tests it was about 0.5-1% 
performance difference, YMMV).  To a certain extent this makes some 
sense, but the turnover point was actually a lot higher than I expected 
(with this workload, the turnover point was around half a terabyte).

I believe this to be a side-effect of how we use per-filesystem 
worker-pools.  In essence, we can schedule parallel access better when 
it's all through the same worker pool than we can when using multiple 
worker pools.  Having realized this, I think it might be interesting to 
see if using a worker-pool per physical device (or at least what the 
system sees as a physical device) might make more sense in terms of 
performance than our current method of using a pool per-filesystem.

3. On these SSD's, running a single partition in dup mode is actually 
marginally more efficient than running 2 partitions in raid1 mode.  I 
was actually somewhat surprised by this, and I haven't been able to find 
a clear explanation as to why (I suspect caching may have something to 
do with it, but I'm not 100% certain about that),  but some limited 
testing with other SSD's seems to indicate that it's the case for most 
SSD's, with the difference being smaller on smaller and faster devices. 
On a traditional hard disk, it's significantly more efficient, but 
that's generally to be expected.

4. Depending on other factors, compression can actually slow you down 
pretty significantly.  In the particular case I saw this happen (all 
cores completely utilized by userspace software), LZO compression 
actually caused around 5-10% performance degradation compared to no 
compression.  This is somewhat obvious once it's explained, but it's not 
exactly intuitive  and as such it's probably worth documenting in the 
man pages that compression won't always make things better.  I may send 
a patch to add this at some point in the near future.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
  2017-04-11 15:40 BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such Austin S. Hemmelgarn
@ 2017-04-12  1:43 ` Ravishankar N
  2017-04-12  5:49 ` Qu Wenruo
  1 sibling, 0 replies; 7+ messages in thread
From: Ravishankar N @ 2017-04-12  1:43 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, BTRFS ML, gluster-users@gluster.org List

Adding gluster-users list. I think there are a few users out there 
running gluster on top of btrfs, so this might benefit a broader audience.

On 04/11/2017 09:10 PM, Austin S. Hemmelgarn wrote:
> About a year ago now, I decided to set up a small storage cluster to 
> store backups (and partially replace Dropbox for my usage, but that's 
> a separate story).  I ended up using GlusterFS as the clustering 
> software itself, and BTRFS as the back-end storage.
>
> GlusterFS itself is actually a pretty easy workload as far as cluster 
> software goes.  It does some processing prior to actually storing the 
> data (a significant amount in fact), but the actual on-device storage 
> on any given node is pretty simple.  You have the full directory 
> structure for the whole volume, and whatever files happen to be on 
> that node are located within that tree exactly like they are in the 
> GlusterFS volume. Beyond the basic data, gluster only stores 2-4 
> xattrs per-file (which are used to track synchronization, and also for 
> it's internal data scrubbing), and a directory called .glusterfs in 
> the top of the back-end storage location for the volume which contains 
> the data required to figure out which node a file is on.  Overall, the 
> access patterns mostly mirror whatever is using the Gluster volume, or 
> are reduced to slow streaming writes (when writing files and the 
> back-end nodes are computationally limited instead of I/O limited), 
> with the addition of some serious metadata operations in the 
> .glusterfs directory (lots of stat calls there, together with large 
> numbers of small files).
>
> As far as overall performance, BTRFS is actually on par for this usage 
> with both ext4 and XFS (at least, on my hardware it is), and I 
> actually see more SSD friendly access patterns when using BTRFS in 
> this case than any other FS I tried.
>
> After some serious experimentation with various configurations for 
> this during the past few months, I've noticed a handful of other things:
>
> 1. The 'ssd' mount option does not actually improve performance on 
> these SSD's.  To a certain extent, this actually surprised me at 
> first, but having seen Hans' e-mail and what he found about this 
> option, it actually makes sense, since erase-blocks on these devices 
> are 4MB, not 2MB, and the drives have a very good FTL (so they will 
> aggregate all the little writes properly).
>
> Given this, I'm beginning to wonder if it actually makes sense to not 
> automatically enable this on mount when dealing with certain types of 
> storage (for example, most SATA and SAS SSD's have reasonably good 
> FTL's, so I would expect them to have similar behavior).  
> Extrapolating further, it might instead make sense to just never 
> automatically enable this, and expose the value this option is 
> manipulating as a mount option as there are other circumstances where 
> setting specific values could improve performance (for example, if 
> you're on hardware RAID6, setting this to the stripe size would 
> probably improve performance on many cheaper controllers).
>
> 2. Up to a certain point, running a single larger BTRFS volume with 
> multiple sub-volumes is more computationally efficient than running 
> multiple smaller BTRFS volumes.  More specifically, there is lower 
> load on the system and lower CPU utilization by BTRFS itself without 
> much noticeable difference in performance (in my tests it was about 
> 0.5-1% performance difference, YMMV).  To a certain extent this makes 
> some sense, but the turnover point was actually a lot higher than I 
> expected (with this workload, the turnover point was around half a 
> terabyte).
>
> I believe this to be a side-effect of how we use per-filesystem 
> worker-pools.  In essence, we can schedule parallel access better when 
> it's all through the same worker pool than we can when using multiple 
> worker pools.  Having realized this, I think it might be interesting 
> to see if using a worker-pool per physical device (or at least what 
> the system sees as a physical device) might make more sense in terms 
> of performance than our current method of using a pool per-filesystem.
>
> 3. On these SSD's, running a single partition in dup mode is actually 
> marginally more efficient than running 2 partitions in raid1 mode.  I 
> was actually somewhat surprised by this, and I haven't been able to 
> find a clear explanation as to why (I suspect caching may have 
> something to do with it, but I'm not 100% certain about that),  but 
> some limited testing with other SSD's seems to indicate that it's the 
> case for most SSD's, with the difference being smaller on smaller and 
> faster devices. On a traditional hard disk, it's significantly more 
> efficient, but that's generally to be expected.
>
> 4. Depending on other factors, compression can actually slow you down 
> pretty significantly.  In the particular case I saw this happen (all 
> cores completely utilized by userspace software), LZO compression 
> actually caused around 5-10% performance degradation compared to no 
> compression.  This is somewhat obvious once it's explained, but it's 
> not exactly intuitive  and as such it's probably worth documenting in 
> the man pages that compression won't always make things better.  I may 
> send a patch to add this at some point in the near future.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
  2017-04-11 15:40 BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such Austin S. Hemmelgarn
  2017-04-12  1:43 ` Ravishankar N
@ 2017-04-12  5:49 ` Qu Wenruo
  2017-04-12  7:16   ` Sargun Dhillon
  2017-04-12 11:18   ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 7+ messages in thread
From: Qu Wenruo @ 2017-04-12  5:49 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, BTRFS ML



At 04/11/2017 11:40 PM, Austin S. Hemmelgarn wrote:
> About a year ago now, I decided to set up a small storage cluster to 
> store backups (and partially replace Dropbox for my usage, but that's a 
> separate story).  I ended up using GlusterFS as the clustering software 
> itself, and BTRFS as the back-end storage.
> 
> GlusterFS itself is actually a pretty easy workload as far as cluster 
> software goes.  It does some processing prior to actually storing the 
> data (a significant amount in fact), but the actual on-device storage on 
> any given node is pretty simple.  You have the full directory structure 
> for the whole volume, and whatever files happen to be on that node are 
> located within that tree exactly like they are in the GlusterFS volume. 
> Beyond the basic data, gluster only stores 2-4 xattrs per-file (which 
> are used to track synchronization, and also for it's internal data 
> scrubbing), and a directory called .glusterfs in the top of the back-end 
> storage location for the volume which contains the data required to 
> figure out which node a file is on.  Overall, the access patterns mostly 
> mirror whatever is using the Gluster volume, or are reduced to slow 
> streaming writes (when writing files and the back-end nodes are 
> computationally limited instead of I/O limited), with the addition of 
> some serious metadata operations in the .glusterfs directory (lots of 
> stat calls there, together with large numbers of small files).

Any real world experience is welcomed to share.

> 
> As far as overall performance, BTRFS is actually on par for this usage 
> with both ext4 and XFS (at least, on my hardware it is), and I actually 
> see more SSD friendly access patterns when using BTRFS in this case than 
> any other FS I tried.

We also find that, for pure buffered read/write, btrfs is no worse than 
traditional fs.

In our PostgreSQL test, btrfs can even get a little better performance 
than ext4/xfs when handling DB files.

But if using btrfs for PostgreSQL Write Ahead Log (WAL), then it's 
completely another thing.
Btrfs falls far behind ext4/xfs on HDD, only half of the TPC performance 
for low concurrency load.

Due to btrfs CoW, btrfs causes extra IO for fsync.
For example, if only to fsync 4K data, btrfs can cause 64K metadata 
write for default mkfs options.
(One tree block for log root tree, one tree block for log tree, multiple 
by 2 for default DUP profile)

> 
> After some serious experimentation with various configurations for this 
> during the past few months, I've noticed a handful of other things:
> 
> 1. The 'ssd' mount option does not actually improve performance on these 
> SSD's.  To a certain extent, this actually surprised me at first, but 
> having seen Hans' e-mail and what he found about this option, it 
> actually makes sense, since erase-blocks on these devices are 4MB, not 
> 2MB, and the drives have a very good FTL (so they will aggregate all the 
> little writes properly).
> 
> Given this, I'm beginning to wonder if it actually makes sense to not 
> automatically enable this on mount when dealing with certain types of 
> storage (for example, most SATA and SAS SSD's have reasonably good 
> FTL's, so I would expect them to have similar behavior).  Extrapolating 
> further, it might instead make sense to just never automatically enable 
> this, and expose the value this option is manipulating as a mount option 
> as there are other circumstances where setting specific values could 
> improve performance (for example, if you're on hardware RAID6, setting 
> this to the stripe size would probably improve performance on many 
> cheaper controllers).
> 
> 2. Up to a certain point, running a single larger BTRFS volume with 
> multiple sub-volumes is more computationally efficient than running 
> multiple smaller BTRFS volumes.  More specifically, there is lower load 
> on the system and lower CPU utilization by BTRFS itself without much 
> noticeable difference in performance (in my tests it was about 0.5-1% 
> performance difference, YMMV).  To a certain extent this makes some 
> sense, but the turnover point was actually a lot higher than I expected 
> (with this workload, the turnover point was around half a terabyte).

This seems to be related to tree locking overhead.

The most obvious solution is just as you stated, use many small 
subvolumes other than one large subvolume.

Another less obvious solution is to reduce tree block size at mkfs time.

This Btrfs is not that good at handling metadata workload, limited by 
both the overhead of mandatory metadata CoW and current tree lock algorithm.

> 
> I believe this to be a side-effect of how we use per-filesystem 
> worker-pools.  In essence, we can schedule parallel access better when 
> it's all through the same worker pool than we can when using multiple 
> worker pools.  Having realized this, I think it might be interesting to 
> see if using a worker-pool per physical device (or at least what the 
> system sees as a physical device) might make more sense in terms of 
> performance than our current method of using a pool per-filesystem.
> 
> 3. On these SSD's, running a single partition in dup mode is actually 
> marginally more efficient than running 2 partitions in raid1 mode.  I 
> was actually somewhat surprised by this, and I haven't been able to find 
> a clear explanation as to why (I suspect caching may have something to 
> do with it, but I'm not 100% certain about that),  but some limited 
> testing with other SSD's seems to indicate that it's the case for most 
> SSD's, with the difference being smaller on smaller and faster devices. 
> On a traditional hard disk, it's significantly more efficient, but 
> that's generally to be expected.
> 
> 4. Depending on other factors, compression can actually slow you down 
> pretty significantly.  In the particular case I saw this happen (all 
> cores completely utilized by userspace software), LZO compression 
> actually caused around 5-10% performance degradation compared to no 
> compression.  This is somewhat obvious once it's explained, but it's not 
> exactly intuitive  and as such it's probably worth documenting in the 
> man pages that compression won't always make things better.  I may send 
> a patch to add this at some point in the near future.

This seems interesting.
Maybe it's CPU limiting the performance?

Thanks,
Qu

> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
  2017-04-12  5:49 ` Qu Wenruo
@ 2017-04-12  7:16   ` Sargun Dhillon
  2017-04-12 11:18   ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 7+ messages in thread
From: Sargun Dhillon @ 2017-04-12  7:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Austin S. Hemmelgarn, BTRFS ML

Not to change the topic too much but, is there a suite of tracing
scripts that one can attach to their BtrFS installation to gather
metrics about tree locking performance? We see an awful lot of
machines with a task waiting on btrfs_tree_lock, and a bunch of other
tasks that are also in disk sleep waiting on BtrFS. We also see a
bunch of hung timeouts around btrfs_destroy_inode -- We're running
Kernel 4.8, so we can pretty easily plug in BPF based probes into the
kernel to get this information, and aggregate it.

Rather than doing this work ourselves, I'm wondering if anyone else
has a good set of tools to collect perf data about BtrFS performance
and lock contention?

On Tue, Apr 11, 2017 at 10:49 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 04/11/2017 11:40 PM, Austin S. Hemmelgarn wrote:
>>
>> About a year ago now, I decided to set up a small storage cluster to store
>> backups (and partially replace Dropbox for my usage, but that's a separate
>> story).  I ended up using GlusterFS as the clustering software itself, and
>> BTRFS as the back-end storage.
>>
>> GlusterFS itself is actually a pretty easy workload as far as cluster
>> software goes.  It does some processing prior to actually storing the data
>> (a significant amount in fact), but the actual on-device storage on any
>> given node is pretty simple.  You have the full directory structure for the
>> whole volume, and whatever files happen to be on that node are located
>> within that tree exactly like they are in the GlusterFS volume. Beyond the
>> basic data, gluster only stores 2-4 xattrs per-file (which are used to track
>> synchronization, and also for it's internal data scrubbing), and a directory
>> called .glusterfs in the top of the back-end storage location for the volume
>> which contains the data required to figure out which node a file is on.
>> Overall, the access patterns mostly mirror whatever is using the Gluster
>> volume, or are reduced to slow streaming writes (when writing files and the
>> back-end nodes are computationally limited instead of I/O limited), with the
>> addition of some serious metadata operations in the .glusterfs directory
>> (lots of stat calls there, together with large numbers of small files).
>
>
> Any real world experience is welcomed to share.
>
>>
>> As far as overall performance, BTRFS is actually on par for this usage
>> with both ext4 and XFS (at least, on my hardware it is), and I actually see
>> more SSD friendly access patterns when using BTRFS in this case than any
>> other FS I tried.
>
>
> We also find that, for pure buffered read/write, btrfs is no worse than
> traditional fs.
>
> In our PostgreSQL test, btrfs can even get a little better performance than
> ext4/xfs when handling DB files.
>
> But if using btrfs for PostgreSQL Write Ahead Log (WAL), then it's
> completely another thing.
> Btrfs falls far behind ext4/xfs on HDD, only half of the TPC performance for
> low concurrency load.
>
> Due to btrfs CoW, btrfs causes extra IO for fsync.
> For example, if only to fsync 4K data, btrfs can cause 64K metadata write
> for default mkfs options.
> (One tree block for log root tree, one tree block for log tree, multiple by
> 2 for default DUP profile)
>
>>
>> After some serious experimentation with various configurations for this
>> during the past few months, I've noticed a handful of other things:
>>
>> 1. The 'ssd' mount option does not actually improve performance on these
>> SSD's.  To a certain extent, this actually surprised me at first, but having
>> seen Hans' e-mail and what he found about this option, it actually makes
>> sense, since erase-blocks on these devices are 4MB, not 2MB, and the drives
>> have a very good FTL (so they will aggregate all the little writes
>> properly).
>>
>> Given this, I'm beginning to wonder if it actually makes sense to not
>> automatically enable this on mount when dealing with certain types of
>> storage (for example, most SATA and SAS SSD's have reasonably good FTL's, so
>> I would expect them to have similar behavior).  Extrapolating further, it
>> might instead make sense to just never automatically enable this, and expose
>> the value this option is manipulating as a mount option as there are other
>> circumstances where setting specific values could improve performance (for
>> example, if you're on hardware RAID6, setting this to the stripe size would
>> probably improve performance on many cheaper controllers).
>>
>> 2. Up to a certain point, running a single larger BTRFS volume with
>> multiple sub-volumes is more computationally efficient than running multiple
>> smaller BTRFS volumes.  More specifically, there is lower load on the system
>> and lower CPU utilization by BTRFS itself without much noticeable difference
>> in performance (in my tests it was about 0.5-1% performance difference,
>> YMMV).  To a certain extent this makes some sense, but the turnover point
>> was actually a lot higher than I expected (with this workload, the turnover
>> point was around half a terabyte).
>
>
> This seems to be related to tree locking overhead.
>
> The most obvious solution is just as you stated, use many small subvolumes
> other than one large subvolume.
>
> Another less obvious solution is to reduce tree block size at mkfs time.
>
> This Btrfs is not that good at handling metadata workload, limited by both
> the overhead of mandatory metadata CoW and current tree lock algorithm.
>
>>
>> I believe this to be a side-effect of how we use per-filesystem
>> worker-pools.  In essence, we can schedule parallel access better when it's
>> all through the same worker pool than we can when using multiple worker
>> pools.  Having realized this, I think it might be interesting to see if
>> using a worker-pool per physical device (or at least what the system sees as
>> a physical device) might make more sense in terms of performance than our
>> current method of using a pool per-filesystem.
>>
>> 3. On these SSD's, running a single partition in dup mode is actually
>> marginally more efficient than running 2 partitions in raid1 mode.  I was
>> actually somewhat surprised by this, and I haven't been able to find a clear
>> explanation as to why (I suspect caching may have something to do with it,
>> but I'm not 100% certain about that),  but some limited testing with other
>> SSD's seems to indicate that it's the case for most SSD's, with the
>> difference being smaller on smaller and faster devices. On a traditional
>> hard disk, it's significantly more efficient, but that's generally to be
>> expected.
>>
>> 4. Depending on other factors, compression can actually slow you down
>> pretty significantly.  In the particular case I saw this happen (all cores
>> completely utilized by userspace software), LZO compression actually caused
>> around 5-10% performance degradation compared to no compression.  This is
>> somewhat obvious once it's explained, but it's not exactly intuitive  and as
>> such it's probably worth documenting in the man pages that compression won't
>> always make things better.  I may send a patch to add this at some point in
>> the near future.
>
>
> This seems interesting.
> Maybe it's CPU limiting the performance?
>
> Thanks,
> Qu
>
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
  2017-04-12  5:49 ` Qu Wenruo
  2017-04-12  7:16   ` Sargun Dhillon
@ 2017-04-12 11:18   ` Austin S. Hemmelgarn
  2017-04-12 22:48     ` Duncan
  1 sibling, 1 reply; 7+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-12 11:18 UTC (permalink / raw)
  To: Qu Wenruo, BTRFS ML

On 2017-04-12 01:49, Qu Wenruo wrote:
>
>
> At 04/11/2017 11:40 PM, Austin S. Hemmelgarn wrote:
>> About a year ago now, I decided to set up a small storage cluster to
>> store backups (and partially replace Dropbox for my usage, but that's
>> a separate story).  I ended up using GlusterFS as the clustering
>> software itself, and BTRFS as the back-end storage.
>>
>> GlusterFS itself is actually a pretty easy workload as far as cluster
>> software goes.  It does some processing prior to actually storing the
>> data (a significant amount in fact), but the actual on-device storage
>> on any given node is pretty simple.  You have the full directory
>> structure for the whole volume, and whatever files happen to be on
>> that node are located within that tree exactly like they are in the
>> GlusterFS volume. Beyond the basic data, gluster only stores 2-4
>> xattrs per-file (which are used to track synchronization, and also for
>> it's internal data scrubbing), and a directory called .glusterfs in
>> the top of the back-end storage location for the volume which contains
>> the data required to figure out which node a file is on.  Overall, the
>> access patterns mostly mirror whatever is using the Gluster volume, or
>> are reduced to slow streaming writes (when writing files and the
>> back-end nodes are computationally limited instead of I/O limited),
>> with the addition of some serious metadata operations in the
>> .glusterfs directory (lots of stat calls there, together with large
>> numbers of small files).
>
> Any real world experience is welcomed to share.
>
>>
>> As far as overall performance, BTRFS is actually on par for this usage
>> with both ext4 and XFS (at least, on my hardware it is), and I
>> actually see more SSD friendly access patterns when using BTRFS in
>> this case than any other FS I tried.
>
> We also find that, for pure buffered read/write, btrfs is no worse than
> traditional fs.
>
> In our PostgreSQL test, btrfs can even get a little better performance
> than ext4/xfs when handling DB files.
>
> But if using btrfs for PostgreSQL Write Ahead Log (WAL), then it's
> completely another thing.
> Btrfs falls far behind ext4/xfs on HDD, only half of the TPC performance
> for low concurrency load.
>
> Due to btrfs CoW, btrfs causes extra IO for fsync.
> For example, if only to fsync 4K data, btrfs can cause 64K metadata
> write for default mkfs options.
> (One tree block for log root tree, one tree block for log tree, multiple
> by 2 for default DUP profile)
>
>>
>> After some serious experimentation with various configurations for
>> this during the past few months, I've noticed a handful of other things:
>>
>> 1. The 'ssd' mount option does not actually improve performance on
>> these SSD's.  To a certain extent, this actually surprised me at
>> first, but having seen Hans' e-mail and what he found about this
>> option, it actually makes sense, since erase-blocks on these devices
>> are 4MB, not 2MB, and the drives have a very good FTL (so they will
>> aggregate all the little writes properly).
>>
>> Given this, I'm beginning to wonder if it actually makes sense to not
>> automatically enable this on mount when dealing with certain types of
>> storage (for example, most SATA and SAS SSD's have reasonably good
>> FTL's, so I would expect them to have similar behavior).
>> Extrapolating further, it might instead make sense to just never
>> automatically enable this, and expose the value this option is
>> manipulating as a mount option as there are other circumstances where
>> setting specific values could improve performance (for example, if
>> you're on hardware RAID6, setting this to the stripe size would
>> probably improve performance on many cheaper controllers).
>>
>> 2. Up to a certain point, running a single larger BTRFS volume with
>> multiple sub-volumes is more computationally efficient than running
>> multiple smaller BTRFS volumes.  More specifically, there is lower
>> load on the system and lower CPU utilization by BTRFS itself without
>> much noticeable difference in performance (in my tests it was about
>> 0.5-1% performance difference, YMMV).  To a certain extent this makes
>> some sense, but the turnover point was actually a lot higher than I
>> expected (with this workload, the turnover point was around half a
>> terabyte).
>
> This seems to be related to tree locking overhead.
My thought too, although I find it interesting that the benefit starts 
to disappear as the FS gets bigger beyond a certain point (on my system 
it was about half a terabyte, but I would expect it to be different on 
systems with different numbers of CPU cores (differing levels of lock 
contention) or different workloads (probably inversely proportionate to 
the amount of metadata work the workload produces).
>
> The most obvious solution is just as you stated, use many small
> subvolumes other than one large subvolume.
>
> Another less obvious solution is to reduce tree block size at mkfs time.
>
> This Btrfs is not that good at handling metadata workload, limited by
> both the overhead of mandatory metadata CoW and current tree lock
> algorithm.
>
>>
>> I believe this to be a side-effect of how we use per-filesystem
>> worker-pools.  In essence, we can schedule parallel access better when
>> it's all through the same worker pool than we can when using multiple
>> worker pools.  Having realized this, I think it might be interesting
>> to see if using a worker-pool per physical device (or at least what
>> the system sees as a physical device) might make more sense in terms
>> of performance than our current method of using a pool per-filesystem.
>>
>> 3. On these SSD's, running a single partition in dup mode is actually
>> marginally more efficient than running 2 partitions in raid1 mode.  I
>> was actually somewhat surprised by this, and I haven't been able to
>> find a clear explanation as to why (I suspect caching may have
>> something to do with it, but I'm not 100% certain about that),  but
>> some limited testing with other SSD's seems to indicate that it's the
>> case for most SSD's, with the difference being smaller on smaller and
>> faster devices. On a traditional hard disk, it's significantly more
>> efficient, but that's generally to be expected.
>>
>> 4. Depending on other factors, compression can actually slow you down
>> pretty significantly.  In the particular case I saw this happen (all
>> cores completely utilized by userspace software), LZO compression
>> actually caused around 5-10% performance degradation compared to no
>> compression.  This is somewhat obvious once it's explained, but it's
>> not exactly intuitive  and as such it's probably worth documenting in
>> the man pages that compression won't always make things better.  I may
>> send a patch to add this at some point in the near future.
>
> This seems interesting.
> Maybe it's CPU limiting the performance?
In this case, I'm pretty certain that that's the cause.  I've only ever 
seen this happen though when the CPU was under either full or more than 
full load (so pretty much full utilization of all the cores), and it 
gets worse as the CPU load increases.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
  2017-04-12 11:18   ` Austin S. Hemmelgarn
@ 2017-04-12 22:48     ` Duncan
  2017-04-13 11:33       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 7+ messages in thread
From: Duncan @ 2017-04-12 22:48 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Wed, 12 Apr 2017 07:18:44 -0400 as
excerpted:

> On 2017-04-12 01:49, Qu Wenruo wrote:
>>
>> At 04/11/2017 11:40 PM, Austin S. Hemmelgarn wrote:
>>>
>>> 4. Depending on other factors, compression can actually slow you down
>>> pretty significantly.  In the particular case I saw this happen (all
>>> cores completely utilized by userspace software), LZO compression
>>> actually caused around 5-10% performance degradation compared to no
>>> compression.  This is somewhat obvious once it's explained, but it's
>>> not exactly intuitive  and as such it's probably worth documenting in
>>> the man pages that compression won't always make things better.  I may
>>> send a patch to add this at some point in the near future.
>>
>> This seems interesting.
>> Maybe it's CPU limiting the performance?

> In this case, I'm pretty certain that that's the cause.  I've only ever
> seen this happen though when the CPU was under either full or more than
> full load (so pretty much full utilization of all the cores), and it
> gets worse as the CPU load increases.

This seems blatantly obvious to me, no explanation needed, at least 
assuming people understand what compression is and does.  It certainly 
doesn't seem btrfs specific to me.

Which makes my wonder if I'm missing something that would seem to 
counteract the obvious, but doesn't in this case.

Compression at its most basic can be described as a tradeoff of CPU 
cycles to decrease data size (by tracking and eliminating internal 
redundancy), and thus transfer time of the data.

In conditions where the bottleneck is (seek and) transfer time, as on hdds 
with mostly idle CPUs, compression therefore tends to be a pretty big 
performance boost because the lower size of the compressed data means 
fewer seeks and lower transfer time, and because that's where the 
bottleneck is, making it more efficient increases the performance of the 
entire thing.

But the context here is SSDs, with 0 seek time and fast transfer speeds, 
and already 100% utilized CPUs, so the bottleneck is the 100% utilized 
CPUs and the increased CPU cycles necessary for the compression/
decompression simply increases the CPU bottleneck.

So far from a mystery, this seems so basic to me that the simplest 
dunderhead should get it, at least as long as they aren't /so/ simple 
they can't understand the tradeoff inherent in the simplest compression 
basics.

But that's not the implication of the discussion quoted above, and the 
participants are both what I'd consider far more qualified to understand 
and deal with this sort of thing than I, so I /gotta/ be missing 
something that despite my correct ultimate conclusion, means I haven't 
reached it using a correct logic train, and that there /must/ be some 
logic steps in there that I've left out that would intuitively switch the 
logic, making this a rather less intuitive conclusion than I'm thinking.

So what am I missing?

Or is it simply that the tradeoff between CPU usage and data size and 
minimum transit time isn't as simple and basic for most people as I'm 
assuming here, such that it isn't obviously giving more work to an 
already bottlenecked CPU, reducing the performance when it /is/ the CPU 
that's bottlenecked?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
  2017-04-12 22:48     ` Duncan
@ 2017-04-13 11:33       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 7+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-13 11:33 UTC (permalink / raw)
  To: linux-btrfs

On 2017-04-12 18:48, Duncan wrote:
> Austin S. Hemmelgarn posted on Wed, 12 Apr 2017 07:18:44 -0400 as
> excerpted:
>
>> On 2017-04-12 01:49, Qu Wenruo wrote:
>>>
>>> At 04/11/2017 11:40 PM, Austin S. Hemmelgarn wrote:
>>>>
>>>> 4. Depending on other factors, compression can actually slow you down
>>>> pretty significantly.  In the particular case I saw this happen (all
>>>> cores completely utilized by userspace software), LZO compression
>>>> actually caused around 5-10% performance degradation compared to no
>>>> compression.  This is somewhat obvious once it's explained, but it's
>>>> not exactly intuitive  and as such it's probably worth documenting in
>>>> the man pages that compression won't always make things better.  I may
>>>> send a patch to add this at some point in the near future.
>>>
>>> This seems interesting.
>>> Maybe it's CPU limiting the performance?
>
>> In this case, I'm pretty certain that that's the cause.  I've only ever
>> seen this happen though when the CPU was under either full or more than
>> full load (so pretty much full utilization of all the cores), and it
>> gets worse as the CPU load increases.
>
> This seems blatantly obvious to me, no explanation needed, at least
> assuming people understand what compression is and does.  It certainly
> doesn't seem btrfs specific to me.
>
> Which makes my wonder if I'm missing something that would seem to
> counteract the obvious, but doesn't in this case.
>
> Compression at its most basic can be described as a tradeoff of CPU
> cycles to decrease data size (by tracking and eliminating internal
> redundancy), and thus transfer time of the data.
>
> In conditions where the bottleneck is (seek and) transfer time, as on hdds
> with mostly idle CPUs, compression therefore tends to be a pretty big
> performance boost because the lower size of the compressed data means
> fewer seeks and lower transfer time, and because that's where the
> bottleneck is, making it more efficient increases the performance of the
> entire thing.
>
> But the context here is SSDs, with 0 seek time and fast transfer speeds,
> and already 100% utilized CPUs, so the bottleneck is the 100% utilized
> CPUs and the increased CPU cycles necessary for the compression/
> decompression simply increases the CPU bottleneck.
>
> So far from a mystery, this seems so basic to me that the simplest
> dunderhead should get it, at least as long as they aren't /so/ simple
> they can't understand the tradeoff inherent in the simplest compression
> basics.
>
> But that's not the implication of the discussion quoted above, and the
> participants are both what I'd consider far more qualified to understand
> and deal with this sort of thing than I, so I /gotta/ be missing
> something that despite my correct ultimate conclusion, means I haven't
> reached it using a correct logic train, and that there /must/ be some
> logic steps in there that I've left out that would intuitively switch the
> logic, making this a rather less intuitive conclusion than I'm thinking.
>
> So what am I missing?
>
> Or is it simply that the tradeoff between CPU usage and data size and
> minimum transit time isn't as simple and basic for most people as I'm
> assuming here, such that it isn't obviously giving more work to an
> already bottlenecked CPU, reducing the performance when it /is/ the CPU
> that's bottlenecked?
>
There's also CPU overhead in transferring the data.  Normally this isn't 
big, but when you start talking about stuff that manages to get full 
bandwidth utilization on the SSD's, it has some impact.  Especially in 
this case, since it's AHCI based SATA controllers (not quite as bad as 
IDE, but still far more overhead than SAS or even parallel SCSI).

The other thing though, is that I see this when dealing with traditional 
hard disks too, and the increasing impact as CPU load increases doesn't 
match up with what I would expect when factoring in both the increased 
scheduling overhead and the decreased runtime, both of which lead me to 
believe that BTRFS is doing something less efficiently than it could 
here, but I'm not sure what.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-04-13 11:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-11 15:40 BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such Austin S. Hemmelgarn
2017-04-12  1:43 ` Ravishankar N
2017-04-12  5:49 ` Qu Wenruo
2017-04-12  7:16   ` Sargun Dhillon
2017-04-12 11:18   ` Austin S. Hemmelgarn
2017-04-12 22:48     ` Duncan
2017-04-13 11:33       ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.