btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

All of lore.kernel.org
 help / color / mirror / Atom feed

* btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
@ 2016-01-22 13:38 Christian Rohmann
  2016-01-22 14:51 ` Duncan
  2016-01-24  2:30 ` Henk Slager
  0 siblings, 2 replies; 28+ messages in thread
From: Christian Rohmann @ 2016-01-22 13:38 UTC (permalink / raw)
  To: linux-btrfs

Hello btrfs-folks,

I am currently doing a big "btrfs balance" to extend a 8 drive RAID6 to
12 drives using
 "btrfs balance start -dstripes 1..11 -mstripes 1..11"

With kernel 4.4 and btrfs progs 4.4 it's running fine for a few days now
and the new disks are slowing getting more and more extents.
But somehow the process is VERY slow (3% in 3 days) and there is almost
no additional disk utilization.

The process doing the balance is doing 100% cpu (one core) so apparently
the whole thing is very much single threaded and therefore CPU-bound in
this case.

Is this a known issue or is there anything I can do to speed this up? I
mean the disks have plenty of iops left to work with and the box has
many more CPU cores idling away.

Regards

Christian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-22 13:38 btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core? Christian Rohmann
@ 2016-01-22 14:51 ` Duncan
  2016-01-24  2:30 ` Henk Slager
  1 sibling, 0 replies; 28+ messages in thread
From: Duncan @ 2016-01-22 14:51 UTC (permalink / raw)
  To: linux-btrfs

Christian Rohmann posted on Fri, 22 Jan 2016 14:38:11 +0100 as excerpted:

> I am currently doing a big "btrfs balance" to extend a 8 drive RAID6 to
> 12 drives using
>  "btrfs balance start -dstripes 1..11 -mstripes 1..11"
> 
> With kernel 4.4 and btrfs progs 4.4 it's running fine for a few days now
> and the new disks are slowing getting more and more extents.
> But somehow the process is VERY slow (3% in 3 days) and there is almost
> no additional disk utilization.
> 
> The process doing the balance is doing 100% cpu (one core) so apparently
> the whole thing is very much single threaded and therefore CPU-bound in
> this case.
> 
> Is this a known issue or is there anything I can do to speed this up? I
> mean the disks have plenty of iops left to work with and the box has
> many more CPU cores idling away.

[This is only intended to be a stop-gap reply, until someone with more 
detailed/direct knowledge/experience on the topic can reply.]

My own use-case is btrfs raid1, but from what I've seen on the list, 
raid56 mode maintenance that involves recalculating parity, as converting 
from an 8-device stripe to a 12-device stripe will, is indeed /very/ slow.

I didn't know it was single-core limited, however.  If it's slow/complex 
calculations, AND limited to a single core, plus given the likely size of 
a filesystem of 8-12 devices in the day of multi-TB devices... ~1%/day, 
100 days to complete... Ouch, that's going to be painful!

The good thing is that it happens online, so you can be using the 
filesystem and the other cores while it's happening.  Plus, balances are 
interruptable.  You can reboot or whatever and it should pick up and 
continue where it left off.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-22 13:38 btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core? Christian Rohmann
  2016-01-22 14:51 ` Duncan
@ 2016-01-24  2:30 ` Henk Slager
  2016-01-25 11:34   ` Christian Rohmann
  1 sibling, 1 reply; 28+ messages in thread
From: Henk Slager @ 2016-01-24  2:30 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Jan 22, 2016 at 2:38 PM, Christian Rohmann
<crohmann@netcologne.de> wrote:
> Hello btrfs-folks,
>
> I am currently doing a big "btrfs balance" to extend a 8 drive RAID6 to
> 12 drives using
>  "btrfs balance start -dstripes 1..11 -mstripes 1..11"

I am not sure why you use/need the stripes filter here; In fact you
want a full balance I think. If you cancel sometime during the ongoing
balance and then later want to continue, it might be needed in order
not to redo the already balanced chunks, maybe that is the case.

> With kernel 4.4 and btrfs progs 4.4 it's running fine for a few days now
> and the new disks are slowing getting more and more extents.
> But somehow the process is VERY slow (3% in 3 days) and there is almost
> no additional disk utilization.
>
> The process doing the balance is doing 100% cpu (one core) so apparently
> the whole thing is very much single threaded and therefore CPU-bound in
> this case.
>
> Is this a known issue or is there anything I can do to speed this up? I
> mean the disks have plenty of iops left to work with and the box has
> many more CPU cores idling away.

I have been using raid5 with kernels 3.11..4.1.6 and several disk
swaps (add command, delete command, dd, but not replace command).

Before raid5 functionally was complete in the kernel, low-level
operations were OK w.r.t speed (like a raid0) as far as I remember.
Later kernels I remember the operations were very slow and very high
cpu load. It has been single core (3.x kernels I believe ), but also
multicore but slow. In fact so slow, that samba gave up and the
filesytem/server was simply unusable for hour/days/weeks.

One reason was I wanted 4x 4TB disk and was halfway (2x 2TB + 2x 4TB)
that upgrade. As balances were crashing and very slow, btrfs was using
4x 2TB for 'normal' raid5 (data0 + data1 +parity), but for the second
half of the 4TB disks just data + parity. The 'normal' raid5 involving
the 2TB disk was very slow, high fragmentation etc.

So my experience is, yes it is or can be slow, very slow. Also scrub
is roughly 10x slower (with 4.3.x kernels at least) than it should be.
A reason is likely that readahead for raid56 is currently not working
(see patches in the list), for some operations, not for all AFAIKU. If
you use iostat you will get an idea of the speed. It might also be
that there are 512 and 4096 sector size effects, but this is just
speculation.

It might be that just a full balance runs faster, so no filters, you
could try that. Otherwise I wouldn't know how to speedup, hopefully
the fs is still usable while balancing.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-24  2:30 ` Henk Slager
@ 2016-01-25 11:34   ` Christian Rohmann
  2016-01-25 22:13     ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Christian Rohmann @ 2016-01-25 11:34 UTC (permalink / raw)
  To: Henk Slager, linux-btrfs

Hey there Henk, btrfs-enthusiasts,

On 01/24/2016 03:30 AM, Henk Slager wrote:
> It might be that just a full balance runs faster, so no filters, you
> could try that. Otherwise I wouldn't know how to speedup, hopefully
> the fs is still usable while balancing.

Yes the FS is still usable, munin shows just a little increate in iops
and disk latency. The filter should not affect the performance of a
balance at all. I am simply saying to only consider chunks which are not
spread across all disks yet. Finding out a chunks data distribution
should not add any burden on the balancing.

The balancing is still VERY VERY slow, we still have 93% left to
balance. But since I did not hit any hardware limit (CPU or disk IO) I
am confident to say btrfs-balance is buggy in this regard. CPU single
thread performance will not explode anytime soon. But disks (or SSD)
will still grow in size and so will their potential iops.

With a 8 - 12 disk array growth I am not doing something crazy that has
never been done before on a storage array either ;-)

Regards

Christian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-25 11:34   ` Christian Rohmann
@ 2016-01-25 22:13     ` Chris Murphy
       [not found]       ` <CAKZK7uxdX9UBPOKButtPjqBOdVUfHdRTimP+W34fkz1h9P+wHg@mail.gmail.com>
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-01-25 22:13 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: Henk Slager, linux-btrfs

On Mon, Jan 25, 2016 at 4:34 AM, Christian Rohmann
<crohmann@netcologne.de> wrote:
> Hey there Henk, btrfs-enthusiasts,
>
>
> On 01/24/2016 03:30 AM, Henk Slager wrote:
>> It might be that just a full balance runs faster, so no filters, you
>> could try that. Otherwise I wouldn't know how to speedup, hopefully
>> the fs is still usable while balancing.
>
> Yes the FS is still usable, munin shows just a little increate in iops
> and disk latency. The filter should not affect the performance of a
> balance at all. I am simply saying to only consider chunks which are not
> spread across all disks yet. Finding out a chunks data distribution
> should not add any burden on the balancing.
>
> The balancing is still VERY VERY slow, we still have 93% left to
> balance. But since I did not hit any hardware limit (CPU or disk IO) I
> am confident to say btrfs-balance is buggy in this regard. CPU single
> thread performance will not explode anytime soon. But disks (or SSD)
> will still grow in size and so will their potential iops.
>
> With a 8 - 12 disk array growth I am not doing something crazy that has
> never been done before on a storage array either ;-)

Does anyone suspect a kernel regression here? I wonder if its worth it
to suggest testing the current version of all fairly recent kernels:
4.5.rc1, 4.4, 4.3.4, 4.2.8, 4.1.16? I think going farther back to
3.18.x isn't worth it since that's before the major work since raid56
was added. Quite a while ago I've done a raid56 rebuild and balance
that was pretty fast but it was only a 4 or 5 device test.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Fwd: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
       [not found]       ` <CAKZK7uxdX9UBPOKButtPjqBOdVUfHdRTimP+W34fkz1h9P+wHg@mail.gmail.com>
@ 2016-01-26  0:44         ` Justin Brown
  2016-01-26  5:17           ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Justin Brown @ 2016-01-26  0:44 UTC (permalink / raw)
  To: linux-btrfs

> Does anyone suspect a kernel regression here? I wonder if its worth it
> to suggest testing the current version of all fairly recent kernels:
> 4.5.rc1, 4.4, 4.3.4, 4.2.8, 4.1.16?

I don't have any useful information about parity RAID modes or large
arrays, so this might be totally useless. Nonetheless, just last week
I added a 2TB drive to an existing Btrfs raid10 array (5x 2TB before
addition) and did a balance afterwards. I don't take any numbers, but
I was frequently looking at htop and iotop. I thought the numbers were
extremely good: 100-120MB/s sustained for each drive with the "total"
reported by iotop exceeding 600MB/s. That's with integrated sata
controller on an Intel Z97 mini-ITX motherboard (cpu i4770).
Significantly faster than anticipated. I started it one evening, and
it was finished when I awoke the next morning.

That was on 4.2.8-300.fc23.x86_64 with btrfs-progs 4.3.1.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-26  0:44         ` Fwd: " Justin Brown
@ 2016-01-26  5:17           ` Chris Murphy
  2016-01-26  6:14             ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-01-26  5:17 UTC (permalink / raw)
  To: linux-btrfs

On Mon, Jan 25, 2016 at 5:44 PM, Justin Brown <justin.brown@fandingo.org> wrote:
>> Does anyone suspect a kernel regression here? I wonder if its worth it
>> to suggest testing the current version of all fairly recent kernels:
>> 4.5.rc1, 4.4, 4.3.4, 4.2.8, 4.1.16?
>
> I don't have any useful information about parity RAID modes or large
> arrays, so this might be totally useless. Nonetheless, just last week
> I added a 2TB drive to an existing Btrfs raid10 array (5x 2TB before
> addition) and did a balance afterwards. I don't take any numbers, but
> I was frequently looking at htop and iotop. I thought the numbers were
> extremely good: 100-120MB/s sustained for each drive with the "total"
> reported by iotop exceeding 600MB/s. That's with integrated sata
> controller on an Intel Z97 mini-ITX motherboard (cpu i4770).
> Significantly faster than anticipated. I started it one evening, and
> it was finished when I awoke the next morning.
>
> That was on 4.2.8-300.fc23.x86_64 with btrfs-progs 4.3.1.

That's been my experience also with raid0 and 10. Because p+q
computation is more expensive with raid6, it may need specifically
testing with a raid6. If Christian can successfully cancel balance,
umount, then reboot another kernel version and retry, it might be
useful in tracking down the problem (or someone else willing to test).
I'd do it but I don't have enough drive space at the moment to do it
with anything other than VM and qcow2 files on a single SSD, although
that should at least saturate the SSD or close to it. If so, it would
still be faster than what Christian is reporting.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-26  5:17           ` Chris Murphy
@ 2016-01-26  6:14             ` Chris Murphy
  2016-01-26  8:54               ` Christian Rohmann
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-01-26  6:14 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: linux-btrfs

1495MiB used raid6 in a VM using 4x qcow2 files on an SSD.
Host is using kernel 4.4.0
Guest is using kernel 4.5.0rc0.git9.1.fc24 (this is a Fedora Rawhide
debug kernel so it'll be a bit slower), btrfs-progs 4.3.1

Not degraded, balance takes 11 seconds, that's ~136MiB/s
iotop isn't consistent, max is ~300MiB/s write.

Reboot with 1 device missing and a new empty qcow2 in its place, mount
degraded and 'btrfs replace' takes 13 seconds according to 'btrfs
replace status'.

I don't know that this is very useful information though.

Christian, what are you getting for 'iotop -d3 -o' or 'iostat -d3'. Is
it consistent or is it fluctuating all over the place? What sort of
eyeball avg/min/max are you getting?

Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-26  6:14             ` Chris Murphy
@ 2016-01-26  8:54               ` Christian Rohmann
  2016-01-26 19:26                 ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Christian Rohmann @ 2016-01-26  8:54 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Hey Chris and all,

On 01/25/2016 11:13 PM, Chris Murphy wrote:
> Does anyone suspect a kernel regression here? I wonder if its worth it
> to suggest testing the current version of all fairly recent kernels:
> 4.5.rc1, 4.4, 4.3.4, 4.2.8, 4.1.16? I think going farther back to
> 3.18.x isn't worth it since that's before the major work since raid56
> was added. Quite a while ago I've done a raid56 rebuild and balance
> that was pretty fast but it was only a 4 or 5 device test.

Problem is that this balance did not work before going to 4.4 kernel,
it's was simply crashing after about an hour or two of runtime.

Currently I am using 4.4 kernel + btrfs-progs, so apart from 4.5rc1 I
can not get any more bleeding edge.

4.5 I am happy to try, but not RC1 as there are already some bugs
popping up regarding the BTRFS changes.


On 01/26/2016 07:14 AM, Chris Murphy wrote:
> Christian, what are you getting for 'iotop -d3 -o' or 'iostat -d3'. Is
> it consistent or is it fluctuating all over the place? What sort of
> eyeball avg/min/max are you getting?

"1672.81 K/s 1672.81 K/s  0.00 %  6.99 % btrfs balance start -dstripes
1..11 -mstripes 1..11 "

but it's jumping up to 25MB/s for a few polls, but most of the time it's
at 1.3 to 1.7 MB/s


You may check out more the various munin graphs of the box if you like:
 * http://mirror.netcologne.de/munin

has all the goods.
This also brings me to mention, that the disks (it's 12 disks though!)
read somewhere between 20 and 60MB/s constantly.




Regards

Christian





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-26  8:54               ` Christian Rohmann
@ 2016-01-26 19:26                 ` Chris Murphy
  2016-01-26 19:27                   ` Chris Murphy
  2016-01-26 19:57                   ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2016-01-26 19:26 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: Chris Murphy, linux-btrfs

On Tue, Jan 26, 2016 at 1:54 AM, Christian Rohmann
<crohmann@netcologne.de> wrote:
> Hey Chris and all,
>
> On 01/25/2016 11:13 PM, Chris Murphy wrote:
>> Does anyone suspect a kernel regression here? I wonder if its worth it
>> to suggest testing the current version of all fairly recent kernels:
>> 4.5.rc1, 4.4, 4.3.4, 4.2.8, 4.1.16? I think going farther back to
>> 3.18.x isn't worth it since that's before the major work since raid56
>> was added. Quite a while ago I've done a raid56 rebuild and balance
>> that was pretty fast but it was only a 4 or 5 device test.
>
> Problem is that this balance did not work before going to 4.4 kernel,
> it's was simply crashing after about an hour or two of runtime.
>
> Currently I am using 4.4 kernel + btrfs-progs, so apart from 4.5rc1 I
> can not get any more bleeding edge.
>
> 4.5 I am happy to try, but not RC1 as there are already some bugs
> popping up regarding the BTRFS changes.
>
>
> On 01/26/2016 07:14 AM, Chris Murphy wrote:
>> Christian, what are you getting for 'iotop -d3 -o' or 'iostat -d3'. Is
>> it consistent or is it fluctuating all over the place? What sort of
>> eyeball avg/min/max are you getting?
>
> "1672.81 K/s 1672.81 K/s  0.00 %  6.99 % btrfs balance start -dstripes
> 1..11 -mstripes 1..11 "
>
> but it's jumping up to 25MB/s for a few polls, but most of the time it's
> at 1.3 to 1.7 MB/s


That is really slow. The fact you can't balance without crashing prior
to a 4.4 kernel makes me suspicious about the file system state. What
about reading and writing files? What's the performance in that case?
Is it just the balance that's this slow? Do you have the call traces
for older kernel crashes with balance? What btrfs-progs was used to
create the raid6 volume?

Maybe the slowness is due to the -dstripes -mstripes filter. That's
relatively new. And I didn't try that. And I also don't really
understand the values you picked either. Seems to me if you've added
four drives relatively recently, there won't be many chunks using
12-strip stripes, most of them will be 8-strip stripes. So I don't
really know what you're limiting.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-26 19:26                 ` Chris Murphy
@ 2016-01-26 19:27                   ` Chris Murphy
  2016-01-26 19:57                   ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-01-26 19:27 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Christian Rohmann, linux-btrfs

On Tue, Jan 26, 2016 at 12:26 PM, Chris Murphy <lists@colorremedies.com> wrote:
> On Tue, Jan 26, 2016 at 1:54 AM, Christian Rohmann
> <crohmann@netcologne.de> wrote:
>> Hey Chris and all,
>>
>> On 01/25/2016 11:13 PM, Chris Murphy wrote:
>>> Does anyone suspect a kernel regression here? I wonder if its worth it
>>> to suggest testing the current version of all fairly recent kernels:
>>> 4.5.rc1, 4.4, 4.3.4, 4.2.8, 4.1.16? I think going farther back to
>>> 3.18.x isn't worth it since that's before the major work since raid56
>>> was added. Quite a while ago I've done a raid56 rebuild and balance
>>> that was pretty fast but it was only a 4 or 5 device test.
>>
>> Problem is that this balance did not work before going to 4.4 kernel,
>> it's was simply crashing after about an hour or two of runtime.
>>
>> Currently I am using 4.4 kernel + btrfs-progs, so apart from 4.5rc1 I
>> can not get any more bleeding edge.
>>
>> 4.5 I am happy to try, but not RC1 as there are already some bugs
>> popping up regarding the BTRFS changes.
>>
>>
>> On 01/26/2016 07:14 AM, Chris Murphy wrote:
>>> Christian, what are you getting for 'iotop -d3 -o' or 'iostat -d3'. Is
>>> it consistent or is it fluctuating all over the place? What sort of
>>> eyeball avg/min/max are you getting?
>>
>> "1672.81 K/s 1672.81 K/s  0.00 %  6.99 % btrfs balance start -dstripes
>> 1..11 -mstripes 1..11 "
>>
>> but it's jumping up to 25MB/s for a few polls, but most of the time it's
>> at 1.3 to 1.7 MB/s
>
>
> That is really slow. The fact you can't balance without crashing prior
> to a 4.4 kernel makes me suspicious about the file system state. What
> about reading and writing files? What's the performance in that case?
> Is it just the balance that's this slow? Do you have the call traces
> for older kernel crashes with balance? What btrfs-progs was used to
> create the raid6 volume?
>
> Maybe the slowness is due to the -dstripes -mstripes filter. That's
> relatively new. And I didn't try that. And I also don't really
> understand the values you picked either. Seems to me if you've added
> four drives relatively recently, there won't be many chunks using
> 12-strip stripes, most of them will be 8-strip stripes. So I don't
> really know what you're limiting.

I guess the bottom line of what I'm suggesting before trying anything
else, is to stop the balance and start a normal one without filters
and see how that performs.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-26 19:26                 ` Chris Murphy
  2016-01-26 19:27                   ` Chris Murphy
@ 2016-01-26 19:57                   ` Austin S. Hemmelgarn
  2016-01-26 20:20                     ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-26 19:57 UTC (permalink / raw)
  To: Chris Murphy, Christian Rohmann; +Cc: linux-btrfs

On 2016-01-26 14:26, Chris Murphy wrote:
> On Tue, Jan 26, 2016 at 1:54 AM, Christian Rohmann
> <crohmann@netcologne.de> wrote:
>> Hey Chris and all,
>>
>> On 01/25/2016 11:13 PM, Chris Murphy wrote:
>>> Does anyone suspect a kernel regression here? I wonder if its worth it
>>> to suggest testing the current version of all fairly recent kernels:
>>> 4.5.rc1, 4.4, 4.3.4, 4.2.8, 4.1.16? I think going farther back to
>>> 3.18.x isn't worth it since that's before the major work since raid56
>>> was added. Quite a while ago I've done a raid56 rebuild and balance
>>> that was pretty fast but it was only a 4 or 5 device test.
>>
>> Problem is that this balance did not work before going to 4.4 kernel,
>> it's was simply crashing after about an hour or two of runtime.
>>
>> Currently I am using 4.4 kernel + btrfs-progs, so apart from 4.5rc1 I
>> can not get any more bleeding edge.
>>
>> 4.5 I am happy to try, but not RC1 as there are already some bugs
>> popping up regarding the BTRFS changes.
>>
>>
>> On 01/26/2016 07:14 AM, Chris Murphy wrote:
>>> Christian, what are you getting for 'iotop -d3 -o' or 'iostat -d3'. Is
>>> it consistent or is it fluctuating all over the place? What sort of
>>> eyeball avg/min/max are you getting?
>>
>> "1672.81 K/s 1672.81 K/s  0.00 %  6.99 % btrfs balance start -dstripes
>> 1..11 -mstripes 1..11 "
>>
>> but it's jumping up to 25MB/s for a few polls, but most of the time it's
>> at 1.3 to 1.7 MB/s
>
>
> That is really slow. The fact you can't balance without crashing prior
> to a 4.4 kernel makes me suspicious about the file system state. What
> about reading and writing files? What's the performance in that case?
> Is it just the balance that's this slow? Do you have the call traces
> for older kernel crashes with balance? What btrfs-progs was used to
> create the raid6 volume?
>
> Maybe the slowness is due to the -dstripes -mstripes filter. That's
> relatively new. And I didn't try that. And I also don't really
> understand the values you picked either. Seems to me if you've added
> four drives relatively recently, there won't be many chunks using
> 12-strip stripes, most of them will be 8-strip stripes. So I don't
> really know what you're limiting.
>
The filters he used are telling balance to re-stripe anything spanning 
less than 12 devices.  So, in essence, it's only going to re-stripe the 
chunks from before the fourth disk was added.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-26 19:57                   ` Austin S. Hemmelgarn
@ 2016-01-26 20:20                     ` Chris Murphy
  2016-01-27  8:48                       ` Christian Rohmann
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-01-26 20:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Christian Rohmann, linux-btrfs

On Tue, Jan 26, 2016 at 12:57 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-26 14:26, Chris Murphy wrote:
>>
>> On Tue, Jan 26, 2016 at 1:54 AM, Christian Rohmann
>> <crohmann@netcologne.de> wrote:
>>>
>>> Hey Chris and all,
>>>
>>> On 01/25/2016 11:13 PM, Chris Murphy wrote:
>>>>
>>>> Does anyone suspect a kernel regression here? I wonder if its worth it
>>>> to suggest testing the current version of all fairly recent kernels:
>>>> 4.5.rc1, 4.4, 4.3.4, 4.2.8, 4.1.16? I think going farther back to
>>>> 3.18.x isn't worth it since that's before the major work since raid56
>>>> was added. Quite a while ago I've done a raid56 rebuild and balance
>>>> that was pretty fast but it was only a 4 or 5 device test.
>>>
>>>
>>> Problem is that this balance did not work before going to 4.4 kernel,
>>> it's was simply crashing after about an hour or two of runtime.
>>>
>>> Currently I am using 4.4 kernel + btrfs-progs, so apart from 4.5rc1 I
>>> can not get any more bleeding edge.
>>>
>>> 4.5 I am happy to try, but not RC1 as there are already some bugs
>>> popping up regarding the BTRFS changes.
>>>
>>>
>>> On 01/26/2016 07:14 AM, Chris Murphy wrote:
>>>>
>>>> Christian, what are you getting for 'iotop -d3 -o' or 'iostat -d3'. Is
>>>> it consistent or is it fluctuating all over the place? What sort of
>>>> eyeball avg/min/max are you getting?
>>>
>>>
>>> "1672.81 K/s 1672.81 K/s  0.00 %  6.99 % btrfs balance start -dstripes
>>> 1..11 -mstripes 1..11 "
>>>
>>> but it's jumping up to 25MB/s for a few polls, but most of the time it's
>>> at 1.3 to 1.7 MB/s
>>
>>
>>
>> That is really slow. The fact you can't balance without crashing prior
>> to a 4.4 kernel makes me suspicious about the file system state. What
>> about reading and writing files? What's the performance in that case?
>> Is it just the balance that's this slow? Do you have the call traces
>> for older kernel crashes with balance? What btrfs-progs was used to
>> create the raid6 volume?
>>
>> Maybe the slowness is due to the -dstripes -mstripes filter. That's
>> relatively new. And I didn't try that. And I also don't really
>> understand the values you picked either. Seems to me if you've added
>> four drives relatively recently, there won't be many chunks using
>> 12-strip stripes, most of them will be 8-strip stripes. So I don't
>> really know what you're limiting.
>>
> The filters he used are telling balance to re-stripe anything spanning less
> than 12 devices.  So, in essence, it's only going to re-stripe the chunks
> from before the fourth disk was added.

Which is most of what's on the volume unless the 4 disks were added
and used for a while, but I can't tell what the time frame is. Anyway,
it seems reasonable to try a balance without the filters to see if
that's a factor, because those filters are brand new in btrfs-progs
4.4. Granted, I'd expect they've been tested by upstream developers,
but I don't know if there's an fstest for balance with these specific
filters yet.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-26 20:20                     ` Chris Murphy
@ 2016-01-27  8:48                       ` Christian Rohmann
  2016-01-27 16:34                         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 28+ messages in thread
From: Christian Rohmann @ 2016-01-27  8:48 UTC (permalink / raw)
  To: Chris Murphy, Austin S. Hemmelgarn; +Cc: linux-btrfs



On 01/26/2016 09:20 PM, Chris Murphy wrote:
> nyway,
> it seems reasonable to try a balance without the filters to see if
> that's a factor, because those filters are brand new in btrfs-progs
> 4.4. Granted, I'd expect they've been tested by upstream developers,
> but I don't know if there's an fstest for balance with these specific
> filters yet.

I have another box with 8 disks RAID6 on which I simply did a balance
with no newly added drives. Same issue ... VERY slow running balance
with IO nowhere near 100% utilization and many many days of runtime to
finish.


Regards

Christian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-27  8:48                       ` Christian Rohmann
@ 2016-01-27 16:34                         ` Austin S. Hemmelgarn
  2016-01-27 20:58                           ` bbrendon
  2016-01-27 21:53                           ` Chris Murphy
  0 siblings, 2 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-27 16:34 UTC (permalink / raw)
  To: Christian Rohmann, Chris Murphy; +Cc: linux-btrfs

On 2016-01-27 03:48, Christian Rohmann wrote:
>
>
> On 01/26/2016 09:20 PM, Chris Murphy wrote:
>> nyway,
>> it seems reasonable to try a balance without the filters to see if
>> that's a factor, because those filters are brand new in btrfs-progs
>> 4.4. Granted, I'd expect they've been tested by upstream developers,
>> but I don't know if there's an fstest for balance with these specific
>> filters yet.
>
> I have another box with 8 disks RAID6 on which I simply did a balance
> with no newly added drives. Same issue ... VERY slow running balance
> with IO nowhere near 100% utilization and many many days of runtime to
> finish.
Hmm, I did some automated testing in a couple of VM's last night, and I 
have to agree, this _really_ needs to get optimized.  Using the same 
data-set on otherwise identical VM's, I saw an average 28x slowdown 
(best case was 16x, worst was almost 100x) for balancing a RAID6 set 
versus a RAID1 set.  While the parity computations add to the time, 
there is absolutely no way that just that can explain why this is taking 
so long.  The closest comparison using MD or DM RAID is probably a full 
verification of the array, and the greatest difference there that I've 
seen is around 10x.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-27 16:34                         ` Austin S. Hemmelgarn
@ 2016-01-27 20:58                           ` bbrendon
  2016-01-27 21:53                           ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: bbrendon @ 2016-01-27 20:58 UTC (permalink / raw)
  Cc: linux-btrfs

I ran into some major problems with balancing a raid6 array recently
on 4.4. It wouldn't resume without crashing and yes, it was VERY slow.
In my case, it took 4 days. I found a link to the posting.

http://www.spinics.net/lists/linux-btrfs/msg51159.html


On Wed, Jan 27, 2016 at 8:34 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-27 03:48, Christian Rohmann wrote:
>>
>>
>>
>> On 01/26/2016 09:20 PM, Chris Murphy wrote:
>>>
>>> nyway,
>>> it seems reasonable to try a balance without the filters to see if
>>> that's a factor, because those filters are brand new in btrfs-progs
>>> 4.4. Granted, I'd expect they've been tested by upstream developers,
>>> but I don't know if there's an fstest for balance with these specific
>>> filters yet.
>>
>>
>> I have another box with 8 disks RAID6 on which I simply did a balance
>> with no newly added drives. Same issue ... VERY slow running balance
>> with IO nowhere near 100% utilization and many many days of runtime to
>> finish.
>
> Hmm, I did some automated testing in a couple of VM's last night, and I have
> to agree, this _really_ needs to get optimized.  Using the same data-set on
> otherwise identical VM's, I saw an average 28x slowdown (best case was 16x,
> worst was almost 100x) for balancing a RAID6 set versus a RAID1 set.  While
> the parity computations add to the time, there is absolutely no way that
> just that can explain why this is taking so long.  The closest comparison
> using MD or DM RAID is probably a full verification of the array, and the
> greatest difference there that I've seen is around 10x.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-27 16:34                         ` Austin S. Hemmelgarn
  2016-01-27 20:58                           ` bbrendon
@ 2016-01-27 21:53                           ` Chris Murphy
  2016-01-28 12:27                             ` Austin S. Hemmelgarn
  2016-02-01 14:10                             ` Christian Rohmann
  1 sibling, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2016-01-27 21:53 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Christian Rohmann, linux-btrfs

On Wed, Jan 27, 2016 at 9:34 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> Hmm, I did some automated testing in a couple of VM's last night, and I have
> to agree, this _really_ needs to get optimized.  Using the same data-set on
> otherwise identical VM's, I saw an average 28x slowdown (best case was 16x,
> worst was almost 100x) for balancing a RAID6 set versus a RAID1 set.  While
> the parity computations add to the time, there is absolutely no way that
> just that can explain why this is taking so long.  The closest comparison
> using MD or DM RAID is probably a full verification of the array, and the
> greatest difference there that I've seen is around 10x.

I can't exactly reproduce this. I'm using +C qcow2 on Btrfs on one SSD
to back the drives in the VM.

2x btrfs raid1 with files totalling 5G consistently takes ~1 minute
[1]  to balance (no filters)

4x btrfs raid6 with the same files *inconsistently* takes ~1m15s [2]
to balance (no filters)
iotop is all over the place, from 21MB/s writes to 527MB/s


Do both of you get something like this:
[root@f23m ~]# dmesg | grep -i raid
[    1.518682] raid6: sse2x1   gen()  4531 MB/s
[    1.535663] raid6: sse2x1   xor()  3783 MB/s
[    1.552683] raid6: sse2x2   gen() 10140 MB/s
[    1.569658] raid6: sse2x2   xor()  7306 MB/s
[    1.586673] raid6: sse2x4   gen() 11261 MB/s
[    1.603683] raid6: sse2x4   xor()  7009 MB/s
[    1.603685] raid6: using algorithm sse2x4 gen() 11261 MB/s
[    1.603686] raid6: .... xor() 7009 MB/s, rmw enabled
[    1.603687] raid6: using ssse3x2 recovery algorithm



[1] Did it 3 times
1m8
0m58
0m40

[2] Did this multiple times
1m15s
0m55s
0m49s
And then from that point all attempts were 2+m, but never more than
2m29s. I'm not sure why, but there were a lot of drop outs in iotop
where it'd go to 0MB/s for a couple seconds. I captured some sysrq+t
for this.

https://drive.google.com/open?id=0B_2Asp8DGjJ9SE5ZNTBGQUV1ZUk


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-27 21:53                           ` Chris Murphy
@ 2016-01-28 12:27                             ` Austin S. Hemmelgarn
  2016-02-01 14:10                             ` Christian Rohmann
  1 sibling, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-28 12:27 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Christian Rohmann, linux-btrfs

On 2016-01-27 16:53, Chris Murphy wrote:
> On Wed, Jan 27, 2016 at 9:34 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> Hmm, I did some automated testing in a couple of VM's last night, and I have
>> to agree, this _really_ needs to get optimized.  Using the same data-set on
>> otherwise identical VM's, I saw an average 28x slowdown (best case was 16x,
>> worst was almost 100x) for balancing a RAID6 set versus a RAID1 set.  While
>> the parity computations add to the time, there is absolutely no way that
>> just that can explain why this is taking so long.  The closest comparison
>> using MD or DM RAID is probably a full verification of the array, and the
>> greatest difference there that I've seen is around 10x.
>
> I can't exactly reproduce this. I'm using +C qcow2 on Btrfs on one SSD
> to back the drives in the VM.
In my case I was using a set of 8 thinly-provisioned 256G (virtual size) 
LVM volumes exposed directly to a Xen VM as virtual block devices, 
physically backed by traditional hard drives.
For both tests, I used a filesystem spanning all the disks which had a 
lot of sparse files, and had had a lot of data chunks forced allocated 
and then made almost empty.  I made a point to use snapshots to ensure 
that the filesystem itself was not a variable in this.  It's probably 
worth noting that the system I ran this on does have other VM's running 
at the same time on the same physical CPU's, but we need to plan for 
that use case also.
>
> 2x btrfs raid1 with files totalling 5G consistently takes ~1 minute
> [1]  to balance (no filters)
Similar times here.
>
> 4x btrfs raid6 with the same files *inconsistently* takes ~1m15s [2]
> to balance (no filters)
On this I was literally getting around 30 minutes on average, with one 
case where it only took 16, and one where it took 97.  On both 
configurations, I did 12 runs total.
> iotop is all over the place, from 21MB/s writes to 527MB/s
Similar results with iotop, with values ranging from 2MB/s up to spikes 
of 100MB/s (which is about 150% of the measured streaming write speed 
from the VM going straight to the virtual disk).
>
>
> Do both of you get something like this:
> [root@f23m ~]# dmesg | grep -i raid
> [    1.518682] raid6: sse2x1   gen()  4531 MB/s
> [    1.535663] raid6: sse2x1   xor()  3783 MB/s
> [    1.552683] raid6: sse2x2   gen() 10140 MB/s
> [    1.569658] raid6: sse2x2   xor()  7306 MB/s
> [    1.586673] raid6: sse2x4   gen() 11261 MB/s
> [    1.603683] raid6: sse2x4   xor()  7009 MB/s
> [    1.603685] raid6: using algorithm sse2x4 gen() 11261 MB/s
> [    1.603686] raid6: .... xor() 7009 MB/s, rmw enabled
> [    1.603687] raid6: using ssse3x2 recovery algorithm
My system picks avx2x4, which supposedly gets 6.6 GB/s on this hardware, 
although I've never seen any raid recovery, even on RAM disks, manage 
that kind of computational throughput.
>
>
>
> [1] Did it 3 times
> 1m8
> 0m58
> 0m40
>
> [2] Did this multiple times
> 1m15s
> 0m55s
> 0m49s
> And then from that point all attempts were 2+m, but never more than
> 2m29s. I'm not sure why, but there were a lot of drop outs in iotop
> where it'd go to 0MB/s for a couple seconds. I captured some sysrq+t
> for this.
I saw similar drops in IO performance as well, although I didn't get any 
traces for it.
>
> https://drive.google.com/open?id=0B_2Asp8DGjJ9SE5ZNTBGQUV1ZUk
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-01-27 21:53                           ` Chris Murphy
  2016-01-28 12:27                             ` Austin S. Hemmelgarn
@ 2016-02-01 14:10                             ` Christian Rohmann
  2016-02-01 20:52                               ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Christian Rohmann @ 2016-02-01 14:10 UTC (permalink / raw)
  To: Chris Murphy, Austin S. Hemmelgarn; +Cc: linux-btrfs

Hey Chris,


sorry for the late reply.


On 01/27/2016 10:53 PM, Chris Murphy wrote:
> I can't exactly reproduce this. I'm using +C qcow2 on Btrfs on one SSD
> to back the drives in the VM.
> 
> 2x btrfs raid1 with files totalling 5G consistently takes ~1 minute
> [1]  to balance (no filters)
> 
> 4x btrfs raid6 with the same files *inconsistently* takes ~1m15s [2]
> to balance (no filters)
> iotop is all over the place, from 21MB/s writes to 527MB/s

To be honest, 5G is not really 21T spread across 12 spindles with LOTS
of data on them. On another box with 8x4TB spinning rust it's also very
slow.


> Do both of you get something like this:
> [root@f23m ~]# dmesg | grep -i raid
> [    1.518682] raid6: sse2x1   gen()  4531 MB/s
> [    1.535663] raid6: sse2x1   xor()  3783 MB/s
> [    1.552683] raid6: sse2x2   gen() 10140 MB/s
> [    1.569658] raid6: sse2x2   xor()  7306 MB/s
> [    1.586673] raid6: sse2x4   gen() 11261 MB/s
> [    1.603683] raid6: sse2x4   xor()  7009 MB/s
> [    1.603685] raid6: using algorithm sse2x4 gen() 11261 MB/s
> [    1.603686] raid6: .... xor() 7009 MB/s, rmw enabled
> [    1.603687] raid6: using ssse3x2 recovery algorithm


Yes:
--- cut ---
[    4.704396] raid6: sse2x1   gen()  4288 MB/s
[    4.772401] raid6: sse2x1   xor()  4036 MB/s
[    4.840403] raid6: sse2x2   gen()  7629 MB/s
[    4.908405] raid6: sse2x2   xor()  6247 MB/s
[    4.976404] raid6: sse2x4   gen() 10221 MB/s
[    5.044397] raid6: sse2x4   xor()  7620 MB/s
[    5.044525] raid6: using algorithm sse2x4 gen() 10221 MB/s
[    5.044641] raid6: .... xor() 7620 MB/s, rmw enabled
[    5.044767] raid6: using ssse3x2 recovery algorithm
--- cut ---






Would some sort of stracing or profiling of the process help to narrow
down where the time is currently spent and why the balancing is only
running single-threaded?




Regards

Christian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-01 14:10                             ` Christian Rohmann
@ 2016-02-01 20:52                               ` Chris Murphy
  2016-02-09 13:48                                 ` Christian Rohmann
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-02-01 20:52 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: Chris Murphy, Austin S. Hemmelgarn, linux-btrfs

On Mon, Feb 1, 2016 at 7:10 AM, Christian Rohmann
<crohmann@netcologne.de> wrote:
> Hey Chris,
>
>
> sorry for the late reply.
>
>
> On 01/27/2016 10:53 PM, Chris Murphy wrote:
>> I can't exactly reproduce this. I'm using +C qcow2 on Btrfs on one SSD
>> to back the drives in the VM.
>>
>> 2x btrfs raid1 with files totalling 5G consistently takes ~1 minute
>> [1]  to balance (no filters)
>>
>> 4x btrfs raid6 with the same files *inconsistently* takes ~1m15s [2]
>> to balance (no filters)
>> iotop is all over the place, from 21MB/s writes to 527MB/s
>
> To be honest, 5G is not really 21T spread across 12 spindles with LOTS
> of data on them. On another box with 8x4TB spinning rust it's also very
> slow.

5G vs 21T is relevant if the mere fact there's more metadata (bigger
file system) is the source of the problem. Otherwise, at a moment in
time, neither one of us have 5G let alone 21T of data in-flight.

But you have 12 drives, with a theoretical data bandwidth for reads
and writes of about 1GiB/s depending on the performance of the drives,
and where on the platter the read/write happens. So my test is
actually the disadvantaged one. My scenario with 4 qcow2 files on a
single SSD should not perform better, except possibly with respect to
IOPS. But this is not a metdata intensive test, it was merely two
large sequential files.

So if you have very heavy metadata intensive workload, that's actually
pretty bad for any RAID6 and it's probably not great for Btrfs either.
A consideration is how metadata chunks get balanced on raid6 where the
strip size is 64K and the nodesize is 16K. If there's a lot of
metadata being produced, I think we'd expect first that 16K nodes are
fully packed, and then each 64K strip per device is fully packed, then
parity is computed for that stripe, and then the whole stripe is
written.

But when modified, what does a single key change look like? The
minimum initial change is a single 16KiB node has to be CoWd, but
since it's raid6, that means what?

1. Read the 64K strip containing the 16K node.
2. Read the separate 64K strip containing its csum? Not sure if the
node's csum is actually in the node itself.
3. Does btrfs raid6 always check parity on every read? That's not the
case with md raid. On normal reads where the drive does not report a
read error, parity strips are never read, so in effect it's raid0
using n-2 drives, with the strip being the minimum read size.

Depending on all of this, a single 16K read means 1-3 IOs. And a
modification would require 4-6 IOs. Each IO is 64K. So this is not
going to be small file friendly at all the way I see it, hence why it
could be really valuable to have raid1 metadata (with n way
mirroring). Or possibly set the nodesize to 64K to match the strip
size?

So the test I did is relevant in that a.) it's sufficiently different
from your setup, b.) I can't reproduce the problem where raid6 balance
takes longer than raid1 balance. So there's something else going on
other than it merely being raid6. It's raid6 *and* it's something
else, like the workload.

> Would some sort of stracing or profiling of the process help to narrow
> down where the time is currently spent and why the balancing is only
> running single-threaded?

This can't be straced. Someone a lot more knowledgeable than I am
might figure out where all the waits are with just a sysrq + t, if it
is a hold up in say parity computations. Otherwise perf which is a
rabbit hole but perf top is kinda cool to watch. That might give you
an idea where most of the cpu cycles are going if you can isolate the
workload to just the balance. Otherwise you may end up with noisy
data.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-01 20:52                               ` Chris Murphy
@ 2016-02-09 13:48                                 ` Christian Rohmann
  2016-02-09 16:46                                   ` Marc MERLIN
  2016-02-09 21:46                                   ` Chris Murphy
  0 siblings, 2 replies; 28+ messages in thread
From: Christian Rohmann @ 2016-02-09 13:48 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, linux-btrfs

On 02/01/2016 09:52 PM, Chris Murphy wrote:
>> Would some sort of stracing or profiling of the process help to narrow
>> > down where the time is currently spent and why the balancing is only
>> > running single-threaded?
> This can't be straced. Someone a lot more knowledgeable than I am
> might figure out where all the waits are with just a sysrq + t, if it
> is a hold up in say parity computations. Otherwise perf which is a
> rabbit hole but perf top is kinda cool to watch. That might give you
> an idea where most of the cpu cycles are going if you can isolate the
> workload to just the balance. Otherwise you may end up with noisy
> data.

My balance run is now working away since 19th of January:
 "885 out of about 3492 chunks balanced (996 considered),  75% left"

So this will take several more WEEKS to finish. Is there really nothing
anyone here wants me to do or analyze to help finding the root cause of
this? I mean with this kind of performance there is no way a RAID6 can
be used in production. Not because the code is not stable or
functioning, but because regular maintenance like replacing a drive or
growing an array takes WEEKS in which another maintenance procedure
could be necessary or, much worse, another drive might have failed.

What I'm saying is: Such a slow RAID6 balance renders the redundancy
unusable because drives might fail quicker than the potential rebuild
(read "balance").

Regards

Christian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-09 13:48                                 ` Christian Rohmann
@ 2016-02-09 16:46                                   ` Marc MERLIN
  2016-02-09 21:46                                   ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Marc MERLIN @ 2016-02-09 16:46 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: Chris Murphy, Austin S. Hemmelgarn, linux-btrfs

On Tue, Feb 09, 2016 at 02:48:14PM +0100, Christian Rohmann wrote:
> 
> 
> On 02/01/2016 09:52 PM, Chris Murphy wrote:
> >> Would some sort of stracing or profiling of the process help to narrow
> >> > down where the time is currently spent and why the balancing is only
> >> > running single-threaded?
> > This can't be straced. Someone a lot more knowledgeable than I am
> > might figure out where all the waits are with just a sysrq + t, if it
> > is a hold up in say parity computations. Otherwise perf which is a
> > rabbit hole but perf top is kinda cool to watch. That might give you
> > an idea where most of the cpu cycles are going if you can isolate the
> > workload to just the balance. Otherwise you may end up with noisy
> > data.
> 
> My balance run is now working away since 19th of January:
>  "885 out of about 3492 chunks balanced (996 considered),  75% left"
> 
> So this will take several more WEEKS to finish. Is there really nothing
> anyone here wants me to do or analyze to help finding the root cause of
> this? I mean with this kind of performance there is no way a RAID6 can
> be used in production. Not because the code is not stable or
> functioning, but because regular maintenance like replacing a drive or
> growing an array takes WEEKS in which another maintenance procedure
> could be necessary or, much worse, another drive might have failed.
> 
> What I'm saying is: Such a slow RAID6 balance renders the redundancy
> unusable because drives might fail quicker than the potential rebuild
> (read "balance").

I agree, this is bad.
For what it's worth, one of my own filesystems (target for backups, many
many files) has apparently become slow enough that it half hangs my
system when I'm using it.
I've just unmounted it to make sure my overall system performance comes
back, and I may have to delete and recreate it.

Sadly, this also means that btrfs still seems to get itself in corner
cases that are causing performance issues.
I'm not saying that you did hit this problem, but it is possible.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-09 13:48                                 ` Christian Rohmann
  2016-02-09 16:46                                   ` Marc MERLIN
@ 2016-02-09 21:46                                   ` Chris Murphy
  2016-02-10  2:23                                     ` Chris Murphy
  2016-02-10 13:19                                     ` Christian Rohmann
  1 sibling, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2016-02-09 21:46 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: Chris Murphy, Austin S. Hemmelgarn, linux-btrfs

On Tue, Feb 9, 2016 at 6:48 AM, Christian Rohmann
<crohmann@netcologne.de> wrote:
>
>
> On 02/01/2016 09:52 PM, Chris Murphy wrote:
>>> Would some sort of stracing or profiling of the process help to narrow
>>> > down where the time is currently spent and why the balancing is only
>>> > running single-threaded?
>> This can't be straced. Someone a lot more knowledgeable than I am
>> might figure out where all the waits are with just a sysrq + t, if it
>> is a hold up in say parity computations. Otherwise perf which is a
>> rabbit hole but perf top is kinda cool to watch. That might give you
>> an idea where most of the cpu cycles are going if you can isolate the
>> workload to just the balance. Otherwise you may end up with noisy
>> data.
>
> My balance run is now working away since 19th of January:
>  "885 out of about 3492 chunks balanced (996 considered),  75% left"
>
> So this will take several more WEEKS to finish. Is there really nothing
> anyone here wants me to do or analyze to help finding the root cause of
> this?

Can you run 'perf top' and let it run for a few minutes, then
copy/paste or screenshot it somewhere? I'll definitely say in advance
this is just a matter of curiosity where the kernel is spending all of
its time, that this is going so slowly. In no way can I imagine being
able to help fix it. I'm a bit surprised there's no dev response,
maybe try the IRC channel? Weeks is just too long. My concern is if
there's a drive failure, a.) what state is the fs going to be in and
b.) will device replace be this slow too? I'd expect the code path for
balance and replace to be the same, so I suspect yes.

> I mean with this kind of performance there is no way a RAID6 can
> be used in production. Not because the code is not stable or
> functioning, but because regular maintenance like replacing a drive or
> growing an array takes WEEKS in which another maintenance procedure
> could be necessary or, much worse, another drive might have failed.

That's right.

In my dummy test, which should have run slower than your setup, the
other differences on my end:

elevator=noop    ## because I'm running an SSD
kernel 4.5rc0

I could redo my test, using 'perf top' also and see if there's any
glaring difference in where the kernel is spending its time on a
system pushing the block device to its max write ability, vs ones that
aren't. I don't have any other ideas. I'd rather a developer say, "try
this" to gather more useful information, rather than just poking
things with a random stick.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-09 21:46                                   ` Chris Murphy
@ 2016-02-10  2:23                                     ` Chris Murphy
  2016-02-10  2:36                                       ` Chris Murphy
  2016-02-10 13:19                                     ` Christian Rohmann
  1 sibling, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-02-10  2:23 UTC (permalink / raw)
  Cc: Christian Rohmann, Austin S. Hemmelgarn, linux-btrfs

# perf stat -e 'btrfs:*' -a sleep 10

## This is single device HDD, balance of a root fs was started before
these 10 seconds of sampling. There are some differences in the
statistics depending on whether there are predominately reads or
writes for the balance, so clearly balance does predominately reads,
then predominately writes. Unsurprising but the three tries I did were
largely in agreement (orders of magnitude wise).
http://fpaste.org/320551/06921614/

# perf record -e block:block_rq_issue -ag
^C   ## after ~30 seconds
# perf report

## Single device HDD, balance of root fs start before perf record.
There's a lot of data, collapsed by default. I expanded a few items at
random just as an example. I suspect the write of the perf.data file
is a non-factor because it was just under 2MiB.
http://fpaste.org/320555/14550698/raw/

# perf top

## Single device HDD, balance of root fs start before issuing this
command, and let it run for about 20 seconds. This is actually not as
interesting as I thought it might be, but I don't really know what I'm
looking for. I'd need something else to compare it to.
http://fpaste.org/320559/55070873/

Anyway, all of these are single device, so it's not apples/apples
comparison, but it is a working (full speed for the block device)
balance.

Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-10  2:23                                     ` Chris Murphy
@ 2016-02-10  2:36                                       ` Chris Murphy
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-02-10  2:36 UTC (permalink / raw)
  Cc: Christian Rohmann, Austin S. Hemmelgarn, linux-btrfs

This could also be interesting. It means canceling the balance in
progress; waiting some time; and then cancelling it again to get
results to return.

# perf stat -B btrfs balance start /

## Again, single device example, balancing at expected performance.
http://fpaste.org/320562/55071438/

I didn't try this but, it looks like it'd be a variation on the above,
attaching to a running balance:

# perf stat -B -p <pidforbalance> sleep 60

Anyway...

Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-09 21:46                                   ` Chris Murphy
  2016-02-10  2:23                                     ` Chris Murphy
@ 2016-02-10 13:19                                     ` Christian Rohmann
  2016-02-10 19:16                                       ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Christian Rohmann @ 2016-02-10 13:19 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, linux-btrfs

Hey btrfs-folks,


I did a bit of digging using "perf":


1)
 * "perf stat -B -p 3933 sleep 60"
 * "perf stat -e 'btrfs:*' -a sleep 60"
 -> http://fpaste.org/320718/10016145/



2)
 * perf record -e block:block_rq_issue -ag" for about 30 seconds:
 -> http://fpaste.org/320719/51101751/raw/


3)
* perf top
 -> http://fpaste.org/320720/45511028/






Regards

Christian

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-10 13:19                                     ` Christian Rohmann
@ 2016-02-10 19:16                                       ` Chris Murphy
  2016-02-10 19:38                                         ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-02-10 19:16 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: Chris Murphy, Austin S. Hemmelgarn, linux-btrfs

http://fpaste.org/320720/45511028/

What is rb_next? See if you can explode that out and find out more
about why there's so much time going on with that. I see that rb_next
gets used for lots of things, including btrfs. In mine, rb_next is
less than 1% overhead, but for you it's the top item. That's
suspicious.

http://fpaste.org/320718/10016145/
line 72-73. We both have counts for qgroup stuff. Mine is much much
less than yours. I have never had quotas enabled on any of my
filesystems, so I don't know why there are any such counts at all. But
since your values are nearly three orders of magnitude greater than
mine, I have to ask if you have quotas enabled or have ever had them
enabled? That might be a factor here...

Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
  2016-02-10 19:16                                       ` Chris Murphy
@ 2016-02-10 19:38                                         ` Chris Murphy
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-02-10 19:38 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: Austin S. Hemmelgarn, linux-btrfs

Sometimes when things are really slow or even hung up with Btrfs, yet
there's no blocked task being reported, a dev has asked for sysrq+t,
so that might also be something to issue while the slow balance is
happening, and then dmesg to grab the result. The thing is, I have no
idea how to read the output, but maybe if it gets posted up somewhere
we can figure it out. I mean, obviously this is a bug, it shouldn't
take two weeks or more to balance a raid6 volume.

I'd like to think this would have been caught much sooner in
regression testing before it'd be released, so it makes me wonder if
this is an edge case related to hardware, kernel build, or more likely
some state of the affected file systems that the test file systems
aren't in. It might be more helpful to sort through xfstests that call
balance and raid56, and see if there's something that's just not being
tested, but applies to the actual filesystems involved; rather than
trying to decipher kernel output. *shrug*

Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-02-10 19:38 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-22 13:38 btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core? Christian Rohmann
2016-01-22 14:51 ` Duncan
2016-01-24  2:30 ` Henk Slager
2016-01-25 11:34   ` Christian Rohmann
2016-01-25 22:13     ` Chris Murphy
     [not found]       ` <CAKZK7uxdX9UBPOKButtPjqBOdVUfHdRTimP+W34fkz1h9P+wHg@mail.gmail.com>
2016-01-26  0:44         ` Fwd: " Justin Brown
2016-01-26  5:17           ` Chris Murphy
2016-01-26  6:14             ` Chris Murphy
2016-01-26  8:54               ` Christian Rohmann
2016-01-26 19:26                 ` Chris Murphy
2016-01-26 19:27                   ` Chris Murphy
2016-01-26 19:57                   ` Austin S. Hemmelgarn
2016-01-26 20:20                     ` Chris Murphy
2016-01-27  8:48                       ` Christian Rohmann
2016-01-27 16:34                         ` Austin S. Hemmelgarn
2016-01-27 20:58                           ` bbrendon
2016-01-27 21:53                           ` Chris Murphy
2016-01-28 12:27                             ` Austin S. Hemmelgarn
2016-02-01 14:10                             ` Christian Rohmann
2016-02-01 20:52                               ` Chris Murphy
2016-02-09 13:48                                 ` Christian Rohmann
2016-02-09 16:46                                   ` Marc MERLIN
2016-02-09 21:46                                   ` Chris Murphy
2016-02-10  2:23                                     ` Chris Murphy
2016-02-10  2:36                                       ` Chris Murphy
2016-02-10 13:19                                     ` Christian Rohmann
2016-02-10 19:16                                       ` Chris Murphy
2016-02-10 19:38                                         ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.