linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [REGRESSION] Super slow balance in v5.0-rc1
@ 2019-01-14 11:47 Tomasz Chmielewski
  2019-01-14 12:34 ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: Tomasz Chmielewski @ 2019-01-14 11:47 UTC (permalink / raw)
  To: linux-btrfs

> When rebasing my qgroup + balance optimization patches, I found one 
> very
> obvious performance regression for balance.
> 
> For normal 4G subvolume, 16 snapshots, balance workload, v4.20 kernel
> only takes 3s to relocate a metadata block group, while for v5.0-rc1, I
> don't really know how it will take as it hasn't finished yet.

Are you sure it's v5.0-rc1 regression, not earlier?

I'm trying to do a metadata-only balance from RAID-5 to RAID-1, with 
4.19.8.

It was going relatively "normal", until it got stuck and showing no 
progress.

I've canceled the balance, upgraded to 4.20, started the balance again. 
For straight 11 days, it rewrote terabytes of data on the disks, with no 
progress at all. Also, 4.19.8 had a balance interrupted because of "out 
of space", despite we have terabytes free.

Metadata RAID-5 usage stays at 4.12GiB for the past 11 days (and a few 
more days with 4.19.8).



# btrfs fi usage /data
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
     Device size:                  14.47TiB
     Device allocated:            112.06GiB
     Device unallocated:           14.36TiB
     Device missing:                  0.00B
     Used:                        107.93GiB
     Free (estimated):                0.00B      (min: 8.00EiB)
     Data ratio:                       0.00
     Metadata ratio:                   1.64
     Global reserve:              512.00MiB      (used: 1.86MiB)

Data,RAID5: Size:5.28TiB, Used:3.04TiB
    /dev/sda5       1.76TiB
    /dev/sdb5       1.76TiB
    /dev/sdc5       1.76TiB
    /dev/sdd5       1.76TiB

Metadata,RAID1: Size:56.00GiB, Used:53.97GiB
    /dev/sda5      29.00GiB
    /dev/sdb5      27.00GiB
    /dev/sdc5      27.00GiB
    /dev/sdd5      29.00GiB

Metadata,RAID5: Size:12.38GiB, Used:11.13GiB
    /dev/sda5       4.12GiB
    /dev/sdb5       4.12GiB
    /dev/sdc5       4.12GiB
    /dev/sdd5       4.12GiB

System,RAID1: Size:32.00MiB, Used:416.00KiB
    /dev/sdb5      32.00MiB
    /dev/sdc5      32.00MiB

Unallocated:
    /dev/sda5       1.83TiB
    /dev/sdb5       1.83TiB
    /dev/sdc5       1.83TiB
    /dev/sdd5       1.83TiB


# btrfs balance status /data
Balance on '/data' is running
13 out of about 64 chunks balanced (15 considered),  80% left

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Super slow balance in v5.0-rc1
  2019-01-14 11:47 [REGRESSION] Super slow balance in v5.0-rc1 Tomasz Chmielewski
@ 2019-01-14 12:34 ` Qu Wenruo
  2019-01-14 13:06   ` Tomasz Chmielewski
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2019-01-14 12:34 UTC (permalink / raw)
  To: Tomasz Chmielewski, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3070 bytes --]



On 2019/1/14 下午7:47, Tomasz Chmielewski wrote:
>> When rebasing my qgroup + balance optimization patches, I found one very
>> obvious performance regression for balance.
>>
>> For normal 4G subvolume, 16 snapshots, balance workload, v4.20 kernel
>> only takes 3s to relocate a metadata block group, while for v5.0-rc1, I
>> don't really know how it will take as it hasn't finished yet.
> 
> Are you sure it's v5.0-rc1 regression, not earlier?

At least for what I'm describing, yes. v5.0-rc1 introduced bug, and
already pinned down.

So I'm afraid you're hitting another different bug.

> 
> I'm trying to do a metadata-only balance from RAID-5 to RAID-1, with
> 4.19.8.
> 
> It was going relatively "normal", until it got stuck and showing no
> progress.
> 
> I've canceled the balance, upgraded to 4.20, started the balance again.> For straight 11 days, it rewrote terabytes of data on the disks, with no
> progress at all.> Also, 4.19.8 had a balance interrupted because of "out
> of space", despite we have terabytes free.
> 
> Metadata RAID-5 usage stays at 4.12GiB for the past 11 days (and a few
> more days with 4.19.8).

Are you using qgroup? Which is another huge cause of slow balance.

Thanks,
Qu

> 
> 
> 
> # btrfs fi usage /data
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:                  14.47TiB
>     Device allocated:            112.06GiB
>     Device unallocated:           14.36TiB
>     Device missing:                  0.00B
>     Used:                        107.93GiB
>     Free (estimated):                0.00B      (min: 8.00EiB)
>     Data ratio:                       0.00
>     Metadata ratio:                   1.64
>     Global reserve:              512.00MiB      (used: 1.86MiB)
> 
> Data,RAID5: Size:5.28TiB, Used:3.04TiB
>    /dev/sda5       1.76TiB
>    /dev/sdb5       1.76TiB
>    /dev/sdc5       1.76TiB
>    /dev/sdd5       1.76TiB
> 
> Metadata,RAID1: Size:56.00GiB, Used:53.97GiB
>    /dev/sda5      29.00GiB
>    /dev/sdb5      27.00GiB
>    /dev/sdc5      27.00GiB
>    /dev/sdd5      29.00GiB
> 
> Metadata,RAID5: Size:12.38GiB, Used:11.13GiB
>    /dev/sda5       4.12GiB
>    /dev/sdb5       4.12GiB
>    /dev/sdc5       4.12GiB
>    /dev/sdd5       4.12GiB
> 
> System,RAID1: Size:32.00MiB, Used:416.00KiB
>    /dev/sdb5      32.00MiB
>    /dev/sdc5      32.00MiB
> 
> Unallocated:
>    /dev/sda5       1.83TiB
>    /dev/sdb5       1.83TiB
>    /dev/sdc5       1.83TiB
>    /dev/sdd5       1.83TiB
> 
> 
> # btrfs balance status /data
> Balance on '/data' is running
> 13 out of about 64 chunks balanced (15 considered),  80% left


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Super slow balance in v5.0-rc1
  2019-01-14 12:34 ` Qu Wenruo
@ 2019-01-14 13:06   ` Tomasz Chmielewski
  0 siblings, 0 replies; 6+ messages in thread
From: Tomasz Chmielewski @ 2019-01-14 13:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 2019-01-14 21:34, Qu Wenruo wrote:
> On 2019/1/14 下午7:47, Tomasz Chmielewski wrote:
>>> When rebasing my qgroup + balance optimization patches, I found one 
>>> very
>>> obvious performance regression for balance.
>>> 
>>> For normal 4G subvolume, 16 snapshots, balance workload, v4.20 kernel
>>> only takes 3s to relocate a metadata block group, while for v5.0-rc1, 
>>> I
>>> don't really know how it will take as it hasn't finished yet.
>> 
>> Are you sure it's v5.0-rc1 regression, not earlier?
> 
> At least for what I'm describing, yes. v5.0-rc1 introduced bug, and
> already pinned down.
> 
> So I'm afraid you're hitting another different bug.
> 
>> 
>> I'm trying to do a metadata-only balance from RAID-5 to RAID-1, with
>> 4.19.8.
>> 
>> It was going relatively "normal", until it got stuck and showing no
>> progress.
>> 
>> I've canceled the balance, upgraded to 4.20, started the balance 
>> again.> For straight 11 days, it rewrote terabytes of data on the 
>> disks, with no
>> progress at all.> Also, 4.19.8 had a balance interrupted because of 
>> "out
>> of space", despite we have terabytes free.
>> 
>> Metadata RAID-5 usage stays at 4.12GiB for the past 11 days (and a few
>> more days with 4.19.8).
> 
> Are you using qgroup? Which is another huge cause of slow balance.

No, no qgroups.

# btrfs quota rescan /data
ERROR: quota rescan failed: Invalid argument

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Super slow balance in v5.0-rc1
  2019-01-14  9:35 ` David Sterba
@ 2019-01-14 10:00   ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2019-01-14 10:00 UTC (permalink / raw)
  To: dsterba, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1916 bytes --]



On 2019/1/14 下午5:35, David Sterba wrote:
> On Mon, Jan 14, 2019 at 01:39:46PM +0800, Qu Wenruo wrote:
>> Hi,
>>
>> When rebasing my qgroup + balance optimization patches, I found one very
>> obvious performance regression for balance.
>>
>> For normal 4G subvolume, 16 snapshots, balance workload, v4.20 kernel
>> only takes 3s to relocate a metadata block group, while for v5.0-rc1, I
>> don't really know how it will take as it hasn't finished yet.
> 
> This looks like a lockup, unbounded waiting or missed wakeup.

Nope.

It's committing transaction like crazy.

With much smaller dataset, it in fact could finish, while v4.20 could
finish just in senconds, v5.0-rc1 finish in near 400 seconds.

And during that 400 seconds, btrfs commits itself for over 2000 times.

> 
>> And the most important part is, this happens when quota is *DISABLED*!!!
>>
>> I'm bisecting for this regression, but if there are some users trying
>> latest rc kernel, please be aware of this regression.
> 
> The rc1 can go pretty wild and issues could be caused by other
> subsystems, so I'd try to test the merged (32ee34eddad13cd4) and
> non-merged (52042d8e82ff50d) branches, this should tell you if it's a
> genuine btrfs bug or not.

I have already bisect the bug, it's 64403612b73a ("btrfs: rework
btrfs_check_space_for_delayed_refs").

And further more, I sumitted an RFC patch for fstests, which everyone
could test without using the uncertain contains from '/usr'.
https://patchwork.kernel.org/patch/10761715/

This turns out to be several change in relocation at least.

If we don't do snapshots, just one subvolume with just several megabytes
metadata to relocate, it just returns ENOSPC.
With enough snapshots, it commits like crazy.

The bisect is based on relocation duration, haven't digged deep enough
to make a judge on the ENOSPC behavior yet.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Super slow balance in v5.0-rc1
  2019-01-14  5:39 Qu Wenruo
@ 2019-01-14  9:35 ` David Sterba
  2019-01-14 10:00   ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: David Sterba @ 2019-01-14  9:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jan 14, 2019 at 01:39:46PM +0800, Qu Wenruo wrote:
> Hi,
> 
> When rebasing my qgroup + balance optimization patches, I found one very
> obvious performance regression for balance.
> 
> For normal 4G subvolume, 16 snapshots, balance workload, v4.20 kernel
> only takes 3s to relocate a metadata block group, while for v5.0-rc1, I
> don't really know how it will take as it hasn't finished yet.

This looks like a lockup, unbounded waiting or missed wakeup.

> And the most important part is, this happens when quota is *DISABLED*!!!
> 
> I'm bisecting for this regression, but if there are some users trying
> latest rc kernel, please be aware of this regression.

The rc1 can go pretty wild and issues could be caused by other
subsystems, so I'd try to test the merged (32ee34eddad13cd4) and
non-merged (52042d8e82ff50d) branches, this should tell you if it's a
genuine btrfs bug or not.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [REGRESSION] Super slow balance in v5.0-rc1
@ 2019-01-14  5:39 Qu Wenruo
  2019-01-14  9:35 ` David Sterba
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2019-01-14  5:39 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 556 bytes --]

Hi,

When rebasing my qgroup + balance optimization patches, I found one very
obvious performance regression for balance.

For normal 4G subvolume, 16 snapshots, balance workload, v4.20 kernel
only takes 3s to relocate a metadata block group, while for v5.0-rc1, I
don't really know how it will take as it hasn't finished yet.

And the most important part is, this happens when quota is *DISABLED*!!!

I'm bisecting for this regression, but if there are some users trying
latest rc kernel, please be aware of this regression.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-01-14 13:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-14 11:47 [REGRESSION] Super slow balance in v5.0-rc1 Tomasz Chmielewski
2019-01-14 12:34 ` Qu Wenruo
2019-01-14 13:06   ` Tomasz Chmielewski
  -- strict thread matches above, loose matches on Subject: below --
2019-01-14  5:39 Qu Wenruo
2019-01-14  9:35 ` David Sterba
2019-01-14 10:00   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).