Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* Help needed, server is unresponsive after btrfs balance
@ 2019-02-04 11:47 Moritz M
  2019-02-04 11:59 ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Moritz M @ 2019-02-04 11:47 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I'm running a Ubuntu server with a btrfs RAID1 consisting of three HDDs.

I do balancing daily via

> btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4 /

It usually takes between 1 - 10 minutes.

But today the server was unresponsive (no ssh connect possible, no 
direct login via keyboard possible)  even after 7 hours.

I had a similar situation two weeks ago. I did not find anything and 
finally checked and repaired the filesystem with

> btrfs check --repair /dev/sda3

Which found some qgroup related problems:

> enabling repair mode
> Checking filesystem on /dev/sda3
> UUID: cf8c4bb2-6a75-4e1d-983c-19583a93a546
> No device size related problem found
> cache and super generation don't match, space cache will be invalidated
> Counts for qgroup id: 0/257 are different
> our:		referenced 127300112384 referenced compressed 127300112384
> disk:		referenced 18446743939800129536 referenced compressed 
> 18446743939800129536
> diff:		referenced 261209534464 referenced compressed 261209534464
> our:		exclusive 56360521728 exclusive compressed 56360521728
> disk:		exclusive 56360521728 exclusive compressed 56360521728
> Repair qgroup 0/257

Today I had to boot a Live system, mount the btrfs filessystem with
-o skip_balance and cancel the balancing there.

Mounting took ~30 mins and in journalctl of the Live system I found this

> Feb 04 09:42:28 ubuntu kernel: INFO: task btrfs-transacti:7527 blocked 
> for
> more than 120 seconds.
> Feb 04 09:42:28 ubuntu kernel:       Not tainted
> 4.15.0-29-generic #31-Ubuntu
> Feb 04 09:42:28 ubuntu kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 04 09:42:28 ubuntu kernel: btrfs-transacti D    0  7527      2 
> 0x80000000
> Feb 04 09:42:28 ubuntu kernel: Call Trace:
> Feb 04 09:42:28 ubuntu kernel:  __schedule+0x291/0x8a0
> Feb 04 09:42:28 ubuntu kernel:  schedule+0x2c/0x80
> Feb 04 09:42:28 ubuntu kernel:  btrfs_commit_transaction+0x81d/0x8f0 
> [btrfs]
> Feb 04 09:42:28 ubuntu kernel:  ? wait_woken+0x80/0x80
> Feb 04 09:42:28 ubuntu kernel:  transaction_kthread+0x18d/0x1b0 [btrfs]
> Feb 04 09:42:28 ubuntu kernel:  kthread+0x121/0x140
> Feb 04 09:42:28 ubuntu kernel:  ? btrfs_cleanup_transaction+0x560/0x560
> [btrfs] Feb 04 09:42:28 ubuntu kernel:  ?
> kthread_create_worker_on_cpu+0x70/0x70 Feb 04 09:42:28 ubuntu kernel:  
> ?
> do_syscall_64+0x73/0x130
> Feb 04 09:42:28 ubuntu kernel:  ? SyS_exit_group+0x14/0x20

After rebooting the server acted normal. The only thing I could find in 
the journalctl was:

> Feb 04 02:00:02 server kernel: BTRFS info (device sda3): relocating 
> block
> group 7246746484736 flags data|raid1
> 
> Feb 04 02:05:23 server kernel: BTRFS info (device sda3): found 3 
> extents
> Feb 04 02:06:12 server kernel: BTRFS info (device sda3): found 3 
> extents
> Feb 04 02:07:01 server kernel: BTRFS info (device sda3): relocating 
> block
> group 7059915407360 flags metadata|raid1

Btrfs balancing starts at 02:00.

Can anybody give me a hint what causes this?

I suspect some kind of hardware failure but can't find anything. Any 
idea where to look?

My setup:
> Linux server 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 
> 2019
> x86_64 x86_64 x86_64 GNU/Linux
> 
> btrfs-progs v4.15.1
> 
> Label: 'rootfs'  uuid: cf8c4bb2-6a75-4e1d-983c-19583a93a546
> 
>         Total devices 3 FS bytes used 620.55GiB
>         devid    1 size 923.13GiB used 446.03GiB path /dev/sdc3
>         devid    2 size 923.13GiB used 449.00GiB path /dev/sda3
>         devid    3 size 923.13GiB used 447.03GiB path /dev/sdb3
> 
> Data, RAID1: total=667.00GiB, used=617.65GiB
> System, RAID1: total=32.00MiB, used=176.00KiB
> Metadata, RAID1: total=4.00GiB, used=2.90GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

Dmesg output is not provided there was nothing after reboot.

Thanks

Moritz

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Help needed, server is unresponsive after btrfs balance
  2019-02-04 11:47 Help needed, server is unresponsive after btrfs balance Moritz M
@ 2019-02-04 11:59 ` Qu Wenruo
  2019-02-04 12:52   ` Moritz M
       [not found]   ` <1a6d00fce82926ce9ec7db7bbab37c12@moritzmueller.ee>
  0 siblings, 2 replies; 5+ messages in thread
From: Qu Wenruo @ 2019-02-04 11:59 UTC (permalink / raw)
  To: Moritz M, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 4511 bytes --]



On 2019/2/4 下午7:47, Moritz M wrote:
> Hi,
> 
> I'm running a Ubuntu server with a btrfs RAID1 consisting of three HDDs.
> 
> I do balancing daily via
> 
>> btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4 /
> 
> It usually takes between 1 - 10 minutes.
> 
> But today the server was unresponsive (no ssh connect possible, no
> direct login via keyboard possible)  even after 7 hours.
> 
> I had a similar situation two weeks ago. I did not find anything and
> finally checked and repaired the filesystem with
> 
>> btrfs check --repair /dev/sda3
> 
> Which found some qgroup related problems:
> 
>> enabling repair mode
>> Checking filesystem on /dev/sda3
>> UUID: cf8c4bb2-6a75-4e1d-983c-19583a93a546
>> No device size related problem found
>> cache and super generation don't match, space cache will be invalidated
>> Counts for qgroup id: 0/257 are different
>> our:        referenced 127300112384 referenced compressed 127300112384
>> disk:        referenced 18446743939800129536 referenced compressed
>> 18446743939800129536
>> diff:        referenced 261209534464 referenced compressed 261209534464
>> our:        exclusive 56360521728 exclusive compressed 56360521728
>> disk:        exclusive 56360521728 exclusive compressed 56360521728
> …
>> Repair qgroup 0/257

You're using qgroups, it's known to cause huge performance overhead for
balance.

We have upcoming patches to solve it, but it not going to mainline
before v5.1 kernel.

So please disable qgroups if you're not using it actively.

Thanks,
Qu

> 
> Today I had to boot a Live system, mount the btrfs filessystem with
> -o skip_balance and cancel the balancing there.
> 
> Mounting took ~30 mins and in journalctl of the Live system I found this
> 
>> Feb 04 09:42:28 ubuntu kernel: INFO: task btrfs-transacti:7527 blocked
>> for
>> more than 120 seconds.
>> Feb 04 09:42:28 ubuntu kernel:       Not tainted
>> 4.15.0-29-generic #31-Ubuntu
>> Feb 04 09:42:28 ubuntu kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Feb 04 09:42:28 ubuntu kernel: btrfs-transacti D    0  7527      2
>> 0x80000000
>> Feb 04 09:42:28 ubuntu kernel: Call Trace:
>> Feb 04 09:42:28 ubuntu kernel:  __schedule+0x291/0x8a0
>> Feb 04 09:42:28 ubuntu kernel:  schedule+0x2c/0x80
>> Feb 04 09:42:28 ubuntu kernel:  btrfs_commit_transaction+0x81d/0x8f0
>> [btrfs]
>> Feb 04 09:42:28 ubuntu kernel:  ? wait_woken+0x80/0x80
>> Feb 04 09:42:28 ubuntu kernel:  transaction_kthread+0x18d/0x1b0 [btrfs]
>> Feb 04 09:42:28 ubuntu kernel:  kthread+0x121/0x140
>> Feb 04 09:42:28 ubuntu kernel:  ? btrfs_cleanup_transaction+0x560/0x560
>> [btrfs] Feb 04 09:42:28 ubuntu kernel:  ?
>> kthread_create_worker_on_cpu+0x70/0x70 Feb 04 09:42:28 ubuntu kernel:  ?
>> do_syscall_64+0x73/0x130
>> Feb 04 09:42:28 ubuntu kernel:  ? SyS_exit_group+0x14/0x20
> 
> After rebooting the server acted normal. The only thing I could find in
> the journalctl was:
> 
>> Feb 04 02:00:02 server kernel: BTRFS info (device sda3): relocating block
>> group 7246746484736 flags data|raid1
>>
>> Feb 04 02:05:23 server kernel: BTRFS info (device sda3): found 3 extents
>> Feb 04 02:06:12 server kernel: BTRFS info (device sda3): found 3 extents
>> Feb 04 02:07:01 server kernel: BTRFS info (device sda3): relocating block
>> group 7059915407360 flags metadata|raid1
> 
> Btrfs balancing starts at 02:00.
> 
> Can anybody give me a hint what causes this?
> 
> I suspect some kind of hardware failure but can't find anything. Any
> idea where to look?
> 
> My setup:
>> Linux server 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC
>> 2019
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> btrfs-progs v4.15.1
>>
>> Label: 'rootfs'  uuid: cf8c4bb2-6a75-4e1d-983c-19583a93a546
>>
>>         Total devices 3 FS bytes used 620.55GiB
>>         devid    1 size 923.13GiB used 446.03GiB path /dev/sdc3
>>         devid    2 size 923.13GiB used 449.00GiB path /dev/sda3
>>         devid    3 size 923.13GiB used 447.03GiB path /dev/sdb3
>>
>> Data, RAID1: total=667.00GiB, used=617.65GiB
>> System, RAID1: total=32.00MiB, used=176.00KiB
>> Metadata, RAID1: total=4.00GiB, used=2.90GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Dmesg output is not provided there was nothing after reboot.
> 
> Thanks
> 
> Moritz


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Help needed, server is unresponsive after btrfs balance
  2019-02-04 11:59 ` Qu Wenruo
@ 2019-02-04 12:52   ` Moritz M
       [not found]   ` <1a6d00fce82926ce9ec7db7bbab37c12@moritzmueller.ee>
  1 sibling, 0 replies; 5+ messages in thread
From: Moritz M @ 2019-02-04 12:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-btrfs-owner

> 
> You're using qgroups, it's known to cause huge performance overhead for
> balance.
> 
> We have upcoming patches to solve it, but it not going to mainline
> before v5.1 kernel.
> 
> So please disable qgroups if you're not using it actively.
> 

Thanks, was not aware that I turned it on. Is

btrfs quota disable /

enough for disabling qgroups?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Help needed, server is unresponsive after btrfs balance
       [not found]     ` <30c7e926-98c7-fd82-f587-1478d31cbf58@gmx.com>
@ 2019-02-05 11:45       ` Moritz M
  2019-02-05 12:17         ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Moritz M @ 2019-02-05 11:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs



Am 2019-02-04 14:55, schrieb Qu Wenruo:
> On 2019/2/4 下午8:49, Moritz M wrote:
>>> 
>>> You're using qgroups, it's known to cause huge performance overhead 
>>> for
>>> balance.
>>> 
>>> We have upcoming patches to solve it, but it not going to mainline
>>> before v5.1 kernel.
>>> 
>>> So please disable qgroups if you're not using it actively.
>>> 
>> 
>> Thanks, was not aware that I turned it on. Is
>> 
>> btrfs quota disable /
>> 
>> enough for disabling qgroups?
> 
> Yep

Do you have any idea how to validate if the enabled qgroups led to my 
problem?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Help needed, server is unresponsive after btrfs balance
  2019-02-05 11:45       ` Moritz M
@ 2019-02-05 12:17         ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2019-02-05 12:17 UTC (permalink / raw)
  To: Moritz M; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1094 bytes --]



On 2019/2/5 下午7:45, Moritz M wrote:
> 
> 
> Am 2019-02-04 14:55, schrieb Qu Wenruo:
>> On 2019/2/4 下午8:49, Moritz M wrote:
>>>>
>>>> You're using qgroups, it's known to cause huge performance overhead for
>>>> balance.
>>>>
>>>> We have upcoming patches to solve it, but it not going to mainline
>>>> before v5.1 kernel.
>>>>
>>>> So please disable qgroups if you're not using it actively.
>>>>
>>>
>>> Thanks, was not aware that I turned it on. Is
>>>
>>> btrfs quota disable /
>>>
>>> enough for disabling qgroups?
>>
>> Yep
> 
> Do you have any idea how to validate if the enabled qgroups led to my
> problem?

It's a common fact.

Especially for balance.

In this patch you could see some of the performance difference:
https://patchwork.kernel.org/cover/10725589/

The final fix should make balance only takes 3 seconds for that workload.
Exactly the same with qgroup disabled.

In short, quota enabled, balance takes 20s~2min (depends on the kernel
version), with quota disabled or with all my optimization, it takes 3s.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-04 11:47 Help needed, server is unresponsive after btrfs balance Moritz M
2019-02-04 11:59 ` Qu Wenruo
2019-02-04 12:52   ` Moritz M
     [not found]   ` <1a6d00fce82926ce9ec7db7bbab37c12@moritzmueller.ee>
     [not found]     ` <30c7e926-98c7-fd82-f587-1478d31cbf58@gmx.com>
2019-02-05 11:45       ` Moritz M
2019-02-05 12:17         ` Qu Wenruo

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox