On 2019/2/4 下午7:47, Moritz M wrote: > Hi, > > I'm running a Ubuntu server with a btrfs RAID1 consisting of three HDDs. > > I do balancing daily via > >> btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4 / > > It usually takes between 1 - 10 minutes. > > But today the server was unresponsive (no ssh connect possible, no > direct login via keyboard possible)  even after 7 hours. > > I had a similar situation two weeks ago. I did not find anything and > finally checked and repaired the filesystem with > >> btrfs check --repair /dev/sda3 > > Which found some qgroup related problems: > >> enabling repair mode >> Checking filesystem on /dev/sda3 >> UUID: cf8c4bb2-6a75-4e1d-983c-19583a93a546 >> No device size related problem found >> cache and super generation don't match, space cache will be invalidated >> Counts for qgroup id: 0/257 are different >> our:        referenced 127300112384 referenced compressed 127300112384 >> disk:        referenced 18446743939800129536 referenced compressed >> 18446743939800129536 >> diff:        referenced 261209534464 referenced compressed 261209534464 >> our:        exclusive 56360521728 exclusive compressed 56360521728 >> disk:        exclusive 56360521728 exclusive compressed 56360521728 > … >> Repair qgroup 0/257 You're using qgroups, it's known to cause huge performance overhead for balance. We have upcoming patches to solve it, but it not going to mainline before v5.1 kernel. So please disable qgroups if you're not using it actively. Thanks, Qu > > Today I had to boot a Live system, mount the btrfs filessystem with > -o skip_balance and cancel the balancing there. > > Mounting took ~30 mins and in journalctl of the Live system I found this > >> Feb 04 09:42:28 ubuntu kernel: INFO: task btrfs-transacti:7527 blocked >> for >> more than 120 seconds. >> Feb 04 09:42:28 ubuntu kernel:       Not tainted >> 4.15.0-29-generic #31-Ubuntu >> Feb 04 09:42:28 ubuntu kernel: "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Feb 04 09:42:28 ubuntu kernel: btrfs-transacti D    0  7527      2 >> 0x80000000 >> Feb 04 09:42:28 ubuntu kernel: Call Trace: >> Feb 04 09:42:28 ubuntu kernel:  __schedule+0x291/0x8a0 >> Feb 04 09:42:28 ubuntu kernel:  schedule+0x2c/0x80 >> Feb 04 09:42:28 ubuntu kernel:  btrfs_commit_transaction+0x81d/0x8f0 >> [btrfs] >> Feb 04 09:42:28 ubuntu kernel:  ? wait_woken+0x80/0x80 >> Feb 04 09:42:28 ubuntu kernel:  transaction_kthread+0x18d/0x1b0 [btrfs] >> Feb 04 09:42:28 ubuntu kernel:  kthread+0x121/0x140 >> Feb 04 09:42:28 ubuntu kernel:  ? btrfs_cleanup_transaction+0x560/0x560 >> [btrfs] Feb 04 09:42:28 ubuntu kernel:  ? >> kthread_create_worker_on_cpu+0x70/0x70 Feb 04 09:42:28 ubuntu kernel:  ? >> do_syscall_64+0x73/0x130 >> Feb 04 09:42:28 ubuntu kernel:  ? SyS_exit_group+0x14/0x20 > > After rebooting the server acted normal. The only thing I could find in > the journalctl was: > >> Feb 04 02:00:02 server kernel: BTRFS info (device sda3): relocating block >> group 7246746484736 flags data|raid1 >> >> Feb 04 02:05:23 server kernel: BTRFS info (device sda3): found 3 extents >> Feb 04 02:06:12 server kernel: BTRFS info (device sda3): found 3 extents >> Feb 04 02:07:01 server kernel: BTRFS info (device sda3): relocating block >> group 7059915407360 flags metadata|raid1 > > Btrfs balancing starts at 02:00. > > Can anybody give me a hint what causes this? > > I suspect some kind of hardware failure but can't find anything. Any > idea where to look? > > My setup: >> Linux server 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC >> 2019 >> x86_64 x86_64 x86_64 GNU/Linux >> >> btrfs-progs v4.15.1 >> >> Label: 'rootfs'  uuid: cf8c4bb2-6a75-4e1d-983c-19583a93a546 >> >>         Total devices 3 FS bytes used 620.55GiB >>         devid    1 size 923.13GiB used 446.03GiB path /dev/sdc3 >>         devid    2 size 923.13GiB used 449.00GiB path /dev/sda3 >>         devid    3 size 923.13GiB used 447.03GiB path /dev/sdb3 >> >> Data, RAID1: total=667.00GiB, used=617.65GiB >> System, RAID1: total=32.00MiB, used=176.00KiB >> Metadata, RAID1: total=4.00GiB, used=2.90GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B > > Dmesg output is not provided there was nothing after reboot. > > Thanks > > Moritz