linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BUG at fs/btrfs/relocation.c:794!
@ 2020-06-30 22:10 David Sterba
  2020-07-23 21:56 ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2020-06-30 22:10 UTC (permalink / raw)
  To: linux-btrfs; +Cc: wqu

Hi,

I've hit a crash in relocation I've never seen before.

[ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
[ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
[ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
[ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
[ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
[ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
[ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
[ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
[ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
[ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
[ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
[ 2129.258771] FS:  00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
[ 2129.258772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
[ 2129.258775] Call Trace:
[ 2129.258825]  btrfs_init_reloc_root+0xe8/0x120 [btrfs]
[ 2129.258862]  record_root_in_trans+0xae/0xd0 [btrfs]
[ 2129.258901]  btrfs_record_root_in_trans+0x51/0x70 [btrfs]
[ 2129.340388]  select_reloc_root+0x94/0x340 [btrfs]
[ 2129.340433]  do_relocation+0xda/0x7b0 [btrfs]
[ 2129.349854]  ? _raw_spin_unlock+0x1f/0x40
[ 2129.349898]  relocate_tree_blocks+0x336/0x670 [btrfs]
[ 2129.359325]  relocate_block_group+0x2f6/0x600 [btrfs]
[ 2129.359365]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
[ 2129.359408]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
[ 2129.375494]  __btrfs_balance+0x42c/0xce0 [btrfs]
[ 2129.375553]  btrfs_balance+0x66a/0xbe0 [btrfs]
[ 2129.375562]  ? kmem_cache_alloc_trace+0x19c/0x330
[ 2129.389852]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 2129.389887]  btrfs_ioctl+0x304/0x2490 [btrfs]
[ 2129.389898]  ? do_user_addr_fault+0x221/0x49c
[ 2129.404070]  ? sched_clock_cpu+0x15/0x140
[ 2129.404073]  ? do_user_addr_fault+0x221/0x49c
[ 2129.404079]  ? up_read+0x18/0x240
[ 2129.404086]  ? ksys_ioctl+0x68/0xa0
[ 2129.404091]  ksys_ioctl+0x68/0xa0
[ 2129.423308]  __x64_sys_ioctl+0x16/0x20
[ 2129.423312]  do_syscall_64+0x50/0xe0
[ 2129.423315]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2129.423318] RIP: 0033:0x7f82a51c6327
[ 2129.423319] Code: Bad RIP value.
[ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
[ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
[ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
[ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000

Relevant code called from create_reloc_root:

        ret = btrfs_insert_root(trans, fs_info->tree_root,
                                &root_key, root_item);
        BUG_ON(ret)

and according to EAX, ret is -17 which is EEXIST.

I don't have a reproducer, the testing image has been filled by random git
checkouts, deduplicated by BEES, then tons of snapshots created until the
metadata got exhausted, some file deletion and balances.

This is the same image that led to the patch "btrfs: allow use of global block
reserve for balance item deletion", so this could have left it in some
intermediate state where the balance item was not removed and the reloc tree as
well.

There were a few unsuccessful mounts due to relocation recovery, that was
trying to debug but then it started to work.

The error happened with this 'fi df' saved after the balance start:

# btrfs fi df mnt
Data, single: total=80.01GiB, used=38.67GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.46GiB
GlobalReserve, single: total=512.00MiB, used=44.00KiB

The error looks like a repeated relocation tree creation, which would point to
the unsuccesful balances or inconsistent state (balance item, reloc trees).
It's not a "typical" mix of operations but I'd appreciate any insights here.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794!
  2020-06-30 22:10 BUG at fs/btrfs/relocation.c:794! David Sterba
@ 2020-07-23 21:56 ` Zygo Blaxell
  2020-07-24  0:19   ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-07-23 21:56 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs, wqu

On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
> Hi,
> 
> I've hit a crash in relocation I've never seen before.
> 
> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!

I hit an issue yesterday that reminded me of this.

> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
> [ 2129.258771] FS:  00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
> [ 2129.258772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
> [ 2129.258775] Call Trace:
> [ 2129.258825]  btrfs_init_reloc_root+0xe8/0x120 [btrfs]
> [ 2129.258862]  record_root_in_trans+0xae/0xd0 [btrfs]
> [ 2129.258901]  btrfs_record_root_in_trans+0x51/0x70 [btrfs]
> [ 2129.340388]  select_reloc_root+0x94/0x340 [btrfs]
> [ 2129.340433]  do_relocation+0xda/0x7b0 [btrfs]
> [ 2129.349854]  ? _raw_spin_unlock+0x1f/0x40
> [ 2129.349898]  relocate_tree_blocks+0x336/0x670 [btrfs]
> [ 2129.359325]  relocate_block_group+0x2f6/0x600 [btrfs]
> [ 2129.359365]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
> [ 2129.359408]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
> [ 2129.375494]  __btrfs_balance+0x42c/0xce0 [btrfs]
> [ 2129.375553]  btrfs_balance+0x66a/0xbe0 [btrfs]
> [ 2129.375562]  ? kmem_cache_alloc_trace+0x19c/0x330
> [ 2129.389852]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
> [ 2129.389887]  btrfs_ioctl+0x304/0x2490 [btrfs]
> [ 2129.389898]  ? do_user_addr_fault+0x221/0x49c
> [ 2129.404070]  ? sched_clock_cpu+0x15/0x140
> [ 2129.404073]  ? do_user_addr_fault+0x221/0x49c
> [ 2129.404079]  ? up_read+0x18/0x240
> [ 2129.404086]  ? ksys_ioctl+0x68/0xa0
> [ 2129.404091]  ksys_ioctl+0x68/0xa0
> [ 2129.423308]  __x64_sys_ioctl+0x16/0x20
> [ 2129.423312]  do_syscall_64+0x50/0xe0
> [ 2129.423315]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 2129.423318] RIP: 0033:0x7f82a51c6327
> [ 2129.423319] Code: Bad RIP value.
> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
> 
> Relevant code called from create_reloc_root:
> 
>         ret = btrfs_insert_root(trans, fs_info->tree_root,
>                                 &root_key, root_item);
>         BUG_ON(ret)
> 
> and according to EAX, ret is -17 which is EEXIST.
> 
> I don't have a reproducer, the testing image has been filled by random git
> checkouts, deduplicated by BEES, then tons of snapshots created until the
> metadata got exhausted, some file deletion and balances.

Mine is rsync, bees, lots of snapshots, balances, scrubs.  I recently also
added random 'killall -INT btrfs' to send balance some fatal signals.

> This is the same image that led to the patch "btrfs: allow use of global block
> reserve for balance item deletion", so this could have left it in some
> intermediate state where the balance item was not removed and the reloc tree as
> well.
> 
> There were a few unsuccessful mounts due to relocation recovery, that was
> trying to debug but then it started to work.
> 
> The error happened with this 'fi df' saved after the balance start:
> 
> # btrfs fi df mnt
> Data, single: total=80.01GiB, used=38.67GiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=19.99GiB, used=19.46GiB
> GlobalReserve, single: total=512.00MiB, used=44.00KiB

Mine is:

	Data, single: total=1.75TiB, used=1.74TiB
	System, RAID1: total=32.00MiB, used=208.00KiB
	Metadata, RAID1: total=25.00GiB, used=22.89GiB
	GlobalReserve, single: total=512.00MiB, used=0.00B

though this is some time after the failure (and a reboot).  I do notice
that there's lots of unallocated space, but metadata usage is close
to allocated, and I have been experiencing a lot of EROFS events when
that happens, even if there's gigabytes unallocated.

btrfs fi us:

	Overall:
	    Device size:                   2.00TiB
	    Device allocated:              1.80TiB
	    Device unallocated:          208.94GiB
	    Device missing:                  0.00B
	    Used:                          1.79TiB
	    Free (estimated):            211.30GiB      (min: 106.83GiB)
	    Data ratio:                       1.00
	    Metadata ratio:                   2.00
	    Global reserve:              512.00MiB      (used: 0.00B)

	Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
	   /dev/mapper/vgtest-tvdb        894.00GiB
	   /dev/mapper/vgtest-tvdc        895.00GiB

	Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
	   /dev/mapper/vgtest-tvdb         25.00GiB
	   /dev/mapper/vgtest-tvdc         25.00GiB

	System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
	   /dev/mapper/vgtest-tvdb         32.00MiB
	   /dev/mapper/vgtest-tvdc         32.00MiB

	Unallocated:
	   /dev/mapper/vgtest-tvdb        104.97GiB
	   /dev/mapper/vgtest-tvdc        103.97GiB

> The error looks like a repeated relocation tree creation, which would point to
> the unsuccesful balances or inconsistent state (balance item, reloc trees).
> It's not a "typical" mix of operations but I'd appreciate any insights here.

I have the same line but different call stack, with misc-next
e3027d10af42d24940be74dabaf1550cd770bd48:

	[ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
	[ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
	[ 9718.511137][T13609] ------------[ cut here ]------------
	[ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
	[ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
	[ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G        W         5.8.0-6582a95aabfe+ #44
	[ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	[ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
	[ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 
	e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
	[ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
	[ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
	[ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
	[ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
	[ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
	[ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
	[ 9718.529756][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
	[ 9718.531211][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
	[ 9718.533608][T13609] Call Trace:
	[ 9718.534151][T13609]  ? update_backref_node+0xf0/0xf0
	[ 9718.535137][T13609]  ? check_chain_key+0x1e6/0x2e0
	[ 9718.536057][T13609]  btrfs_init_reloc_root+0x2d7/0x310
	[ 9718.537016][T13609]  ? find_reloc_root+0x200/0x200
	[ 9718.537992][T13609]  ? do_raw_spin_unlock+0xa8/0x140
	[ 9718.538899][T13609]  record_root_in_trans+0x18c/0x1d0
	[ 9718.539848][T13609]  btrfs_record_root_in_trans+0x8b/0xc0
	[ 9718.540843][T13609]  select_reloc_root+0x15f/0x6a0
	[ 9718.541943][T13609]  ? create_reloc_inode.isra.28+0x410/0x410
	[ 9718.543066][T13609]  ? rcu_read_lock_sched_held+0xa1/0xd0
	[ 9718.544333][T13609]  ? check_flags.part.44+0x86/0x220
	[ 9718.545186][T13609]  ? check_flags+0x26/0x30
	[ 9718.545870][T13609]  ? lock_is_held_type+0xc9/0x100
	[ 9718.546651][T13609]  do_relocation+0x242/0xc90
	[ 9718.547372][T13609]  ? select_reloc_root+0x6a0/0x6a0
	[ 9718.548160][T13609]  ? check_flags.part.44+0x86/0x220
	[ 9718.548969][T13609]  ? __kasan_check_read+0x11/0x20
	[ 9718.549745][T13609]  ? mark_lock+0xa8/0x440
	[ 9718.550426][T13609]  ? mark_held_locks+0x8d/0xb0
	[ 9718.551165][T13609]  ? btrfs_backref_cleanup_node+0x5c1/0x600
	[ 9718.552079][T13609]  ? memcpy+0x4d/0x60
	[ 9718.552694][T13609]  ? read_extent_buffer+0xcc/0x120
	[ 9718.553478][T13609]  relocate_tree_blocks+0xa29/0xb00
	[ 9718.554255][T13609]  ? do_relocation+0xc90/0xc90
	[ 9718.554978][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
	[ 9718.555855][T13609]  ? free_extent_buffer.part.46+0x90/0x140
	[ 9718.556756][T13609]  ? rb_insert_color+0x342/0x360
	[ 9718.557581][T13609]  ? free_extent_buffer+0x13/0x20
	[ 9718.558445][T13609]  ? add_tree_block.isra.34+0x236/0x2b0
	[ 9718.559387][T13609]  relocate_block_group+0x52e/0x830
	[ 9718.560275][T13609]  ? merge_reloc_roots+0x4b0/0x4b0
	[ 9718.561137][T13609]  btrfs_relocate_block_group+0x26e/0x4c0
	[ 9718.562137][T13609]  btrfs_relocate_chunk+0x52/0x120
	[ 9718.562918][T13609]  btrfs_balance+0xe22/0x1910
	[ 9718.563605][T13609]  ? check_chain_key+0x1e6/0x2e0
	[ 9718.564331][T13609]  ? btrfs_relocate_chunk+0x120/0x120
	[ 9718.565126][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
	[ 9718.565943][T13609]  ? _copy_from_user+0x95/0xd0
	[ 9718.566649][T13609]  btrfs_ioctl_balance+0x3de/0x4c0
	[ 9718.567414][T13609]  btrfs_ioctl+0x2385/0x4250
	[ 9718.568090][T13609]  ? __kasan_check_read+0x11/0x20
	[ 9718.568830][T13609]  ? check_chain_key+0x1e6/0x2e0
	[ 9718.569619][T13609]  ? btrfs_ioctl_get_supported_features+0x30/0x30
	[ 9718.570658][T13609]  ? kvm_sched_clock_read+0x18/0x30
	[ 9718.571526][T13609]  ? check_chain_key+0x1e6/0x2e0
	[ 9718.572348][T13609]  ? lock_downgrade+0x3e0/0x3e0
	[ 9718.573121][T13609]  ? do_vfs_ioctl+0xfc/0x9d0
	[ 9718.573835][T13609]  ? ioctl_file_clone+0xe0/0xe0
	[ 9718.574637][T13609]  ? check_flags.part.44+0x86/0x220
	[ 9718.575472][T13609]  ? check_flags+0x26/0x30
	[ 9718.576190][T13609]  ? lock_is_held_type+0xc9/0x100
	[ 9718.576990][T13609]  ? check_flags.part.44+0x86/0x220
	[ 9718.577836][T13609]  ? check_flags+0x26/0x30
	[ 9718.578542][T13609]  ? lock_is_held_type+0xc9/0x100
	[ 9718.579403][T13609]  ? __kasan_check_read+0x11/0x20
	[ 9718.580225][T13609]  ? __fget_light+0xae/0x110
	[ 9718.580983][T13609]  ksys_ioctl+0xa1/0xe0
	[ 9718.581628][T13609]  __x64_sys_ioctl+0x43/0x50
	[ 9718.582334][T13609]  do_syscall_64+0x60/0xf0
	[ 9718.583285][T13609]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
	[ 9718.585289][T13609] Code: Bad RIP value.
	[ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
	[ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
	[ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
	[ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
	[ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
	[ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
	[ 9718.596109][T13609] Modules linked in:
	[ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
	[ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
	[ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
	[ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
	[ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
	[ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
	[ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
	[ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
	[ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
	[ 9718.644840][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
	[ 9718.646728][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
	[ 9718.869689][ T4545] ==================================================================

same line, different call stack:

	0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
	789             btrfs_tree_unlock(eb);
	790             free_extent_buffer(eb);
	791
	792             ret = btrfs_insert_root(trans, fs_info->tree_root,
	793                                     &root_key, root_item);
	794             BUG_ON(ret);
	795             kfree(root_item);
	796
	797             reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
	798             BUG_ON(IS_ERR(reloc_root));

followed by

	[ 9718.869689][ T4545] ==================================================================
	[ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
	[ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
	[ 9718.873746][ T4545] 
	[ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G      D W         5.8.0-6582a95aabfe+ #44
	[ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	[ 9718.877149][ T4545] Call Trace:
	[ 9718.877655][ T4545]  dump_stack+0xc8/0x11a
	[ 9718.878317][ T4545]  ? __mutex_lock+0x202/0xce0
	[ 9718.879065][ T4545]  print_address_description.constprop.8+0x1f/0x200
	[ 9718.880167][ T4545]  ? __mutex_lock+0x202/0xce0
	[ 9718.880916][ T4545]  ? __mutex_lock+0x202/0xce0
	[ 9718.881666][ T4545]  kasan_report.cold.11+0x20/0x3e
	[ 9718.882483][ T4545]  ? __mutex_lock+0x202/0xce0
	[ 9718.883229][ T4545]  __asan_load4+0x69/0x90
	[ 9718.883920][ T4545]  __mutex_lock+0x202/0xce0
	[ 9718.884651][ T4545]  ? wait_current_trans+0xb7/0x230
	[ 9718.885465][ T4545]  ? btrfs_record_root_in_trans+0x7e/0xc0
	[ 9718.886388][ T4545]  ? mutex_lock_io_nested+0xc20/0xc20
	[ 9718.887246][ T4545]  ? __kasan_check_read+0x11/0x20
	[ 9718.888035][ T4545]  ? join_transaction+0x32/0x6f0
	[ 9718.888854][ T4545]  ? join_transaction+0x1a6/0x6f0
	[ 9718.889679][ T4545]  ? lock_downgrade+0x3e0/0x3e0
	[ 9718.890496][ T4545]  ? __kasan_check_write+0x14/0x20
	[ 9718.891308][ T4545]  ? lock_contended+0x720/0x720
	[ 9718.892093][ T4545]  ? do_raw_spin_lock+0x1e0/0x1e0
	[ 9718.892912][ T4545]  ? wait_current_trans+0xb7/0x230
	[ 9718.893705][ T4545]  mutex_lock_nested+0x1b/0x20
	[ 9718.894494][ T4545]  ? mutex_lock_nested+0x1b/0x20
	[ 9718.895317][ T4545]  btrfs_record_root_in_trans+0x7e/0xc0
	[ 9718.896245][ T4545]  start_transaction+0x189/0x8f0
	[ 9718.897081][ T4545]  btrfs_start_transaction+0x1e/0x20
	[ 9718.897941][ T4545]  btrfs_cont_expand+0x549/0x7a0
	[ 9718.898805][ T4545]  ? btrfs_truncate_block+0x930/0x930
	[ 9718.899665][ T4545]  ? inode_newsize_ok+0x75/0xc0
	[ 9718.900438][ T4545]  ? setattr_prepare+0x9c/0x310
	[ 9718.901242][ T4545]  btrfs_setattr+0x514/0x850
	[ 9718.902035][ T4545]  ? current_time+0x8c/0xe0
	[ 9718.902799][ T4545]  notify_change+0x4ec/0x700
	[ 9718.903584][ T4545]  ? do_sys_ftruncate+0x108/0x220
	[ 9718.904459][ T4545]  do_truncate+0xe4/0x160
	[ 9718.905200][ T4545]  ? __x64_sys_openat2+0x170/0x170
	[ 9718.906116][ T4545]  ? __sb_start_write+0x1a1/0x270
	[ 9718.906954][ T4545]  do_sys_ftruncate+0x1b8/0x220
	[ 9718.907759][ T4545]  __x64_sys_ftruncate+0x36/0x40
	[ 9718.908577][ T4545]  do_syscall_64+0x60/0xf0
	[ 9718.909292][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
	[ 9718.911247][ T4545] Code: Bad RIP value.
	[ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
	[ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
	[ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
	[ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
	[ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
	[ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
	[ 9718.919882][ T4545] 
	[ 9718.920268][ T4545] Allocated by task 6732:
	[ 9718.920973][ T4545]  save_stack+0x21/0x50
	[ 9718.921648][ T4545]  __kasan_kmalloc.constprop.17+0xc1/0xd0
	[ 9718.922580][ T4545]  kasan_slab_alloc+0x12/0x20
	[ 9718.923345][ T4545]  kmem_cache_alloc_node+0x113/0x720
	[ 9718.924203][ T4545]  copy_process+0x357/0x3680
	[ 9718.924955][ T4545]  _do_fork+0xed/0x880
	[ 9718.925622][ T4545]  __do_sys_clone+0xee/0x130
	[ 9718.926369][ T4545]  __x64_sys_clone+0x67/0x80
	[ 9718.927119][ T4545]  do_syscall_64+0x60/0xf0
	[ 9718.927848][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[ 9718.928812][ T4545] 
	[ 9718.929173][ T4545] Freed by task 24:
	[ 9718.929787][ T4545]  save_stack+0x21/0x50
	[ 9718.930453][ T4545]  __kasan_slab_free+0x118/0x170
	[ 9718.931242][ T4545]  kasan_slab_free+0xe/0x10
	[ 9718.931970][ T4545]  kmem_cache_free+0x5f/0x280
	[ 9718.932730][ T4545]  free_task+0x73/0x90
	[ 9718.933391][ T4545]  __put_task_struct+0x199/0x1d0
	[ 9718.934187][ T4545]  delayed_put_task_struct+0x124/0x1b0
	[ 9718.935071][ T4545]  rcu_core+0x3b0/0xeb0
	[ 9718.935758][ T4545]  rcu_core_si+0xe/0x10
	[ 9718.936433][ T4545]  __do_softirq+0x120/0x5e3
	[ 9718.937165][ T4545] 
	[ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
	[ 9718.937545][ T4545]  which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
	[ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
	[ 9718.940391][ T4545]  11072-byte region [ffff888014e94000, ffff888014e96b40)
	[ 9718.942559][ T4545] The buggy address belongs to the page:
	[ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
	[ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
	[ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
	[ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
	[ 9718.949889][ T4545] page dumped because: kasan: bad access detected
	[ 9718.950977][ T4545] 
	[ 9718.951354][ T4545] Memory state around the buggy address:
	[ 9718.952296][ T4545]  ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	[ 9718.953641][ T4545]  ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	[ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	[ 9718.956366][ T4545]                                   ^
	[ 9718.957258][ T4545]  ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	[ 9718.958653][ T4545]  ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	[ 9718.960034][ T4545] ==================================================================


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794!
  2020-07-23 21:56 ` Zygo Blaxell
@ 2020-07-24  0:19   ` Qu Wenruo
  2020-08-04 16:16     ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-07-24  0:19 UTC (permalink / raw)
  To: Zygo Blaxell, David Sterba; +Cc: linux-btrfs, wqu


[-- Attachment #1.1: Type: text/plain, Size: 21223 bytes --]



On 2020/7/24 上午5:56, Zygo Blaxell wrote:
> On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
>> Hi,
>>
>> I've hit a crash in relocation I've never seen before.
>>
>> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
> 
> I hit an issue yesterday that reminded me of this.
> 
>> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
>> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
>> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
>> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
>> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
>> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
>> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
>> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
>> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
>> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
>> [ 2129.258771] FS:  00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
>> [ 2129.258772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
>> [ 2129.258775] Call Trace:
>> [ 2129.258825]  btrfs_init_reloc_root+0xe8/0x120 [btrfs]
>> [ 2129.258862]  record_root_in_trans+0xae/0xd0 [btrfs]
>> [ 2129.258901]  btrfs_record_root_in_trans+0x51/0x70 [btrfs]
>> [ 2129.340388]  select_reloc_root+0x94/0x340 [btrfs]
>> [ 2129.340433]  do_relocation+0xda/0x7b0 [btrfs]
>> [ 2129.349854]  ? _raw_spin_unlock+0x1f/0x40
>> [ 2129.349898]  relocate_tree_blocks+0x336/0x670 [btrfs]
>> [ 2129.359325]  relocate_block_group+0x2f6/0x600 [btrfs]
>> [ 2129.359365]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
>> [ 2129.359408]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
>> [ 2129.375494]  __btrfs_balance+0x42c/0xce0 [btrfs]
>> [ 2129.375553]  btrfs_balance+0x66a/0xbe0 [btrfs]
>> [ 2129.375562]  ? kmem_cache_alloc_trace+0x19c/0x330
>> [ 2129.389852]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
>> [ 2129.389887]  btrfs_ioctl+0x304/0x2490 [btrfs]
>> [ 2129.389898]  ? do_user_addr_fault+0x221/0x49c
>> [ 2129.404070]  ? sched_clock_cpu+0x15/0x140
>> [ 2129.404073]  ? do_user_addr_fault+0x221/0x49c
>> [ 2129.404079]  ? up_read+0x18/0x240
>> [ 2129.404086]  ? ksys_ioctl+0x68/0xa0
>> [ 2129.404091]  ksys_ioctl+0x68/0xa0
>> [ 2129.423308]  __x64_sys_ioctl+0x16/0x20
>> [ 2129.423312]  do_syscall_64+0x50/0xe0
>> [ 2129.423315]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 2129.423318] RIP: 0033:0x7f82a51c6327
>> [ 2129.423319] Code: Bad RIP value.
>> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
>> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
>> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
>> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
>> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
>>
>> Relevant code called from create_reloc_root:
>>
>>         ret = btrfs_insert_root(trans, fs_info->tree_root,
>>                                 &root_key, root_item);
>>         BUG_ON(ret)
>>
>> and according to EAX, ret is -17 which is EEXIST.
>>
>> I don't have a reproducer, the testing image has been filled by random git
>> checkouts, deduplicated by BEES, then tons of snapshots created until the
>> metadata got exhausted, some file deletion and balances.
> 
> Mine is rsync, bees, lots of snapshots, balances, scrubs.  I recently also
> added random 'killall -INT btrfs' to send balance some fatal signals.
> 
>> This is the same image that led to the patch "btrfs: allow use of global block
>> reserve for balance item deletion", so this could have left it in some
>> intermediate state where the balance item was not removed and the reloc tree as
>> well.
>>
>> There were a few unsuccessful mounts due to relocation recovery, that was
>> trying to debug but then it started to work.
>>
>> The error happened with this 'fi df' saved after the balance start:
>>
>> # btrfs fi df mnt
>> Data, single: total=80.01GiB, used=38.67GiB
>> System, single: total=4.00MiB, used=16.00KiB
>> Metadata, single: total=19.99GiB, used=19.46GiB
>> GlobalReserve, single: total=512.00MiB, used=44.00KiB
> 
> Mine is:
> 
> 	Data, single: total=1.75TiB, used=1.74TiB
> 	System, RAID1: total=32.00MiB, used=208.00KiB
> 	Metadata, RAID1: total=25.00GiB, used=22.89GiB
> 	GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> though this is some time after the failure (and a reboot).  I do notice
> that there's lots of unallocated space, but metadata usage is close
> to allocated, and I have been experiencing a lot of EROFS events when
> that happens, even if there's gigabytes unallocated.
> 
> btrfs fi us:
> 
> 	Overall:
> 	    Device size:                   2.00TiB
> 	    Device allocated:              1.80TiB
> 	    Device unallocated:          208.94GiB
> 	    Device missing:                  0.00B
> 	    Used:                          1.79TiB
> 	    Free (estimated):            211.30GiB      (min: 106.83GiB)
> 	    Data ratio:                       1.00
> 	    Metadata ratio:                   2.00
> 	    Global reserve:              512.00MiB      (used: 0.00B)
> 
> 	Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
> 	   /dev/mapper/vgtest-tvdb        894.00GiB
> 	   /dev/mapper/vgtest-tvdc        895.00GiB
> 
> 	Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
> 	   /dev/mapper/vgtest-tvdb         25.00GiB
> 	   /dev/mapper/vgtest-tvdc         25.00GiB
> 
> 	System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
> 	   /dev/mapper/vgtest-tvdb         32.00MiB
> 	   /dev/mapper/vgtest-tvdc         32.00MiB
> 
> 	Unallocated:
> 	   /dev/mapper/vgtest-tvdb        104.97GiB
> 	   /dev/mapper/vgtest-tvdc        103.97GiB
> 
>> The error looks like a repeated relocation tree creation, which would point to
>> the unsuccesful balances or inconsistent state (balance item, reloc trees).
>> It's not a "typical" mix of operations but I'd appreciate any insights here.
> 
> I have the same line but different call stack, with misc-next
> e3027d10af42d24940be74dabaf1550cd770bd48:
> 
> 	[ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
> 	[ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
> 	[ 9718.511137][T13609] ------------[ cut here ]------------
> 	[ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
> 	[ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
> 	[ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G        W         5.8.0-6582a95aabfe+ #44
> 	[ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> 	[ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> 	[ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 
> 	e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> 	[ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> 	[ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> 	[ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> 	[ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> 	[ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> 	[ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> 	[ 9718.529756][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> 	[ 9718.531211][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> 	[ 9718.533608][T13609] Call Trace:
> 	[ 9718.534151][T13609]  ? update_backref_node+0xf0/0xf0
> 	[ 9718.535137][T13609]  ? check_chain_key+0x1e6/0x2e0
> 	[ 9718.536057][T13609]  btrfs_init_reloc_root+0x2d7/0x310

That's the same problem.

Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON().

In that case, that means there are some reloc trees not cleaned up.

Would you mind to provide the "btrfs ins dump-tree -t root" dump for
that fs if the problem still happens?

Thanks,
Qu
> 	[ 9718.537016][T13609]  ? find_reloc_root+0x200/0x200
> 	[ 9718.537992][T13609]  ? do_raw_spin_unlock+0xa8/0x140
> 	[ 9718.538899][T13609]  record_root_in_trans+0x18c/0x1d0
> 	[ 9718.539848][T13609]  btrfs_record_root_in_trans+0x8b/0xc0
> 	[ 9718.540843][T13609]  select_reloc_root+0x15f/0x6a0
> 	[ 9718.541943][T13609]  ? create_reloc_inode.isra.28+0x410/0x410
> 	[ 9718.543066][T13609]  ? rcu_read_lock_sched_held+0xa1/0xd0
> 	[ 9718.544333][T13609]  ? check_flags.part.44+0x86/0x220
> 	[ 9718.545186][T13609]  ? check_flags+0x26/0x30
> 	[ 9718.545870][T13609]  ? lock_is_held_type+0xc9/0x100
> 	[ 9718.546651][T13609]  do_relocation+0x242/0xc90
> 	[ 9718.547372][T13609]  ? select_reloc_root+0x6a0/0x6a0
> 	[ 9718.548160][T13609]  ? check_flags.part.44+0x86/0x220
> 	[ 9718.548969][T13609]  ? __kasan_check_read+0x11/0x20
> 	[ 9718.549745][T13609]  ? mark_lock+0xa8/0x440
> 	[ 9718.550426][T13609]  ? mark_held_locks+0x8d/0xb0
> 	[ 9718.551165][T13609]  ? btrfs_backref_cleanup_node+0x5c1/0x600
> 	[ 9718.552079][T13609]  ? memcpy+0x4d/0x60
> 	[ 9718.552694][T13609]  ? read_extent_buffer+0xcc/0x120
> 	[ 9718.553478][T13609]  relocate_tree_blocks+0xa29/0xb00
> 	[ 9718.554255][T13609]  ? do_relocation+0xc90/0xc90
> 	[ 9718.554978][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
> 	[ 9718.555855][T13609]  ? free_extent_buffer.part.46+0x90/0x140
> 	[ 9718.556756][T13609]  ? rb_insert_color+0x342/0x360
> 	[ 9718.557581][T13609]  ? free_extent_buffer+0x13/0x20
> 	[ 9718.558445][T13609]  ? add_tree_block.isra.34+0x236/0x2b0
> 	[ 9718.559387][T13609]  relocate_block_group+0x52e/0x830
> 	[ 9718.560275][T13609]  ? merge_reloc_roots+0x4b0/0x4b0
> 	[ 9718.561137][T13609]  btrfs_relocate_block_group+0x26e/0x4c0
> 	[ 9718.562137][T13609]  btrfs_relocate_chunk+0x52/0x120
> 	[ 9718.562918][T13609]  btrfs_balance+0xe22/0x1910
> 	[ 9718.563605][T13609]  ? check_chain_key+0x1e6/0x2e0
> 	[ 9718.564331][T13609]  ? btrfs_relocate_chunk+0x120/0x120
> 	[ 9718.565126][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
> 	[ 9718.565943][T13609]  ? _copy_from_user+0x95/0xd0
> 	[ 9718.566649][T13609]  btrfs_ioctl_balance+0x3de/0x4c0
> 	[ 9718.567414][T13609]  btrfs_ioctl+0x2385/0x4250
> 	[ 9718.568090][T13609]  ? __kasan_check_read+0x11/0x20
> 	[ 9718.568830][T13609]  ? check_chain_key+0x1e6/0x2e0
> 	[ 9718.569619][T13609]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> 	[ 9718.570658][T13609]  ? kvm_sched_clock_read+0x18/0x30
> 	[ 9718.571526][T13609]  ? check_chain_key+0x1e6/0x2e0
> 	[ 9718.572348][T13609]  ? lock_downgrade+0x3e0/0x3e0
> 	[ 9718.573121][T13609]  ? do_vfs_ioctl+0xfc/0x9d0
> 	[ 9718.573835][T13609]  ? ioctl_file_clone+0xe0/0xe0
> 	[ 9718.574637][T13609]  ? check_flags.part.44+0x86/0x220
> 	[ 9718.575472][T13609]  ? check_flags+0x26/0x30
> 	[ 9718.576190][T13609]  ? lock_is_held_type+0xc9/0x100
> 	[ 9718.576990][T13609]  ? check_flags.part.44+0x86/0x220
> 	[ 9718.577836][T13609]  ? check_flags+0x26/0x30
> 	[ 9718.578542][T13609]  ? lock_is_held_type+0xc9/0x100
> 	[ 9718.579403][T13609]  ? __kasan_check_read+0x11/0x20
> 	[ 9718.580225][T13609]  ? __fget_light+0xae/0x110
> 	[ 9718.580983][T13609]  ksys_ioctl+0xa1/0xe0
> 	[ 9718.581628][T13609]  __x64_sys_ioctl+0x43/0x50
> 	[ 9718.582334][T13609]  do_syscall_64+0x60/0xf0
> 	[ 9718.583285][T13609]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
> 	[ 9718.585289][T13609] Code: Bad RIP value.
> 	[ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> 	[ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
> 	[ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
> 	[ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> 	[ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
> 	[ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
> 	[ 9718.596109][T13609] Modules linked in:
> 	[ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
> 	[ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> 	[ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> 	[ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> 	[ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> 	[ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> 	[ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> 	[ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> 	[ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> 	[ 9718.644840][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> 	[ 9718.646728][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> 	[ 9718.869689][ T4545] ==================================================================
> 
> same line, different call stack:
> 
> 	0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
> 	789             btrfs_tree_unlock(eb);
> 	790             free_extent_buffer(eb);
> 	791
> 	792             ret = btrfs_insert_root(trans, fs_info->tree_root,
> 	793                                     &root_key, root_item);
> 	794             BUG_ON(ret);
> 	795             kfree(root_item);
> 	796
> 	797             reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> 	798             BUG_ON(IS_ERR(reloc_root));
> 
> followed by
> 
> 	[ 9718.869689][ T4545] ==================================================================
> 	[ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
> 	[ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
> 	[ 9718.873746][ T4545] 
> 	[ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G      D W         5.8.0-6582a95aabfe+ #44
> 	[ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> 	[ 9718.877149][ T4545] Call Trace:
> 	[ 9718.877655][ T4545]  dump_stack+0xc8/0x11a
> 	[ 9718.878317][ T4545]  ? __mutex_lock+0x202/0xce0
> 	[ 9718.879065][ T4545]  print_address_description.constprop.8+0x1f/0x200
> 	[ 9718.880167][ T4545]  ? __mutex_lock+0x202/0xce0
> 	[ 9718.880916][ T4545]  ? __mutex_lock+0x202/0xce0
> 	[ 9718.881666][ T4545]  kasan_report.cold.11+0x20/0x3e
> 	[ 9718.882483][ T4545]  ? __mutex_lock+0x202/0xce0
> 	[ 9718.883229][ T4545]  __asan_load4+0x69/0x90
> 	[ 9718.883920][ T4545]  __mutex_lock+0x202/0xce0
> 	[ 9718.884651][ T4545]  ? wait_current_trans+0xb7/0x230
> 	[ 9718.885465][ T4545]  ? btrfs_record_root_in_trans+0x7e/0xc0
> 	[ 9718.886388][ T4545]  ? mutex_lock_io_nested+0xc20/0xc20
> 	[ 9718.887246][ T4545]  ? __kasan_check_read+0x11/0x20
> 	[ 9718.888035][ T4545]  ? join_transaction+0x32/0x6f0
> 	[ 9718.888854][ T4545]  ? join_transaction+0x1a6/0x6f0
> 	[ 9718.889679][ T4545]  ? lock_downgrade+0x3e0/0x3e0
> 	[ 9718.890496][ T4545]  ? __kasan_check_write+0x14/0x20
> 	[ 9718.891308][ T4545]  ? lock_contended+0x720/0x720
> 	[ 9718.892093][ T4545]  ? do_raw_spin_lock+0x1e0/0x1e0
> 	[ 9718.892912][ T4545]  ? wait_current_trans+0xb7/0x230
> 	[ 9718.893705][ T4545]  mutex_lock_nested+0x1b/0x20
> 	[ 9718.894494][ T4545]  ? mutex_lock_nested+0x1b/0x20
> 	[ 9718.895317][ T4545]  btrfs_record_root_in_trans+0x7e/0xc0
> 	[ 9718.896245][ T4545]  start_transaction+0x189/0x8f0
> 	[ 9718.897081][ T4545]  btrfs_start_transaction+0x1e/0x20
> 	[ 9718.897941][ T4545]  btrfs_cont_expand+0x549/0x7a0
> 	[ 9718.898805][ T4545]  ? btrfs_truncate_block+0x930/0x930
> 	[ 9718.899665][ T4545]  ? inode_newsize_ok+0x75/0xc0
> 	[ 9718.900438][ T4545]  ? setattr_prepare+0x9c/0x310
> 	[ 9718.901242][ T4545]  btrfs_setattr+0x514/0x850
> 	[ 9718.902035][ T4545]  ? current_time+0x8c/0xe0
> 	[ 9718.902799][ T4545]  notify_change+0x4ec/0x700
> 	[ 9718.903584][ T4545]  ? do_sys_ftruncate+0x108/0x220
> 	[ 9718.904459][ T4545]  do_truncate+0xe4/0x160
> 	[ 9718.905200][ T4545]  ? __x64_sys_openat2+0x170/0x170
> 	[ 9718.906116][ T4545]  ? __sb_start_write+0x1a1/0x270
> 	[ 9718.906954][ T4545]  do_sys_ftruncate+0x1b8/0x220
> 	[ 9718.907759][ T4545]  __x64_sys_ftruncate+0x36/0x40
> 	[ 9718.908577][ T4545]  do_syscall_64+0x60/0xf0
> 	[ 9718.909292][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
> 	[ 9718.911247][ T4545] Code: Bad RIP value.
> 	[ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> 	[ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
> 	[ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
> 	[ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
> 	[ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
> 	[ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
> 	[ 9718.919882][ T4545] 
> 	[ 9718.920268][ T4545] Allocated by task 6732:
> 	[ 9718.920973][ T4545]  save_stack+0x21/0x50
> 	[ 9718.921648][ T4545]  __kasan_kmalloc.constprop.17+0xc1/0xd0
> 	[ 9718.922580][ T4545]  kasan_slab_alloc+0x12/0x20
> 	[ 9718.923345][ T4545]  kmem_cache_alloc_node+0x113/0x720
> 	[ 9718.924203][ T4545]  copy_process+0x357/0x3680
> 	[ 9718.924955][ T4545]  _do_fork+0xed/0x880
> 	[ 9718.925622][ T4545]  __do_sys_clone+0xee/0x130
> 	[ 9718.926369][ T4545]  __x64_sys_clone+0x67/0x80
> 	[ 9718.927119][ T4545]  do_syscall_64+0x60/0xf0
> 	[ 9718.927848][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[ 9718.928812][ T4545] 
> 	[ 9718.929173][ T4545] Freed by task 24:
> 	[ 9718.929787][ T4545]  save_stack+0x21/0x50
> 	[ 9718.930453][ T4545]  __kasan_slab_free+0x118/0x170
> 	[ 9718.931242][ T4545]  kasan_slab_free+0xe/0x10
> 	[ 9718.931970][ T4545]  kmem_cache_free+0x5f/0x280
> 	[ 9718.932730][ T4545]  free_task+0x73/0x90
> 	[ 9718.933391][ T4545]  __put_task_struct+0x199/0x1d0
> 	[ 9718.934187][ T4545]  delayed_put_task_struct+0x124/0x1b0
> 	[ 9718.935071][ T4545]  rcu_core+0x3b0/0xeb0
> 	[ 9718.935758][ T4545]  rcu_core_si+0xe/0x10
> 	[ 9718.936433][ T4545]  __do_softirq+0x120/0x5e3
> 	[ 9718.937165][ T4545] 
> 	[ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
> 	[ 9718.937545][ T4545]  which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
> 	[ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
> 	[ 9718.940391][ T4545]  11072-byte region [ffff888014e94000, ffff888014e96b40)
> 	[ 9718.942559][ T4545] The buggy address belongs to the page:
> 	[ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
> 	[ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
> 	[ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
> 	[ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
> 	[ 9718.949889][ T4545] page dumped because: kasan: bad access detected
> 	[ 9718.950977][ T4545] 
> 	[ 9718.951354][ T4545] Memory state around the buggy address:
> 	[ 9718.952296][ T4545]  ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 	[ 9718.953641][ T4545]  ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 	[ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	[ 9718.956366][ T4545]                                   ^
> 	[ 9718.957258][ T4545]  ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	[ 9718.958653][ T4545]  ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	[ 9718.960034][ T4545] ==================================================================
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794!
  2020-07-24  0:19   ` Qu Wenruo
@ 2020-08-04 16:16     ` Zygo Blaxell
  2020-08-28  0:03       ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-08-04 16:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu

On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/7/24 上午5:56, Zygo Blaxell wrote:
> > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
> >> Hi,
> >>
> >> I've hit a crash in relocation I've never seen before.
> >>
> >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
> > 
> > I hit an issue yesterday that reminded me of this.
> > 
> >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
> >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
> >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
> >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
> >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
> >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
> >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
> >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
> >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
> >> [ 2129.258771] FS:  00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
> >> [ 2129.258772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
> >> [ 2129.258775] Call Trace:
> >> [ 2129.258825]  btrfs_init_reloc_root+0xe8/0x120 [btrfs]
> >> [ 2129.258862]  record_root_in_trans+0xae/0xd0 [btrfs]
> >> [ 2129.258901]  btrfs_record_root_in_trans+0x51/0x70 [btrfs]
> >> [ 2129.340388]  select_reloc_root+0x94/0x340 [btrfs]
> >> [ 2129.340433]  do_relocation+0xda/0x7b0 [btrfs]
> >> [ 2129.349854]  ? _raw_spin_unlock+0x1f/0x40
> >> [ 2129.349898]  relocate_tree_blocks+0x336/0x670 [btrfs]
> >> [ 2129.359325]  relocate_block_group+0x2f6/0x600 [btrfs]
> >> [ 2129.359365]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
> >> [ 2129.359408]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
> >> [ 2129.375494]  __btrfs_balance+0x42c/0xce0 [btrfs]
> >> [ 2129.375553]  btrfs_balance+0x66a/0xbe0 [btrfs]
> >> [ 2129.375562]  ? kmem_cache_alloc_trace+0x19c/0x330
> >> [ 2129.389852]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
> >> [ 2129.389887]  btrfs_ioctl+0x304/0x2490 [btrfs]
> >> [ 2129.389898]  ? do_user_addr_fault+0x221/0x49c
> >> [ 2129.404070]  ? sched_clock_cpu+0x15/0x140
> >> [ 2129.404073]  ? do_user_addr_fault+0x221/0x49c
> >> [ 2129.404079]  ? up_read+0x18/0x240
> >> [ 2129.404086]  ? ksys_ioctl+0x68/0xa0
> >> [ 2129.404091]  ksys_ioctl+0x68/0xa0
> >> [ 2129.423308]  __x64_sys_ioctl+0x16/0x20
> >> [ 2129.423312]  do_syscall_64+0x50/0xe0
> >> [ 2129.423315]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [ 2129.423318] RIP: 0033:0x7f82a51c6327
> >> [ 2129.423319] Code: Bad RIP value.
> >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
> >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
> >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
> >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
> >>
> >> Relevant code called from create_reloc_root:
> >>
> >>         ret = btrfs_insert_root(trans, fs_info->tree_root,
> >>                                 &root_key, root_item);
> >>         BUG_ON(ret)
> >>
> >> and according to EAX, ret is -17 which is EEXIST.
> >>
> >> I don't have a reproducer, the testing image has been filled by random git
> >> checkouts, deduplicated by BEES, then tons of snapshots created until the
> >> metadata got exhausted, some file deletion and balances.
> > 
> > Mine is rsync, bees, lots of snapshots, balances, scrubs.  I recently also
> > added random 'killall -INT btrfs' to send balance some fatal signals.
> > 
> >> This is the same image that led to the patch "btrfs: allow use of global block
> >> reserve for balance item deletion", so this could have left it in some
> >> intermediate state where the balance item was not removed and the reloc tree as
> >> well.
> >>
> >> There were a few unsuccessful mounts due to relocation recovery, that was
> >> trying to debug but then it started to work.
> >>
> >> The error happened with this 'fi df' saved after the balance start:
> >>
> >> # btrfs fi df mnt
> >> Data, single: total=80.01GiB, used=38.67GiB
> >> System, single: total=4.00MiB, used=16.00KiB
> >> Metadata, single: total=19.99GiB, used=19.46GiB
> >> GlobalReserve, single: total=512.00MiB, used=44.00KiB
> > 
> > Mine is:
> > 
> > 	Data, single: total=1.75TiB, used=1.74TiB
> > 	System, RAID1: total=32.00MiB, used=208.00KiB
> > 	Metadata, RAID1: total=25.00GiB, used=22.89GiB
> > 	GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > though this is some time after the failure (and a reboot).  I do notice
> > that there's lots of unallocated space, but metadata usage is close
> > to allocated, and I have been experiencing a lot of EROFS events when
> > that happens, even if there's gigabytes unallocated.
> > 
> > btrfs fi us:
> > 
> > 	Overall:
> > 	    Device size:                   2.00TiB
> > 	    Device allocated:              1.80TiB
> > 	    Device unallocated:          208.94GiB
> > 	    Device missing:                  0.00B
> > 	    Used:                          1.79TiB
> > 	    Free (estimated):            211.30GiB      (min: 106.83GiB)
> > 	    Data ratio:                       1.00
> > 	    Metadata ratio:                   2.00
> > 	    Global reserve:              512.00MiB      (used: 0.00B)
> > 
> > 	Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
> > 	   /dev/mapper/vgtest-tvdb        894.00GiB
> > 	   /dev/mapper/vgtest-tvdc        895.00GiB
> > 
> > 	Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
> > 	   /dev/mapper/vgtest-tvdb         25.00GiB
> > 	   /dev/mapper/vgtest-tvdc         25.00GiB
> > 
> > 	System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
> > 	   /dev/mapper/vgtest-tvdb         32.00MiB
> > 	   /dev/mapper/vgtest-tvdc         32.00MiB
> > 
> > 	Unallocated:
> > 	   /dev/mapper/vgtest-tvdb        104.97GiB
> > 	   /dev/mapper/vgtest-tvdc        103.97GiB
> > 
> >> The error looks like a repeated relocation tree creation, which would point to
> >> the unsuccesful balances or inconsistent state (balance item, reloc trees).
> >> It's not a "typical" mix of operations but I'd appreciate any insights here.
> > 
> > I have the same line but different call stack, with misc-next
> > e3027d10af42d24940be74dabaf1550cd770bd48:
> > 
> > 	[ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
> > 	[ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
> > 	[ 9718.511137][T13609] ------------[ cut here ]------------
> > 	[ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
> > 	[ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
> > 	[ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G        W         5.8.0-6582a95aabfe+ #44
> > 	[ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > 	[ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > 	[ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 
> > 	e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > 	[ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > 	[ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > 	[ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > 	[ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > 	[ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > 	[ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > 	[ 9718.529756][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > 	[ 9718.531211][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 	[ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > 	[ 9718.533608][T13609] Call Trace:
> > 	[ 9718.534151][T13609]  ? update_backref_node+0xf0/0xf0
> > 	[ 9718.535137][T13609]  ? check_chain_key+0x1e6/0x2e0
> > 	[ 9718.536057][T13609]  btrfs_init_reloc_root+0x2d7/0x310
> 
> That's the same problem.
> 
> Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON().
> 
> In that case, that means there are some reloc trees not cleaned up.
> 
> Would you mind to provide the "btrfs ins dump-tree -t root" dump for
> that fs if the problem still happens?

http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt

The problem is now happening multiple times per day, starting with
kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date:  Thu
Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0.

The previous misc-next (that I have test data for),
cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020
-0700 does not have this problem.

These commit hashes are from https://gitlab.com/kdave/btrfs-devel.


> Thanks,
> Qu
> > 	[ 9718.537016][T13609]  ? find_reloc_root+0x200/0x200
> > 	[ 9718.537992][T13609]  ? do_raw_spin_unlock+0xa8/0x140
> > 	[ 9718.538899][T13609]  record_root_in_trans+0x18c/0x1d0
> > 	[ 9718.539848][T13609]  btrfs_record_root_in_trans+0x8b/0xc0
> > 	[ 9718.540843][T13609]  select_reloc_root+0x15f/0x6a0
> > 	[ 9718.541943][T13609]  ? create_reloc_inode.isra.28+0x410/0x410
> > 	[ 9718.543066][T13609]  ? rcu_read_lock_sched_held+0xa1/0xd0
> > 	[ 9718.544333][T13609]  ? check_flags.part.44+0x86/0x220
> > 	[ 9718.545186][T13609]  ? check_flags+0x26/0x30
> > 	[ 9718.545870][T13609]  ? lock_is_held_type+0xc9/0x100
> > 	[ 9718.546651][T13609]  do_relocation+0x242/0xc90
> > 	[ 9718.547372][T13609]  ? select_reloc_root+0x6a0/0x6a0
> > 	[ 9718.548160][T13609]  ? check_flags.part.44+0x86/0x220
> > 	[ 9718.548969][T13609]  ? __kasan_check_read+0x11/0x20
> > 	[ 9718.549745][T13609]  ? mark_lock+0xa8/0x440
> > 	[ 9718.550426][T13609]  ? mark_held_locks+0x8d/0xb0
> > 	[ 9718.551165][T13609]  ? btrfs_backref_cleanup_node+0x5c1/0x600
> > 	[ 9718.552079][T13609]  ? memcpy+0x4d/0x60
> > 	[ 9718.552694][T13609]  ? read_extent_buffer+0xcc/0x120
> > 	[ 9718.553478][T13609]  relocate_tree_blocks+0xa29/0xb00
> > 	[ 9718.554255][T13609]  ? do_relocation+0xc90/0xc90
> > 	[ 9718.554978][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
> > 	[ 9718.555855][T13609]  ? free_extent_buffer.part.46+0x90/0x140
> > 	[ 9718.556756][T13609]  ? rb_insert_color+0x342/0x360
> > 	[ 9718.557581][T13609]  ? free_extent_buffer+0x13/0x20
> > 	[ 9718.558445][T13609]  ? add_tree_block.isra.34+0x236/0x2b0
> > 	[ 9718.559387][T13609]  relocate_block_group+0x52e/0x830
> > 	[ 9718.560275][T13609]  ? merge_reloc_roots+0x4b0/0x4b0
> > 	[ 9718.561137][T13609]  btrfs_relocate_block_group+0x26e/0x4c0
> > 	[ 9718.562137][T13609]  btrfs_relocate_chunk+0x52/0x120
> > 	[ 9718.562918][T13609]  btrfs_balance+0xe22/0x1910
> > 	[ 9718.563605][T13609]  ? check_chain_key+0x1e6/0x2e0
> > 	[ 9718.564331][T13609]  ? btrfs_relocate_chunk+0x120/0x120
> > 	[ 9718.565126][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
> > 	[ 9718.565943][T13609]  ? _copy_from_user+0x95/0xd0
> > 	[ 9718.566649][T13609]  btrfs_ioctl_balance+0x3de/0x4c0
> > 	[ 9718.567414][T13609]  btrfs_ioctl+0x2385/0x4250
> > 	[ 9718.568090][T13609]  ? __kasan_check_read+0x11/0x20
> > 	[ 9718.568830][T13609]  ? check_chain_key+0x1e6/0x2e0
> > 	[ 9718.569619][T13609]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> > 	[ 9718.570658][T13609]  ? kvm_sched_clock_read+0x18/0x30
> > 	[ 9718.571526][T13609]  ? check_chain_key+0x1e6/0x2e0
> > 	[ 9718.572348][T13609]  ? lock_downgrade+0x3e0/0x3e0
> > 	[ 9718.573121][T13609]  ? do_vfs_ioctl+0xfc/0x9d0
> > 	[ 9718.573835][T13609]  ? ioctl_file_clone+0xe0/0xe0
> > 	[ 9718.574637][T13609]  ? check_flags.part.44+0x86/0x220
> > 	[ 9718.575472][T13609]  ? check_flags+0x26/0x30
> > 	[ 9718.576190][T13609]  ? lock_is_held_type+0xc9/0x100
> > 	[ 9718.576990][T13609]  ? check_flags.part.44+0x86/0x220
> > 	[ 9718.577836][T13609]  ? check_flags+0x26/0x30
> > 	[ 9718.578542][T13609]  ? lock_is_held_type+0xc9/0x100
> > 	[ 9718.579403][T13609]  ? __kasan_check_read+0x11/0x20
> > 	[ 9718.580225][T13609]  ? __fget_light+0xae/0x110
> > 	[ 9718.580983][T13609]  ksys_ioctl+0xa1/0xe0
> > 	[ 9718.581628][T13609]  __x64_sys_ioctl+0x43/0x50
> > 	[ 9718.582334][T13609]  do_syscall_64+0x60/0xf0
> > 	[ 9718.583285][T13609]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 	[ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
> > 	[ 9718.585289][T13609] Code: Bad RIP value.
> > 	[ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > 	[ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
> > 	[ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
> > 	[ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> > 	[ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
> > 	[ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
> > 	[ 9718.596109][T13609] Modules linked in:
> > 	[ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
> > 	[ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > 	[ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > 	[ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > 	[ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > 	[ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > 	[ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > 	[ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > 	[ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > 	[ 9718.644840][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > 	[ 9718.646728][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 	[ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > 	[ 9718.869689][ T4545] ==================================================================
> > 
> > same line, different call stack:
> > 
> > 	0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
> > 	789             btrfs_tree_unlock(eb);
> > 	790             free_extent_buffer(eb);
> > 	791
> > 	792             ret = btrfs_insert_root(trans, fs_info->tree_root,
> > 	793                                     &root_key, root_item);
> > 	794             BUG_ON(ret);
> > 	795             kfree(root_item);
> > 	796
> > 	797             reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> > 	798             BUG_ON(IS_ERR(reloc_root));
> > 
> > followed by
> > 
> > 	[ 9718.869689][ T4545] ==================================================================
> > 	[ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
> > 	[ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
> > 	[ 9718.873746][ T4545] 
> > 	[ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G      D W         5.8.0-6582a95aabfe+ #44
> > 	[ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > 	[ 9718.877149][ T4545] Call Trace:
> > 	[ 9718.877655][ T4545]  dump_stack+0xc8/0x11a
> > 	[ 9718.878317][ T4545]  ? __mutex_lock+0x202/0xce0
> > 	[ 9718.879065][ T4545]  print_address_description.constprop.8+0x1f/0x200
> > 	[ 9718.880167][ T4545]  ? __mutex_lock+0x202/0xce0
> > 	[ 9718.880916][ T4545]  ? __mutex_lock+0x202/0xce0
> > 	[ 9718.881666][ T4545]  kasan_report.cold.11+0x20/0x3e
> > 	[ 9718.882483][ T4545]  ? __mutex_lock+0x202/0xce0
> > 	[ 9718.883229][ T4545]  __asan_load4+0x69/0x90
> > 	[ 9718.883920][ T4545]  __mutex_lock+0x202/0xce0
> > 	[ 9718.884651][ T4545]  ? wait_current_trans+0xb7/0x230
> > 	[ 9718.885465][ T4545]  ? btrfs_record_root_in_trans+0x7e/0xc0
> > 	[ 9718.886388][ T4545]  ? mutex_lock_io_nested+0xc20/0xc20
> > 	[ 9718.887246][ T4545]  ? __kasan_check_read+0x11/0x20
> > 	[ 9718.888035][ T4545]  ? join_transaction+0x32/0x6f0
> > 	[ 9718.888854][ T4545]  ? join_transaction+0x1a6/0x6f0
> > 	[ 9718.889679][ T4545]  ? lock_downgrade+0x3e0/0x3e0
> > 	[ 9718.890496][ T4545]  ? __kasan_check_write+0x14/0x20
> > 	[ 9718.891308][ T4545]  ? lock_contended+0x720/0x720
> > 	[ 9718.892093][ T4545]  ? do_raw_spin_lock+0x1e0/0x1e0
> > 	[ 9718.892912][ T4545]  ? wait_current_trans+0xb7/0x230
> > 	[ 9718.893705][ T4545]  mutex_lock_nested+0x1b/0x20
> > 	[ 9718.894494][ T4545]  ? mutex_lock_nested+0x1b/0x20
> > 	[ 9718.895317][ T4545]  btrfs_record_root_in_trans+0x7e/0xc0
> > 	[ 9718.896245][ T4545]  start_transaction+0x189/0x8f0
> > 	[ 9718.897081][ T4545]  btrfs_start_transaction+0x1e/0x20
> > 	[ 9718.897941][ T4545]  btrfs_cont_expand+0x549/0x7a0
> > 	[ 9718.898805][ T4545]  ? btrfs_truncate_block+0x930/0x930
> > 	[ 9718.899665][ T4545]  ? inode_newsize_ok+0x75/0xc0
> > 	[ 9718.900438][ T4545]  ? setattr_prepare+0x9c/0x310
> > 	[ 9718.901242][ T4545]  btrfs_setattr+0x514/0x850
> > 	[ 9718.902035][ T4545]  ? current_time+0x8c/0xe0
> > 	[ 9718.902799][ T4545]  notify_change+0x4ec/0x700
> > 	[ 9718.903584][ T4545]  ? do_sys_ftruncate+0x108/0x220
> > 	[ 9718.904459][ T4545]  do_truncate+0xe4/0x160
> > 	[ 9718.905200][ T4545]  ? __x64_sys_openat2+0x170/0x170
> > 	[ 9718.906116][ T4545]  ? __sb_start_write+0x1a1/0x270
> > 	[ 9718.906954][ T4545]  do_sys_ftruncate+0x1b8/0x220
> > 	[ 9718.907759][ T4545]  __x64_sys_ftruncate+0x36/0x40
> > 	[ 9718.908577][ T4545]  do_syscall_64+0x60/0xf0
> > 	[ 9718.909292][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 	[ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
> > 	[ 9718.911247][ T4545] Code: Bad RIP value.
> > 	[ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > 	[ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
> > 	[ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
> > 	[ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
> > 	[ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
> > 	[ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
> > 	[ 9718.919882][ T4545] 
> > 	[ 9718.920268][ T4545] Allocated by task 6732:
> > 	[ 9718.920973][ T4545]  save_stack+0x21/0x50
> > 	[ 9718.921648][ T4545]  __kasan_kmalloc.constprop.17+0xc1/0xd0
> > 	[ 9718.922580][ T4545]  kasan_slab_alloc+0x12/0x20
> > 	[ 9718.923345][ T4545]  kmem_cache_alloc_node+0x113/0x720
> > 	[ 9718.924203][ T4545]  copy_process+0x357/0x3680
> > 	[ 9718.924955][ T4545]  _do_fork+0xed/0x880
> > 	[ 9718.925622][ T4545]  __do_sys_clone+0xee/0x130
> > 	[ 9718.926369][ T4545]  __x64_sys_clone+0x67/0x80
> > 	[ 9718.927119][ T4545]  do_syscall_64+0x60/0xf0
> > 	[ 9718.927848][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 	[ 9718.928812][ T4545] 
> > 	[ 9718.929173][ T4545] Freed by task 24:
> > 	[ 9718.929787][ T4545]  save_stack+0x21/0x50
> > 	[ 9718.930453][ T4545]  __kasan_slab_free+0x118/0x170
> > 	[ 9718.931242][ T4545]  kasan_slab_free+0xe/0x10
> > 	[ 9718.931970][ T4545]  kmem_cache_free+0x5f/0x280
> > 	[ 9718.932730][ T4545]  free_task+0x73/0x90
> > 	[ 9718.933391][ T4545]  __put_task_struct+0x199/0x1d0
> > 	[ 9718.934187][ T4545]  delayed_put_task_struct+0x124/0x1b0
> > 	[ 9718.935071][ T4545]  rcu_core+0x3b0/0xeb0
> > 	[ 9718.935758][ T4545]  rcu_core_si+0xe/0x10
> > 	[ 9718.936433][ T4545]  __do_softirq+0x120/0x5e3
> > 	[ 9718.937165][ T4545] 
> > 	[ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
> > 	[ 9718.937545][ T4545]  which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
> > 	[ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
> > 	[ 9718.940391][ T4545]  11072-byte region [ffff888014e94000, ffff888014e96b40)
> > 	[ 9718.942559][ T4545] The buggy address belongs to the page:
> > 	[ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
> > 	[ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
> > 	[ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
> > 	[ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
> > 	[ 9718.949889][ T4545] page dumped because: kasan: bad access detected
> > 	[ 9718.950977][ T4545] 
> > 	[ 9718.951354][ T4545] Memory state around the buggy address:
> > 	[ 9718.952296][ T4545]  ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 	[ 9718.953641][ T4545]  ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 	[ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > 	[ 9718.956366][ T4545]                                   ^
> > 	[ 9718.957258][ T4545]  ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > 	[ 9718.958653][ T4545]  ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > 	[ 9718.960034][ T4545] ==================================================================
> > 
> 




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-08-04 16:16     ` Zygo Blaxell
@ 2020-08-28  0:03       ` Zygo Blaxell
  2020-08-28  0:08         ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-08-28  0:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu

On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote:
> > 
> > 
> > On 2020/7/24 上午5:56, Zygo Blaxell wrote:
> > > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
> > >> Hi,
> > >>
> > >> I've hit a crash in relocation I've never seen before.
> > >>
> > >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
> > > 
> > > I hit an issue yesterday that reminded me of this.
> > > 
> > >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
> > >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
> > >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> > >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
> > >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
> > >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
> > >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
> > >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
> > >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
> > >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
> > >> [ 2129.258771] FS:  00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
> > >> [ 2129.258772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
> > >> [ 2129.258775] Call Trace:
> > >> [ 2129.258825]  btrfs_init_reloc_root+0xe8/0x120 [btrfs]
> > >> [ 2129.258862]  record_root_in_trans+0xae/0xd0 [btrfs]
> > >> [ 2129.258901]  btrfs_record_root_in_trans+0x51/0x70 [btrfs]
> > >> [ 2129.340388]  select_reloc_root+0x94/0x340 [btrfs]
> > >> [ 2129.340433]  do_relocation+0xda/0x7b0 [btrfs]
> > >> [ 2129.349854]  ? _raw_spin_unlock+0x1f/0x40
> > >> [ 2129.349898]  relocate_tree_blocks+0x336/0x670 [btrfs]
> > >> [ 2129.359325]  relocate_block_group+0x2f6/0x600 [btrfs]
> > >> [ 2129.359365]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
> > >> [ 2129.359408]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
> > >> [ 2129.375494]  __btrfs_balance+0x42c/0xce0 [btrfs]
> > >> [ 2129.375553]  btrfs_balance+0x66a/0xbe0 [btrfs]
> > >> [ 2129.375562]  ? kmem_cache_alloc_trace+0x19c/0x330
> > >> [ 2129.389852]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
> > >> [ 2129.389887]  btrfs_ioctl+0x304/0x2490 [btrfs]
> > >> [ 2129.389898]  ? do_user_addr_fault+0x221/0x49c
> > >> [ 2129.404070]  ? sched_clock_cpu+0x15/0x140
> > >> [ 2129.404073]  ? do_user_addr_fault+0x221/0x49c
> > >> [ 2129.404079]  ? up_read+0x18/0x240
> > >> [ 2129.404086]  ? ksys_ioctl+0x68/0xa0
> > >> [ 2129.404091]  ksys_ioctl+0x68/0xa0
> > >> [ 2129.423308]  __x64_sys_ioctl+0x16/0x20
> > >> [ 2129.423312]  do_syscall_64+0x50/0xe0
> > >> [ 2129.423315]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >> [ 2129.423318] RIP: 0033:0x7f82a51c6327
> > >> [ 2129.423319] Code: Bad RIP value.
> > >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
> > >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
> > >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> > >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
> > >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
> > >>
> > >> Relevant code called from create_reloc_root:
> > >>
> > >>         ret = btrfs_insert_root(trans, fs_info->tree_root,
> > >>                                 &root_key, root_item);
> > >>         BUG_ON(ret)
> > >>
> > >> and according to EAX, ret is -17 which is EEXIST.
> > >>
> > >> I don't have a reproducer, the testing image has been filled by random git
> > >> checkouts, deduplicated by BEES, then tons of snapshots created until the
> > >> metadata got exhausted, some file deletion and balances.
> > > 
> > > Mine is rsync, bees, lots of snapshots, balances, scrubs.  I recently also
> > > added random 'killall -INT btrfs' to send balance some fatal signals.
> > > 
> > >> This is the same image that led to the patch "btrfs: allow use of global block
> > >> reserve for balance item deletion", so this could have left it in some
> > >> intermediate state where the balance item was not removed and the reloc tree as
> > >> well.
> > >>
> > >> There were a few unsuccessful mounts due to relocation recovery, that was
> > >> trying to debug but then it started to work.
> > >>
> > >> The error happened with this 'fi df' saved after the balance start:
> > >>
> > >> # btrfs fi df mnt
> > >> Data, single: total=80.01GiB, used=38.67GiB
> > >> System, single: total=4.00MiB, used=16.00KiB
> > >> Metadata, single: total=19.99GiB, used=19.46GiB
> > >> GlobalReserve, single: total=512.00MiB, used=44.00KiB
> > > 
> > > Mine is:
> > > 
> > > 	Data, single: total=1.75TiB, used=1.74TiB
> > > 	System, RAID1: total=32.00MiB, used=208.00KiB
> > > 	Metadata, RAID1: total=25.00GiB, used=22.89GiB
> > > 	GlobalReserve, single: total=512.00MiB, used=0.00B
> > > 
> > > though this is some time after the failure (and a reboot).  I do notice
> > > that there's lots of unallocated space, but metadata usage is close
> > > to allocated, and I have been experiencing a lot of EROFS events when
> > > that happens, even if there's gigabytes unallocated.
> > > 
> > > btrfs fi us:
> > > 
> > > 	Overall:
> > > 	    Device size:                   2.00TiB
> > > 	    Device allocated:              1.80TiB
> > > 	    Device unallocated:          208.94GiB
> > > 	    Device missing:                  0.00B
> > > 	    Used:                          1.79TiB
> > > 	    Free (estimated):            211.30GiB      (min: 106.83GiB)
> > > 	    Data ratio:                       1.00
> > > 	    Metadata ratio:                   2.00
> > > 	    Global reserve:              512.00MiB      (used: 0.00B)
> > > 
> > > 	Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
> > > 	   /dev/mapper/vgtest-tvdb        894.00GiB
> > > 	   /dev/mapper/vgtest-tvdc        895.00GiB
> > > 
> > > 	Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
> > > 	   /dev/mapper/vgtest-tvdb         25.00GiB
> > > 	   /dev/mapper/vgtest-tvdc         25.00GiB
> > > 
> > > 	System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
> > > 	   /dev/mapper/vgtest-tvdb         32.00MiB
> > > 	   /dev/mapper/vgtest-tvdc         32.00MiB
> > > 
> > > 	Unallocated:
> > > 	   /dev/mapper/vgtest-tvdb        104.97GiB
> > > 	   /dev/mapper/vgtest-tvdc        103.97GiB
> > > 
> > >> The error looks like a repeated relocation tree creation, which would point to
> > >> the unsuccesful balances or inconsistent state (balance item, reloc trees).
> > >> It's not a "typical" mix of operations but I'd appreciate any insights here.
> > > 
> > > I have the same line but different call stack, with misc-next
> > > e3027d10af42d24940be74dabaf1550cd770bd48:
> > > 
> > > 	[ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
> > > 	[ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
> > > 	[ 9718.511137][T13609] ------------[ cut here ]------------
> > > 	[ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
> > > 	[ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
> > > 	[ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G        W         5.8.0-6582a95aabfe+ #44
> > > 	[ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > 	[ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > > 	[ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 
> > > 	e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > > 	[ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > > 	[ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > > 	[ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > > 	[ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > > 	[ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > > 	[ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > > 	[ 9718.529756][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > > 	[ 9718.531211][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > 	[ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > > 	[ 9718.533608][T13609] Call Trace:
> > > 	[ 9718.534151][T13609]  ? update_backref_node+0xf0/0xf0
> > > 	[ 9718.535137][T13609]  ? check_chain_key+0x1e6/0x2e0
> > > 	[ 9718.536057][T13609]  btrfs_init_reloc_root+0x2d7/0x310
> > 
> > That's the same problem.
> > 
> > Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON().
> > 
> > In that case, that means there are some reloc trees not cleaned up.
> > 
> > Would you mind to provide the "btrfs ins dump-tree -t root" dump for
> > that fs if the problem still happens?
> 
> http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt
> 
> The problem is now happening multiple times per day, starting with
> kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date:  Thu
> Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0.
> 
> The previous misc-next (that I have test data for),
> cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020
> -0700 does not have this problem.
> 
> These commit hashes are from https://gitlab.com/kdave/btrfs-devel.

Still hitting this bug every few hours on all 7.8.x so far, and misc-next.

There is a strong correlation between hitting the bug and starting a metadata
block group in balance, and a weaker correlation with data balances:

	Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1
	Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------
	Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794!

	Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1
	Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------
	Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794!

	Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1
	Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------
	Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794!

	Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1
	Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------
	Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794!

	Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1
	Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------
	Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794!

	Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data
	Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------
	Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794!

	Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1
	Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------
	Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794!

	Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1
	Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------
	Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794!

There don't seem to be any instances of the BUG that did not occur
within 30 seconds of starting a balance.

The on-disk data is fine.  After a reboot the same block group can be
successfully balanced.

> 
> > Thanks,
> > Qu
> > > 	[ 9718.537016][T13609]  ? find_reloc_root+0x200/0x200
> > > 	[ 9718.537992][T13609]  ? do_raw_spin_unlock+0xa8/0x140
> > > 	[ 9718.538899][T13609]  record_root_in_trans+0x18c/0x1d0
> > > 	[ 9718.539848][T13609]  btrfs_record_root_in_trans+0x8b/0xc0
> > > 	[ 9718.540843][T13609]  select_reloc_root+0x15f/0x6a0
> > > 	[ 9718.541943][T13609]  ? create_reloc_inode.isra.28+0x410/0x410
> > > 	[ 9718.543066][T13609]  ? rcu_read_lock_sched_held+0xa1/0xd0
> > > 	[ 9718.544333][T13609]  ? check_flags.part.44+0x86/0x220
> > > 	[ 9718.545186][T13609]  ? check_flags+0x26/0x30
> > > 	[ 9718.545870][T13609]  ? lock_is_held_type+0xc9/0x100
> > > 	[ 9718.546651][T13609]  do_relocation+0x242/0xc90
> > > 	[ 9718.547372][T13609]  ? select_reloc_root+0x6a0/0x6a0
> > > 	[ 9718.548160][T13609]  ? check_flags.part.44+0x86/0x220
> > > 	[ 9718.548969][T13609]  ? __kasan_check_read+0x11/0x20
> > > 	[ 9718.549745][T13609]  ? mark_lock+0xa8/0x440
> > > 	[ 9718.550426][T13609]  ? mark_held_locks+0x8d/0xb0
> > > 	[ 9718.551165][T13609]  ? btrfs_backref_cleanup_node+0x5c1/0x600
> > > 	[ 9718.552079][T13609]  ? memcpy+0x4d/0x60
> > > 	[ 9718.552694][T13609]  ? read_extent_buffer+0xcc/0x120
> > > 	[ 9718.553478][T13609]  relocate_tree_blocks+0xa29/0xb00
> > > 	[ 9718.554255][T13609]  ? do_relocation+0xc90/0xc90
> > > 	[ 9718.554978][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
> > > 	[ 9718.555855][T13609]  ? free_extent_buffer.part.46+0x90/0x140
> > > 	[ 9718.556756][T13609]  ? rb_insert_color+0x342/0x360
> > > 	[ 9718.557581][T13609]  ? free_extent_buffer+0x13/0x20
> > > 	[ 9718.558445][T13609]  ? add_tree_block.isra.34+0x236/0x2b0
> > > 	[ 9718.559387][T13609]  relocate_block_group+0x52e/0x830
> > > 	[ 9718.560275][T13609]  ? merge_reloc_roots+0x4b0/0x4b0
> > > 	[ 9718.561137][T13609]  btrfs_relocate_block_group+0x26e/0x4c0
> > > 	[ 9718.562137][T13609]  btrfs_relocate_chunk+0x52/0x120
> > > 	[ 9718.562918][T13609]  btrfs_balance+0xe22/0x1910
> > > 	[ 9718.563605][T13609]  ? check_chain_key+0x1e6/0x2e0
> > > 	[ 9718.564331][T13609]  ? btrfs_relocate_chunk+0x120/0x120
> > > 	[ 9718.565126][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
> > > 	[ 9718.565943][T13609]  ? _copy_from_user+0x95/0xd0
> > > 	[ 9718.566649][T13609]  btrfs_ioctl_balance+0x3de/0x4c0
> > > 	[ 9718.567414][T13609]  btrfs_ioctl+0x2385/0x4250
> > > 	[ 9718.568090][T13609]  ? __kasan_check_read+0x11/0x20
> > > 	[ 9718.568830][T13609]  ? check_chain_key+0x1e6/0x2e0
> > > 	[ 9718.569619][T13609]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> > > 	[ 9718.570658][T13609]  ? kvm_sched_clock_read+0x18/0x30
> > > 	[ 9718.571526][T13609]  ? check_chain_key+0x1e6/0x2e0
> > > 	[ 9718.572348][T13609]  ? lock_downgrade+0x3e0/0x3e0
> > > 	[ 9718.573121][T13609]  ? do_vfs_ioctl+0xfc/0x9d0
> > > 	[ 9718.573835][T13609]  ? ioctl_file_clone+0xe0/0xe0
> > > 	[ 9718.574637][T13609]  ? check_flags.part.44+0x86/0x220
> > > 	[ 9718.575472][T13609]  ? check_flags+0x26/0x30
> > > 	[ 9718.576190][T13609]  ? lock_is_held_type+0xc9/0x100
> > > 	[ 9718.576990][T13609]  ? check_flags.part.44+0x86/0x220
> > > 	[ 9718.577836][T13609]  ? check_flags+0x26/0x30
> > > 	[ 9718.578542][T13609]  ? lock_is_held_type+0xc9/0x100
> > > 	[ 9718.579403][T13609]  ? __kasan_check_read+0x11/0x20
> > > 	[ 9718.580225][T13609]  ? __fget_light+0xae/0x110
> > > 	[ 9718.580983][T13609]  ksys_ioctl+0xa1/0xe0
> > > 	[ 9718.581628][T13609]  __x64_sys_ioctl+0x43/0x50
> > > 	[ 9718.582334][T13609]  do_syscall_64+0x60/0xf0
> > > 	[ 9718.583285][T13609]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 	[ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
> > > 	[ 9718.585289][T13609] Code: Bad RIP value.
> > > 	[ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > > 	[ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
> > > 	[ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
> > > 	[ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> > > 	[ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
> > > 	[ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
> > > 	[ 9718.596109][T13609] Modules linked in:
> > > 	[ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
> > > 	[ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > > 	[ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > > 	[ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > > 	[ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > > 	[ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > > 	[ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > > 	[ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > > 	[ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > > 	[ 9718.644840][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > > 	[ 9718.646728][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > 	[ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > > 	[ 9718.869689][ T4545] ==================================================================
> > > 
> > > same line, different call stack:
> > > 
> > > 	0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
> > > 	789             btrfs_tree_unlock(eb);
> > > 	790             free_extent_buffer(eb);
> > > 	791
> > > 	792             ret = btrfs_insert_root(trans, fs_info->tree_root,
> > > 	793                                     &root_key, root_item);
> > > 	794             BUG_ON(ret);
> > > 	795             kfree(root_item);
> > > 	796
> > > 	797             reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> > > 	798             BUG_ON(IS_ERR(reloc_root));
> > > 
> > > followed by
> > > 
> > > 	[ 9718.869689][ T4545] ==================================================================
> > > 	[ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
> > > 	[ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
> > > 	[ 9718.873746][ T4545] 
> > > 	[ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G      D W         5.8.0-6582a95aabfe+ #44
> > > 	[ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > 	[ 9718.877149][ T4545] Call Trace:
> > > 	[ 9718.877655][ T4545]  dump_stack+0xc8/0x11a
> > > 	[ 9718.878317][ T4545]  ? __mutex_lock+0x202/0xce0
> > > 	[ 9718.879065][ T4545]  print_address_description.constprop.8+0x1f/0x200
> > > 	[ 9718.880167][ T4545]  ? __mutex_lock+0x202/0xce0
> > > 	[ 9718.880916][ T4545]  ? __mutex_lock+0x202/0xce0
> > > 	[ 9718.881666][ T4545]  kasan_report.cold.11+0x20/0x3e
> > > 	[ 9718.882483][ T4545]  ? __mutex_lock+0x202/0xce0
> > > 	[ 9718.883229][ T4545]  __asan_load4+0x69/0x90
> > > 	[ 9718.883920][ T4545]  __mutex_lock+0x202/0xce0
> > > 	[ 9718.884651][ T4545]  ? wait_current_trans+0xb7/0x230
> > > 	[ 9718.885465][ T4545]  ? btrfs_record_root_in_trans+0x7e/0xc0
> > > 	[ 9718.886388][ T4545]  ? mutex_lock_io_nested+0xc20/0xc20
> > > 	[ 9718.887246][ T4545]  ? __kasan_check_read+0x11/0x20
> > > 	[ 9718.888035][ T4545]  ? join_transaction+0x32/0x6f0
> > > 	[ 9718.888854][ T4545]  ? join_transaction+0x1a6/0x6f0
> > > 	[ 9718.889679][ T4545]  ? lock_downgrade+0x3e0/0x3e0
> > > 	[ 9718.890496][ T4545]  ? __kasan_check_write+0x14/0x20
> > > 	[ 9718.891308][ T4545]  ? lock_contended+0x720/0x720
> > > 	[ 9718.892093][ T4545]  ? do_raw_spin_lock+0x1e0/0x1e0
> > > 	[ 9718.892912][ T4545]  ? wait_current_trans+0xb7/0x230
> > > 	[ 9718.893705][ T4545]  mutex_lock_nested+0x1b/0x20
> > > 	[ 9718.894494][ T4545]  ? mutex_lock_nested+0x1b/0x20
> > > 	[ 9718.895317][ T4545]  btrfs_record_root_in_trans+0x7e/0xc0
> > > 	[ 9718.896245][ T4545]  start_transaction+0x189/0x8f0
> > > 	[ 9718.897081][ T4545]  btrfs_start_transaction+0x1e/0x20
> > > 	[ 9718.897941][ T4545]  btrfs_cont_expand+0x549/0x7a0
> > > 	[ 9718.898805][ T4545]  ? btrfs_truncate_block+0x930/0x930
> > > 	[ 9718.899665][ T4545]  ? inode_newsize_ok+0x75/0xc0
> > > 	[ 9718.900438][ T4545]  ? setattr_prepare+0x9c/0x310
> > > 	[ 9718.901242][ T4545]  btrfs_setattr+0x514/0x850
> > > 	[ 9718.902035][ T4545]  ? current_time+0x8c/0xe0
> > > 	[ 9718.902799][ T4545]  notify_change+0x4ec/0x700
> > > 	[ 9718.903584][ T4545]  ? do_sys_ftruncate+0x108/0x220
> > > 	[ 9718.904459][ T4545]  do_truncate+0xe4/0x160
> > > 	[ 9718.905200][ T4545]  ? __x64_sys_openat2+0x170/0x170
> > > 	[ 9718.906116][ T4545]  ? __sb_start_write+0x1a1/0x270
> > > 	[ 9718.906954][ T4545]  do_sys_ftruncate+0x1b8/0x220
> > > 	[ 9718.907759][ T4545]  __x64_sys_ftruncate+0x36/0x40
> > > 	[ 9718.908577][ T4545]  do_syscall_64+0x60/0xf0
> > > 	[ 9718.909292][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 	[ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
> > > 	[ 9718.911247][ T4545] Code: Bad RIP value.
> > > 	[ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > > 	[ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
> > > 	[ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
> > > 	[ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
> > > 	[ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
> > > 	[ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
> > > 	[ 9718.919882][ T4545] 
> > > 	[ 9718.920268][ T4545] Allocated by task 6732:
> > > 	[ 9718.920973][ T4545]  save_stack+0x21/0x50
> > > 	[ 9718.921648][ T4545]  __kasan_kmalloc.constprop.17+0xc1/0xd0
> > > 	[ 9718.922580][ T4545]  kasan_slab_alloc+0x12/0x20
> > > 	[ 9718.923345][ T4545]  kmem_cache_alloc_node+0x113/0x720
> > > 	[ 9718.924203][ T4545]  copy_process+0x357/0x3680
> > > 	[ 9718.924955][ T4545]  _do_fork+0xed/0x880
> > > 	[ 9718.925622][ T4545]  __do_sys_clone+0xee/0x130
> > > 	[ 9718.926369][ T4545]  __x64_sys_clone+0x67/0x80
> > > 	[ 9718.927119][ T4545]  do_syscall_64+0x60/0xf0
> > > 	[ 9718.927848][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 	[ 9718.928812][ T4545] 
> > > 	[ 9718.929173][ T4545] Freed by task 24:
> > > 	[ 9718.929787][ T4545]  save_stack+0x21/0x50
> > > 	[ 9718.930453][ T4545]  __kasan_slab_free+0x118/0x170
> > > 	[ 9718.931242][ T4545]  kasan_slab_free+0xe/0x10
> > > 	[ 9718.931970][ T4545]  kmem_cache_free+0x5f/0x280
> > > 	[ 9718.932730][ T4545]  free_task+0x73/0x90
> > > 	[ 9718.933391][ T4545]  __put_task_struct+0x199/0x1d0
> > > 	[ 9718.934187][ T4545]  delayed_put_task_struct+0x124/0x1b0
> > > 	[ 9718.935071][ T4545]  rcu_core+0x3b0/0xeb0
> > > 	[ 9718.935758][ T4545]  rcu_core_si+0xe/0x10
> > > 	[ 9718.936433][ T4545]  __do_softirq+0x120/0x5e3
> > > 	[ 9718.937165][ T4545] 
> > > 	[ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
> > > 	[ 9718.937545][ T4545]  which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
> > > 	[ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
> > > 	[ 9718.940391][ T4545]  11072-byte region [ffff888014e94000, ffff888014e96b40)
> > > 	[ 9718.942559][ T4545] The buggy address belongs to the page:
> > > 	[ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
> > > 	[ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
> > > 	[ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
> > > 	[ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
> > > 	[ 9718.949889][ T4545] page dumped because: kasan: bad access detected
> > > 	[ 9718.950977][ T4545] 
> > > 	[ 9718.951354][ T4545] Memory state around the buggy address:
> > > 	[ 9718.952296][ T4545]  ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 	[ 9718.953641][ T4545]  ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 	[ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > 	[ 9718.956366][ T4545]                                   ^
> > > 	[ 9718.957258][ T4545]  ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > 	[ 9718.958653][ T4545]  ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > 	[ 9718.960034][ T4545] ==================================================================
> > > 
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-08-28  0:03       ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell
@ 2020-08-28  0:08         ` Zygo Blaxell
  2020-08-28  6:34           ` Nikolay Borisov
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-08-28  0:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu

On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> > On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote:
> > > 
> > > 
> > > On 2020/7/24 上午5:56, Zygo Blaxell wrote:
> > > > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
> > > >> Hi,
> > > >>
> > > >> I've hit a crash in relocation I've never seen before.
> > > >>
> > > >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
> > > > 
> > > > I hit an issue yesterday that reminded me of this.
> > > > 
> > > >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
> > > >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
> > > >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> > > >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
> > > >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
> > > >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
> > > >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
> > > >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
> > > >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
> > > >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
> > > >> [ 2129.258771] FS:  00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
> > > >> [ 2129.258772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
> > > >> [ 2129.258775] Call Trace:
> > > >> [ 2129.258825]  btrfs_init_reloc_root+0xe8/0x120 [btrfs]
> > > >> [ 2129.258862]  record_root_in_trans+0xae/0xd0 [btrfs]
> > > >> [ 2129.258901]  btrfs_record_root_in_trans+0x51/0x70 [btrfs]
> > > >> [ 2129.340388]  select_reloc_root+0x94/0x340 [btrfs]
> > > >> [ 2129.340433]  do_relocation+0xda/0x7b0 [btrfs]
> > > >> [ 2129.349854]  ? _raw_spin_unlock+0x1f/0x40
> > > >> [ 2129.349898]  relocate_tree_blocks+0x336/0x670 [btrfs]
> > > >> [ 2129.359325]  relocate_block_group+0x2f6/0x600 [btrfs]
> > > >> [ 2129.359365]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
> > > >> [ 2129.359408]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
> > > >> [ 2129.375494]  __btrfs_balance+0x42c/0xce0 [btrfs]
> > > >> [ 2129.375553]  btrfs_balance+0x66a/0xbe0 [btrfs]
> > > >> [ 2129.375562]  ? kmem_cache_alloc_trace+0x19c/0x330
> > > >> [ 2129.389852]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
> > > >> [ 2129.389887]  btrfs_ioctl+0x304/0x2490 [btrfs]
> > > >> [ 2129.389898]  ? do_user_addr_fault+0x221/0x49c
> > > >> [ 2129.404070]  ? sched_clock_cpu+0x15/0x140
> > > >> [ 2129.404073]  ? do_user_addr_fault+0x221/0x49c
> > > >> [ 2129.404079]  ? up_read+0x18/0x240
> > > >> [ 2129.404086]  ? ksys_ioctl+0x68/0xa0
> > > >> [ 2129.404091]  ksys_ioctl+0x68/0xa0
> > > >> [ 2129.423308]  __x64_sys_ioctl+0x16/0x20
> > > >> [ 2129.423312]  do_syscall_64+0x50/0xe0
> > > >> [ 2129.423315]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >> [ 2129.423318] RIP: 0033:0x7f82a51c6327
> > > >> [ 2129.423319] Code: Bad RIP value.
> > > >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > > >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
> > > >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
> > > >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> > > >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
> > > >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
> > > >>
> > > >> Relevant code called from create_reloc_root:
> > > >>
> > > >>         ret = btrfs_insert_root(trans, fs_info->tree_root,
> > > >>                                 &root_key, root_item);
> > > >>         BUG_ON(ret)
> > > >>
> > > >> and according to EAX, ret is -17 which is EEXIST.
> > > >>
> > > >> I don't have a reproducer, the testing image has been filled by random git
> > > >> checkouts, deduplicated by BEES, then tons of snapshots created until the
> > > >> metadata got exhausted, some file deletion and balances.
> > > > 
> > > > Mine is rsync, bees, lots of snapshots, balances, scrubs.  I recently also
> > > > added random 'killall -INT btrfs' to send balance some fatal signals.
> > > > 
> > > >> This is the same image that led to the patch "btrfs: allow use of global block
> > > >> reserve for balance item deletion", so this could have left it in some
> > > >> intermediate state where the balance item was not removed and the reloc tree as
> > > >> well.
> > > >>
> > > >> There were a few unsuccessful mounts due to relocation recovery, that was
> > > >> trying to debug but then it started to work.
> > > >>
> > > >> The error happened with this 'fi df' saved after the balance start:
> > > >>
> > > >> # btrfs fi df mnt
> > > >> Data, single: total=80.01GiB, used=38.67GiB
> > > >> System, single: total=4.00MiB, used=16.00KiB
> > > >> Metadata, single: total=19.99GiB, used=19.46GiB
> > > >> GlobalReserve, single: total=512.00MiB, used=44.00KiB
> > > > 
> > > > Mine is:
> > > > 
> > > > 	Data, single: total=1.75TiB, used=1.74TiB
> > > > 	System, RAID1: total=32.00MiB, used=208.00KiB
> > > > 	Metadata, RAID1: total=25.00GiB, used=22.89GiB
> > > > 	GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > 
> > > > though this is some time after the failure (and a reboot).  I do notice
> > > > that there's lots of unallocated space, but metadata usage is close
> > > > to allocated, and I have been experiencing a lot of EROFS events when
> > > > that happens, even if there's gigabytes unallocated.
> > > > 
> > > > btrfs fi us:
> > > > 
> > > > 	Overall:
> > > > 	    Device size:                   2.00TiB
> > > > 	    Device allocated:              1.80TiB
> > > > 	    Device unallocated:          208.94GiB
> > > > 	    Device missing:                  0.00B
> > > > 	    Used:                          1.79TiB
> > > > 	    Free (estimated):            211.30GiB      (min: 106.83GiB)
> > > > 	    Data ratio:                       1.00
> > > > 	    Metadata ratio:                   2.00
> > > > 	    Global reserve:              512.00MiB      (used: 0.00B)
> > > > 
> > > > 	Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
> > > > 	   /dev/mapper/vgtest-tvdb        894.00GiB
> > > > 	   /dev/mapper/vgtest-tvdc        895.00GiB
> > > > 
> > > > 	Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
> > > > 	   /dev/mapper/vgtest-tvdb         25.00GiB
> > > > 	   /dev/mapper/vgtest-tvdc         25.00GiB
> > > > 
> > > > 	System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
> > > > 	   /dev/mapper/vgtest-tvdb         32.00MiB
> > > > 	   /dev/mapper/vgtest-tvdc         32.00MiB
> > > > 
> > > > 	Unallocated:
> > > > 	   /dev/mapper/vgtest-tvdb        104.97GiB
> > > > 	   /dev/mapper/vgtest-tvdc        103.97GiB
> > > > 
> > > >> The error looks like a repeated relocation tree creation, which would point to
> > > >> the unsuccesful balances or inconsistent state (balance item, reloc trees).
> > > >> It's not a "typical" mix of operations but I'd appreciate any insights here.
> > > > 
> > > > I have the same line but different call stack, with misc-next
> > > > e3027d10af42d24940be74dabaf1550cd770bd48:
> > > > 
> > > > 	[ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
> > > > 	[ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
> > > > 	[ 9718.511137][T13609] ------------[ cut here ]------------
> > > > 	[ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
> > > > 	[ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
> > > > 	[ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G        W         5.8.0-6582a95aabfe+ #44
> > > > 	[ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > > 	[ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > > > 	[ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 
> > > > 	e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > > > 	[ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > > > 	[ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > > > 	[ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > > > 	[ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > > > 	[ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > > > 	[ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > > > 	[ 9718.529756][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > > > 	[ 9718.531211][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > 	[ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > > > 	[ 9718.533608][T13609] Call Trace:
> > > > 	[ 9718.534151][T13609]  ? update_backref_node+0xf0/0xf0
> > > > 	[ 9718.535137][T13609]  ? check_chain_key+0x1e6/0x2e0
> > > > 	[ 9718.536057][T13609]  btrfs_init_reloc_root+0x2d7/0x310
> > > 
> > > That's the same problem.
> > > 
> > > Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON().
> > > 
> > > In that case, that means there are some reloc trees not cleaned up.
> > > 
> > > Would you mind to provide the "btrfs ins dump-tree -t root" dump for
> > > that fs if the problem still happens?
> > 
> > http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt
> > 
> > The problem is now happening multiple times per day, starting with
> > kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date:  Thu
> > Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0.
> > 
> > The previous misc-next (that I have test data for),
> > cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020
> > -0700 does not have this problem.
> > 
> > These commit hashes are from https://gitlab.com/kdave/btrfs-devel.
> 
> Still hitting this bug every few hours on all 7.8.x so far, and misc-next.
> 
> There is a strong correlation between hitting the bug and starting a metadata
> block group in balance, and a weaker correlation with data balances:
> 
> 	Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1
> 	Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------
> 	Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794!
> 
> 	Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1
> 	Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------
> 	Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794!
> 
> 	Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1
> 	Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------
> 	Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794!
> 
> 	Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1
> 	Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------
> 	Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794!
> 
> 	Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1
> 	Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------
> 	Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794!
> 
> 	Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data
> 	Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------
> 	Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794!
> 
> 	Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1
> 	Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------
> 	Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794!
> 
> 	Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1
> 	Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------
> 	Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794!
> 
> There don't seem to be any instances of the BUG that did not occur
> within 30 seconds of starting a balance.
> 
> The on-disk data is fine.  After a reboot the same block group can be
> successfully balanced.

Forgot to mention the failure rate:  8 crashes (listed above) among 1492
block groups balanced over the same 4-day period.

> > 
> > > Thanks,
> > > Qu
> > > > 	[ 9718.537016][T13609]  ? find_reloc_root+0x200/0x200
> > > > 	[ 9718.537992][T13609]  ? do_raw_spin_unlock+0xa8/0x140
> > > > 	[ 9718.538899][T13609]  record_root_in_trans+0x18c/0x1d0
> > > > 	[ 9718.539848][T13609]  btrfs_record_root_in_trans+0x8b/0xc0
> > > > 	[ 9718.540843][T13609]  select_reloc_root+0x15f/0x6a0
> > > > 	[ 9718.541943][T13609]  ? create_reloc_inode.isra.28+0x410/0x410
> > > > 	[ 9718.543066][T13609]  ? rcu_read_lock_sched_held+0xa1/0xd0
> > > > 	[ 9718.544333][T13609]  ? check_flags.part.44+0x86/0x220
> > > > 	[ 9718.545186][T13609]  ? check_flags+0x26/0x30
> > > > 	[ 9718.545870][T13609]  ? lock_is_held_type+0xc9/0x100
> > > > 	[ 9718.546651][T13609]  do_relocation+0x242/0xc90
> > > > 	[ 9718.547372][T13609]  ? select_reloc_root+0x6a0/0x6a0
> > > > 	[ 9718.548160][T13609]  ? check_flags.part.44+0x86/0x220
> > > > 	[ 9718.548969][T13609]  ? __kasan_check_read+0x11/0x20
> > > > 	[ 9718.549745][T13609]  ? mark_lock+0xa8/0x440
> > > > 	[ 9718.550426][T13609]  ? mark_held_locks+0x8d/0xb0
> > > > 	[ 9718.551165][T13609]  ? btrfs_backref_cleanup_node+0x5c1/0x600
> > > > 	[ 9718.552079][T13609]  ? memcpy+0x4d/0x60
> > > > 	[ 9718.552694][T13609]  ? read_extent_buffer+0xcc/0x120
> > > > 	[ 9718.553478][T13609]  relocate_tree_blocks+0xa29/0xb00
> > > > 	[ 9718.554255][T13609]  ? do_relocation+0xc90/0xc90
> > > > 	[ 9718.554978][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
> > > > 	[ 9718.555855][T13609]  ? free_extent_buffer.part.46+0x90/0x140
> > > > 	[ 9718.556756][T13609]  ? rb_insert_color+0x342/0x360
> > > > 	[ 9718.557581][T13609]  ? free_extent_buffer+0x13/0x20
> > > > 	[ 9718.558445][T13609]  ? add_tree_block.isra.34+0x236/0x2b0
> > > > 	[ 9718.559387][T13609]  relocate_block_group+0x52e/0x830
> > > > 	[ 9718.560275][T13609]  ? merge_reloc_roots+0x4b0/0x4b0
> > > > 	[ 9718.561137][T13609]  btrfs_relocate_block_group+0x26e/0x4c0
> > > > 	[ 9718.562137][T13609]  btrfs_relocate_chunk+0x52/0x120
> > > > 	[ 9718.562918][T13609]  btrfs_balance+0xe22/0x1910
> > > > 	[ 9718.563605][T13609]  ? check_chain_key+0x1e6/0x2e0
> > > > 	[ 9718.564331][T13609]  ? btrfs_relocate_chunk+0x120/0x120
> > > > 	[ 9718.565126][T13609]  ? kmem_cache_alloc_trace+0x5af/0x740
> > > > 	[ 9718.565943][T13609]  ? _copy_from_user+0x95/0xd0
> > > > 	[ 9718.566649][T13609]  btrfs_ioctl_balance+0x3de/0x4c0
> > > > 	[ 9718.567414][T13609]  btrfs_ioctl+0x2385/0x4250
> > > > 	[ 9718.568090][T13609]  ? __kasan_check_read+0x11/0x20
> > > > 	[ 9718.568830][T13609]  ? check_chain_key+0x1e6/0x2e0
> > > > 	[ 9718.569619][T13609]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> > > > 	[ 9718.570658][T13609]  ? kvm_sched_clock_read+0x18/0x30
> > > > 	[ 9718.571526][T13609]  ? check_chain_key+0x1e6/0x2e0
> > > > 	[ 9718.572348][T13609]  ? lock_downgrade+0x3e0/0x3e0
> > > > 	[ 9718.573121][T13609]  ? do_vfs_ioctl+0xfc/0x9d0
> > > > 	[ 9718.573835][T13609]  ? ioctl_file_clone+0xe0/0xe0
> > > > 	[ 9718.574637][T13609]  ? check_flags.part.44+0x86/0x220
> > > > 	[ 9718.575472][T13609]  ? check_flags+0x26/0x30
> > > > 	[ 9718.576190][T13609]  ? lock_is_held_type+0xc9/0x100
> > > > 	[ 9718.576990][T13609]  ? check_flags.part.44+0x86/0x220
> > > > 	[ 9718.577836][T13609]  ? check_flags+0x26/0x30
> > > > 	[ 9718.578542][T13609]  ? lock_is_held_type+0xc9/0x100
> > > > 	[ 9718.579403][T13609]  ? __kasan_check_read+0x11/0x20
> > > > 	[ 9718.580225][T13609]  ? __fget_light+0xae/0x110
> > > > 	[ 9718.580983][T13609]  ksys_ioctl+0xa1/0xe0
> > > > 	[ 9718.581628][T13609]  __x64_sys_ioctl+0x43/0x50
> > > > 	[ 9718.582334][T13609]  do_syscall_64+0x60/0xf0
> > > > 	[ 9718.583285][T13609]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 	[ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
> > > > 	[ 9718.585289][T13609] Code: Bad RIP value.
> > > > 	[ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > > > 	[ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
> > > > 	[ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
> > > > 	[ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> > > > 	[ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
> > > > 	[ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
> > > > 	[ 9718.596109][T13609] Modules linked in:
> > > > 	[ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
> > > > 	[ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > > > 	[ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > > > 	[ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > > > 	[ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > > > 	[ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > > > 	[ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > > > 	[ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > > > 	[ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > > > 	[ 9718.644840][T13609] FS:  00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > > > 	[ 9718.646728][T13609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > 	[ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > > > 	[ 9718.869689][ T4545] ==================================================================
> > > > 
> > > > same line, different call stack:
> > > > 
> > > > 	0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
> > > > 	789             btrfs_tree_unlock(eb);
> > > > 	790             free_extent_buffer(eb);
> > > > 	791
> > > > 	792             ret = btrfs_insert_root(trans, fs_info->tree_root,
> > > > 	793                                     &root_key, root_item);
> > > > 	794             BUG_ON(ret);
> > > > 	795             kfree(root_item);
> > > > 	796
> > > > 	797             reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> > > > 	798             BUG_ON(IS_ERR(reloc_root));
> > > > 
> > > > followed by
> > > > 
> > > > 	[ 9718.869689][ T4545] ==================================================================
> > > > 	[ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
> > > > 	[ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
> > > > 	[ 9718.873746][ T4545] 
> > > > 	[ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G      D W         5.8.0-6582a95aabfe+ #44
> > > > 	[ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > > 	[ 9718.877149][ T4545] Call Trace:
> > > > 	[ 9718.877655][ T4545]  dump_stack+0xc8/0x11a
> > > > 	[ 9718.878317][ T4545]  ? __mutex_lock+0x202/0xce0
> > > > 	[ 9718.879065][ T4545]  print_address_description.constprop.8+0x1f/0x200
> > > > 	[ 9718.880167][ T4545]  ? __mutex_lock+0x202/0xce0
> > > > 	[ 9718.880916][ T4545]  ? __mutex_lock+0x202/0xce0
> > > > 	[ 9718.881666][ T4545]  kasan_report.cold.11+0x20/0x3e
> > > > 	[ 9718.882483][ T4545]  ? __mutex_lock+0x202/0xce0
> > > > 	[ 9718.883229][ T4545]  __asan_load4+0x69/0x90
> > > > 	[ 9718.883920][ T4545]  __mutex_lock+0x202/0xce0
> > > > 	[ 9718.884651][ T4545]  ? wait_current_trans+0xb7/0x230
> > > > 	[ 9718.885465][ T4545]  ? btrfs_record_root_in_trans+0x7e/0xc0
> > > > 	[ 9718.886388][ T4545]  ? mutex_lock_io_nested+0xc20/0xc20
> > > > 	[ 9718.887246][ T4545]  ? __kasan_check_read+0x11/0x20
> > > > 	[ 9718.888035][ T4545]  ? join_transaction+0x32/0x6f0
> > > > 	[ 9718.888854][ T4545]  ? join_transaction+0x1a6/0x6f0
> > > > 	[ 9718.889679][ T4545]  ? lock_downgrade+0x3e0/0x3e0
> > > > 	[ 9718.890496][ T4545]  ? __kasan_check_write+0x14/0x20
> > > > 	[ 9718.891308][ T4545]  ? lock_contended+0x720/0x720
> > > > 	[ 9718.892093][ T4545]  ? do_raw_spin_lock+0x1e0/0x1e0
> > > > 	[ 9718.892912][ T4545]  ? wait_current_trans+0xb7/0x230
> > > > 	[ 9718.893705][ T4545]  mutex_lock_nested+0x1b/0x20
> > > > 	[ 9718.894494][ T4545]  ? mutex_lock_nested+0x1b/0x20
> > > > 	[ 9718.895317][ T4545]  btrfs_record_root_in_trans+0x7e/0xc0
> > > > 	[ 9718.896245][ T4545]  start_transaction+0x189/0x8f0
> > > > 	[ 9718.897081][ T4545]  btrfs_start_transaction+0x1e/0x20
> > > > 	[ 9718.897941][ T4545]  btrfs_cont_expand+0x549/0x7a0
> > > > 	[ 9718.898805][ T4545]  ? btrfs_truncate_block+0x930/0x930
> > > > 	[ 9718.899665][ T4545]  ? inode_newsize_ok+0x75/0xc0
> > > > 	[ 9718.900438][ T4545]  ? setattr_prepare+0x9c/0x310
> > > > 	[ 9718.901242][ T4545]  btrfs_setattr+0x514/0x850
> > > > 	[ 9718.902035][ T4545]  ? current_time+0x8c/0xe0
> > > > 	[ 9718.902799][ T4545]  notify_change+0x4ec/0x700
> > > > 	[ 9718.903584][ T4545]  ? do_sys_ftruncate+0x108/0x220
> > > > 	[ 9718.904459][ T4545]  do_truncate+0xe4/0x160
> > > > 	[ 9718.905200][ T4545]  ? __x64_sys_openat2+0x170/0x170
> > > > 	[ 9718.906116][ T4545]  ? __sb_start_write+0x1a1/0x270
> > > > 	[ 9718.906954][ T4545]  do_sys_ftruncate+0x1b8/0x220
> > > > 	[ 9718.907759][ T4545]  __x64_sys_ftruncate+0x36/0x40
> > > > 	[ 9718.908577][ T4545]  do_syscall_64+0x60/0xf0
> > > > 	[ 9718.909292][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 	[ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
> > > > 	[ 9718.911247][ T4545] Code: Bad RIP value.
> > > > 	[ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > > > 	[ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
> > > > 	[ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
> > > > 	[ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
> > > > 	[ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
> > > > 	[ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
> > > > 	[ 9718.919882][ T4545] 
> > > > 	[ 9718.920268][ T4545] Allocated by task 6732:
> > > > 	[ 9718.920973][ T4545]  save_stack+0x21/0x50
> > > > 	[ 9718.921648][ T4545]  __kasan_kmalloc.constprop.17+0xc1/0xd0
> > > > 	[ 9718.922580][ T4545]  kasan_slab_alloc+0x12/0x20
> > > > 	[ 9718.923345][ T4545]  kmem_cache_alloc_node+0x113/0x720
> > > > 	[ 9718.924203][ T4545]  copy_process+0x357/0x3680
> > > > 	[ 9718.924955][ T4545]  _do_fork+0xed/0x880
> > > > 	[ 9718.925622][ T4545]  __do_sys_clone+0xee/0x130
> > > > 	[ 9718.926369][ T4545]  __x64_sys_clone+0x67/0x80
> > > > 	[ 9718.927119][ T4545]  do_syscall_64+0x60/0xf0
> > > > 	[ 9718.927848][ T4545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 	[ 9718.928812][ T4545] 
> > > > 	[ 9718.929173][ T4545] Freed by task 24:
> > > > 	[ 9718.929787][ T4545]  save_stack+0x21/0x50
> > > > 	[ 9718.930453][ T4545]  __kasan_slab_free+0x118/0x170
> > > > 	[ 9718.931242][ T4545]  kasan_slab_free+0xe/0x10
> > > > 	[ 9718.931970][ T4545]  kmem_cache_free+0x5f/0x280
> > > > 	[ 9718.932730][ T4545]  free_task+0x73/0x90
> > > > 	[ 9718.933391][ T4545]  __put_task_struct+0x199/0x1d0
> > > > 	[ 9718.934187][ T4545]  delayed_put_task_struct+0x124/0x1b0
> > > > 	[ 9718.935071][ T4545]  rcu_core+0x3b0/0xeb0
> > > > 	[ 9718.935758][ T4545]  rcu_core_si+0xe/0x10
> > > > 	[ 9718.936433][ T4545]  __do_softirq+0x120/0x5e3
> > > > 	[ 9718.937165][ T4545] 
> > > > 	[ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
> > > > 	[ 9718.937545][ T4545]  which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
> > > > 	[ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
> > > > 	[ 9718.940391][ T4545]  11072-byte region [ffff888014e94000, ffff888014e96b40)
> > > > 	[ 9718.942559][ T4545] The buggy address belongs to the page:
> > > > 	[ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
> > > > 	[ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
> > > > 	[ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
> > > > 	[ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
> > > > 	[ 9718.949889][ T4545] page dumped because: kasan: bad access detected
> > > > 	[ 9718.950977][ T4545] 
> > > > 	[ 9718.951354][ T4545] Memory state around the buggy address:
> > > > 	[ 9718.952296][ T4545]  ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > 	[ 9718.953641][ T4545]  ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > 	[ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > 	[ 9718.956366][ T4545]                                   ^
> > > > 	[ 9718.957258][ T4545]  ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > 	[ 9718.958653][ T4545]  ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > 	[ 9718.960034][ T4545] ==================================================================
> > > > 
> > > 
> > 
> > 
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-08-28  0:08         ` Zygo Blaxell
@ 2020-08-28  6:34           ` Nikolay Borisov
  2020-08-28 20:42             ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Nikolay Borisov @ 2020-08-28  6:34 UTC (permalink / raw)
  To: Zygo Blaxell, Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu



On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:

<snip>

>>
>> 	Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1
>> 	Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------
>> 	Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> 	Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1
>> 	Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------
>> 	Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> 	Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1
>> 	Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------
>> 	Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> 	Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1
>> 	Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------
>> 	Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> 	Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1
>> 	Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------
>> 	Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> 	Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data
>> 	Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------
>> 	Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> 	Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1
>> 	Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------
>> 	Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> 	Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1
>> 	Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------
>> 	Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> There don't seem to be any instances of the BUG that did not occur
>> within 30 seconds of starting a balance.
>>
>> The on-disk data is fine.  After a reboot the same block group can be
>> successfully balanced.
> 
> Forgot to mention the failure rate:  8 crashes (listed above) among 1492
> block groups balanced over the same 4-day period.

Since you can repro reliably could you modify the code in
create_reloc_root so it prints what's the returned error value, I'd
speculate it's EEXIST from

btrfs_insert_root
  btrfs_insert_item
   btrfs_insert_empty_item
     btrfs_insert_empty_items
       btrfs_search_slot

But better be sure.

> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-08-28  6:34           ` Nikolay Borisov
@ 2020-08-28 20:42             ` Zygo Blaxell
  2020-09-01 22:53               ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-08-28 20:42 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Qu Wenruo, David Sterba, linux-btrfs, wqu

On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
> > On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
> >> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> 
> <snip>
> 
> Since you can repro reliably could you modify the code in
> create_reloc_root so it prints what's the returned error value, I'd
> speculate it's EEXIST from
> 
> btrfs_insert_root
>   btrfs_insert_item
>    btrfs_insert_empty_item
>      btrfs_insert_empty_items
>        btrfs_search_slot
> 
> But better be sure.

Here you go, EEXIST == 17:

	Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
	Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
	Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
	Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
	Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
	Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
	Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
	Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
	Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
	Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
	Aug 28 15:30:56 regress kernel: [18454.459006][ T2100] invalid opcode: 0000 [#1] SMP KASAN PTI
	Aug 28 15:30:56 regress kernel: [18454.460356][ T2100] CPU: 2 PID: 2100 Comm: rsync Tainted: G        W         5.8.5-8de74804e45b+ #6
	Aug 28 15:30:57 regress kernel: [18454.462324][ T2100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	Aug 28 15:30:57 regress kernel: [18454.464289][ T2100] RIP: 0010:create_reloc_root+0x47a/0x490
	Aug 28 15:30:57 regress kernel: [18454.465507][ T2100] Code: e8 5b 3b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 65 3b bd ff 4d 89 b4 24 f0 00 00 00 e9 dc fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 b9 90 07 01 66 0f 1f 84 00 00 00 00 00
	Aug 28 15:30:57 regress kernel: [18454.468861][ T2100] RSP: 0018:ffffc90000c777d0 EFLAGS: 00010282
	Aug 28 15:30:57 regress kernel: [18454.469787][ T2100] RAX: 000000000000001b RBX: ffff88817cbc9400 RCX: ffffffffa5273b42
	Aug 28 15:30:57 regress kernel: [18454.471005][ T2100] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f5dff28c
	Aug 28 15:30:57 regress kernel: [18454.472278][ T2100] RBP: ffffc90000c778c0 R08: ffffed103ebc1645 R09: ffffed103ebc1645
	Aug 28 15:30:57 regress kernel: [18454.473547][ T2100] R10: ffff8881f5e0b227 R11: ffffed103ebc1644 R12: ffff8881cb710020
	Aug 28 15:30:57 regress kernel: [18454.474949][ T2100] R13: ffff888118800a80 R14: 00000000ffffffef R15: ffffc90000c77858
	Aug 28 15:30:57 regress kernel: [18454.476224][ T2100] FS:  00007f1b8f7d9b80(0000) GS:ffff8881f5c00000(0000) knlGS:0000000000000000
	Aug 28 15:30:57 regress kernel: [18454.477635][ T2100] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	Aug 28 15:30:57 regress kernel: [18454.478661][ T2100] CR2: 00007fc1d25e7100 CR3: 0000000120a8e006 CR4: 00000000001606e0
	Aug 28 15:30:57 regress kernel: [18454.479894][ T2100] Call Trace:
	Aug 28 15:30:57 regress kernel: [18454.480416][ T2100]  ? update_backref_node+0xf0/0xf0
	Aug 28 15:30:57 regress kernel: [18454.481209][ T2100]  ? check_chain_key+0x1e6/0x2e0
	Aug 28 15:30:57 regress kernel: [18454.482012][ T2100]  btrfs_init_reloc_root+0x1b0/0x310
	Aug 28 15:30:57 regress kernel: [18454.482859][ T2100]  ? find_reloc_root+0x200/0x200
	Aug 28 15:30:57 regress kernel: [18454.483661][ T2100]  ? do_raw_spin_unlock+0xa8/0x140
	Aug 28 15:30:57 regress kernel: [18454.484482][ T2100]  record_root_in_trans+0x18c/0x1d0
	Aug 28 15:30:57 regress kernel: [18454.485435][ T2100]  btrfs_record_root_in_trans+0x8b/0xc0
	Aug 28 15:30:57 regress kernel: [18454.486301][ T2100]  start_transaction+0x16b/0x8f0
	Aug 28 15:30:57 regress kernel: [18454.487082][ T2100]  btrfs_start_transaction+0x1e/0x20
	Aug 28 15:30:57 regress kernel: [18454.487905][ T2100]  btrfs_cont_expand+0x549/0x7a0
	Aug 28 15:30:57 regress kernel: [18454.488680][ T2100]  ? btrfs_truncate_block+0x970/0x970
	Aug 28 15:30:57 regress kernel: [18454.489527][ T2100]  ? timestamp_truncate+0x180/0x180
	Aug 28 15:30:57 regress kernel: [18454.490344][ T2100]  ? check_chain_key+0x1e6/0x2e0
	Aug 28 15:30:57 regress kernel: [18454.491117][ T2100]  btrfs_file_write_iter+0x7ae/0x957
	Aug 28 15:30:57 regress kernel: [18454.491938][ T2100]  ? btrfs_sync_file+0x7c0/0x7c0
	Aug 28 15:30:57 regress kernel: [18454.492710][ T2100]  ? iov_iter_init+0x99/0xd0
	Aug 28 15:30:57 regress kernel: [18454.493426][ T2100]  new_sync_write+0x2ad/0x3f0
	Aug 28 15:30:57 regress kernel: [18454.494153][ T2100]  ? new_sync_read+0x3e0/0x3e0
	Aug 28 15:30:57 regress kernel: [18454.494890][ T2100]  ? check_flags+0x26/0x30
	Aug 28 15:30:57 regress kernel: [18454.495582][ T2100]  ? lock_is_held_type+0xc9/0x100
	Aug 28 15:30:57 regress kernel: [18454.496365][ T2100]  ? rcu_read_lock_any_held+0xd2/0x100
	Aug 28 15:30:57 regress kernel: [18454.497211][ T2100]  ? rcu_read_lock_held+0xb0/0xb0
	Aug 28 15:30:57 regress kernel: [18454.497985][ T2100]  ? __sb_start_write+0x1a1/0x270
	Aug 28 15:30:57 regress kernel: [18454.498768][ T2100]  vfs_write+0x2d2/0x300
	Aug 28 15:30:57 regress kernel: [18454.499417][ T2100]  ksys_write+0xcc/0x170
	Aug 28 15:30:57 regress kernel: [18454.500064][ T2100]  ? __ia32_sys_read+0x50/0x50
	Aug 28 15:30:57 regress kernel: [18454.500783][ T2100]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
	Aug 28 15:30:57 regress kernel: [18454.501704][ T2100]  __x64_sys_write+0x43/0x50
	Aug 28 15:30:57 regress kernel: [18454.502403][ T2100]  do_syscall_64+0x60/0xf0
	Aug 28 15:30:57 regress kernel: [18454.503079][ T2100]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	Aug 28 15:30:57 regress kernel: [18454.503971][ T2100] RIP: 0033:0x7f1b8f8c5504
	Aug 28 15:30:57 regress kernel: [18454.504644][ T2100] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
	Aug 28 15:30:57 regress kernel: [18454.507565][ T2100] RSP: 002b:00007fff3419eaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
	Aug 28 15:30:57 regress kernel: [18454.508800][ T2100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1b8f8c5504
	Aug 28 15:30:57 regress kernel: [18454.509982][ T2100] RDX: 0000000000000400 RSI: 000055e56f375bb0 RDI: 0000000000000001
	Aug 28 15:30:57 regress kernel: [18454.511153][ T2100] RBP: 0000000000000400 R08: 0000000000000400 R09: 000000002c4a4095
	Aug 28 15:30:57 regress kernel: [18454.512325][ T2100] R10: 000000000a7b98fd R11: 0000000000000246 R12: 000055e56f375bb0
	Aug 28 15:30:57 regress kernel: [18454.513503][ T2100] R13: 000055e56f375bb0 R14: 0000000000008000 R15: 0000000000000400
	Aug 28 15:30:57 regress kernel: [18454.514685][ T2100] Modules linked in:
	Aug 28 15:30:57 regress kernel: [18454.515321][ T2100] ---[ end trace dc1ad17026339b11 ]---
	Aug 28 15:30:57 regress kernel: [18454.516184][ T2100] RIP: 0010:create_reloc_root+0x47a/0x490
	Aug 28 15:30:57 regress kernel: [18454.517085][ T2100] Code: e8 5b 3b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 65 3b bd ff 4d 89 b4 24 f0 00 00 00 e9 dc fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 b9 90 07 01 66 0f 1f 84 00 00 00 00 00
	Aug 28 15:30:57 regress kernel: [18454.520010][ T2100] RSP: 0018:ffffc90000c777d0 EFLAGS: 00010282
	Aug 28 15:30:57 regress kernel: [18454.520935][ T2100] RAX: 000000000000001b RBX: ffff88817cbc9400 RCX: ffffffffa5273b42
	Aug 28 15:30:57 regress kernel: [18454.522172][ T2100] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f5dff28c
	Aug 28 15:30:57 regress kernel: [18454.523567][ T2100] RBP: ffffc90000c778c0 R08: ffffed103ebc1645 R09: ffffed103ebc1645
	Aug 28 15:30:57 regress kernel: [18454.524985][ T2100] R10: ffff8881f5e0b227 R11: ffffed103ebc1644 R12: ffff8881cb710020
	Aug 28 15:30:57 regress kernel: [18454.526404][ T2100] R13: ffff888118800a80 R14: 00000000ffffffef R15: ffffc90000c77858
	Aug 28 15:30:57 regress kernel: [18454.527887][ T2100] FS:  00007f1b8f7d9b80(0000) GS:ffff8881f5c00000(0000) knlGS:0000000000000000
	Aug 28 15:30:57 regress kernel: [18454.529576][ T2100] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	Aug 28 15:30:57 regress kernel: [18454.530845][ T2100] CR2: 00007fc1d25e7100 CR3: 0000000120a8e006 CR4: 00000000001606e0
	Aug 28 15:30:57 regress kernel: [18454.821401][T32222] ==================================================================
	Aug 28 15:30:57 regress kernel: [18454.822634][T32222] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
	Aug 28 15:30:57 regress kernel: [18454.823654][T32222] Read of size 4 at addr ffff88811329c02c by task mkdir/32222
	Aug 28 15:30:57 regress kernel: [18454.824781][T32222] 
	Aug 28 15:30:57 regress kernel: [18454.825148][T32222] CPU: 1 PID: 32222 Comm: mkdir Tainted: G      D W         5.8.5-8de74804e45b+ #6
	Aug 28 15:30:57 regress kernel: [18454.826616][T32222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	Aug 28 15:30:57 regress kernel: [18454.828088][T32222] Call Trace:
	Aug 28 15:30:57 regress kernel: [18454.828607][T32222]  dump_stack+0xc8/0x11a
	Aug 28 15:30:57 regress kernel: [18454.829297][T32222]  ? __mutex_lock+0x202/0xce0
	Aug 28 15:30:57 regress kernel: [18454.830033][T32222]  print_address_description.constprop.8+0x1f/0x200
	Aug 28 15:30:57 regress kernel: [18454.831062][T32222]  ? __mutex_lock+0x202/0xce0
	Aug 28 15:30:57 regress kernel: [18454.831783][T32222]  ? __mutex_lock+0x202/0xce0
	Aug 28 15:30:57 regress kernel: [18454.832537][T32222]  kasan_report.cold.11+0x20/0x3e
	Aug 28 15:30:57 regress kernel: [18454.833323][T32222]  ? __mutex_lock+0x202/0xce0
	Aug 28 15:30:57 regress kernel: [18454.834056][T32222]  __asan_load4+0x69/0x90
	Aug 28 15:30:57 regress kernel: [18454.834754][T32222]  __mutex_lock+0x202/0xce0
	Aug 28 15:30:57 regress kernel: [18454.835475][T32222]  ? wait_current_trans+0xb7/0x230
	Aug 28 15:30:57 regress kernel: [18454.836295][T32222]  ? btrfs_record_root_in_trans+0x7e/0xc0
	Aug 28 15:30:57 regress kernel: [18454.837206][T32222]  ? mutex_lock_io_nested+0xc20/0xc20
	Aug 28 15:30:57 regress kernel: [18454.838064][T32222]  ? __kasan_check_read+0x11/0x20
	Aug 28 15:30:57 regress kernel: [18454.838860][T32222]  ? join_transaction+0x32/0x6f0
	Aug 28 15:30:57 regress kernel: [18454.839653][T32222]  ? join_transaction+0x1a6/0x6f0
	Aug 28 15:30:57 regress kernel: [18454.840592][T32222]  ? lock_downgrade+0x3e0/0x3e0
	Aug 28 15:30:57 regress kernel: [18454.841401][T32222]  ? __kasan_check_write+0x14/0x20
	Aug 28 15:30:57 regress kernel: [18454.842165][T32222]  ? lock_contended+0x720/0x720
	Aug 28 15:30:57 regress kernel: [18454.842883][T32222]  ? do_raw_spin_lock+0x1e0/0x1e0
	Aug 28 15:30:57 regress kernel: [18454.843629][T32222]  ? wait_current_trans+0xb7/0x230
	Aug 28 15:30:57 regress kernel: [18454.844409][T32222]  mutex_lock_nested+0x1b/0x20
	Aug 28 15:30:57 regress kernel: [18454.845121][T32222]  ? mutex_lock_nested+0x1b/0x20
	Aug 28 15:30:57 regress kernel: [18454.845867][T32222]  btrfs_record_root_in_trans+0x7e/0xc0
	Aug 28 15:30:57 regress kernel: [18454.846694][T32222]  start_transaction+0x16b/0x8f0
	Aug 28 15:30:57 regress kernel: [18454.847438][T32222]  btrfs_start_transaction+0x1e/0x20
	Aug 28 15:30:57 regress kernel: [18454.848223][T32222]  btrfs_mkdir+0xf5/0x3b0
	Aug 28 15:30:57 regress kernel: [18454.848863][T32222]  ? make_kprojid+0x20/0x20
	Aug 28 15:30:57 regress kernel: [18454.849533][T32222]  ? putname+0x6b/0x80
	Aug 28 15:30:57 regress kernel: [18454.850141][T32222]  ? btrfs_rename2+0x2b20/0x2b20
	Aug 28 15:30:57 regress kernel: [18454.850866][T32222]  ? generic_permission+0x58/0x250
	Aug 28 15:30:57 regress kernel: [18454.851753][T32222]  ? security_inode_permission+0x1d/0x70
	Aug 28 15:30:57 regress kernel: [18454.852598][T32222]  ? inode_permission+0x7a/0x1f0
	Aug 28 15:30:57 regress kernel: [18454.853343][T32222]  vfs_mkdir+0x1e1/0x2f0
	Aug 28 15:30:57 regress kernel: [18454.853971][T32222]  do_mkdirat+0x192/0x1c0
	Aug 28 15:30:57 regress kernel: [18454.854620][T32222]  ? __ia32_sys_mknod+0x50/0x50
	Aug 28 15:30:57 regress kernel: [18454.855357][T32222]  ? trace_hardirqs_on_prepare+0x35/0x170
	Aug 28 15:30:57 regress kernel: [18454.856239][T32222]  __x64_sys_mkdir+0x37/0x40
	Aug 28 15:30:57 regress kernel: [18454.856951][T32222]  do_syscall_64+0x60/0xf0
	Aug 28 15:30:57 regress kernel: [18454.857645][T32222]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	Aug 28 15:30:57 regress kernel: [18454.858609][T32222] RIP: 0033:0x7f36074470d7
	Aug 28 15:30:57 regress kernel: [18454.859287][T32222] Code: 1f 40 00 48 8b 05 b9 0d 0d 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 0d 0d 00 f7 d8 64 89 01 48
	Aug 28 15:30:57 regress kernel: [18454.862597][T32222] RSP: 002b:00007ffc5c8419e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
	Aug 28 15:30:57 regress kernel: [18454.863874][T32222] RAX: ffffffffffffffda RBX: 00007ffc5c842bc8 RCX: 00007f36074470d7
	Aug 28 15:30:57 regress kernel: [18454.865087][T32222] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc5c842bc8
	Aug 28 15:30:57 regress kernel: [18454.866297][T32222] RBP: 00007ffc5c842bc8 R08: 00000000000001ff R09: 0000557174728c00
	Aug 28 15:30:57 regress kernel: [18454.867501][T32222] R10: fffffffffffff35a R11: 0000000000000206 R12: 00000000000001ff
	Aug 28 15:30:57 regress kernel: [18454.868709][T32222] R13: 0000000000000000 R14: 00007ffc5c841b60 R15: 00007ffc5c841cf0
	Aug 28 15:30:57 regress kernel: [18454.869923][T32222] 
	Aug 28 15:30:57 regress kernel: [18454.870296][T32222] Allocated by task 2066:
	Aug 28 15:30:57 regress kernel: [18454.870939][T32222]  save_stack+0x21/0x50
	Aug 28 15:30:57 regress kernel: [18454.871572][T32222]  __kasan_kmalloc.constprop.17+0xc1/0xd0
	Aug 28 15:30:57 regress kernel: [18454.872434][T32222]  kasan_slab_alloc+0x12/0x20
	Aug 28 15:30:57 regress kernel: [18454.873133][T32222]  kmem_cache_alloc_node+0x113/0x720
	Aug 28 15:30:57 regress kernel: [18454.873914][T32222]  copy_process+0x357/0x3680
	Aug 28 15:30:57 regress kernel: [18454.874653][T32222]  _do_fork+0xed/0x880
	Aug 28 15:30:57 regress kernel: [18454.875353][T32222]  __do_sys_clone+0xee/0x130
	Aug 28 15:30:57 regress kernel: [18454.876057][T32222]  __x64_sys_clone+0x67/0x80
	Aug 28 15:30:57 regress kernel: [18454.876782][T32222]  do_syscall_64+0x60/0xf0
	Aug 28 15:30:57 regress kernel: [18454.877476][T32222]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	Aug 28 15:30:57 regress kernel: [18454.878398][T32222] 
	Aug 28 15:30:57 regress kernel: [18454.878760][T32222] Freed by task 3558:
	Aug 28 15:30:57 regress kernel: [18454.879384][T32222]  save_stack+0x21/0x50
	Aug 28 15:30:57 regress kernel: [18454.880038][T32222]  __kasan_slab_free+0x118/0x170
	Aug 28 15:30:57 regress kernel: [18454.880855][T32222]  kasan_slab_free+0xe/0x10
	Aug 28 15:30:57 regress kernel: [18454.881565][T32222]  kmem_cache_free+0x5f/0x280
	Aug 28 15:30:57 regress kernel: [18454.882297][T32222]  free_task+0x73/0x90
	Aug 28 15:30:57 regress kernel: [18454.882928][T32222]  __put_task_struct+0x199/0x1d0
	Aug 28 15:30:57 regress kernel: [18454.883699][T32222]  delayed_put_task_struct+0x124/0x1b0
	Aug 28 15:30:57 regress kernel: [18454.884615][T32222]  rcu_core+0x3b0/0xea0
	Aug 28 15:30:57 regress kernel: [18454.885273][T32222]  rcu_core_si+0xe/0x10
	Aug 28 15:30:57 regress kernel: [18454.886251][T32222]  __do_softirq+0x120/0x5e3
	Aug 28 15:30:57 regress kernel: [18454.886964][T32222] 
	Aug 28 15:30:57 regress kernel: [18454.887332][T32222] The buggy address belongs to the object at ffff88811329c000
	Aug 28 15:30:57 regress kernel: [18454.887332][T32222]  which belongs to the cache task_struct(192:ssh.service) of size 11072
	Aug 28 15:30:57 regress kernel: [18454.889771][T32222] The buggy address is located 44 bytes inside of
	Aug 28 15:30:57 regress kernel: [18454.889771][T32222]  11072-byte region [ffff88811329c000, ffff88811329eb40)
	Aug 28 15:30:57 regress kernel: [18454.891843][T32222] The buggy address belongs to the page:
	Aug 28 15:30:57 regress kernel: [18454.892718][T32222] page:ffffea00044ca700 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88811329ffff head:ffffea00044ca700 order:2 compound_mapcount:0 compound_pincount:0
	Aug 28 15:30:57 regress kernel: [18454.895303][T32222] flags: 0x17ffe0000010200(slab|head)
	Aug 28 15:30:57 regress kernel: [18454.896186][T32222] raw: 017ffe0000010200 ffffea0001a49908 ffff8881f5b36498 ffff8881eb5a1380
	Aug 28 15:30:57 regress kernel: [18454.897618][T32222] raw: ffff88811329ffff ffff88811329c000 0000000100000001 0000000000000000
	Aug 28 15:30:57 regress kernel: [18454.899016][T32222] page dumped because: kasan: bad access detected
	Aug 28 15:30:57 regress kernel: [18454.900061][T32222] 
	Aug 28 15:30:57 regress kernel: [18454.900439][T32222] Memory state around the buggy address:
	Aug 28 15:30:57 regress kernel: [18454.901364][T32222]  ffff88811329bf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	Aug 28 15:30:57 regress kernel: [18454.902699][T32222]  ffff88811329bf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	Aug 28 15:30:57 regress kernel: [18454.904052][T32222] >ffff88811329c000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	Aug 28 15:30:57 regress kernel: [18454.905345][T32222]                                   ^
	Aug 28 15:30:57 regress kernel: [18454.906245][T32222]  ffff88811329c080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	Aug 28 15:30:57 regress kernel: [18454.907675][T32222]  ffff88811329c100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	Aug 28 15:30:57 regress kernel: [18454.909247][T32222] ==================================================================


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-08-28 20:42             ` Zygo Blaxell
@ 2020-09-01 22:53               ` Zygo Blaxell
  2020-09-01 23:33                 ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-09-01 22:53 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Qu Wenruo, David Sterba, linux-btrfs, wqu

On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote:
> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
> > On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
> > > On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
> > >> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> > 
> > <snip>
> > 
> > Since you can repro reliably could you modify the code in
> > create_reloc_root so it prints what's the returned error value, I'd
> > speculate it's EEXIST from
> > 
> > btrfs_insert_root
> >   btrfs_insert_item
> >    btrfs_insert_empty_item
> >      btrfs_insert_empty_items
> >        btrfs_search_slot
> > 
> > But better be sure.
> 
> Here you go, EEXIST == 17:
> 
> 	Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
> 	Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
> 	Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
> 	Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
> 	Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
> 	Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
> 	Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
> 	Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
> 	Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
> 	Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!

I did a low-resolution bisect for this issue.  I dug up 5.4, 5.5, 5.6,
and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete
kernels, and ran the tests on each kernel.  Results:

	5.8:  kernel BUG at fs/btrfs/relocation.c:794

	5.7:  kernel BUG (same code but different line number)

	5.6:  kernel BUG (same as the others)

	5.5:  assertion failure (stack trace below)

	5.4:  kernel BUG (!)

The 5.4 result is interesting--I've been running 5.4 for some time and
not hit this before.  So there are 3 possible theories:

	1.  It's because of sending signals to balance.  That has been
	added to my test workload after 5.7 was released, so earlier
	tests on 5.4 would not have triggered it.

	2.  There's a regression in 5.4-stable, which I've cherry-picked
	to all the other kernels during my test setup.	(On the other
	hand, if I don't backport some fixes, kernels 5.5..5.7 crash
	before they get to this bug.)

	3.  There's something rotten in my test filesystem, and the
	BUG will go away for a while if I do a mkfs.  Qu asked for
	a dump earlier in this thread, and I provided one.

All three of these theories are testable to some extent, so I'll have
my test VM run some variations.

The full test workload is:

	balance metadata or data at random intervals

	scrub, scrub cancel at random intervals

	20x rsync updating files

	snapshot create, delete at random intervals

	bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)

	balance cancel at random intervals

	kill -9 $(pidof btrfs balance) at random intervals (NEW - added
	when 5.7 came out)

This is the 5.5 root assertion failure:

	Sep  1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
	Sep  1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
	Sep  1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
	Sep  1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
	Sep  1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
	Sep  1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last  enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
	Sep  1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G        W         5.5.19-76348822ab91+ #14
	Sep  1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
	Sep  1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	Sep  1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last  enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
	Sep  1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
	Sep  1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
	Sep  1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
	Sep  1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
	Sep  1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
	Sep  1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
	Sep  1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
	Sep  1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
	Sep  1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
	Sep  1 04:48:49 regress kernel: [10642.561391][T24161] FS:  00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
	Sep  1 04:48:49 regress kernel: [10642.562779][T24161] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	Sep  1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
	Sep  1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
	Sep  1 04:48:49 regress kernel: [10642.565565][T24161]  build_backref_tree+0x186b/0x2590
	Sep  1 04:48:49 regress kernel: [10642.566389][T24161]  ? relocate_data_extent+0x1a0/0x1a0
	Sep  1 04:48:49 regress kernel: [10642.567295][T24161]  ? lock_downgrade+0x3d0/0x3d0
	Sep  1 04:48:49 regress kernel: [10642.568142][T24161]  ? match_held_lock+0x20/0x260
	Sep  1 04:48:49 regress kernel: [10642.568925][T24161]  ? do_raw_spin_unlock+0xa8/0x140
	Sep  1 04:48:49 regress kernel: [10642.569765][T24161]  ? _raw_spin_trylock_bh+0x60/0x80
	Sep  1 04:48:49 regress kernel: [10642.570605][T24161]  ? release_extent_buffer+0x13b/0x230
	Sep  1 04:48:49 regress kernel: [10642.571480][T24161]  ? free_extent_buffer.part.45+0xd7/0x140
	Sep  1 04:48:49 regress kernel: [10642.572406][T24161]  relocate_tree_blocks+0x204/0xa50
	Sep  1 04:48:49 regress kernel: [10642.573244][T24161]  ? build_backref_tree+0x2590/0x2590
	Sep  1 04:48:49 regress kernel: [10642.574103][T24161]  ? rb_insert_color+0x3af/0x400
	Sep  1 04:48:49 regress kernel: [10642.574896][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
	Sep  1 04:48:49 regress kernel: [10642.575785][T24161]  ? tree_insert+0x90/0xb0
	Sep  1 04:48:49 regress kernel: [10642.576495][T24161]  ? add_tree_block.isra.38+0x1d6/0x230
	Sep  1 04:48:49 regress kernel: [10642.577387][T24161]  relocate_block_group+0x528/0x9d0
	Sep  1 04:48:49 regress kernel: [10642.578220][T24161]  ? merge_reloc_roots+0x470/0x470
	Sep  1 04:48:49 regress kernel: [10642.579047][T24161]  btrfs_relocate_block_group+0x26e/0x4c0
	Sep  1 04:48:49 regress kernel: [10642.579968][T24161]  btrfs_relocate_chunk+0x52/0xf0
	Sep  1 04:48:49 regress kernel: [10642.580773][T24161]  btrfs_balance+0xe5b/0x1800
	Sep  1 04:48:49 regress kernel: [10642.581542][T24161]  ? btrfs_relocate_chunk+0xf0/0xf0
	Sep  1 04:48:49 regress kernel: [10642.582381][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
	Sep  1 04:48:49 regress kernel: [10642.583270][T24161]  ? _copy_from_user+0xaa/0xd0
	Sep  1 04:48:49 regress kernel: [10642.584022][T24161]  btrfs_ioctl_balance+0x3de/0x4c0
	Sep  1 04:48:49 regress kernel: [10642.584819][T24161]  btrfs_ioctl+0x3122/0x4470
	Sep  1 04:48:49 regress kernel: [10642.585540][T24161]  ? __asan_loadN+0xf/0x20
	Sep  1 04:48:49 regress kernel: [10642.586229][T24161]  ? __asan_loadN+0xf/0x20
	Sep  1 04:48:49 regress kernel: [10642.586920][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
	Sep  1 04:48:49 regress kernel: [10642.587935][T24161]  ? __asan_loadN+0xf/0x20
	Sep  1 04:48:49 regress kernel: [10642.588649][T24161]  ? pvclock_clocksource_read+0xeb/0x190
	Sep  1 04:48:49 regress kernel: [10642.589566][T24161]  ? __asan_loadN+0xf/0x20
	Sep  1 04:48:49 regress kernel: [10642.590254][T24161]  ? pvclock_clocksource_read+0xeb/0x190
	Sep  1 04:48:49 regress kernel: [10642.591128][T24161]  ? __kasan_check_read+0x11/0x20
	Sep  1 04:48:49 regress kernel: [10642.591913][T24161]  ? check_chain_key+0x1e6/0x2e0
	Sep  1 04:48:49 regress kernel: [10642.592707][T24161]  ? __asan_loadN+0xf/0x20
	Sep  1 04:48:49 regress kernel: [10642.593409][T24161]  ? pvclock_clocksource_read+0xeb/0x190
	Sep  1 04:48:49 regress kernel: [10642.594312][T24161]  ? kvm_sched_clock_read+0x18/0x30
	Sep  1 04:48:49 regress kernel: [10642.595139][T24161]  ? check_chain_key+0x1e6/0x2e0
	Sep  1 04:48:49 regress kernel: [10642.595929][T24161]  ? sched_clock_cpu+0x1b/0x120
	Sep  1 04:48:49 regress kernel: [10642.596712][T24161]  do_vfs_ioctl+0x13e/0xad0
	Sep  1 04:48:49 regress kernel: [10642.597432][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
	Sep  1 04:48:49 regress kernel: [10642.598455][T24161]  ? do_vfs_ioctl+0x13e/0xad0
	Sep  1 04:48:49 regress kernel: [10642.599202][T24161]  ? compat_ioctl_preallocate+0x170/0x170
	Sep  1 04:48:49 regress kernel: [10642.600128][T24161]  ? __kasan_check_write+0x14/0x20
	Sep  1 04:48:49 regress kernel: [10642.600949][T24161]  ? up_read+0x176/0x4f0
	Sep  1 04:48:49 regress kernel: [10642.601648][T24161]  ? down_write_nested+0x2d0/0x2d0
	Sep  1 04:48:49 regress kernel: [10642.602476][T24161]  ? handle_mm_fault+0x211/0x480
	Sep  1 04:48:49 regress kernel: [10642.603263][T24161]  ? __kasan_check_read+0x11/0x20
	Sep  1 04:48:49 regress kernel: [10642.604062][T24161]  ? __fget_light+0xb2/0x110
	Sep  1 04:48:49 regress kernel: [10642.604805][T24161]  ksys_ioctl+0x67/0x90
	Sep  1 04:48:49 regress kernel: [10642.605471][T24161]  __x64_sys_ioctl+0x43/0x50
	Sep  1 04:48:49 regress kernel: [10642.606203][T24161]  do_syscall_64+0x77/0x2d0
	Sep  1 04:48:49 regress kernel: [10642.606933][T24161]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
	Sep  1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
	Sep  1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
	Sep  1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
	Sep  1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
	Sep  1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
	Sep  1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
	Sep  1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
	Sep  1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
	Sep  1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
	Sep  1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-09-01 22:53               ` Zygo Blaxell
@ 2020-09-01 23:33                 ` Qu Wenruo
  2020-09-02  0:14                   ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-09-01 23:33 UTC (permalink / raw)
  To: Zygo Blaxell, Nikolay Borisov; +Cc: David Sterba, linux-btrfs, wqu

This looks like a race between some reloc tree creation from some other
part.

If you could add debug output for create_reloc_root() and its callers,
we may have a chance to debug it.

But for the first step, we can hunt down the BUG_ON()s first and make it
exist more gracefully.

I'll try to spare some time to do this in the following week.

Thanks,
Qu

On 2020/9/2 上午6:53, Zygo Blaxell wrote:
> On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote:
>> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
>>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
>>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
>>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
>>>
>>> <snip>
>>>
>>> Since you can repro reliably could you modify the code in
>>> create_reloc_root so it prints what's the returned error value, I'd
>>> speculate it's EEXIST from
>>>
>>> btrfs_insert_root
>>>   btrfs_insert_item
>>>    btrfs_insert_empty_item
>>>      btrfs_insert_empty_items
>>>        btrfs_search_slot
>>>
>>> But better be sure.
>>
>> Here you go, EEXIST == 17:
>>
>> 	Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
>> 	Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
>> 	Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
>> 	Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
>> 	Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
>> 	Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
>> 	Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
>> 	Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
>> 	Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
>> 	Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
>
> I did a low-resolution bisect for this issue.  I dug up 5.4, 5.5, 5.6,
> and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete
> kernels, and ran the tests on each kernel.  Results:
>
> 	5.8:  kernel BUG at fs/btrfs/relocation.c:794
>
> 	5.7:  kernel BUG (same code but different line number)
>
> 	5.6:  kernel BUG (same as the others)
>
> 	5.5:  assertion failure (stack trace below)
>
> 	5.4:  kernel BUG (!)
>
> The 5.4 result is interesting--I've been running 5.4 for some time and
> not hit this before.  So there are 3 possible theories:
>
> 	1.  It's because of sending signals to balance.  That has been
> 	added to my test workload after 5.7 was released, so earlier
> 	tests on 5.4 would not have triggered it.
>
> 	2.  There's a regression in 5.4-stable, which I've cherry-picked
> 	to all the other kernels during my test setup.	(On the other
> 	hand, if I don't backport some fixes, kernels 5.5..5.7 crash
> 	before they get to this bug.)
>
> 	3.  There's something rotten in my test filesystem, and the
> 	BUG will go away for a while if I do a mkfs.  Qu asked for
> 	a dump earlier in this thread, and I provided one.
>
> All three of these theories are testable to some extent, so I'll have
> my test VM run some variations.
>
> The full test workload is:
>
> 	balance metadata or data at random intervals
>
> 	scrub, scrub cancel at random intervals
>
> 	20x rsync updating files
>
> 	snapshot create, delete at random intervals
>
> 	bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
>
> 	balance cancel at random intervals
>
> 	kill -9 $(pidof btrfs balance) at random intervals (NEW - added
> 	when 5.7 came out)
>
> This is the 5.5 root assertion failure:
>
> 	Sep  1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
> 	Sep  1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
> 	Sep  1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
> 	Sep  1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
> 	Sep  1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
> 	Sep  1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last  enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
> 	Sep  1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G        W         5.5.19-76348822ab91+ #14
> 	Sep  1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
> 	Sep  1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> 	Sep  1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last  enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
> 	Sep  1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
> 	Sep  1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
> 	Sep  1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
> 	Sep  1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
> 	Sep  1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
> 	Sep  1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
> 	Sep  1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
> 	Sep  1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
> 	Sep  1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
> 	Sep  1 04:48:49 regress kernel: [10642.561391][T24161] FS:  00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
> 	Sep  1 04:48:49 regress kernel: [10642.562779][T24161] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	Sep  1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
> 	Sep  1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
> 	Sep  1 04:48:49 regress kernel: [10642.565565][T24161]  build_backref_tree+0x186b/0x2590
> 	Sep  1 04:48:49 regress kernel: [10642.566389][T24161]  ? relocate_data_extent+0x1a0/0x1a0
> 	Sep  1 04:48:49 regress kernel: [10642.567295][T24161]  ? lock_downgrade+0x3d0/0x3d0
> 	Sep  1 04:48:49 regress kernel: [10642.568142][T24161]  ? match_held_lock+0x20/0x260
> 	Sep  1 04:48:49 regress kernel: [10642.568925][T24161]  ? do_raw_spin_unlock+0xa8/0x140
> 	Sep  1 04:48:49 regress kernel: [10642.569765][T24161]  ? _raw_spin_trylock_bh+0x60/0x80
> 	Sep  1 04:48:49 regress kernel: [10642.570605][T24161]  ? release_extent_buffer+0x13b/0x230
> 	Sep  1 04:48:49 regress kernel: [10642.571480][T24161]  ? free_extent_buffer.part.45+0xd7/0x140
> 	Sep  1 04:48:49 regress kernel: [10642.572406][T24161]  relocate_tree_blocks+0x204/0xa50
> 	Sep  1 04:48:49 regress kernel: [10642.573244][T24161]  ? build_backref_tree+0x2590/0x2590
> 	Sep  1 04:48:49 regress kernel: [10642.574103][T24161]  ? rb_insert_color+0x3af/0x400
> 	Sep  1 04:48:49 regress kernel: [10642.574896][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
> 	Sep  1 04:48:49 regress kernel: [10642.575785][T24161]  ? tree_insert+0x90/0xb0
> 	Sep  1 04:48:49 regress kernel: [10642.576495][T24161]  ? add_tree_block.isra.38+0x1d6/0x230
> 	Sep  1 04:48:49 regress kernel: [10642.577387][T24161]  relocate_block_group+0x528/0x9d0
> 	Sep  1 04:48:49 regress kernel: [10642.578220][T24161]  ? merge_reloc_roots+0x470/0x470
> 	Sep  1 04:48:49 regress kernel: [10642.579047][T24161]  btrfs_relocate_block_group+0x26e/0x4c0
> 	Sep  1 04:48:49 regress kernel: [10642.579968][T24161]  btrfs_relocate_chunk+0x52/0xf0
> 	Sep  1 04:48:49 regress kernel: [10642.580773][T24161]  btrfs_balance+0xe5b/0x1800
> 	Sep  1 04:48:49 regress kernel: [10642.581542][T24161]  ? btrfs_relocate_chunk+0xf0/0xf0
> 	Sep  1 04:48:49 regress kernel: [10642.582381][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
> 	Sep  1 04:48:49 regress kernel: [10642.583270][T24161]  ? _copy_from_user+0xaa/0xd0
> 	Sep  1 04:48:49 regress kernel: [10642.584022][T24161]  btrfs_ioctl_balance+0x3de/0x4c0
> 	Sep  1 04:48:49 regress kernel: [10642.584819][T24161]  btrfs_ioctl+0x3122/0x4470
> 	Sep  1 04:48:49 regress kernel: [10642.585540][T24161]  ? __asan_loadN+0xf/0x20
> 	Sep  1 04:48:49 regress kernel: [10642.586229][T24161]  ? __asan_loadN+0xf/0x20
> 	Sep  1 04:48:49 regress kernel: [10642.586920][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> 	Sep  1 04:48:49 regress kernel: [10642.587935][T24161]  ? __asan_loadN+0xf/0x20
> 	Sep  1 04:48:49 regress kernel: [10642.588649][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> 	Sep  1 04:48:49 regress kernel: [10642.589566][T24161]  ? __asan_loadN+0xf/0x20
> 	Sep  1 04:48:49 regress kernel: [10642.590254][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> 	Sep  1 04:48:49 regress kernel: [10642.591128][T24161]  ? __kasan_check_read+0x11/0x20
> 	Sep  1 04:48:49 regress kernel: [10642.591913][T24161]  ? check_chain_key+0x1e6/0x2e0
> 	Sep  1 04:48:49 regress kernel: [10642.592707][T24161]  ? __asan_loadN+0xf/0x20
> 	Sep  1 04:48:49 regress kernel: [10642.593409][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> 	Sep  1 04:48:49 regress kernel: [10642.594312][T24161]  ? kvm_sched_clock_read+0x18/0x30
> 	Sep  1 04:48:49 regress kernel: [10642.595139][T24161]  ? check_chain_key+0x1e6/0x2e0
> 	Sep  1 04:48:49 regress kernel: [10642.595929][T24161]  ? sched_clock_cpu+0x1b/0x120
> 	Sep  1 04:48:49 regress kernel: [10642.596712][T24161]  do_vfs_ioctl+0x13e/0xad0
> 	Sep  1 04:48:49 regress kernel: [10642.597432][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> 	Sep  1 04:48:49 regress kernel: [10642.598455][T24161]  ? do_vfs_ioctl+0x13e/0xad0
> 	Sep  1 04:48:49 regress kernel: [10642.599202][T24161]  ? compat_ioctl_preallocate+0x170/0x170
> 	Sep  1 04:48:49 regress kernel: [10642.600128][T24161]  ? __kasan_check_write+0x14/0x20
> 	Sep  1 04:48:49 regress kernel: [10642.600949][T24161]  ? up_read+0x176/0x4f0
> 	Sep  1 04:48:49 regress kernel: [10642.601648][T24161]  ? down_write_nested+0x2d0/0x2d0
> 	Sep  1 04:48:49 regress kernel: [10642.602476][T24161]  ? handle_mm_fault+0x211/0x480
> 	Sep  1 04:48:49 regress kernel: [10642.603263][T24161]  ? __kasan_check_read+0x11/0x20
> 	Sep  1 04:48:49 regress kernel: [10642.604062][T24161]  ? __fget_light+0xb2/0x110
> 	Sep  1 04:48:49 regress kernel: [10642.604805][T24161]  ksys_ioctl+0x67/0x90
> 	Sep  1 04:48:49 regress kernel: [10642.605471][T24161]  __x64_sys_ioctl+0x43/0x50
> 	Sep  1 04:48:49 regress kernel: [10642.606203][T24161]  do_syscall_64+0x77/0x2d0
> 	Sep  1 04:48:49 regress kernel: [10642.606933][T24161]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 	Sep  1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
> 	Sep  1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
> 	Sep  1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> 	Sep  1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
> 	Sep  1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
> 	Sep  1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> 	Sep  1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
> 	Sep  1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
> 	Sep  1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
> 	Sep  1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-09-01 23:33                 ` Qu Wenruo
@ 2020-09-02  0:14                   ` Zygo Blaxell
  2020-09-02  1:46                     ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-09-02  0:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu

On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote:
> This looks like a race between some reloc tree creation from some other
> part.
> 
> If you could add debug output for create_reloc_root() and its callers,
> we may have a chance to debug it.

The callers are always the same:

	btrfs_init_reloc_root+0x1b0
	record_root_in_trans+0x18c
	record_root_in_trans+0x8b
	start_transaction+0x189

	(gdb) l *(create_reloc_root+0x468)
	0xffffffff81930848 is in create_reloc_root (fs/btrfs/relocation.c:1503).
	1498            btrfs_tree_unlock(eb);
	1499            free_extent_buffer(eb);
	1500
	1501            ret = btrfs_insert_root(trans, fs_info->tree_root,
	1502                                    &root_key, root_item);
	1503            BUG_ON(ret);
	1504            kfree(root_item);
	1505
	1506            reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
	1507            BUG_ON(IS_ERR(reloc_root));
	(gdb) l *(btrfs_init_reloc_root+0x1b0)
	0xffffffff81937db0 is in btrfs_init_reloc_root (fs/btrfs/relocation.c:1567).
	1562            if (!trans->reloc_reserved) {
	1563                    rsv = trans->block_rsv;
	1564                    trans->block_rsv = rc->block_rsv;
	1565                    clear_rsv = 1;
	1566            }
	1567            reloc_root = create_reloc_root(trans, root, root->root_key.objectid);
	1568            if (clear_rsv)
	1569                    trans->block_rsv = rsv;
	1570
	1571            ret = __add_reloc_root(reloc_root);
	(gdb) l *(record_root_in_trans+0x18c)
	0xffffffff81889bfc is in record_root_in_trans (./include/asm-generic/bitops/instrumented-atomic.h:41).
	36       *
	37       * This is a relaxed atomic operation (no implied memory barriers).
	38       */
	39      static inline void clear_bit(long nr, volatile unsigned long *addr)
	40      {
	41              kasan_check_write(addr + BIT_WORD(nr), sizeof(long));
	42              arch_clear_bit(nr, addr);
	43      }
	44
	45      /**
	(gdb) l *(start_transaction+0x189)
	0xffffffff8188f0d9 is in start_transaction (fs/btrfs/transaction.c:697).
	692              * Thus it need to be called after current->journal_info initialized,
	693              * or we can deadlock.
	694              */
	695             btrfs_record_root_in_trans(h, root);
	696
	697             return h;
	698
	699     join_fail:
	700             if (type & __TRANS_FREEZABLE)
	701                     sb_end_intwrite(fs_info->sb);
	(gdb) quit

It seems to be very early in the transaction.  Is there anything to
output here?  Or are we more interested in what is left over from
the previous transaction?

> But for the first step, we can hunt down the BUG_ON()s first and make it
> exist more gracefully.
> 
> I'll try to spare some time to do this in the following week.
> 
> Thanks,
> Qu
> 
> On 2020/9/2 上午6:53, Zygo Blaxell wrote:
> > On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote:
> >> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
> >>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
> >>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
> >>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> >>>
> >>> <snip>
> >>>
> >>> Since you can repro reliably could you modify the code in
> >>> create_reloc_root so it prints what's the returned error value, I'd
> >>> speculate it's EEXIST from
> >>>
> >>> btrfs_insert_root
> >>>   btrfs_insert_item
> >>>    btrfs_insert_empty_item
> >>>      btrfs_insert_empty_items
> >>>        btrfs_search_slot
> >>>
> >>> But better be sure.
> >>
> >> Here you go, EEXIST == 17:
> >>
> >> 	Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
> >> 	Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
> >> 	Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
> >> 	Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
> >> 	Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
> >> 	Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
> >> 	Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
> >> 	Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
> >> 	Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
> >> 	Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
> >
> > I did a low-resolution bisect for this issue.  I dug up 5.4, 5.5, 5.6,
> > and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete
> > kernels, and ran the tests on each kernel.  Results:
> >
> > 	5.8:  kernel BUG at fs/btrfs/relocation.c:794
> >
> > 	5.7:  kernel BUG (same code but different line number)
> >
> > 	5.6:  kernel BUG (same as the others)
> >
> > 	5.5:  assertion failure (stack trace below)
> >
> > 	5.4:  kernel BUG (!)
> >
> > The 5.4 result is interesting--I've been running 5.4 for some time and
> > not hit this before.  So there are 3 possible theories:
> >
> > 	1.  It's because of sending signals to balance.  That has been
> > 	added to my test workload after 5.7 was released, so earlier
> > 	tests on 5.4 would not have triggered it.
> >
> > 	2.  There's a regression in 5.4-stable, which I've cherry-picked
> > 	to all the other kernels during my test setup.	(On the other
> > 	hand, if I don't backport some fixes, kernels 5.5..5.7 crash
> > 	before they get to this bug.)
> >
> > 	3.  There's something rotten in my test filesystem, and the
> > 	BUG will go away for a while if I do a mkfs.  Qu asked for
> > 	a dump earlier in this thread, and I provided one.
> >
> > All three of these theories are testable to some extent, so I'll have
> > my test VM run some variations.
> >
> > The full test workload is:
> >
> > 	balance metadata or data at random intervals
> >
> > 	scrub, scrub cancel at random intervals
> >
> > 	20x rsync updating files
> >
> > 	snapshot create, delete at random intervals
> >
> > 	bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
> >
> > 	balance cancel at random intervals
> >
> > 	kill -9 $(pidof btrfs balance) at random intervals (NEW - added
> > 	when 5.7 came out)
> >
> > This is the 5.5 root assertion failure:
> >
> > 	Sep  1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
> > 	Sep  1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
> > 	Sep  1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
> > 	Sep  1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
> > 	Sep  1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
> > 	Sep  1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last  enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
> > 	Sep  1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G        W         5.5.19-76348822ab91+ #14
> > 	Sep  1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
> > 	Sep  1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > 	Sep  1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last  enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
> > 	Sep  1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
> > 	Sep  1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
> > 	Sep  1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
> > 	Sep  1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
> > 	Sep  1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
> > 	Sep  1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
> > 	Sep  1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
> > 	Sep  1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
> > 	Sep  1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
> > 	Sep  1 04:48:49 regress kernel: [10642.561391][T24161] FS:  00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
> > 	Sep  1 04:48:49 regress kernel: [10642.562779][T24161] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 	Sep  1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
> > 	Sep  1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
> > 	Sep  1 04:48:49 regress kernel: [10642.565565][T24161]  build_backref_tree+0x186b/0x2590
> > 	Sep  1 04:48:49 regress kernel: [10642.566389][T24161]  ? relocate_data_extent+0x1a0/0x1a0
> > 	Sep  1 04:48:49 regress kernel: [10642.567295][T24161]  ? lock_downgrade+0x3d0/0x3d0
> > 	Sep  1 04:48:49 regress kernel: [10642.568142][T24161]  ? match_held_lock+0x20/0x260
> > 	Sep  1 04:48:49 regress kernel: [10642.568925][T24161]  ? do_raw_spin_unlock+0xa8/0x140
> > 	Sep  1 04:48:49 regress kernel: [10642.569765][T24161]  ? _raw_spin_trylock_bh+0x60/0x80
> > 	Sep  1 04:48:49 regress kernel: [10642.570605][T24161]  ? release_extent_buffer+0x13b/0x230
> > 	Sep  1 04:48:49 regress kernel: [10642.571480][T24161]  ? free_extent_buffer.part.45+0xd7/0x140
> > 	Sep  1 04:48:49 regress kernel: [10642.572406][T24161]  relocate_tree_blocks+0x204/0xa50
> > 	Sep  1 04:48:49 regress kernel: [10642.573244][T24161]  ? build_backref_tree+0x2590/0x2590
> > 	Sep  1 04:48:49 regress kernel: [10642.574103][T24161]  ? rb_insert_color+0x3af/0x400
> > 	Sep  1 04:48:49 regress kernel: [10642.574896][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
> > 	Sep  1 04:48:49 regress kernel: [10642.575785][T24161]  ? tree_insert+0x90/0xb0
> > 	Sep  1 04:48:49 regress kernel: [10642.576495][T24161]  ? add_tree_block.isra.38+0x1d6/0x230
> > 	Sep  1 04:48:49 regress kernel: [10642.577387][T24161]  relocate_block_group+0x528/0x9d0
> > 	Sep  1 04:48:49 regress kernel: [10642.578220][T24161]  ? merge_reloc_roots+0x470/0x470
> > 	Sep  1 04:48:49 regress kernel: [10642.579047][T24161]  btrfs_relocate_block_group+0x26e/0x4c0
> > 	Sep  1 04:48:49 regress kernel: [10642.579968][T24161]  btrfs_relocate_chunk+0x52/0xf0
> > 	Sep  1 04:48:49 regress kernel: [10642.580773][T24161]  btrfs_balance+0xe5b/0x1800
> > 	Sep  1 04:48:49 regress kernel: [10642.581542][T24161]  ? btrfs_relocate_chunk+0xf0/0xf0
> > 	Sep  1 04:48:49 regress kernel: [10642.582381][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
> > 	Sep  1 04:48:49 regress kernel: [10642.583270][T24161]  ? _copy_from_user+0xaa/0xd0
> > 	Sep  1 04:48:49 regress kernel: [10642.584022][T24161]  btrfs_ioctl_balance+0x3de/0x4c0
> > 	Sep  1 04:48:49 regress kernel: [10642.584819][T24161]  btrfs_ioctl+0x3122/0x4470
> > 	Sep  1 04:48:49 regress kernel: [10642.585540][T24161]  ? __asan_loadN+0xf/0x20
> > 	Sep  1 04:48:49 regress kernel: [10642.586229][T24161]  ? __asan_loadN+0xf/0x20
> > 	Sep  1 04:48:49 regress kernel: [10642.586920][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> > 	Sep  1 04:48:49 regress kernel: [10642.587935][T24161]  ? __asan_loadN+0xf/0x20
> > 	Sep  1 04:48:49 regress kernel: [10642.588649][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> > 	Sep  1 04:48:49 regress kernel: [10642.589566][T24161]  ? __asan_loadN+0xf/0x20
> > 	Sep  1 04:48:49 regress kernel: [10642.590254][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> > 	Sep  1 04:48:49 regress kernel: [10642.591128][T24161]  ? __kasan_check_read+0x11/0x20
> > 	Sep  1 04:48:49 regress kernel: [10642.591913][T24161]  ? check_chain_key+0x1e6/0x2e0
> > 	Sep  1 04:48:49 regress kernel: [10642.592707][T24161]  ? __asan_loadN+0xf/0x20
> > 	Sep  1 04:48:49 regress kernel: [10642.593409][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> > 	Sep  1 04:48:49 regress kernel: [10642.594312][T24161]  ? kvm_sched_clock_read+0x18/0x30
> > 	Sep  1 04:48:49 regress kernel: [10642.595139][T24161]  ? check_chain_key+0x1e6/0x2e0
> > 	Sep  1 04:48:49 regress kernel: [10642.595929][T24161]  ? sched_clock_cpu+0x1b/0x120
> > 	Sep  1 04:48:49 regress kernel: [10642.596712][T24161]  do_vfs_ioctl+0x13e/0xad0
> > 	Sep  1 04:48:49 regress kernel: [10642.597432][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> > 	Sep  1 04:48:49 regress kernel: [10642.598455][T24161]  ? do_vfs_ioctl+0x13e/0xad0
> > 	Sep  1 04:48:49 regress kernel: [10642.599202][T24161]  ? compat_ioctl_preallocate+0x170/0x170
> > 	Sep  1 04:48:49 regress kernel: [10642.600128][T24161]  ? __kasan_check_write+0x14/0x20
> > 	Sep  1 04:48:49 regress kernel: [10642.600949][T24161]  ? up_read+0x176/0x4f0
> > 	Sep  1 04:48:49 regress kernel: [10642.601648][T24161]  ? down_write_nested+0x2d0/0x2d0
> > 	Sep  1 04:48:49 regress kernel: [10642.602476][T24161]  ? handle_mm_fault+0x211/0x480
> > 	Sep  1 04:48:49 regress kernel: [10642.603263][T24161]  ? __kasan_check_read+0x11/0x20
> > 	Sep  1 04:48:49 regress kernel: [10642.604062][T24161]  ? __fget_light+0xb2/0x110
> > 	Sep  1 04:48:49 regress kernel: [10642.604805][T24161]  ksys_ioctl+0x67/0x90
> > 	Sep  1 04:48:49 regress kernel: [10642.605471][T24161]  __x64_sys_ioctl+0x43/0x50
> > 	Sep  1 04:48:49 regress kernel: [10642.606203][T24161]  do_syscall_64+0x77/0x2d0
> > 	Sep  1 04:48:49 regress kernel: [10642.606933][T24161]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 	Sep  1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
> > 	Sep  1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
> > 	Sep  1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > 	Sep  1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
> > 	Sep  1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
> > 	Sep  1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> > 	Sep  1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
> > 	Sep  1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
> > 	Sep  1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
> > 	Sep  1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-09-02  0:14                   ` Zygo Blaxell
@ 2020-09-02  1:46                     ` Qu Wenruo
  2020-09-04 15:54                       ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-09-02  1:46 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu



On 2020/9/2 上午8:14, Zygo Blaxell wrote:
> On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote:
>> This looks like a race between some reloc tree creation from some other
>> part.
>>
>> If you could add debug output for create_reloc_root() and its callers,
>> we may have a chance to debug it.
>
> The callers are always the same:
>
> 	btrfs_init_reloc_root+0x1b0
> 	record_root_in_trans+0x18c
> 	record_root_in_trans+0x8b
> 	start_transaction+0x189
>
> 	(gdb) l *(create_reloc_root+0x468)
> 	0xffffffff81930848 is in create_reloc_root (fs/btrfs/relocation.c:1503).
> 	1498            btrfs_tree_unlock(eb);
> 	1499            free_extent_buffer(eb);
> 	1500
> 	1501            ret = btrfs_insert_root(trans, fs_info->tree_root,
> 	1502                                    &root_key, root_item);
> 	1503            BUG_ON(ret);
> 	1504            kfree(root_item);
> 	1505
> 	1506            reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> 	1507            BUG_ON(IS_ERR(reloc_root));
> 	(gdb) l *(btrfs_init_reloc_root+0x1b0)
> 	0xffffffff81937db0 is in btrfs_init_reloc_root (fs/btrfs/relocation.c:1567).
> 	1562            if (!trans->reloc_reserved) {
> 	1563                    rsv = trans->block_rsv;
> 	1564                    trans->block_rsv = rc->block_rsv;
> 	1565                    clear_rsv = 1;
> 	1566            }
> 	1567            reloc_root = create_reloc_root(trans, root, root->root_key.objectid);
> 	1568            if (clear_rsv)
> 	1569                    trans->block_rsv = rsv;
> 	1570
> 	1571            ret = __add_reloc_root(reloc_root);
> 	(gdb) l *(record_root_in_trans+0x18c)
> 	0xffffffff81889bfc is in record_root_in_trans (./include/asm-generic/bitops/instrumented-atomic.h:41).
> 	36       *
> 	37       * This is a relaxed atomic operation (no implied memory barriers).
> 	38       */
> 	39      static inline void clear_bit(long nr, volatile unsigned long *addr)
> 	40      {
> 	41              kasan_check_write(addr + BIT_WORD(nr), sizeof(long));
> 	42              arch_clear_bit(nr, addr);
> 	43      }
> 	44
> 	45      /**
> 	(gdb) l *(start_transaction+0x189)
> 	0xffffffff8188f0d9 is in start_transaction (fs/btrfs/transaction.c:697).
> 	692              * Thus it need to be called after current->journal_info initialized,
> 	693              * or we can deadlock.
> 	694              */
> 	695             btrfs_record_root_in_trans(h, root);
> 	696
> 	697             return h;
> 	698
> 	699     join_fail:
> 	700             if (type & __TRANS_FREEZABLE)
> 	701                     sb_end_intwrite(fs_info->sb);
> 	(gdb) quit
>
> It seems to be very early in the transaction.  Is there anything to
> output here?  Or are we more interested in what is left over from
> the previous transaction?

What I mean is, I want to see who else created the reloc tree, not only
who caused the EEXIST BUG_ON().

That's why I hope to add enough debug pr_info or whatever for
create_reloc_root(), so that we can catch the ordinary calls that seems
safe but may be unsafe for other callers.

Thanks,
Qu

>
>> But for the first step, we can hunt down the BUG_ON()s first and make it
>> exist more gracefully.
>>
>> I'll try to spare some time to do this in the following week.
>>
>> Thanks,
>> Qu
>>
>> On 2020/9/2 上午6:53, Zygo Blaxell wrote:
>>> On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote:
>>>> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
>>>>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
>>>>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
>>>>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>> Since you can repro reliably could you modify the code in
>>>>> create_reloc_root so it prints what's the returned error value, I'd
>>>>> speculate it's EEXIST from
>>>>>
>>>>> btrfs_insert_root
>>>>>   btrfs_insert_item
>>>>>    btrfs_insert_empty_item
>>>>>      btrfs_insert_empty_items
>>>>>        btrfs_search_slot
>>>>>
>>>>> But better be sure.
>>>>
>>>> Here you go, EEXIST == 17:
>>>>
>>>> 	Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
>>>> 	Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
>>>> 	Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
>>>> 	Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
>>>> 	Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
>>>> 	Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
>>>> 	Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
>>>> 	Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
>>>> 	Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
>>>> 	Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
>>>
>>> I did a low-resolution bisect for this issue.  I dug up 5.4, 5.5, 5.6,
>>> and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete
>>> kernels, and ran the tests on each kernel.  Results:
>>>
>>> 	5.8:  kernel BUG at fs/btrfs/relocation.c:794
>>>
>>> 	5.7:  kernel BUG (same code but different line number)
>>>
>>> 	5.6:  kernel BUG (same as the others)
>>>
>>> 	5.5:  assertion failure (stack trace below)
>>>
>>> 	5.4:  kernel BUG (!)
>>>
>>> The 5.4 result is interesting--I've been running 5.4 for some time and
>>> not hit this before.  So there are 3 possible theories:
>>>
>>> 	1.  It's because of sending signals to balance.  That has been
>>> 	added to my test workload after 5.7 was released, so earlier
>>> 	tests on 5.4 would not have triggered it.
>>>
>>> 	2.  There's a regression in 5.4-stable, which I've cherry-picked
>>> 	to all the other kernels during my test setup.	(On the other
>>> 	hand, if I don't backport some fixes, kernels 5.5..5.7 crash
>>> 	before they get to this bug.)
>>>
>>> 	3.  There's something rotten in my test filesystem, and the
>>> 	BUG will go away for a while if I do a mkfs.  Qu asked for
>>> 	a dump earlier in this thread, and I provided one.
>>>
>>> All three of these theories are testable to some extent, so I'll have
>>> my test VM run some variations.
>>>
>>> The full test workload is:
>>>
>>> 	balance metadata or data at random intervals
>>>
>>> 	scrub, scrub cancel at random intervals
>>>
>>> 	20x rsync updating files
>>>
>>> 	snapshot create, delete at random intervals
>>>
>>> 	bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
>>>
>>> 	balance cancel at random intervals
>>>
>>> 	kill -9 $(pidof btrfs balance) at random intervals (NEW - added
>>> 	when 5.7 came out)
>>>
>>> This is the 5.5 root assertion failure:
>>>
>>> 	Sep  1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
>>> 	Sep  1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
>>> 	Sep  1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
>>> 	Sep  1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
>>> 	Sep  1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
>>> 	Sep  1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last  enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
>>> 	Sep  1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G        W         5.5.19-76348822ab91+ #14
>>> 	Sep  1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
>>> 	Sep  1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
>>> 	Sep  1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last  enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
>>> 	Sep  1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
>>> 	Sep  1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
>>> 	Sep  1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
>>> 	Sep  1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
>>> 	Sep  1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
>>> 	Sep  1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
>>> 	Sep  1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
>>> 	Sep  1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
>>> 	Sep  1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
>>> 	Sep  1 04:48:49 regress kernel: [10642.561391][T24161] FS:  00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
>>> 	Sep  1 04:48:49 regress kernel: [10642.562779][T24161] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> 	Sep  1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
>>> 	Sep  1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
>>> 	Sep  1 04:48:49 regress kernel: [10642.565565][T24161]  build_backref_tree+0x186b/0x2590
>>> 	Sep  1 04:48:49 regress kernel: [10642.566389][T24161]  ? relocate_data_extent+0x1a0/0x1a0
>>> 	Sep  1 04:48:49 regress kernel: [10642.567295][T24161]  ? lock_downgrade+0x3d0/0x3d0
>>> 	Sep  1 04:48:49 regress kernel: [10642.568142][T24161]  ? match_held_lock+0x20/0x260
>>> 	Sep  1 04:48:49 regress kernel: [10642.568925][T24161]  ? do_raw_spin_unlock+0xa8/0x140
>>> 	Sep  1 04:48:49 regress kernel: [10642.569765][T24161]  ? _raw_spin_trylock_bh+0x60/0x80
>>> 	Sep  1 04:48:49 regress kernel: [10642.570605][T24161]  ? release_extent_buffer+0x13b/0x230
>>> 	Sep  1 04:48:49 regress kernel: [10642.571480][T24161]  ? free_extent_buffer.part.45+0xd7/0x140
>>> 	Sep  1 04:48:49 regress kernel: [10642.572406][T24161]  relocate_tree_blocks+0x204/0xa50
>>> 	Sep  1 04:48:49 regress kernel: [10642.573244][T24161]  ? build_backref_tree+0x2590/0x2590
>>> 	Sep  1 04:48:49 regress kernel: [10642.574103][T24161]  ? rb_insert_color+0x3af/0x400
>>> 	Sep  1 04:48:49 regress kernel: [10642.574896][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
>>> 	Sep  1 04:48:49 regress kernel: [10642.575785][T24161]  ? tree_insert+0x90/0xb0
>>> 	Sep  1 04:48:49 regress kernel: [10642.576495][T24161]  ? add_tree_block.isra.38+0x1d6/0x230
>>> 	Sep  1 04:48:49 regress kernel: [10642.577387][T24161]  relocate_block_group+0x528/0x9d0
>>> 	Sep  1 04:48:49 regress kernel: [10642.578220][T24161]  ? merge_reloc_roots+0x470/0x470
>>> 	Sep  1 04:48:49 regress kernel: [10642.579047][T24161]  btrfs_relocate_block_group+0x26e/0x4c0
>>> 	Sep  1 04:48:49 regress kernel: [10642.579968][T24161]  btrfs_relocate_chunk+0x52/0xf0
>>> 	Sep  1 04:48:49 regress kernel: [10642.580773][T24161]  btrfs_balance+0xe5b/0x1800
>>> 	Sep  1 04:48:49 regress kernel: [10642.581542][T24161]  ? btrfs_relocate_chunk+0xf0/0xf0
>>> 	Sep  1 04:48:49 regress kernel: [10642.582381][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
>>> 	Sep  1 04:48:49 regress kernel: [10642.583270][T24161]  ? _copy_from_user+0xaa/0xd0
>>> 	Sep  1 04:48:49 regress kernel: [10642.584022][T24161]  btrfs_ioctl_balance+0x3de/0x4c0
>>> 	Sep  1 04:48:49 regress kernel: [10642.584819][T24161]  btrfs_ioctl+0x3122/0x4470
>>> 	Sep  1 04:48:49 regress kernel: [10642.585540][T24161]  ? __asan_loadN+0xf/0x20
>>> 	Sep  1 04:48:49 regress kernel: [10642.586229][T24161]  ? __asan_loadN+0xf/0x20
>>> 	Sep  1 04:48:49 regress kernel: [10642.586920][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
>>> 	Sep  1 04:48:49 regress kernel: [10642.587935][T24161]  ? __asan_loadN+0xf/0x20
>>> 	Sep  1 04:48:49 regress kernel: [10642.588649][T24161]  ? pvclock_clocksource_read+0xeb/0x190
>>> 	Sep  1 04:48:49 regress kernel: [10642.589566][T24161]  ? __asan_loadN+0xf/0x20
>>> 	Sep  1 04:48:49 regress kernel: [10642.590254][T24161]  ? pvclock_clocksource_read+0xeb/0x190
>>> 	Sep  1 04:48:49 regress kernel: [10642.591128][T24161]  ? __kasan_check_read+0x11/0x20
>>> 	Sep  1 04:48:49 regress kernel: [10642.591913][T24161]  ? check_chain_key+0x1e6/0x2e0
>>> 	Sep  1 04:48:49 regress kernel: [10642.592707][T24161]  ? __asan_loadN+0xf/0x20
>>> 	Sep  1 04:48:49 regress kernel: [10642.593409][T24161]  ? pvclock_clocksource_read+0xeb/0x190
>>> 	Sep  1 04:48:49 regress kernel: [10642.594312][T24161]  ? kvm_sched_clock_read+0x18/0x30
>>> 	Sep  1 04:48:49 regress kernel: [10642.595139][T24161]  ? check_chain_key+0x1e6/0x2e0
>>> 	Sep  1 04:48:49 regress kernel: [10642.595929][T24161]  ? sched_clock_cpu+0x1b/0x120
>>> 	Sep  1 04:48:49 regress kernel: [10642.596712][T24161]  do_vfs_ioctl+0x13e/0xad0
>>> 	Sep  1 04:48:49 regress kernel: [10642.597432][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
>>> 	Sep  1 04:48:49 regress kernel: [10642.598455][T24161]  ? do_vfs_ioctl+0x13e/0xad0
>>> 	Sep  1 04:48:49 regress kernel: [10642.599202][T24161]  ? compat_ioctl_preallocate+0x170/0x170
>>> 	Sep  1 04:48:49 regress kernel: [10642.600128][T24161]  ? __kasan_check_write+0x14/0x20
>>> 	Sep  1 04:48:49 regress kernel: [10642.600949][T24161]  ? up_read+0x176/0x4f0
>>> 	Sep  1 04:48:49 regress kernel: [10642.601648][T24161]  ? down_write_nested+0x2d0/0x2d0
>>> 	Sep  1 04:48:49 regress kernel: [10642.602476][T24161]  ? handle_mm_fault+0x211/0x480
>>> 	Sep  1 04:48:49 regress kernel: [10642.603263][T24161]  ? __kasan_check_read+0x11/0x20
>>> 	Sep  1 04:48:49 regress kernel: [10642.604062][T24161]  ? __fget_light+0xb2/0x110
>>> 	Sep  1 04:48:49 regress kernel: [10642.604805][T24161]  ksys_ioctl+0x67/0x90
>>> 	Sep  1 04:48:49 regress kernel: [10642.605471][T24161]  __x64_sys_ioctl+0x43/0x50
>>> 	Sep  1 04:48:49 regress kernel: [10642.606203][T24161]  do_syscall_64+0x77/0x2d0
>>> 	Sep  1 04:48:49 regress kernel: [10642.606933][T24161]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> 	Sep  1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
>>> 	Sep  1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
>>> 	Sep  1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>>> 	Sep  1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
>>> 	Sep  1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
>>> 	Sep  1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
>>> 	Sep  1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
>>> 	Sep  1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
>>> 	Sep  1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
>>> 	Sep  1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
  2020-09-02  1:46                     ` Qu Wenruo
@ 2020-09-04 15:54                       ` Zygo Blaxell
  0 siblings, 0 replies; 13+ messages in thread
From: Zygo Blaxell @ 2020-09-04 15:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu

On Wed, Sep 02, 2020 at 09:46:29AM +0800, Qu Wenruo wrote:
> On 2020/9/2 上午8:14, Zygo Blaxell wrote:
> > On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote:
> >> This looks like a race between some reloc tree creation from some other
> >> part.
> >>
> >> If you could add debug output for create_reloc_root() and its callers,
> >> we may have a chance to debug it.
> 
> What I mean is, I want to see who else created the reloc tree, not only
> who caused the EEXIST BUG_ON().
> 
> That's why I hope to add enough debug pr_info or whatever for
> create_reloc_root(), so that we can catch the ordinary calls that seems
> safe but may be unsafe for other callers.

There doesn't appear to be a race with multiple instances of
create_reloc_root as nobody else seems to be calling it at the time
when it fails.  On the other hand, it is a kworker thread, so it could
be racing with something else.

	Sep  4 01:46:42 regress kernel: [12131.050264][ T5245] btrfs_search_slot ret = 0
	Sep  4 01:46:51 regress kernel: [12140.058734][ T5245] btrfs_search_slot ret = 0
	Sep  4 01:47:00 regress kernel: [12149.079892][ T5245] btrfs_search_slot ret = 0
	Sep  4 01:47:09 regress kernel: [12158.091883][ T5245] btrfs_search_slot ret = 0
	Sep  4 01:47:14 regress kernel: [12162.521167][ T2993] btrfs_search_slot ret = 0
	Sep  4 01:47:14 regress kernel: [12162.823894][ T2993] btrfs_search_slot ret = 0
	Sep  4 01:47:14 regress kernel: [12162.991624][ T2993] btrfs_search_slot ret = 0
	Sep  4 01:47:14 regress kernel: [12162.999665][ T2993] btrfs_search_slot ret = 0
	Sep  4 01:47:19 regress kernel: [12167.117620][ T5245] btrfs_search_slot ret = 0
	Sep  4 01:47:28 regress kernel: [12176.232713][ T5245] btrfs_search_slot ret = 0
	Sep  4 01:47:37 regress kernel: [12185.237905][ T5245] btrfs_search_slot ret = 0
	Sep  4 01:47:50 regress kernel: [12199.005753][ T5245] btrfs_search_slot ret = 0
	Sep  4 01:47:51 regress kernel: [12199.953977][T27716] BTRFS info (device dm-0): balance: start -dlimit=9
	Sep  4 01:47:51 regress kernel: [12199.992918][T27716] BTRFS info (device dm-0): relocating block group 16502453436416 flags data
	Sep  4 01:47:54 regress kernel: [12202.443621][T11829] root->root_key.objectid == 0, objectid = 10502
	Sep  4 01:47:54 regress kernel: [12202.444916][T11829] CPU: 0 PID: 11829 Comm: kworker/u8:20 Tainted: G        W         5.8.6-ce459d8ff170+ #8
	Sep  4 01:47:54 regress kernel: [12202.446791][T11829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	Sep  4 01:47:54 regress kernel: [12202.449187][T11829] Workqueue: btrfs-endio-write btrfs_work_helper
	Sep  4 01:47:54 regress kernel: [12202.450355][T11829] Call Trace:
	Sep  4 01:47:54 regress kernel: [12202.451580][T11829]  dump_stack+0xc8/0x11a
	Sep  4 01:47:54 regress kernel: [12202.452475][T11829]  create_reloc_root.cold.42+0x60/0x4d9
	Sep  4 01:47:54 regress kernel: [12202.453512][T11829]  ? invalidate_extent_cache+0x2a0/0x2a0
	Sep  4 01:47:54 regress kernel: [12202.454538][T11829]  ? check_chain_key+0x1e6/0x2e0
	Sep  4 01:47:54 regress kernel: [12202.455479][T11829]  btrfs_init_reloc_root+0x2d7/0x310
	Sep  4 01:47:54 regress kernel: [12202.456493][T11829]  ? find_reloc_root+0x200/0x200
	Sep  4 01:47:54 regress kernel: [12202.457510][T11829]  ? do_raw_spin_unlock+0xa8/0x140
	Sep  4 01:47:54 regress kernel: [12202.458446][T11829]  record_root_in_trans+0x18c/0x1d0
	Sep  4 01:47:54 regress kernel: [12202.459394][T11829]  btrfs_record_root_in_trans+0x8b/0xc0
	Sep  4 01:47:54 regress kernel: [12202.460673][T11829]  start_transaction+0x16b/0x8f0
	Sep  4 01:47:54 regress kernel: [12202.461595][T11829]  btrfs_join_transaction+0x1d/0x20
	Sep  4 01:47:54 regress kernel: [12202.462586][T11829]  btrfs_finish_ordered_io+0x535/0xd10
	Sep  4 01:47:54 regress kernel: [12202.463590][T11829]  ? register_lock_class+0x900/0x900
	Sep  4 01:47:54 regress kernel: [12202.464576][T11829]  ? btrfs_update_inode_fallback+0x40/0x40
	Sep  4 01:47:54 regress kernel: [12202.465584][T11829]  ? rcu_read_lock_sched_held+0xa1/0xd0
	Sep  4 01:47:54 regress kernel: [12202.466547][T11829]  ? rcu_read_lock_bh_held+0xb0/0xb0
	Sep  4 01:47:54 regress kernel: [12202.467463][T11829]  ? lock_is_held_type+0xc9/0x100
	Sep  4 01:47:54 regress kernel: [12202.468371][T11829]  finish_ordered_fn+0x15/0x20
	Sep  4 01:47:54 regress kernel: [12202.469224][T11829]  btrfs_work_helper+0x118/0x920
	Sep  4 01:47:54 regress kernel: [12202.470105][T11829]  ? rcu_read_lock_bh_held+0xb0/0xb0
	Sep  4 01:47:54 regress kernel: [12202.471082][T11829]  ? trace_hardirqs_on+0x57/0x140
	Sep  4 01:47:54 regress kernel: [12202.471998][T11829]  process_one_work+0x507/0xa70
	Sep  4 01:47:54 regress kernel: [12202.472885][T11829]  ? pwq_dec_nr_in_flight+0x130/0x130
	Sep  4 01:47:54 regress kernel: [12202.473816][T11829]  ? do_raw_spin_lock+0x1e0/0x1e0
	Sep  4 01:47:54 regress kernel: [12202.474716][T11829]  worker_thread+0x63/0x5a0
	Sep  4 01:47:54 regress kernel: [12202.475510][T11829]  ? process_one_work+0xa70/0xa70
	Sep  4 01:47:54 regress kernel: [12202.476428][T11829]  kthread+0x20c/0x230
	Sep  4 01:47:54 regress kernel: [12202.477137][T11829]  ? kthread_create_worker_on_cpu+0xc0/0xc0
	Sep  4 01:47:54 regress kernel: [12202.478152][T11829]  ret_from_fork+0x22/0x30
	Sep  4 01:47:54 regress kernel: [12202.480616][T11829] btrfs_search_slot ret = 0
	Sep  4 01:47:54 regress kernel: [12202.482834][T11829] btrfs_insert_empty_item ret = -17
	Sep  4 01:47:54 regress kernel: [12202.485447][T11829] btrfs_insert_root ret = -17
	Sep  4 01:47:54 regress kernel: [12202.487775][T11829] ------------[ cut here ]------------
	Sep  4 01:47:54 regress kernel: [12202.490086][T11829] kernel BUG at fs/btrfs/relocation.c:798!
	Sep  4 01:47:54 regress kernel: [12202.491104][T11829] invalid opcode: 0000 [#1] SMP KASAN PTI
	Sep  4 01:47:54 regress kernel: [12202.492056][T11829] CPU: 1 PID: 11829 Comm: kworker/u8:20 Tainted: G        W         5.8.6-ce459d8ff170+ #8
	Sep  4 01:47:54 regress kernel: [12202.493712][T11829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	Sep  4 01:47:54 regress kernel: [12202.495311][T11829] Workqueue: btrfs-endio-write btrfs_work_helper
	Sep  4 01:47:54 regress kernel: [12202.496424][T11829] RIP: 0010:create_reloc_root.cold.42+0x434/0x4d9
	Sep  4 01:47:54 regress kernel: [12202.497550][T11829] Code: e8 6c d6 f3 ff 48 c7 c7 e0 98 03 8f 89 c6 89 85 30 ff ff ff e8 0d 53 8c ff 8b 95 30 ff ff ff 4c 8b 8d 28 ff ff ff 85 d2 74 02 <0f> 0b 4c 89 cf e8 fd 56 bc ff 4c 89 e7 e8 45 9c bc ff 49 8b 7f 20
	Sep  4 01:47:54 regress kernel: [12202.501225][T11829] RSP: 0018:ffffc9000b80f920 EFLAGS: 00010282
	Sep  4 01:47:54 regress kernel: [12202.503239][T11829] RAX: 000000000000001b RBX: 1ffff92001701f29 RCX: ffffffff8d273af2
	Sep  4 01:47:54 regress kernel: [12202.504644][T11829] RDX: 00000000ffffffef RSI: 0000000000000008 RDI: ffff8881f59ff28c
	Sep  4 01:47:54 regress kernel: [12202.507025][T11829] RBP: ffffc9000b80fa10 R08: ffffed103eb41645 R09: ffff8880a598b400
	Sep  4 01:47:54 regress kernel: [12202.509429][T11829] R10: ffff8881f5a0b227 R11: ffffed103eb41644 R12: ffff8881c93e8020
	Sep  4 01:47:54 regress kernel: [12202.510781][T11829] R13: ffff8881cbefd2a0 R14: ffffc9000b80f9a8 R15: ffff8881c93e8000
	Sep  4 01:47:54 regress kernel: [12202.512142][T11829] FS:  0000000000000000(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
	Sep  4 01:47:54 regress kernel: [12202.513651][T11829] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	Sep  4 01:47:54 regress kernel: [12202.514790][T11829] CR2: 00007fb4c11f0a68 CR3: 00000001dc604005 CR4: 00000000001606e0
	Sep  4 01:47:54 regress kernel: [12202.516258][T11829] Call Trace:

For reference, here's my kernel logging so far:

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 82ab6e5a386d..b98b74397afc 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -4748,10 +4748,14 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle *trans,
 
	total_size = total_data + (nr * sizeof(struct btrfs_item));
	ret = btrfs_search_slot(trans, root, cpu_key, path, total_size, 1);
-	if (ret == 0)
+	if (ret == 0) {
+		printk(KERN_ERR "btrfs_search_slot ret = %d\n", ret);
		return -EEXIST;
-	if (ret < 0)
+	}
+	if (ret < 0) {
+		printk(KERN_ERR "btrfs_search_slot ret = %d\n", ret);
		return ret;
+	}
 
	slot = path->slots[0];
	BUG_ON(slot < 0);
@@ -4775,14 +4779,18 @@ int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root *root,
	unsigned long ptr;
 
	path = btrfs_alloc_path();
-	if (!path)
+	if (!path) {
+		printk(KERN_ERR "btrfs_alloc_path ENOMEM\n");
		return -ENOMEM;
+	}
	ret = btrfs_insert_empty_item(trans, root, path, cpu_key, data_size);
	if (!ret) {
		leaf = path->nodes[0];
		ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
		write_extent_buffer(leaf, data, ptr, data_size);
		btrfs_mark_buffer_dirty(leaf);
+	} else {
+		printk(KERN_ERR "btrfs_insert_empty_item ret = %d\n", ret);
	}
	btrfs_free_path(path);
	return ret;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 350050b288e4..23fffd4bfd41 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -744,6 +744,9 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans,
	root_key.type = BTRFS_ROOT_ITEM_KEY;
	root_key.offset = objectid;
 
+	printk(KERN_ERR "root->root_key.objectid == %zu, objectid = %zu\n", ret, root->root_key.objectid, objectid);
+	dump_stack();
+
	if (root->root_key.objectid == objectid) {
		u64 commit_root_gen;
 
@@ -791,6 +794,7 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans,
 
	ret = btrfs_insert_root(trans, fs_info->tree_root,
				&root_key, root_item);
+	printk(KERN_ERR "btrfs_insert_root ret = %d\n", ret);
	BUG_ON(ret);
	kfree(root_item);

> >>> The 5.4 result is interesting--I've been running 5.4 for some time and
> >>> not hit this before.  So there are 3 possible theories:
> >>>
> >>> 	1.  It's because of sending signals to balance.  That has been
> >>> 	added to my test workload after 5.7 was released, so earlier
> >>> 	tests on 5.4 would not have triggered it.

This might be related.  I removed 'kill the balance process' from my
test workload, and didn't have any BUG_ONs for two days.  When I put
the kill-the-balance-process test back in the workload, it went back
to BUG_ON at fairly reliable 1-5 hour intervals.  Of course that's just
correlation, and with random events at that, but so far the data supports
theory 1 and refutes theory 3.

> >>> 	2.  There's a regression in 5.4-stable, which I've cherry-picked
> >>> 	to all the other kernels during my test setup.	(On the other
> >>> 	hand, if I don't backport some fixes, kernels 5.5..5.7 crash
> >>> 	before they get to this bug.)
> >>>
> >>> 	3.  There's something rotten in my test filesystem, and the
> >>> 	BUG will go away for a while if I do a mkfs.  Qu asked for
> >>> 	a dump earlier in this thread, and I provided one.
> >>>
> >>> All three of these theories are testable to some extent, so I'll have
> >>> my test VM run some variations.
> >>>
> >>> The full test workload is:
> >>>
> >>> 	balance metadata or data at random intervals
> >>>
> >>> 	scrub, scrub cancel at random intervals
> >>>
> >>> 	20x rsync updating files
> >>>
> >>> 	snapshot create, delete at random intervals
> >>>
> >>> 	bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
> >>>
> >>> 	balance cancel at random intervals
> >>>
> >>> 	kill -9 $(pidof btrfs balance) at random intervals (NEW - added
> >>> 	when 5.7 came out)
> >>>
> >>> This is the 5.5 root assertion failure:
> >>>
> >>> 	Sep  1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
> >>> 	Sep  1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
> >>> 	Sep  1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
> >>> 	Sep  1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
> >>> 	Sep  1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
> >>> 	Sep  1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last  enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
> >>> 	Sep  1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G        W         5.5.19-76348822ab91+ #14
> >>> 	Sep  1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
> >>> 	Sep  1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> >>> 	Sep  1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last  enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
> >>> 	Sep  1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
> >>> 	Sep  1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
> >>> 	Sep  1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
> >>> 	Sep  1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
> >>> 	Sep  1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
> >>> 	Sep  1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
> >>> 	Sep  1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
> >>> 	Sep  1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.561391][T24161] FS:  00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
> >>> 	Sep  1 04:48:49 regress kernel: [10642.562779][T24161] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> 	Sep  1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
> >>> 	Sep  1 04:48:49 regress kernel: [10642.565565][T24161]  build_backref_tree+0x186b/0x2590
> >>> 	Sep  1 04:48:49 regress kernel: [10642.566389][T24161]  ? relocate_data_extent+0x1a0/0x1a0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.567295][T24161]  ? lock_downgrade+0x3d0/0x3d0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.568142][T24161]  ? match_held_lock+0x20/0x260
> >>> 	Sep  1 04:48:49 regress kernel: [10642.568925][T24161]  ? do_raw_spin_unlock+0xa8/0x140
> >>> 	Sep  1 04:48:49 regress kernel: [10642.569765][T24161]  ? _raw_spin_trylock_bh+0x60/0x80
> >>> 	Sep  1 04:48:49 regress kernel: [10642.570605][T24161]  ? release_extent_buffer+0x13b/0x230
> >>> 	Sep  1 04:48:49 regress kernel: [10642.571480][T24161]  ? free_extent_buffer.part.45+0xd7/0x140
> >>> 	Sep  1 04:48:49 regress kernel: [10642.572406][T24161]  relocate_tree_blocks+0x204/0xa50
> >>> 	Sep  1 04:48:49 regress kernel: [10642.573244][T24161]  ? build_backref_tree+0x2590/0x2590
> >>> 	Sep  1 04:48:49 regress kernel: [10642.574103][T24161]  ? rb_insert_color+0x3af/0x400
> >>> 	Sep  1 04:48:49 regress kernel: [10642.574896][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
> >>> 	Sep  1 04:48:49 regress kernel: [10642.575785][T24161]  ? tree_insert+0x90/0xb0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.576495][T24161]  ? add_tree_block.isra.38+0x1d6/0x230
> >>> 	Sep  1 04:48:49 regress kernel: [10642.577387][T24161]  relocate_block_group+0x528/0x9d0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.578220][T24161]  ? merge_reloc_roots+0x470/0x470
> >>> 	Sep  1 04:48:49 regress kernel: [10642.579047][T24161]  btrfs_relocate_block_group+0x26e/0x4c0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.579968][T24161]  btrfs_relocate_chunk+0x52/0xf0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.580773][T24161]  btrfs_balance+0xe5b/0x1800
> >>> 	Sep  1 04:48:49 regress kernel: [10642.581542][T24161]  ? btrfs_relocate_chunk+0xf0/0xf0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.582381][T24161]  ? kmem_cache_alloc_trace+0x5af/0x740
> >>> 	Sep  1 04:48:49 regress kernel: [10642.583270][T24161]  ? _copy_from_user+0xaa/0xd0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.584022][T24161]  btrfs_ioctl_balance+0x3de/0x4c0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.584819][T24161]  btrfs_ioctl+0x3122/0x4470
> >>> 	Sep  1 04:48:49 regress kernel: [10642.585540][T24161]  ? __asan_loadN+0xf/0x20
> >>> 	Sep  1 04:48:49 regress kernel: [10642.586229][T24161]  ? __asan_loadN+0xf/0x20
> >>> 	Sep  1 04:48:49 regress kernel: [10642.586920][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> >>> 	Sep  1 04:48:49 regress kernel: [10642.587935][T24161]  ? __asan_loadN+0xf/0x20
> >>> 	Sep  1 04:48:49 regress kernel: [10642.588649][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> >>> 	Sep  1 04:48:49 regress kernel: [10642.589566][T24161]  ? __asan_loadN+0xf/0x20
> >>> 	Sep  1 04:48:49 regress kernel: [10642.590254][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> >>> 	Sep  1 04:48:49 regress kernel: [10642.591128][T24161]  ? __kasan_check_read+0x11/0x20
> >>> 	Sep  1 04:48:49 regress kernel: [10642.591913][T24161]  ? check_chain_key+0x1e6/0x2e0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.592707][T24161]  ? __asan_loadN+0xf/0x20
> >>> 	Sep  1 04:48:49 regress kernel: [10642.593409][T24161]  ? pvclock_clocksource_read+0xeb/0x190
> >>> 	Sep  1 04:48:49 regress kernel: [10642.594312][T24161]  ? kvm_sched_clock_read+0x18/0x30
> >>> 	Sep  1 04:48:49 regress kernel: [10642.595139][T24161]  ? check_chain_key+0x1e6/0x2e0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.595929][T24161]  ? sched_clock_cpu+0x1b/0x120
> >>> 	Sep  1 04:48:49 regress kernel: [10642.596712][T24161]  do_vfs_ioctl+0x13e/0xad0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.597432][T24161]  ? btrfs_ioctl_get_supported_features+0x30/0x30
> >>> 	Sep  1 04:48:49 regress kernel: [10642.598455][T24161]  ? do_vfs_ioctl+0x13e/0xad0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.599202][T24161]  ? compat_ioctl_preallocate+0x170/0x170
> >>> 	Sep  1 04:48:49 regress kernel: [10642.600128][T24161]  ? __kasan_check_write+0x14/0x20
> >>> 	Sep  1 04:48:49 regress kernel: [10642.600949][T24161]  ? up_read+0x176/0x4f0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.601648][T24161]  ? down_write_nested+0x2d0/0x2d0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.602476][T24161]  ? handle_mm_fault+0x211/0x480
> >>> 	Sep  1 04:48:49 regress kernel: [10642.603263][T24161]  ? __kasan_check_read+0x11/0x20
> >>> 	Sep  1 04:48:49 regress kernel: [10642.604062][T24161]  ? __fget_light+0xb2/0x110
> >>> 	Sep  1 04:48:49 regress kernel: [10642.604805][T24161]  ksys_ioctl+0x67/0x90
> >>> 	Sep  1 04:48:49 regress kernel: [10642.605471][T24161]  __x64_sys_ioctl+0x43/0x50
> >>> 	Sep  1 04:48:49 regress kernel: [10642.606203][T24161]  do_syscall_64+0x77/0x2d0
> >>> 	Sep  1 04:48:49 regress kernel: [10642.606933][T24161]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>> 	Sep  1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
> >>> 	Sep  1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
> >>> 	Sep  1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> >>> 	Sep  1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
> >>> 	Sep  1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
> >>> 	Sep  1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> >>> 	Sep  1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
> >>> 	Sep  1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
> >>> 	Sep  1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
> >>> 	Sep  1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
> >>>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-09-04 15:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-30 22:10 BUG at fs/btrfs/relocation.c:794! David Sterba
2020-07-23 21:56 ` Zygo Blaxell
2020-07-24  0:19   ` Qu Wenruo
2020-08-04 16:16     ` Zygo Blaxell
2020-08-28  0:03       ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell
2020-08-28  0:08         ` Zygo Blaxell
2020-08-28  6:34           ` Nikolay Borisov
2020-08-28 20:42             ` Zygo Blaxell
2020-09-01 22:53               ` Zygo Blaxell
2020-09-01 23:33                 ` Qu Wenruo
2020-09-02  0:14                   ` Zygo Blaxell
2020-09-02  1:46                     ` Qu Wenruo
2020-09-04 15:54                       ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).