* BUG at fs/btrfs/relocation.c:794!
@ 2020-06-30 22:10 David Sterba
2020-07-23 21:56 ` Zygo Blaxell
0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2020-06-30 22:10 UTC (permalink / raw)
To: linux-btrfs; +Cc: wqu
Hi,
I've hit a crash in relocation I've never seen before.
[ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
[ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
[ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
[ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
[ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
[ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
[ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
[ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
[ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
[ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
[ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
[ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
[ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
[ 2129.258775] Call Trace:
[ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs]
[ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs]
[ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs]
[ 2129.340388] select_reloc_root+0x94/0x340 [btrfs]
[ 2129.340433] do_relocation+0xda/0x7b0 [btrfs]
[ 2129.349854] ? _raw_spin_unlock+0x1f/0x40
[ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs]
[ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs]
[ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs]
[ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs]
[ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs]
[ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs]
[ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330
[ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs]
[ 2129.389898] ? do_user_addr_fault+0x221/0x49c
[ 2129.404070] ? sched_clock_cpu+0x15/0x140
[ 2129.404073] ? do_user_addr_fault+0x221/0x49c
[ 2129.404079] ? up_read+0x18/0x240
[ 2129.404086] ? ksys_ioctl+0x68/0xa0
[ 2129.404091] ksys_ioctl+0x68/0xa0
[ 2129.423308] __x64_sys_ioctl+0x16/0x20
[ 2129.423312] do_syscall_64+0x50/0xe0
[ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2129.423318] RIP: 0033:0x7f82a51c6327
[ 2129.423319] Code: Bad RIP value.
[ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
[ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
[ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
[ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
Relevant code called from create_reloc_root:
ret = btrfs_insert_root(trans, fs_info->tree_root,
&root_key, root_item);
BUG_ON(ret)
and according to EAX, ret is -17 which is EEXIST.
I don't have a reproducer, the testing image has been filled by random git
checkouts, deduplicated by BEES, then tons of snapshots created until the
metadata got exhausted, some file deletion and balances.
This is the same image that led to the patch "btrfs: allow use of global block
reserve for balance item deletion", so this could have left it in some
intermediate state where the balance item was not removed and the reloc tree as
well.
There were a few unsuccessful mounts due to relocation recovery, that was
trying to debug but then it started to work.
The error happened with this 'fi df' saved after the balance start:
# btrfs fi df mnt
Data, single: total=80.01GiB, used=38.67GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.46GiB
GlobalReserve, single: total=512.00MiB, used=44.00KiB
The error looks like a repeated relocation tree creation, which would point to
the unsuccesful balances or inconsistent state (balance item, reloc trees).
It's not a "typical" mix of operations but I'd appreciate any insights here.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794!
2020-06-30 22:10 BUG at fs/btrfs/relocation.c:794! David Sterba
@ 2020-07-23 21:56 ` Zygo Blaxell
2020-07-24 0:19 ` Qu Wenruo
0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-07-23 21:56 UTC (permalink / raw)
To: David Sterba; +Cc: linux-btrfs, wqu
On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
> Hi,
>
> I've hit a crash in relocation I've never seen before.
>
> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
I hit an issue yesterday that reminded me of this.
> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
> [ 2129.258775] Call Trace:
> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs]
> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs]
> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs]
> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs]
> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs]
> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40
> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs]
> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs]
> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs]
> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs]
> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs]
> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs]
> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330
> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs]
> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs]
> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c
> [ 2129.404070] ? sched_clock_cpu+0x15/0x140
> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c
> [ 2129.404079] ? up_read+0x18/0x240
> [ 2129.404086] ? ksys_ioctl+0x68/0xa0
> [ 2129.404091] ksys_ioctl+0x68/0xa0
> [ 2129.423308] __x64_sys_ioctl+0x16/0x20
> [ 2129.423312] do_syscall_64+0x50/0xe0
> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 2129.423318] RIP: 0033:0x7f82a51c6327
> [ 2129.423319] Code: Bad RIP value.
> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
>
> Relevant code called from create_reloc_root:
>
> ret = btrfs_insert_root(trans, fs_info->tree_root,
> &root_key, root_item);
> BUG_ON(ret)
>
> and according to EAX, ret is -17 which is EEXIST.
>
> I don't have a reproducer, the testing image has been filled by random git
> checkouts, deduplicated by BEES, then tons of snapshots created until the
> metadata got exhausted, some file deletion and balances.
Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also
added random 'killall -INT btrfs' to send balance some fatal signals.
> This is the same image that led to the patch "btrfs: allow use of global block
> reserve for balance item deletion", so this could have left it in some
> intermediate state where the balance item was not removed and the reloc tree as
> well.
>
> There were a few unsuccessful mounts due to relocation recovery, that was
> trying to debug but then it started to work.
>
> The error happened with this 'fi df' saved after the balance start:
>
> # btrfs fi df mnt
> Data, single: total=80.01GiB, used=38.67GiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=19.99GiB, used=19.46GiB
> GlobalReserve, single: total=512.00MiB, used=44.00KiB
Mine is:
Data, single: total=1.75TiB, used=1.74TiB
System, RAID1: total=32.00MiB, used=208.00KiB
Metadata, RAID1: total=25.00GiB, used=22.89GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
though this is some time after the failure (and a reboot). I do notice
that there's lots of unallocated space, but metadata usage is close
to allocated, and I have been experiencing a lot of EROFS events when
that happens, even if there's gigabytes unallocated.
btrfs fi us:
Overall:
Device size: 2.00TiB
Device allocated: 1.80TiB
Device unallocated: 208.94GiB
Device missing: 0.00B
Used: 1.79TiB
Free (estimated): 211.30GiB (min: 106.83GiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
/dev/mapper/vgtest-tvdb 894.00GiB
/dev/mapper/vgtest-tvdc 895.00GiB
Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
/dev/mapper/vgtest-tvdb 25.00GiB
/dev/mapper/vgtest-tvdc 25.00GiB
System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
/dev/mapper/vgtest-tvdb 32.00MiB
/dev/mapper/vgtest-tvdc 32.00MiB
Unallocated:
/dev/mapper/vgtest-tvdb 104.97GiB
/dev/mapper/vgtest-tvdc 103.97GiB
> The error looks like a repeated relocation tree creation, which would point to
> the unsuccesful balances or inconsistent state (balance item, reloc trees).
> It's not a "typical" mix of operations but I'd appreciate any insights here.
I have the same line but different call stack, with misc-next
e3027d10af42d24940be74dabaf1550cd770bd48:
[ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
[ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
[ 9718.511137][T13609] ------------[ cut here ]------------
[ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
[ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
[ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44
[ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
[ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b
e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
[ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
[ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
[ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
[ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
[ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
[ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
[ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
[ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
[ 9718.533608][T13609] Call Trace:
[ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0
[ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0
[ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310
[ 9718.537016][T13609] ? find_reloc_root+0x200/0x200
[ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140
[ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0
[ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0
[ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0
[ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410
[ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220
[ 9718.545186][T13609] ? check_flags+0x26/0x30
[ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100
[ 9718.546651][T13609] do_relocation+0x242/0xc90
[ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0
[ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220
[ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20
[ 9718.549745][T13609] ? mark_lock+0xa8/0x440
[ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0
[ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600
[ 9718.552079][T13609] ? memcpy+0x4d/0x60
[ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120
[ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00
[ 9718.554255][T13609] ? do_relocation+0xc90/0xc90
[ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
[ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140
[ 9718.556756][T13609] ? rb_insert_color+0x342/0x360
[ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20
[ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0
[ 9718.559387][T13609] relocate_block_group+0x52e/0x830
[ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0
[ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0
[ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120
[ 9718.562918][T13609] btrfs_balance+0xe22/0x1910
[ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0
[ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120
[ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
[ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0
[ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0
[ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250
[ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20
[ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0
[ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30
[ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30
[ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0
[ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0
[ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0
[ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0
[ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220
[ 9718.575472][T13609] ? check_flags+0x26/0x30
[ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100
[ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220
[ 9718.577836][T13609] ? check_flags+0x26/0x30
[ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100
[ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20
[ 9718.580225][T13609] ? __fget_light+0xae/0x110
[ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0
[ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50
[ 9718.582334][T13609] do_syscall_64+0x60/0xf0
[ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
[ 9718.585289][T13609] Code: Bad RIP value.
[ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
[ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
[ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
[ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
[ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
[ 9718.596109][T13609] Modules linked in:
[ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
[ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
[ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
[ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
[ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
[ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
[ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
[ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
[ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
[ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
[ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
[ 9718.869689][ T4545] ==================================================================
same line, different call stack:
0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
789 btrfs_tree_unlock(eb);
790 free_extent_buffer(eb);
791
792 ret = btrfs_insert_root(trans, fs_info->tree_root,
793 &root_key, root_item);
794 BUG_ON(ret);
795 kfree(root_item);
796
797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
798 BUG_ON(IS_ERR(reloc_root));
followed by
[ 9718.869689][ T4545] ==================================================================
[ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
[ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
[ 9718.873746][ T4545]
[ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44
[ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 9718.877149][ T4545] Call Trace:
[ 9718.877655][ T4545] dump_stack+0xc8/0x11a
[ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0
[ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200
[ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0
[ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0
[ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e
[ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0
[ 9718.883229][ T4545] __asan_load4+0x69/0x90
[ 9718.883920][ T4545] __mutex_lock+0x202/0xce0
[ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230
[ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0
[ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20
[ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20
[ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0
[ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0
[ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0
[ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20
[ 9718.891308][ T4545] ? lock_contended+0x720/0x720
[ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0
[ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230
[ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20
[ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20
[ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0
[ 9718.896245][ T4545] start_transaction+0x189/0x8f0
[ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20
[ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0
[ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930
[ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0
[ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310
[ 9718.901242][ T4545] btrfs_setattr+0x514/0x850
[ 9718.902035][ T4545] ? current_time+0x8c/0xe0
[ 9718.902799][ T4545] notify_change+0x4ec/0x700
[ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220
[ 9718.904459][ T4545] do_truncate+0xe4/0x160
[ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170
[ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270
[ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220
[ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40
[ 9718.908577][ T4545] do_syscall_64+0x60/0xf0
[ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
[ 9718.911247][ T4545] Code: Bad RIP value.
[ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
[ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
[ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
[ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
[ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
[ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
[ 9718.919882][ T4545]
[ 9718.920268][ T4545] Allocated by task 6732:
[ 9718.920973][ T4545] save_stack+0x21/0x50
[ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0
[ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20
[ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720
[ 9718.924203][ T4545] copy_process+0x357/0x3680
[ 9718.924955][ T4545] _do_fork+0xed/0x880
[ 9718.925622][ T4545] __do_sys_clone+0xee/0x130
[ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80
[ 9718.927119][ T4545] do_syscall_64+0x60/0xf0
[ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9718.928812][ T4545]
[ 9718.929173][ T4545] Freed by task 24:
[ 9718.929787][ T4545] save_stack+0x21/0x50
[ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170
[ 9718.931242][ T4545] kasan_slab_free+0xe/0x10
[ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280
[ 9718.932730][ T4545] free_task+0x73/0x90
[ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0
[ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0
[ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0
[ 9718.935758][ T4545] rcu_core_si+0xe/0x10
[ 9718.936433][ T4545] __do_softirq+0x120/0x5e3
[ 9718.937165][ T4545]
[ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
[ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
[ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
[ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40)
[ 9718.942559][ T4545] The buggy address belongs to the page:
[ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
[ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
[ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
[ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
[ 9718.949889][ T4545] page dumped because: kasan: bad access detected
[ 9718.950977][ T4545]
[ 9718.951354][ T4545] Memory state around the buggy address:
[ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 9718.956366][ T4545] ^
[ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 9718.960034][ T4545] ==================================================================
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794!
2020-07-23 21:56 ` Zygo Blaxell
@ 2020-07-24 0:19 ` Qu Wenruo
2020-08-04 16:16 ` Zygo Blaxell
0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-07-24 0:19 UTC (permalink / raw)
To: Zygo Blaxell, David Sterba; +Cc: linux-btrfs, wqu
[-- Attachment #1.1: Type: text/plain, Size: 21223 bytes --]
On 2020/7/24 上午5:56, Zygo Blaxell wrote:
> On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
>> Hi,
>>
>> I've hit a crash in relocation I've never seen before.
>>
>> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
>
> I hit an issue yesterday that reminded me of this.
>
>> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
>> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
>> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
>> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
>> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
>> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
>> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
>> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
>> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
>> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
>> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
>> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
>> [ 2129.258775] Call Trace:
>> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs]
>> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs]
>> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs]
>> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs]
>> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs]
>> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40
>> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs]
>> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs]
>> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs]
>> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs]
>> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs]
>> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs]
>> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330
>> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs]
>> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs]
>> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c
>> [ 2129.404070] ? sched_clock_cpu+0x15/0x140
>> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c
>> [ 2129.404079] ? up_read+0x18/0x240
>> [ 2129.404086] ? ksys_ioctl+0x68/0xa0
>> [ 2129.404091] ksys_ioctl+0x68/0xa0
>> [ 2129.423308] __x64_sys_ioctl+0x16/0x20
>> [ 2129.423312] do_syscall_64+0x50/0xe0
>> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 2129.423318] RIP: 0033:0x7f82a51c6327
>> [ 2129.423319] Code: Bad RIP value.
>> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
>> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
>> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
>> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
>> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
>>
>> Relevant code called from create_reloc_root:
>>
>> ret = btrfs_insert_root(trans, fs_info->tree_root,
>> &root_key, root_item);
>> BUG_ON(ret)
>>
>> and according to EAX, ret is -17 which is EEXIST.
>>
>> I don't have a reproducer, the testing image has been filled by random git
>> checkouts, deduplicated by BEES, then tons of snapshots created until the
>> metadata got exhausted, some file deletion and balances.
>
> Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also
> added random 'killall -INT btrfs' to send balance some fatal signals.
>
>> This is the same image that led to the patch "btrfs: allow use of global block
>> reserve for balance item deletion", so this could have left it in some
>> intermediate state where the balance item was not removed and the reloc tree as
>> well.
>>
>> There were a few unsuccessful mounts due to relocation recovery, that was
>> trying to debug but then it started to work.
>>
>> The error happened with this 'fi df' saved after the balance start:
>>
>> # btrfs fi df mnt
>> Data, single: total=80.01GiB, used=38.67GiB
>> System, single: total=4.00MiB, used=16.00KiB
>> Metadata, single: total=19.99GiB, used=19.46GiB
>> GlobalReserve, single: total=512.00MiB, used=44.00KiB
>
> Mine is:
>
> Data, single: total=1.75TiB, used=1.74TiB
> System, RAID1: total=32.00MiB, used=208.00KiB
> Metadata, RAID1: total=25.00GiB, used=22.89GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> though this is some time after the failure (and a reboot). I do notice
> that there's lots of unallocated space, but metadata usage is close
> to allocated, and I have been experiencing a lot of EROFS events when
> that happens, even if there's gigabytes unallocated.
>
> btrfs fi us:
>
> Overall:
> Device size: 2.00TiB
> Device allocated: 1.80TiB
> Device unallocated: 208.94GiB
> Device missing: 0.00B
> Used: 1.79TiB
> Free (estimated): 211.30GiB (min: 106.83GiB)
> Data ratio: 1.00
> Metadata ratio: 2.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
> /dev/mapper/vgtest-tvdb 894.00GiB
> /dev/mapper/vgtest-tvdc 895.00GiB
>
> Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
> /dev/mapper/vgtest-tvdb 25.00GiB
> /dev/mapper/vgtest-tvdc 25.00GiB
>
> System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
> /dev/mapper/vgtest-tvdb 32.00MiB
> /dev/mapper/vgtest-tvdc 32.00MiB
>
> Unallocated:
> /dev/mapper/vgtest-tvdb 104.97GiB
> /dev/mapper/vgtest-tvdc 103.97GiB
>
>> The error looks like a repeated relocation tree creation, which would point to
>> the unsuccesful balances or inconsistent state (balance item, reloc trees).
>> It's not a "typical" mix of operations but I'd appreciate any insights here.
>
> I have the same line but different call stack, with misc-next
> e3027d10af42d24940be74dabaf1550cd770bd48:
>
> [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
> [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
> [ 9718.511137][T13609] ------------[ cut here ]------------
> [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
> [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
> [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44
> [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b
> e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> [ 9718.533608][T13609] Call Trace:
> [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0
> [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0
> [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310
That's the same problem.
Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON().
In that case, that means there are some reloc trees not cleaned up.
Would you mind to provide the "btrfs ins dump-tree -t root" dump for
that fs if the problem still happens?
Thanks,
Qu
> [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200
> [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140
> [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0
> [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0
> [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0
> [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410
> [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0
> [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220
> [ 9718.545186][T13609] ? check_flags+0x26/0x30
> [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100
> [ 9718.546651][T13609] do_relocation+0x242/0xc90
> [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0
> [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220
> [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20
> [ 9718.549745][T13609] ? mark_lock+0xa8/0x440
> [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0
> [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600
> [ 9718.552079][T13609] ? memcpy+0x4d/0x60
> [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120
> [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00
> [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90
> [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
> [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140
> [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360
> [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20
> [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0
> [ 9718.559387][T13609] relocate_block_group+0x52e/0x830
> [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0
> [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0
> [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120
> [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910
> [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0
> [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120
> [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
> [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0
> [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0
> [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250
> [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20
> [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0
> [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30
> [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30
> [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0
> [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0
> [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0
> [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0
> [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220
> [ 9718.575472][T13609] ? check_flags+0x26/0x30
> [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100
> [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220
> [ 9718.577836][T13609] ? check_flags+0x26/0x30
> [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100
> [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20
> [ 9718.580225][T13609] ? __fget_light+0xae/0x110
> [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0
> [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50
> [ 9718.582334][T13609] do_syscall_64+0x60/0xf0
> [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
> [ 9718.585289][T13609] Code: Bad RIP value.
> [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
> [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
> [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
> [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
> [ 9718.596109][T13609] Modules linked in:
> [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
> [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> [ 9718.869689][ T4545] ==================================================================
>
> same line, different call stack:
>
> 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
> 789 btrfs_tree_unlock(eb);
> 790 free_extent_buffer(eb);
> 791
> 792 ret = btrfs_insert_root(trans, fs_info->tree_root,
> 793 &root_key, root_item);
> 794 BUG_ON(ret);
> 795 kfree(root_item);
> 796
> 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> 798 BUG_ON(IS_ERR(reloc_root));
>
> followed by
>
> [ 9718.869689][ T4545] ==================================================================
> [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
> [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
> [ 9718.873746][ T4545]
> [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44
> [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [ 9718.877149][ T4545] Call Trace:
> [ 9718.877655][ T4545] dump_stack+0xc8/0x11a
> [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0
> [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200
> [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0
> [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0
> [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e
> [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0
> [ 9718.883229][ T4545] __asan_load4+0x69/0x90
> [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0
> [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230
> [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0
> [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20
> [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20
> [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0
> [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0
> [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0
> [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20
> [ 9718.891308][ T4545] ? lock_contended+0x720/0x720
> [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0
> [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230
> [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20
> [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20
> [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0
> [ 9718.896245][ T4545] start_transaction+0x189/0x8f0
> [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20
> [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0
> [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930
> [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0
> [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310
> [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850
> [ 9718.902035][ T4545] ? current_time+0x8c/0xe0
> [ 9718.902799][ T4545] notify_change+0x4ec/0x700
> [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220
> [ 9718.904459][ T4545] do_truncate+0xe4/0x160
> [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170
> [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270
> [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220
> [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40
> [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0
> [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
> [ 9718.911247][ T4545] Code: Bad RIP value.
> [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
> [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
> [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
> [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
> [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
> [ 9718.919882][ T4545]
> [ 9718.920268][ T4545] Allocated by task 6732:
> [ 9718.920973][ T4545] save_stack+0x21/0x50
> [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0
> [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20
> [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720
> [ 9718.924203][ T4545] copy_process+0x357/0x3680
> [ 9718.924955][ T4545] _do_fork+0xed/0x880
> [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130
> [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80
> [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0
> [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9718.928812][ T4545]
> [ 9718.929173][ T4545] Freed by task 24:
> [ 9718.929787][ T4545] save_stack+0x21/0x50
> [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170
> [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10
> [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280
> [ 9718.932730][ T4545] free_task+0x73/0x90
> [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0
> [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0
> [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0
> [ 9718.935758][ T4545] rcu_core_si+0xe/0x10
> [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3
> [ 9718.937165][ T4545]
> [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
> [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
> [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
> [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40)
> [ 9718.942559][ T4545] The buggy address belongs to the page:
> [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
> [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
> [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
> [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
> [ 9718.949889][ T4545] page dumped because: kasan: bad access detected
> [ 9718.950977][ T4545]
> [ 9718.951354][ T4545] Memory state around the buggy address:
> [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 9718.956366][ T4545] ^
> [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 9718.960034][ T4545] ==================================================================
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794!
2020-07-24 0:19 ` Qu Wenruo
@ 2020-08-04 16:16 ` Zygo Blaxell
2020-08-28 0:03 ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell
0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-08-04 16:16 UTC (permalink / raw)
To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu
On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote:
>
>
> On 2020/7/24 上午5:56, Zygo Blaxell wrote:
> > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
> >> Hi,
> >>
> >> I've hit a crash in relocation I've never seen before.
> >>
> >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
> >
> > I hit an issue yesterday that reminded me of this.
> >
> >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
> >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
> >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
> >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
> >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
> >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
> >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
> >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
> >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
> >> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
> >> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
> >> [ 2129.258775] Call Trace:
> >> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs]
> >> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs]
> >> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs]
> >> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs]
> >> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs]
> >> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40
> >> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs]
> >> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs]
> >> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs]
> >> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs]
> >> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs]
> >> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs]
> >> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330
> >> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs]
> >> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs]
> >> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c
> >> [ 2129.404070] ? sched_clock_cpu+0x15/0x140
> >> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c
> >> [ 2129.404079] ? up_read+0x18/0x240
> >> [ 2129.404086] ? ksys_ioctl+0x68/0xa0
> >> [ 2129.404091] ksys_ioctl+0x68/0xa0
> >> [ 2129.423308] __x64_sys_ioctl+0x16/0x20
> >> [ 2129.423312] do_syscall_64+0x50/0xe0
> >> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [ 2129.423318] RIP: 0033:0x7f82a51c6327
> >> [ 2129.423319] Code: Bad RIP value.
> >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
> >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
> >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
> >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
> >>
> >> Relevant code called from create_reloc_root:
> >>
> >> ret = btrfs_insert_root(trans, fs_info->tree_root,
> >> &root_key, root_item);
> >> BUG_ON(ret)
> >>
> >> and according to EAX, ret is -17 which is EEXIST.
> >>
> >> I don't have a reproducer, the testing image has been filled by random git
> >> checkouts, deduplicated by BEES, then tons of snapshots created until the
> >> metadata got exhausted, some file deletion and balances.
> >
> > Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also
> > added random 'killall -INT btrfs' to send balance some fatal signals.
> >
> >> This is the same image that led to the patch "btrfs: allow use of global block
> >> reserve for balance item deletion", so this could have left it in some
> >> intermediate state where the balance item was not removed and the reloc tree as
> >> well.
> >>
> >> There were a few unsuccessful mounts due to relocation recovery, that was
> >> trying to debug but then it started to work.
> >>
> >> The error happened with this 'fi df' saved after the balance start:
> >>
> >> # btrfs fi df mnt
> >> Data, single: total=80.01GiB, used=38.67GiB
> >> System, single: total=4.00MiB, used=16.00KiB
> >> Metadata, single: total=19.99GiB, used=19.46GiB
> >> GlobalReserve, single: total=512.00MiB, used=44.00KiB
> >
> > Mine is:
> >
> > Data, single: total=1.75TiB, used=1.74TiB
> > System, RAID1: total=32.00MiB, used=208.00KiB
> > Metadata, RAID1: total=25.00GiB, used=22.89GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> > though this is some time after the failure (and a reboot). I do notice
> > that there's lots of unallocated space, but metadata usage is close
> > to allocated, and I have been experiencing a lot of EROFS events when
> > that happens, even if there's gigabytes unallocated.
> >
> > btrfs fi us:
> >
> > Overall:
> > Device size: 2.00TiB
> > Device allocated: 1.80TiB
> > Device unallocated: 208.94GiB
> > Device missing: 0.00B
> > Used: 1.79TiB
> > Free (estimated): 211.30GiB (min: 106.83GiB)
> > Data ratio: 1.00
> > Metadata ratio: 2.00
> > Global reserve: 512.00MiB (used: 0.00B)
> >
> > Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
> > /dev/mapper/vgtest-tvdb 894.00GiB
> > /dev/mapper/vgtest-tvdc 895.00GiB
> >
> > Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
> > /dev/mapper/vgtest-tvdb 25.00GiB
> > /dev/mapper/vgtest-tvdc 25.00GiB
> >
> > System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
> > /dev/mapper/vgtest-tvdb 32.00MiB
> > /dev/mapper/vgtest-tvdc 32.00MiB
> >
> > Unallocated:
> > /dev/mapper/vgtest-tvdb 104.97GiB
> > /dev/mapper/vgtest-tvdc 103.97GiB
> >
> >> The error looks like a repeated relocation tree creation, which would point to
> >> the unsuccesful balances or inconsistent state (balance item, reloc trees).
> >> It's not a "typical" mix of operations but I'd appreciate any insights here.
> >
> > I have the same line but different call stack, with misc-next
> > e3027d10af42d24940be74dabaf1550cd770bd48:
> >
> > [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
> > [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
> > [ 9718.511137][T13609] ------------[ cut here ]------------
> > [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
> > [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
> > [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44
> > [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b
> > e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > [ 9718.533608][T13609] Call Trace:
> > [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0
> > [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0
> > [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310
>
> That's the same problem.
>
> Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON().
>
> In that case, that means there are some reloc trees not cleaned up.
>
> Would you mind to provide the "btrfs ins dump-tree -t root" dump for
> that fs if the problem still happens?
http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt
The problem is now happening multiple times per day, starting with
kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date: Thu
Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0.
The previous misc-next (that I have test data for),
cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020
-0700 does not have this problem.
These commit hashes are from https://gitlab.com/kdave/btrfs-devel.
> Thanks,
> Qu
> > [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200
> > [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140
> > [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0
> > [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0
> > [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0
> > [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410
> > [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0
> > [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220
> > [ 9718.545186][T13609] ? check_flags+0x26/0x30
> > [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100
> > [ 9718.546651][T13609] do_relocation+0x242/0xc90
> > [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0
> > [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220
> > [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20
> > [ 9718.549745][T13609] ? mark_lock+0xa8/0x440
> > [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0
> > [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600
> > [ 9718.552079][T13609] ? memcpy+0x4d/0x60
> > [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120
> > [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00
> > [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90
> > [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
> > [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140
> > [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360
> > [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20
> > [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0
> > [ 9718.559387][T13609] relocate_block_group+0x52e/0x830
> > [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0
> > [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0
> > [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120
> > [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910
> > [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0
> > [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120
> > [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
> > [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0
> > [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0
> > [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250
> > [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20
> > [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0
> > [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30
> > [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30
> > [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0
> > [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0
> > [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0
> > [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0
> > [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220
> > [ 9718.575472][T13609] ? check_flags+0x26/0x30
> > [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100
> > [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220
> > [ 9718.577836][T13609] ? check_flags+0x26/0x30
> > [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100
> > [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20
> > [ 9718.580225][T13609] ? __fget_light+0xae/0x110
> > [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0
> > [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50
> > [ 9718.582334][T13609] do_syscall_64+0x60/0xf0
> > [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
> > [ 9718.585289][T13609] Code: Bad RIP value.
> > [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
> > [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
> > [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> > [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
> > [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
> > [ 9718.596109][T13609] Modules linked in:
> > [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
> > [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > [ 9718.869689][ T4545] ==================================================================
> >
> > same line, different call stack:
> >
> > 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
> > 789 btrfs_tree_unlock(eb);
> > 790 free_extent_buffer(eb);
> > 791
> > 792 ret = btrfs_insert_root(trans, fs_info->tree_root,
> > 793 &root_key, root_item);
> > 794 BUG_ON(ret);
> > 795 kfree(root_item);
> > 796
> > 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> > 798 BUG_ON(IS_ERR(reloc_root));
> >
> > followed by
> >
> > [ 9718.869689][ T4545] ==================================================================
> > [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
> > [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
> > [ 9718.873746][ T4545]
> > [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44
> > [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > [ 9718.877149][ T4545] Call Trace:
> > [ 9718.877655][ T4545] dump_stack+0xc8/0x11a
> > [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0
> > [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200
> > [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0
> > [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0
> > [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e
> > [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0
> > [ 9718.883229][ T4545] __asan_load4+0x69/0x90
> > [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0
> > [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230
> > [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0
> > [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20
> > [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20
> > [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0
> > [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0
> > [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0
> > [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20
> > [ 9718.891308][ T4545] ? lock_contended+0x720/0x720
> > [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0
> > [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230
> > [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20
> > [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20
> > [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0
> > [ 9718.896245][ T4545] start_transaction+0x189/0x8f0
> > [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20
> > [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0
> > [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930
> > [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0
> > [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310
> > [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850
> > [ 9718.902035][ T4545] ? current_time+0x8c/0xe0
> > [ 9718.902799][ T4545] notify_change+0x4ec/0x700
> > [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220
> > [ 9718.904459][ T4545] do_truncate+0xe4/0x160
> > [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170
> > [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270
> > [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220
> > [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40
> > [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0
> > [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
> > [ 9718.911247][ T4545] Code: Bad RIP value.
> > [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
> > [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
> > [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
> > [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
> > [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
> > [ 9718.919882][ T4545]
> > [ 9718.920268][ T4545] Allocated by task 6732:
> > [ 9718.920973][ T4545] save_stack+0x21/0x50
> > [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0
> > [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20
> > [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720
> > [ 9718.924203][ T4545] copy_process+0x357/0x3680
> > [ 9718.924955][ T4545] _do_fork+0xed/0x880
> > [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130
> > [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80
> > [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0
> > [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 9718.928812][ T4545]
> > [ 9718.929173][ T4545] Freed by task 24:
> > [ 9718.929787][ T4545] save_stack+0x21/0x50
> > [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170
> > [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10
> > [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280
> > [ 9718.932730][ T4545] free_task+0x73/0x90
> > [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0
> > [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0
> > [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0
> > [ 9718.935758][ T4545] rcu_core_si+0xe/0x10
> > [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3
> > [ 9718.937165][ T4545]
> > [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
> > [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
> > [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
> > [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40)
> > [ 9718.942559][ T4545] The buggy address belongs to the page:
> > [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
> > [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
> > [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
> > [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
> > [ 9718.949889][ T4545] page dumped because: kasan: bad access detected
> > [ 9718.950977][ T4545]
> > [ 9718.951354][ T4545] Memory state around the buggy address:
> > [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > [ 9718.956366][ T4545] ^
> > [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > [ 9718.960034][ T4545] ==================================================================
> >
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-08-04 16:16 ` Zygo Blaxell
@ 2020-08-28 0:03 ` Zygo Blaxell
2020-08-28 0:08 ` Zygo Blaxell
0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-08-28 0:03 UTC (permalink / raw)
To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu
On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote:
> >
> >
> > On 2020/7/24 上午5:56, Zygo Blaxell wrote:
> > > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
> > >> Hi,
> > >>
> > >> I've hit a crash in relocation I've never seen before.
> > >>
> > >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
> > >
> > > I hit an issue yesterday that reminded me of this.
> > >
> > >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
> > >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
> > >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> > >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
> > >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
> > >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
> > >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
> > >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
> > >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
> > >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
> > >> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
> > >> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
> > >> [ 2129.258775] Call Trace:
> > >> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs]
> > >> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs]
> > >> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs]
> > >> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs]
> > >> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs]
> > >> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40
> > >> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs]
> > >> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs]
> > >> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs]
> > >> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs]
> > >> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs]
> > >> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs]
> > >> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330
> > >> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs]
> > >> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs]
> > >> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c
> > >> [ 2129.404070] ? sched_clock_cpu+0x15/0x140
> > >> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c
> > >> [ 2129.404079] ? up_read+0x18/0x240
> > >> [ 2129.404086] ? ksys_ioctl+0x68/0xa0
> > >> [ 2129.404091] ksys_ioctl+0x68/0xa0
> > >> [ 2129.423308] __x64_sys_ioctl+0x16/0x20
> > >> [ 2129.423312] do_syscall_64+0x50/0xe0
> > >> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >> [ 2129.423318] RIP: 0033:0x7f82a51c6327
> > >> [ 2129.423319] Code: Bad RIP value.
> > >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
> > >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
> > >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> > >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
> > >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
> > >>
> > >> Relevant code called from create_reloc_root:
> > >>
> > >> ret = btrfs_insert_root(trans, fs_info->tree_root,
> > >> &root_key, root_item);
> > >> BUG_ON(ret)
> > >>
> > >> and according to EAX, ret is -17 which is EEXIST.
> > >>
> > >> I don't have a reproducer, the testing image has been filled by random git
> > >> checkouts, deduplicated by BEES, then tons of snapshots created until the
> > >> metadata got exhausted, some file deletion and balances.
> > >
> > > Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also
> > > added random 'killall -INT btrfs' to send balance some fatal signals.
> > >
> > >> This is the same image that led to the patch "btrfs: allow use of global block
> > >> reserve for balance item deletion", so this could have left it in some
> > >> intermediate state where the balance item was not removed and the reloc tree as
> > >> well.
> > >>
> > >> There were a few unsuccessful mounts due to relocation recovery, that was
> > >> trying to debug but then it started to work.
> > >>
> > >> The error happened with this 'fi df' saved after the balance start:
> > >>
> > >> # btrfs fi df mnt
> > >> Data, single: total=80.01GiB, used=38.67GiB
> > >> System, single: total=4.00MiB, used=16.00KiB
> > >> Metadata, single: total=19.99GiB, used=19.46GiB
> > >> GlobalReserve, single: total=512.00MiB, used=44.00KiB
> > >
> > > Mine is:
> > >
> > > Data, single: total=1.75TiB, used=1.74TiB
> > > System, RAID1: total=32.00MiB, used=208.00KiB
> > > Metadata, RAID1: total=25.00GiB, used=22.89GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > >
> > > though this is some time after the failure (and a reboot). I do notice
> > > that there's lots of unallocated space, but metadata usage is close
> > > to allocated, and I have been experiencing a lot of EROFS events when
> > > that happens, even if there's gigabytes unallocated.
> > >
> > > btrfs fi us:
> > >
> > > Overall:
> > > Device size: 2.00TiB
> > > Device allocated: 1.80TiB
> > > Device unallocated: 208.94GiB
> > > Device missing: 0.00B
> > > Used: 1.79TiB
> > > Free (estimated): 211.30GiB (min: 106.83GiB)
> > > Data ratio: 1.00
> > > Metadata ratio: 2.00
> > > Global reserve: 512.00MiB (used: 0.00B)
> > >
> > > Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
> > > /dev/mapper/vgtest-tvdb 894.00GiB
> > > /dev/mapper/vgtest-tvdc 895.00GiB
> > >
> > > Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
> > > /dev/mapper/vgtest-tvdb 25.00GiB
> > > /dev/mapper/vgtest-tvdc 25.00GiB
> > >
> > > System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
> > > /dev/mapper/vgtest-tvdb 32.00MiB
> > > /dev/mapper/vgtest-tvdc 32.00MiB
> > >
> > > Unallocated:
> > > /dev/mapper/vgtest-tvdb 104.97GiB
> > > /dev/mapper/vgtest-tvdc 103.97GiB
> > >
> > >> The error looks like a repeated relocation tree creation, which would point to
> > >> the unsuccesful balances or inconsistent state (balance item, reloc trees).
> > >> It's not a "typical" mix of operations but I'd appreciate any insights here.
> > >
> > > I have the same line but different call stack, with misc-next
> > > e3027d10af42d24940be74dabaf1550cd770bd48:
> > >
> > > [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
> > > [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
> > > [ 9718.511137][T13609] ------------[ cut here ]------------
> > > [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
> > > [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
> > > [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44
> > > [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > > [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b
> > > e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > > [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > > [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > > [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > > [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > > [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > > [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > > [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > > [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > > [ 9718.533608][T13609] Call Trace:
> > > [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0
> > > [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0
> > > [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310
> >
> > That's the same problem.
> >
> > Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON().
> >
> > In that case, that means there are some reloc trees not cleaned up.
> >
> > Would you mind to provide the "btrfs ins dump-tree -t root" dump for
> > that fs if the problem still happens?
>
> http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt
>
> The problem is now happening multiple times per day, starting with
> kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date: Thu
> Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0.
>
> The previous misc-next (that I have test data for),
> cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020
> -0700 does not have this problem.
>
> These commit hashes are from https://gitlab.com/kdave/btrfs-devel.
Still hitting this bug every few hours on all 7.8.x so far, and misc-next.
There is a strong correlation between hitting the bug and starting a metadata
block group in balance, and a weaker correlation with data balances:
Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1
Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------
Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794!
Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1
Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------
Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794!
Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1
Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------
Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794!
Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1
Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------
Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794!
Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1
Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------
Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794!
Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data
Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------
Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794!
Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1
Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------
Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794!
Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1
Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------
Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794!
There don't seem to be any instances of the BUG that did not occur
within 30 seconds of starting a balance.
The on-disk data is fine. After a reboot the same block group can be
successfully balanced.
>
> > Thanks,
> > Qu
> > > [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200
> > > [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140
> > > [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0
> > > [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0
> > > [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0
> > > [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410
> > > [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0
> > > [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220
> > > [ 9718.545186][T13609] ? check_flags+0x26/0x30
> > > [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100
> > > [ 9718.546651][T13609] do_relocation+0x242/0xc90
> > > [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0
> > > [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220
> > > [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20
> > > [ 9718.549745][T13609] ? mark_lock+0xa8/0x440
> > > [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0
> > > [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600
> > > [ 9718.552079][T13609] ? memcpy+0x4d/0x60
> > > [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120
> > > [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00
> > > [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90
> > > [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
> > > [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140
> > > [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360
> > > [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20
> > > [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0
> > > [ 9718.559387][T13609] relocate_block_group+0x52e/0x830
> > > [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0
> > > [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0
> > > [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120
> > > [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910
> > > [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0
> > > [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120
> > > [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
> > > [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0
> > > [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0
> > > [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250
> > > [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20
> > > [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0
> > > [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30
> > > [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30
> > > [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0
> > > [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0
> > > [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0
> > > [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0
> > > [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220
> > > [ 9718.575472][T13609] ? check_flags+0x26/0x30
> > > [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100
> > > [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220
> > > [ 9718.577836][T13609] ? check_flags+0x26/0x30
> > > [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100
> > > [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20
> > > [ 9718.580225][T13609] ? __fget_light+0xae/0x110
> > > [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0
> > > [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50
> > > [ 9718.582334][T13609] do_syscall_64+0x60/0xf0
> > > [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
> > > [ 9718.585289][T13609] Code: Bad RIP value.
> > > [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > > [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
> > > [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
> > > [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> > > [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
> > > [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
> > > [ 9718.596109][T13609] Modules linked in:
> > > [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
> > > [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > > [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > > [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > > [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > > [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > > [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > > [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > > [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > > [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > > [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > > [ 9718.869689][ T4545] ==================================================================
> > >
> > > same line, different call stack:
> > >
> > > 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
> > > 789 btrfs_tree_unlock(eb);
> > > 790 free_extent_buffer(eb);
> > > 791
> > > 792 ret = btrfs_insert_root(trans, fs_info->tree_root,
> > > 793 &root_key, root_item);
> > > 794 BUG_ON(ret);
> > > 795 kfree(root_item);
> > > 796
> > > 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> > > 798 BUG_ON(IS_ERR(reloc_root));
> > >
> > > followed by
> > >
> > > [ 9718.869689][ T4545] ==================================================================
> > > [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
> > > [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
> > > [ 9718.873746][ T4545]
> > > [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44
> > > [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > [ 9718.877149][ T4545] Call Trace:
> > > [ 9718.877655][ T4545] dump_stack+0xc8/0x11a
> > > [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0
> > > [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200
> > > [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0
> > > [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0
> > > [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e
> > > [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0
> > > [ 9718.883229][ T4545] __asan_load4+0x69/0x90
> > > [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0
> > > [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230
> > > [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0
> > > [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20
> > > [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20
> > > [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0
> > > [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0
> > > [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0
> > > [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20
> > > [ 9718.891308][ T4545] ? lock_contended+0x720/0x720
> > > [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0
> > > [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230
> > > [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20
> > > [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20
> > > [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0
> > > [ 9718.896245][ T4545] start_transaction+0x189/0x8f0
> > > [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20
> > > [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0
> > > [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930
> > > [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0
> > > [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310
> > > [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850
> > > [ 9718.902035][ T4545] ? current_time+0x8c/0xe0
> > > [ 9718.902799][ T4545] notify_change+0x4ec/0x700
> > > [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220
> > > [ 9718.904459][ T4545] do_truncate+0xe4/0x160
> > > [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170
> > > [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270
> > > [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220
> > > [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40
> > > [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0
> > > [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
> > > [ 9718.911247][ T4545] Code: Bad RIP value.
> > > [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > > [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
> > > [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
> > > [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
> > > [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
> > > [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
> > > [ 9718.919882][ T4545]
> > > [ 9718.920268][ T4545] Allocated by task 6732:
> > > [ 9718.920973][ T4545] save_stack+0x21/0x50
> > > [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0
> > > [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20
> > > [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720
> > > [ 9718.924203][ T4545] copy_process+0x357/0x3680
> > > [ 9718.924955][ T4545] _do_fork+0xed/0x880
> > > [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130
> > > [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80
> > > [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0
> > > [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > [ 9718.928812][ T4545]
> > > [ 9718.929173][ T4545] Freed by task 24:
> > > [ 9718.929787][ T4545] save_stack+0x21/0x50
> > > [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170
> > > [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10
> > > [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280
> > > [ 9718.932730][ T4545] free_task+0x73/0x90
> > > [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0
> > > [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0
> > > [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0
> > > [ 9718.935758][ T4545] rcu_core_si+0xe/0x10
> > > [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3
> > > [ 9718.937165][ T4545]
> > > [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
> > > [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
> > > [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
> > > [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40)
> > > [ 9718.942559][ T4545] The buggy address belongs to the page:
> > > [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
> > > [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
> > > [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
> > > [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
> > > [ 9718.949889][ T4545] page dumped because: kasan: bad access detected
> > > [ 9718.950977][ T4545]
> > > [ 9718.951354][ T4545] Memory state around the buggy address:
> > > [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > [ 9718.956366][ T4545] ^
> > > [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > [ 9718.960034][ T4545] ==================================================================
> > >
> >
>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-08-28 0:03 ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell
@ 2020-08-28 0:08 ` Zygo Blaxell
2020-08-28 6:34 ` Nikolay Borisov
0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-08-28 0:08 UTC (permalink / raw)
To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu
On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> > On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote:
> > >
> > >
> > > On 2020/7/24 上午5:56, Zygo Blaxell wrote:
> > > > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote:
> > > >> Hi,
> > > >>
> > > >> I've hit a crash in relocation I've never seen before.
> > > >>
> > > >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794!
> > > >
> > > > I hit an issue yesterday that reminded me of this.
> > > >
> > > >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP
> > > >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638
> > > >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
> > > >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs]
> > > >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282
> > > >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000
> > > >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8
> > > >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000
> > > >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78
> > > >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78
> > > >> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000
> > > >> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0
> > > >> [ 2129.258775] Call Trace:
> > > >> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs]
> > > >> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs]
> > > >> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs]
> > > >> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs]
> > > >> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs]
> > > >> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40
> > > >> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs]
> > > >> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs]
> > > >> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs]
> > > >> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs]
> > > >> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs]
> > > >> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs]
> > > >> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330
> > > >> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs]
> > > >> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs]
> > > >> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c
> > > >> [ 2129.404070] ? sched_clock_cpu+0x15/0x140
> > > >> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c
> > > >> [ 2129.404079] ? up_read+0x18/0x240
> > > >> [ 2129.404086] ? ksys_ioctl+0x68/0xa0
> > > >> [ 2129.404091] ksys_ioctl+0x68/0xa0
> > > >> [ 2129.423308] __x64_sys_ioctl+0x16/0x20
> > > >> [ 2129.423312] do_syscall_64+0x50/0xe0
> > > >> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >> [ 2129.423318] RIP: 0033:0x7f82a51c6327
> > > >> [ 2129.423319] Code: Bad RIP value.
> > > >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > > >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327
> > > >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003
> > > >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> > > >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823
> > > >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000
> > > >>
> > > >> Relevant code called from create_reloc_root:
> > > >>
> > > >> ret = btrfs_insert_root(trans, fs_info->tree_root,
> > > >> &root_key, root_item);
> > > >> BUG_ON(ret)
> > > >>
> > > >> and according to EAX, ret is -17 which is EEXIST.
> > > >>
> > > >> I don't have a reproducer, the testing image has been filled by random git
> > > >> checkouts, deduplicated by BEES, then tons of snapshots created until the
> > > >> metadata got exhausted, some file deletion and balances.
> > > >
> > > > Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also
> > > > added random 'killall -INT btrfs' to send balance some fatal signals.
> > > >
> > > >> This is the same image that led to the patch "btrfs: allow use of global block
> > > >> reserve for balance item deletion", so this could have left it in some
> > > >> intermediate state where the balance item was not removed and the reloc tree as
> > > >> well.
> > > >>
> > > >> There were a few unsuccessful mounts due to relocation recovery, that was
> > > >> trying to debug but then it started to work.
> > > >>
> > > >> The error happened with this 'fi df' saved after the balance start:
> > > >>
> > > >> # btrfs fi df mnt
> > > >> Data, single: total=80.01GiB, used=38.67GiB
> > > >> System, single: total=4.00MiB, used=16.00KiB
> > > >> Metadata, single: total=19.99GiB, used=19.46GiB
> > > >> GlobalReserve, single: total=512.00MiB, used=44.00KiB
> > > >
> > > > Mine is:
> > > >
> > > > Data, single: total=1.75TiB, used=1.74TiB
> > > > System, RAID1: total=32.00MiB, used=208.00KiB
> > > > Metadata, RAID1: total=25.00GiB, used=22.89GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > >
> > > > though this is some time after the failure (and a reboot). I do notice
> > > > that there's lots of unallocated space, but metadata usage is close
> > > > to allocated, and I have been experiencing a lot of EROFS events when
> > > > that happens, even if there's gigabytes unallocated.
> > > >
> > > > btrfs fi us:
> > > >
> > > > Overall:
> > > > Device size: 2.00TiB
> > > > Device allocated: 1.80TiB
> > > > Device unallocated: 208.94GiB
> > > > Device missing: 0.00B
> > > > Used: 1.79TiB
> > > > Free (estimated): 211.30GiB (min: 106.83GiB)
> > > > Data ratio: 1.00
> > > > Metadata ratio: 2.00
> > > > Global reserve: 512.00MiB (used: 0.00B)
> > > >
> > > > Data,single: Size:1.75TiB, Used:1.74TiB (99.87%)
> > > > /dev/mapper/vgtest-tvdb 894.00GiB
> > > > /dev/mapper/vgtest-tvdc 895.00GiB
> > > >
> > > > Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%)
> > > > /dev/mapper/vgtest-tvdb 25.00GiB
> > > > /dev/mapper/vgtest-tvdc 25.00GiB
> > > >
> > > > System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%)
> > > > /dev/mapper/vgtest-tvdb 32.00MiB
> > > > /dev/mapper/vgtest-tvdc 32.00MiB
> > > >
> > > > Unallocated:
> > > > /dev/mapper/vgtest-tvdb 104.97GiB
> > > > /dev/mapper/vgtest-tvdc 103.97GiB
> > > >
> > > >> The error looks like a repeated relocation tree creation, which would point to
> > > >> the unsuccesful balances or inconsistent state (balance item, reloc trees).
> > > >> It's not a "typical" mix of operations but I'd appreciate any insights here.
> > > >
> > > > I have the same line but different call stack, with misc-next
> > > > e3027d10af42d24940be74dabaf1550cd770bd48:
> > > >
> > > > [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1
> > > > [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1
> > > > [ 9718.511137][T13609] ------------[ cut here ]------------
> > > > [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794!
> > > > [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI
> > > > [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44
> > > > [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > > [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > > > [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b
> > > > e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > > > [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > > > [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > > > [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > > > [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > > > [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > > > [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > > > [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > > > [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > > > [ 9718.533608][T13609] Call Trace:
> > > > [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0
> > > > [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0
> > > > [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310
> > >
> > > That's the same problem.
> > >
> > > Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON().
> > >
> > > In that case, that means there are some reloc trees not cleaned up.
> > >
> > > Would you mind to provide the "btrfs ins dump-tree -t root" dump for
> > > that fs if the problem still happens?
> >
> > http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt
> >
> > The problem is now happening multiple times per day, starting with
> > kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date: Thu
> > Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0.
> >
> > The previous misc-next (that I have test data for),
> > cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020
> > -0700 does not have this problem.
> >
> > These commit hashes are from https://gitlab.com/kdave/btrfs-devel.
>
> Still hitting this bug every few hours on all 7.8.x so far, and misc-next.
>
> There is a strong correlation between hitting the bug and starting a metadata
> block group in balance, and a weaker correlation with data balances:
>
> Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1
> Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------
> Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794!
>
> Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1
> Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------
> Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794!
>
> Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1
> Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------
> Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794!
>
> Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1
> Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------
> Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794!
>
> Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1
> Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------
> Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794!
>
> Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data
> Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------
> Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794!
>
> Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1
> Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------
> Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794!
>
> Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1
> Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------
> Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794!
>
> There don't seem to be any instances of the BUG that did not occur
> within 30 seconds of starting a balance.
>
> The on-disk data is fine. After a reboot the same block group can be
> successfully balanced.
Forgot to mention the failure rate: 8 crashes (listed above) among 1492
block groups balanced over the same 4-day period.
> >
> > > Thanks,
> > > Qu
> > > > [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200
> > > > [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140
> > > > [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0
> > > > [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0
> > > > [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0
> > > > [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410
> > > > [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0
> > > > [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220
> > > > [ 9718.545186][T13609] ? check_flags+0x26/0x30
> > > > [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100
> > > > [ 9718.546651][T13609] do_relocation+0x242/0xc90
> > > > [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0
> > > > [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220
> > > > [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20
> > > > [ 9718.549745][T13609] ? mark_lock+0xa8/0x440
> > > > [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0
> > > > [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600
> > > > [ 9718.552079][T13609] ? memcpy+0x4d/0x60
> > > > [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120
> > > > [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00
> > > > [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90
> > > > [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
> > > > [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140
> > > > [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360
> > > > [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20
> > > > [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0
> > > > [ 9718.559387][T13609] relocate_block_group+0x52e/0x830
> > > > [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0
> > > > [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0
> > > > [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120
> > > > [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910
> > > > [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0
> > > > [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120
> > > > [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740
> > > > [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0
> > > > [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0
> > > > [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250
> > > > [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20
> > > > [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0
> > > > [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30
> > > > [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30
> > > > [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0
> > > > [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0
> > > > [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0
> > > > [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0
> > > > [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220
> > > > [ 9718.575472][T13609] ? check_flags+0x26/0x30
> > > > [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100
> > > > [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220
> > > > [ 9718.577836][T13609] ? check_flags+0x26/0x30
> > > > [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100
> > > > [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20
> > > > [ 9718.580225][T13609] ? __fget_light+0xae/0x110
> > > > [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0
> > > > [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50
> > > > [ 9718.582334][T13609] do_syscall_64+0x60/0xf0
> > > > [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427
> > > > [ 9718.585289][T13609] Code: Bad RIP value.
> > > > [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > > > [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427
> > > > [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003
> > > > [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> > > > [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001
> > > > [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001
> > > > [ 9718.596109][T13609] Modules linked in:
> > > > [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]---
> > > > [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480
> > > > [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00
> > > > [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282
> > > > [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000
> > > > [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246
> > > > [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001
> > > > [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020
> > > > [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0
> > > > [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
> > > > [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0
> > > > [ 9718.869689][ T4545] ==================================================================
> > > >
> > > > same line, different call stack:
> > > >
> > > > 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794).
> > > > 789 btrfs_tree_unlock(eb);
> > > > 790 free_extent_buffer(eb);
> > > > 791
> > > > 792 ret = btrfs_insert_root(trans, fs_info->tree_root,
> > > > 793 &root_key, root_item);
> > > > 794 BUG_ON(ret);
> > > > 795 kfree(root_item);
> > > > 796
> > > > 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> > > > 798 BUG_ON(IS_ERR(reloc_root));
> > > >
> > > > followed by
> > > >
> > > > [ 9718.869689][ T4545] ==================================================================
> > > > [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
> > > > [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545
> > > > [ 9718.873746][ T4545]
> > > > [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44
> > > > [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > > [ 9718.877149][ T4545] Call Trace:
> > > > [ 9718.877655][ T4545] dump_stack+0xc8/0x11a
> > > > [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0
> > > > [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200
> > > > [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0
> > > > [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0
> > > > [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e
> > > > [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0
> > > > [ 9718.883229][ T4545] __asan_load4+0x69/0x90
> > > > [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0
> > > > [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230
> > > > [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0
> > > > [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20
> > > > [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20
> > > > [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0
> > > > [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0
> > > > [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0
> > > > [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20
> > > > [ 9718.891308][ T4545] ? lock_contended+0x720/0x720
> > > > [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0
> > > > [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230
> > > > [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20
> > > > [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20
> > > > [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0
> > > > [ 9718.896245][ T4545] start_transaction+0x189/0x8f0
> > > > [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20
> > > > [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0
> > > > [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930
> > > > [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0
> > > > [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310
> > > > [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850
> > > > [ 9718.902035][ T4545] ? current_time+0x8c/0xe0
> > > > [ 9718.902799][ T4545] notify_change+0x4ec/0x700
> > > > [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220
> > > > [ 9718.904459][ T4545] do_truncate+0xe4/0x160
> > > > [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170
> > > > [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270
> > > > [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220
> > > > [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40
> > > > [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0
> > > > [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947
> > > > [ 9718.911247][ T4545] Code: Bad RIP value.
> > > > [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > > > [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947
> > > > [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1
> > > > [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78
> > > > [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20
> > > > [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0
> > > > [ 9718.919882][ T4545]
> > > > [ 9718.920268][ T4545] Allocated by task 6732:
> > > > [ 9718.920973][ T4545] save_stack+0x21/0x50
> > > > [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0
> > > > [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20
> > > > [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720
> > > > [ 9718.924203][ T4545] copy_process+0x357/0x3680
> > > > [ 9718.924955][ T4545] _do_fork+0xed/0x880
> > > > [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130
> > > > [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80
> > > > [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0
> > > > [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [ 9718.928812][ T4545]
> > > > [ 9718.929173][ T4545] Freed by task 24:
> > > > [ 9718.929787][ T4545] save_stack+0x21/0x50
> > > > [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170
> > > > [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10
> > > > [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280
> > > > [ 9718.932730][ T4545] free_task+0x73/0x90
> > > > [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0
> > > > [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0
> > > > [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0
> > > > [ 9718.935758][ T4545] rcu_core_si+0xe/0x10
> > > > [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3
> > > > [ 9718.937165][ T4545]
> > > > [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000
> > > > [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072
> > > > [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of
> > > > [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40)
> > > > [ 9718.942559][ T4545] The buggy address belongs to the page:
> > > > [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0
> > > > [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head)
> > > > [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700
> > > > [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000
> > > > [ 9718.949889][ T4545] page dumped because: kasan: bad access detected
> > > > [ 9718.950977][ T4545]
> > > > [ 9718.951354][ T4545] Memory state around the buggy address:
> > > > [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > [ 9718.956366][ T4545] ^
> > > > [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > [ 9718.960034][ T4545] ==================================================================
> > > >
> > >
> >
> >
> >
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-08-28 0:08 ` Zygo Blaxell
@ 2020-08-28 6:34 ` Nikolay Borisov
2020-08-28 20:42 ` Zygo Blaxell
0 siblings, 1 reply; 13+ messages in thread
From: Nikolay Borisov @ 2020-08-28 6:34 UTC (permalink / raw)
To: Zygo Blaxell, Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu
On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
<snip>
>>
>> Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1
>> Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------
>> Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1
>> Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------
>> Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1
>> Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------
>> Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1
>> Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------
>> Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1
>> Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------
>> Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data
>> Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------
>> Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1
>> Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------
>> Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1
>> Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------
>> Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794!
>>
>> There don't seem to be any instances of the BUG that did not occur
>> within 30 seconds of starting a balance.
>>
>> The on-disk data is fine. After a reboot the same block group can be
>> successfully balanced.
>
> Forgot to mention the failure rate: 8 crashes (listed above) among 1492
> block groups balanced over the same 4-day period.
Since you can repro reliably could you modify the code in
create_reloc_root so it prints what's the returned error value, I'd
speculate it's EEXIST from
btrfs_insert_root
btrfs_insert_item
btrfs_insert_empty_item
btrfs_insert_empty_items
btrfs_search_slot
But better be sure.
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-08-28 6:34 ` Nikolay Borisov
@ 2020-08-28 20:42 ` Zygo Blaxell
2020-09-01 22:53 ` Zygo Blaxell
0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-08-28 20:42 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Qu Wenruo, David Sterba, linux-btrfs, wqu
On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
> > On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
> >> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
>
> <snip>
>
> Since you can repro reliably could you modify the code in
> create_reloc_root so it prints what's the returned error value, I'd
> speculate it's EEXIST from
>
> btrfs_insert_root
> btrfs_insert_item
> btrfs_insert_empty_item
> btrfs_insert_empty_items
> btrfs_search_slot
>
> But better be sure.
Here you go, EEXIST == 17:
Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
Aug 28 15:30:56 regress kernel: [18454.459006][ T2100] invalid opcode: 0000 [#1] SMP KASAN PTI
Aug 28 15:30:56 regress kernel: [18454.460356][ T2100] CPU: 2 PID: 2100 Comm: rsync Tainted: G W 5.8.5-8de74804e45b+ #6
Aug 28 15:30:57 regress kernel: [18454.462324][ T2100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Aug 28 15:30:57 regress kernel: [18454.464289][ T2100] RIP: 0010:create_reloc_root+0x47a/0x490
Aug 28 15:30:57 regress kernel: [18454.465507][ T2100] Code: e8 5b 3b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 65 3b bd ff 4d 89 b4 24 f0 00 00 00 e9 dc fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 b9 90 07 01 66 0f 1f 84 00 00 00 00 00
Aug 28 15:30:57 regress kernel: [18454.468861][ T2100] RSP: 0018:ffffc90000c777d0 EFLAGS: 00010282
Aug 28 15:30:57 regress kernel: [18454.469787][ T2100] RAX: 000000000000001b RBX: ffff88817cbc9400 RCX: ffffffffa5273b42
Aug 28 15:30:57 regress kernel: [18454.471005][ T2100] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f5dff28c
Aug 28 15:30:57 regress kernel: [18454.472278][ T2100] RBP: ffffc90000c778c0 R08: ffffed103ebc1645 R09: ffffed103ebc1645
Aug 28 15:30:57 regress kernel: [18454.473547][ T2100] R10: ffff8881f5e0b227 R11: ffffed103ebc1644 R12: ffff8881cb710020
Aug 28 15:30:57 regress kernel: [18454.474949][ T2100] R13: ffff888118800a80 R14: 00000000ffffffef R15: ffffc90000c77858
Aug 28 15:30:57 regress kernel: [18454.476224][ T2100] FS: 00007f1b8f7d9b80(0000) GS:ffff8881f5c00000(0000) knlGS:0000000000000000
Aug 28 15:30:57 regress kernel: [18454.477635][ T2100] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 15:30:57 regress kernel: [18454.478661][ T2100] CR2: 00007fc1d25e7100 CR3: 0000000120a8e006 CR4: 00000000001606e0
Aug 28 15:30:57 regress kernel: [18454.479894][ T2100] Call Trace:
Aug 28 15:30:57 regress kernel: [18454.480416][ T2100] ? update_backref_node+0xf0/0xf0
Aug 28 15:30:57 regress kernel: [18454.481209][ T2100] ? check_chain_key+0x1e6/0x2e0
Aug 28 15:30:57 regress kernel: [18454.482012][ T2100] btrfs_init_reloc_root+0x1b0/0x310
Aug 28 15:30:57 regress kernel: [18454.482859][ T2100] ? find_reloc_root+0x200/0x200
Aug 28 15:30:57 regress kernel: [18454.483661][ T2100] ? do_raw_spin_unlock+0xa8/0x140
Aug 28 15:30:57 regress kernel: [18454.484482][ T2100] record_root_in_trans+0x18c/0x1d0
Aug 28 15:30:57 regress kernel: [18454.485435][ T2100] btrfs_record_root_in_trans+0x8b/0xc0
Aug 28 15:30:57 regress kernel: [18454.486301][ T2100] start_transaction+0x16b/0x8f0
Aug 28 15:30:57 regress kernel: [18454.487082][ T2100] btrfs_start_transaction+0x1e/0x20
Aug 28 15:30:57 regress kernel: [18454.487905][ T2100] btrfs_cont_expand+0x549/0x7a0
Aug 28 15:30:57 regress kernel: [18454.488680][ T2100] ? btrfs_truncate_block+0x970/0x970
Aug 28 15:30:57 regress kernel: [18454.489527][ T2100] ? timestamp_truncate+0x180/0x180
Aug 28 15:30:57 regress kernel: [18454.490344][ T2100] ? check_chain_key+0x1e6/0x2e0
Aug 28 15:30:57 regress kernel: [18454.491117][ T2100] btrfs_file_write_iter+0x7ae/0x957
Aug 28 15:30:57 regress kernel: [18454.491938][ T2100] ? btrfs_sync_file+0x7c0/0x7c0
Aug 28 15:30:57 regress kernel: [18454.492710][ T2100] ? iov_iter_init+0x99/0xd0
Aug 28 15:30:57 regress kernel: [18454.493426][ T2100] new_sync_write+0x2ad/0x3f0
Aug 28 15:30:57 regress kernel: [18454.494153][ T2100] ? new_sync_read+0x3e0/0x3e0
Aug 28 15:30:57 regress kernel: [18454.494890][ T2100] ? check_flags+0x26/0x30
Aug 28 15:30:57 regress kernel: [18454.495582][ T2100] ? lock_is_held_type+0xc9/0x100
Aug 28 15:30:57 regress kernel: [18454.496365][ T2100] ? rcu_read_lock_any_held+0xd2/0x100
Aug 28 15:30:57 regress kernel: [18454.497211][ T2100] ? rcu_read_lock_held+0xb0/0xb0
Aug 28 15:30:57 regress kernel: [18454.497985][ T2100] ? __sb_start_write+0x1a1/0x270
Aug 28 15:30:57 regress kernel: [18454.498768][ T2100] vfs_write+0x2d2/0x300
Aug 28 15:30:57 regress kernel: [18454.499417][ T2100] ksys_write+0xcc/0x170
Aug 28 15:30:57 regress kernel: [18454.500064][ T2100] ? __ia32_sys_read+0x50/0x50
Aug 28 15:30:57 regress kernel: [18454.500783][ T2100] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 28 15:30:57 regress kernel: [18454.501704][ T2100] __x64_sys_write+0x43/0x50
Aug 28 15:30:57 regress kernel: [18454.502403][ T2100] do_syscall_64+0x60/0xf0
Aug 28 15:30:57 regress kernel: [18454.503079][ T2100] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 28 15:30:57 regress kernel: [18454.503971][ T2100] RIP: 0033:0x7f1b8f8c5504
Aug 28 15:30:57 regress kernel: [18454.504644][ T2100] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
Aug 28 15:30:57 regress kernel: [18454.507565][ T2100] RSP: 002b:00007fff3419eaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Aug 28 15:30:57 regress kernel: [18454.508800][ T2100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1b8f8c5504
Aug 28 15:30:57 regress kernel: [18454.509982][ T2100] RDX: 0000000000000400 RSI: 000055e56f375bb0 RDI: 0000000000000001
Aug 28 15:30:57 regress kernel: [18454.511153][ T2100] RBP: 0000000000000400 R08: 0000000000000400 R09: 000000002c4a4095
Aug 28 15:30:57 regress kernel: [18454.512325][ T2100] R10: 000000000a7b98fd R11: 0000000000000246 R12: 000055e56f375bb0
Aug 28 15:30:57 regress kernel: [18454.513503][ T2100] R13: 000055e56f375bb0 R14: 0000000000008000 R15: 0000000000000400
Aug 28 15:30:57 regress kernel: [18454.514685][ T2100] Modules linked in:
Aug 28 15:30:57 regress kernel: [18454.515321][ T2100] ---[ end trace dc1ad17026339b11 ]---
Aug 28 15:30:57 regress kernel: [18454.516184][ T2100] RIP: 0010:create_reloc_root+0x47a/0x490
Aug 28 15:30:57 regress kernel: [18454.517085][ T2100] Code: e8 5b 3b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 65 3b bd ff 4d 89 b4 24 f0 00 00 00 e9 dc fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 b9 90 07 01 66 0f 1f 84 00 00 00 00 00
Aug 28 15:30:57 regress kernel: [18454.520010][ T2100] RSP: 0018:ffffc90000c777d0 EFLAGS: 00010282
Aug 28 15:30:57 regress kernel: [18454.520935][ T2100] RAX: 000000000000001b RBX: ffff88817cbc9400 RCX: ffffffffa5273b42
Aug 28 15:30:57 regress kernel: [18454.522172][ T2100] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f5dff28c
Aug 28 15:30:57 regress kernel: [18454.523567][ T2100] RBP: ffffc90000c778c0 R08: ffffed103ebc1645 R09: ffffed103ebc1645
Aug 28 15:30:57 regress kernel: [18454.524985][ T2100] R10: ffff8881f5e0b227 R11: ffffed103ebc1644 R12: ffff8881cb710020
Aug 28 15:30:57 regress kernel: [18454.526404][ T2100] R13: ffff888118800a80 R14: 00000000ffffffef R15: ffffc90000c77858
Aug 28 15:30:57 regress kernel: [18454.527887][ T2100] FS: 00007f1b8f7d9b80(0000) GS:ffff8881f5c00000(0000) knlGS:0000000000000000
Aug 28 15:30:57 regress kernel: [18454.529576][ T2100] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 15:30:57 regress kernel: [18454.530845][ T2100] CR2: 00007fc1d25e7100 CR3: 0000000120a8e006 CR4: 00000000001606e0
Aug 28 15:30:57 regress kernel: [18454.821401][T32222] ==================================================================
Aug 28 15:30:57 regress kernel: [18454.822634][T32222] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0
Aug 28 15:30:57 regress kernel: [18454.823654][T32222] Read of size 4 at addr ffff88811329c02c by task mkdir/32222
Aug 28 15:30:57 regress kernel: [18454.824781][T32222]
Aug 28 15:30:57 regress kernel: [18454.825148][T32222] CPU: 1 PID: 32222 Comm: mkdir Tainted: G D W 5.8.5-8de74804e45b+ #6
Aug 28 15:30:57 regress kernel: [18454.826616][T32222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Aug 28 15:30:57 regress kernel: [18454.828088][T32222] Call Trace:
Aug 28 15:30:57 regress kernel: [18454.828607][T32222] dump_stack+0xc8/0x11a
Aug 28 15:30:57 regress kernel: [18454.829297][T32222] ? __mutex_lock+0x202/0xce0
Aug 28 15:30:57 regress kernel: [18454.830033][T32222] print_address_description.constprop.8+0x1f/0x200
Aug 28 15:30:57 regress kernel: [18454.831062][T32222] ? __mutex_lock+0x202/0xce0
Aug 28 15:30:57 regress kernel: [18454.831783][T32222] ? __mutex_lock+0x202/0xce0
Aug 28 15:30:57 regress kernel: [18454.832537][T32222] kasan_report.cold.11+0x20/0x3e
Aug 28 15:30:57 regress kernel: [18454.833323][T32222] ? __mutex_lock+0x202/0xce0
Aug 28 15:30:57 regress kernel: [18454.834056][T32222] __asan_load4+0x69/0x90
Aug 28 15:30:57 regress kernel: [18454.834754][T32222] __mutex_lock+0x202/0xce0
Aug 28 15:30:57 regress kernel: [18454.835475][T32222] ? wait_current_trans+0xb7/0x230
Aug 28 15:30:57 regress kernel: [18454.836295][T32222] ? btrfs_record_root_in_trans+0x7e/0xc0
Aug 28 15:30:57 regress kernel: [18454.837206][T32222] ? mutex_lock_io_nested+0xc20/0xc20
Aug 28 15:30:57 regress kernel: [18454.838064][T32222] ? __kasan_check_read+0x11/0x20
Aug 28 15:30:57 regress kernel: [18454.838860][T32222] ? join_transaction+0x32/0x6f0
Aug 28 15:30:57 regress kernel: [18454.839653][T32222] ? join_transaction+0x1a6/0x6f0
Aug 28 15:30:57 regress kernel: [18454.840592][T32222] ? lock_downgrade+0x3e0/0x3e0
Aug 28 15:30:57 regress kernel: [18454.841401][T32222] ? __kasan_check_write+0x14/0x20
Aug 28 15:30:57 regress kernel: [18454.842165][T32222] ? lock_contended+0x720/0x720
Aug 28 15:30:57 regress kernel: [18454.842883][T32222] ? do_raw_spin_lock+0x1e0/0x1e0
Aug 28 15:30:57 regress kernel: [18454.843629][T32222] ? wait_current_trans+0xb7/0x230
Aug 28 15:30:57 regress kernel: [18454.844409][T32222] mutex_lock_nested+0x1b/0x20
Aug 28 15:30:57 regress kernel: [18454.845121][T32222] ? mutex_lock_nested+0x1b/0x20
Aug 28 15:30:57 regress kernel: [18454.845867][T32222] btrfs_record_root_in_trans+0x7e/0xc0
Aug 28 15:30:57 regress kernel: [18454.846694][T32222] start_transaction+0x16b/0x8f0
Aug 28 15:30:57 regress kernel: [18454.847438][T32222] btrfs_start_transaction+0x1e/0x20
Aug 28 15:30:57 regress kernel: [18454.848223][T32222] btrfs_mkdir+0xf5/0x3b0
Aug 28 15:30:57 regress kernel: [18454.848863][T32222] ? make_kprojid+0x20/0x20
Aug 28 15:30:57 regress kernel: [18454.849533][T32222] ? putname+0x6b/0x80
Aug 28 15:30:57 regress kernel: [18454.850141][T32222] ? btrfs_rename2+0x2b20/0x2b20
Aug 28 15:30:57 regress kernel: [18454.850866][T32222] ? generic_permission+0x58/0x250
Aug 28 15:30:57 regress kernel: [18454.851753][T32222] ? security_inode_permission+0x1d/0x70
Aug 28 15:30:57 regress kernel: [18454.852598][T32222] ? inode_permission+0x7a/0x1f0
Aug 28 15:30:57 regress kernel: [18454.853343][T32222] vfs_mkdir+0x1e1/0x2f0
Aug 28 15:30:57 regress kernel: [18454.853971][T32222] do_mkdirat+0x192/0x1c0
Aug 28 15:30:57 regress kernel: [18454.854620][T32222] ? __ia32_sys_mknod+0x50/0x50
Aug 28 15:30:57 regress kernel: [18454.855357][T32222] ? trace_hardirqs_on_prepare+0x35/0x170
Aug 28 15:30:57 regress kernel: [18454.856239][T32222] __x64_sys_mkdir+0x37/0x40
Aug 28 15:30:57 regress kernel: [18454.856951][T32222] do_syscall_64+0x60/0xf0
Aug 28 15:30:57 regress kernel: [18454.857645][T32222] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 28 15:30:57 regress kernel: [18454.858609][T32222] RIP: 0033:0x7f36074470d7
Aug 28 15:30:57 regress kernel: [18454.859287][T32222] Code: 1f 40 00 48 8b 05 b9 0d 0d 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 0d 0d 00 f7 d8 64 89 01 48
Aug 28 15:30:57 regress kernel: [18454.862597][T32222] RSP: 002b:00007ffc5c8419e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
Aug 28 15:30:57 regress kernel: [18454.863874][T32222] RAX: ffffffffffffffda RBX: 00007ffc5c842bc8 RCX: 00007f36074470d7
Aug 28 15:30:57 regress kernel: [18454.865087][T32222] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc5c842bc8
Aug 28 15:30:57 regress kernel: [18454.866297][T32222] RBP: 00007ffc5c842bc8 R08: 00000000000001ff R09: 0000557174728c00
Aug 28 15:30:57 regress kernel: [18454.867501][T32222] R10: fffffffffffff35a R11: 0000000000000206 R12: 00000000000001ff
Aug 28 15:30:57 regress kernel: [18454.868709][T32222] R13: 0000000000000000 R14: 00007ffc5c841b60 R15: 00007ffc5c841cf0
Aug 28 15:30:57 regress kernel: [18454.869923][T32222]
Aug 28 15:30:57 regress kernel: [18454.870296][T32222] Allocated by task 2066:
Aug 28 15:30:57 regress kernel: [18454.870939][T32222] save_stack+0x21/0x50
Aug 28 15:30:57 regress kernel: [18454.871572][T32222] __kasan_kmalloc.constprop.17+0xc1/0xd0
Aug 28 15:30:57 regress kernel: [18454.872434][T32222] kasan_slab_alloc+0x12/0x20
Aug 28 15:30:57 regress kernel: [18454.873133][T32222] kmem_cache_alloc_node+0x113/0x720
Aug 28 15:30:57 regress kernel: [18454.873914][T32222] copy_process+0x357/0x3680
Aug 28 15:30:57 regress kernel: [18454.874653][T32222] _do_fork+0xed/0x880
Aug 28 15:30:57 regress kernel: [18454.875353][T32222] __do_sys_clone+0xee/0x130
Aug 28 15:30:57 regress kernel: [18454.876057][T32222] __x64_sys_clone+0x67/0x80
Aug 28 15:30:57 regress kernel: [18454.876782][T32222] do_syscall_64+0x60/0xf0
Aug 28 15:30:57 regress kernel: [18454.877476][T32222] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 28 15:30:57 regress kernel: [18454.878398][T32222]
Aug 28 15:30:57 regress kernel: [18454.878760][T32222] Freed by task 3558:
Aug 28 15:30:57 regress kernel: [18454.879384][T32222] save_stack+0x21/0x50
Aug 28 15:30:57 regress kernel: [18454.880038][T32222] __kasan_slab_free+0x118/0x170
Aug 28 15:30:57 regress kernel: [18454.880855][T32222] kasan_slab_free+0xe/0x10
Aug 28 15:30:57 regress kernel: [18454.881565][T32222] kmem_cache_free+0x5f/0x280
Aug 28 15:30:57 regress kernel: [18454.882297][T32222] free_task+0x73/0x90
Aug 28 15:30:57 regress kernel: [18454.882928][T32222] __put_task_struct+0x199/0x1d0
Aug 28 15:30:57 regress kernel: [18454.883699][T32222] delayed_put_task_struct+0x124/0x1b0
Aug 28 15:30:57 regress kernel: [18454.884615][T32222] rcu_core+0x3b0/0xea0
Aug 28 15:30:57 regress kernel: [18454.885273][T32222] rcu_core_si+0xe/0x10
Aug 28 15:30:57 regress kernel: [18454.886251][T32222] __do_softirq+0x120/0x5e3
Aug 28 15:30:57 regress kernel: [18454.886964][T32222]
Aug 28 15:30:57 regress kernel: [18454.887332][T32222] The buggy address belongs to the object at ffff88811329c000
Aug 28 15:30:57 regress kernel: [18454.887332][T32222] which belongs to the cache task_struct(192:ssh.service) of size 11072
Aug 28 15:30:57 regress kernel: [18454.889771][T32222] The buggy address is located 44 bytes inside of
Aug 28 15:30:57 regress kernel: [18454.889771][T32222] 11072-byte region [ffff88811329c000, ffff88811329eb40)
Aug 28 15:30:57 regress kernel: [18454.891843][T32222] The buggy address belongs to the page:
Aug 28 15:30:57 regress kernel: [18454.892718][T32222] page:ffffea00044ca700 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88811329ffff head:ffffea00044ca700 order:2 compound_mapcount:0 compound_pincount:0
Aug 28 15:30:57 regress kernel: [18454.895303][T32222] flags: 0x17ffe0000010200(slab|head)
Aug 28 15:30:57 regress kernel: [18454.896186][T32222] raw: 017ffe0000010200 ffffea0001a49908 ffff8881f5b36498 ffff8881eb5a1380
Aug 28 15:30:57 regress kernel: [18454.897618][T32222] raw: ffff88811329ffff ffff88811329c000 0000000100000001 0000000000000000
Aug 28 15:30:57 regress kernel: [18454.899016][T32222] page dumped because: kasan: bad access detected
Aug 28 15:30:57 regress kernel: [18454.900061][T32222]
Aug 28 15:30:57 regress kernel: [18454.900439][T32222] Memory state around the buggy address:
Aug 28 15:30:57 regress kernel: [18454.901364][T32222] ffff88811329bf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Aug 28 15:30:57 regress kernel: [18454.902699][T32222] ffff88811329bf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Aug 28 15:30:57 regress kernel: [18454.904052][T32222] >ffff88811329c000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Aug 28 15:30:57 regress kernel: [18454.905345][T32222] ^
Aug 28 15:30:57 regress kernel: [18454.906245][T32222] ffff88811329c080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Aug 28 15:30:57 regress kernel: [18454.907675][T32222] ffff88811329c100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Aug 28 15:30:57 regress kernel: [18454.909247][T32222] ==================================================================
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-08-28 20:42 ` Zygo Blaxell
@ 2020-09-01 22:53 ` Zygo Blaxell
2020-09-01 23:33 ` Qu Wenruo
0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-09-01 22:53 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Qu Wenruo, David Sterba, linux-btrfs, wqu
On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote:
> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
> > On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
> > > On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
> > >> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> >
> > <snip>
> >
> > Since you can repro reliably could you modify the code in
> > create_reloc_root so it prints what's the returned error value, I'd
> > speculate it's EEXIST from
> >
> > btrfs_insert_root
> > btrfs_insert_item
> > btrfs_insert_empty_item
> > btrfs_insert_empty_items
> > btrfs_search_slot
> >
> > But better be sure.
>
> Here you go, EEXIST == 17:
>
> Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
> Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
> Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
> Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
> Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
> Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
> Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
> Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
> Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
> Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
I did a low-resolution bisect for this issue. I dug up 5.4, 5.5, 5.6,
and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete
kernels, and ran the tests on each kernel. Results:
5.8: kernel BUG at fs/btrfs/relocation.c:794
5.7: kernel BUG (same code but different line number)
5.6: kernel BUG (same as the others)
5.5: assertion failure (stack trace below)
5.4: kernel BUG (!)
The 5.4 result is interesting--I've been running 5.4 for some time and
not hit this before. So there are 3 possible theories:
1. It's because of sending signals to balance. That has been
added to my test workload after 5.7 was released, so earlier
tests on 5.4 would not have triggered it.
2. There's a regression in 5.4-stable, which I've cherry-picked
to all the other kernels during my test setup. (On the other
hand, if I don't backport some fixes, kernels 5.5..5.7 crash
before they get to this bug.)
3. There's something rotten in my test filesystem, and the
BUG will go away for a while if I do a mkfs. Qu asked for
a dump earlier in this thread, and I provided one.
All three of these theories are testable to some extent, so I'll have
my test VM run some variations.
The full test workload is:
balance metadata or data at random intervals
scrub, scrub cancel at random intervals
20x rsync updating files
snapshot create, delete at random intervals
bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
balance cancel at random intervals
kill -9 $(pidof btrfs balance) at random intervals (NEW - added
when 5.7 came out)
This is the 5.5 root assertion failure:
Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14
Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590
Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0
Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0
Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260
Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140
Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80
Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230
Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140
Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50
Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590
Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400
Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0
Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230
Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0
Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470
Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0
Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0
Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800
Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0
Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0
Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0
Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470
Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20
Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20
Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20
Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190
Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20
Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190
Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20
Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0
Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20
Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190
Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30
Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0
Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120
Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0
Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0
Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170
Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20
Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0
Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0
Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480
Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20
Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110
Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90
Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50
Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0
Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-09-01 22:53 ` Zygo Blaxell
@ 2020-09-01 23:33 ` Qu Wenruo
2020-09-02 0:14 ` Zygo Blaxell
0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-09-01 23:33 UTC (permalink / raw)
To: Zygo Blaxell, Nikolay Borisov; +Cc: David Sterba, linux-btrfs, wqu
This looks like a race between some reloc tree creation from some other
part.
If you could add debug output for create_reloc_root() and its callers,
we may have a chance to debug it.
But for the first step, we can hunt down the BUG_ON()s first and make it
exist more gracefully.
I'll try to spare some time to do this in the following week.
Thanks,
Qu
On 2020/9/2 上午6:53, Zygo Blaxell wrote:
> On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote:
>> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
>>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
>>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
>>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
>>>
>>> <snip>
>>>
>>> Since you can repro reliably could you modify the code in
>>> create_reloc_root so it prints what's the returned error value, I'd
>>> speculate it's EEXIST from
>>>
>>> btrfs_insert_root
>>> btrfs_insert_item
>>> btrfs_insert_empty_item
>>> btrfs_insert_empty_items
>>> btrfs_search_slot
>>>
>>> But better be sure.
>>
>> Here you go, EEXIST == 17:
>>
>> Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
>> Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
>> Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
>> Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
>> Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
>> Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
>> Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
>> Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
>> Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
>> Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
>
> I did a low-resolution bisect for this issue. I dug up 5.4, 5.5, 5.6,
> and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete
> kernels, and ran the tests on each kernel. Results:
>
> 5.8: kernel BUG at fs/btrfs/relocation.c:794
>
> 5.7: kernel BUG (same code but different line number)
>
> 5.6: kernel BUG (same as the others)
>
> 5.5: assertion failure (stack trace below)
>
> 5.4: kernel BUG (!)
>
> The 5.4 result is interesting--I've been running 5.4 for some time and
> not hit this before. So there are 3 possible theories:
>
> 1. It's because of sending signals to balance. That has been
> added to my test workload after 5.7 was released, so earlier
> tests on 5.4 would not have triggered it.
>
> 2. There's a regression in 5.4-stable, which I've cherry-picked
> to all the other kernels during my test setup. (On the other
> hand, if I don't backport some fixes, kernels 5.5..5.7 crash
> before they get to this bug.)
>
> 3. There's something rotten in my test filesystem, and the
> BUG will go away for a while if I do a mkfs. Qu asked for
> a dump earlier in this thread, and I provided one.
>
> All three of these theories are testable to some extent, so I'll have
> my test VM run some variations.
>
> The full test workload is:
>
> balance metadata or data at random intervals
>
> scrub, scrub cancel at random intervals
>
> 20x rsync updating files
>
> snapshot create, delete at random intervals
>
> bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
>
> balance cancel at random intervals
>
> kill -9 $(pidof btrfs balance) at random intervals (NEW - added
> when 5.7 came out)
>
> This is the 5.5 root assertion failure:
>
> Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
> Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
> Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
> Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
> Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
> Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
> Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14
> Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
> Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
> Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
> Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
> Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
> Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
> Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
> Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
> Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
> Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
> Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
> Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
> Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
> Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
> Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590
> Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0
> Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0
> Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260
> Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140
> Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80
> Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230
> Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140
> Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50
> Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590
> Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400
> Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
> Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0
> Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230
> Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0
> Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470
> Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0
> Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0
> Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800
> Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0
> Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
> Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0
> Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0
> Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470
> Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20
> Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20
> Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
> Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20
> Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190
> Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20
> Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190
> Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20
> Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0
> Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20
> Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190
> Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30
> Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0
> Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120
> Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0
> Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
> Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0
> Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170
> Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20
> Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0
> Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0
> Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480
> Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20
> Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110
> Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90
> Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50
> Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0
> Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
> Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
> Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
> Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
> Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
> Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
> Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
> Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-09-01 23:33 ` Qu Wenruo
@ 2020-09-02 0:14 ` Zygo Blaxell
2020-09-02 1:46 ` Qu Wenruo
0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-09-02 0:14 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu
On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote:
> This looks like a race between some reloc tree creation from some other
> part.
>
> If you could add debug output for create_reloc_root() and its callers,
> we may have a chance to debug it.
The callers are always the same:
btrfs_init_reloc_root+0x1b0
record_root_in_trans+0x18c
record_root_in_trans+0x8b
start_transaction+0x189
(gdb) l *(create_reloc_root+0x468)
0xffffffff81930848 is in create_reloc_root (fs/btrfs/relocation.c:1503).
1498 btrfs_tree_unlock(eb);
1499 free_extent_buffer(eb);
1500
1501 ret = btrfs_insert_root(trans, fs_info->tree_root,
1502 &root_key, root_item);
1503 BUG_ON(ret);
1504 kfree(root_item);
1505
1506 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
1507 BUG_ON(IS_ERR(reloc_root));
(gdb) l *(btrfs_init_reloc_root+0x1b0)
0xffffffff81937db0 is in btrfs_init_reloc_root (fs/btrfs/relocation.c:1567).
1562 if (!trans->reloc_reserved) {
1563 rsv = trans->block_rsv;
1564 trans->block_rsv = rc->block_rsv;
1565 clear_rsv = 1;
1566 }
1567 reloc_root = create_reloc_root(trans, root, root->root_key.objectid);
1568 if (clear_rsv)
1569 trans->block_rsv = rsv;
1570
1571 ret = __add_reloc_root(reloc_root);
(gdb) l *(record_root_in_trans+0x18c)
0xffffffff81889bfc is in record_root_in_trans (./include/asm-generic/bitops/instrumented-atomic.h:41).
36 *
37 * This is a relaxed atomic operation (no implied memory barriers).
38 */
39 static inline void clear_bit(long nr, volatile unsigned long *addr)
40 {
41 kasan_check_write(addr + BIT_WORD(nr), sizeof(long));
42 arch_clear_bit(nr, addr);
43 }
44
45 /**
(gdb) l *(start_transaction+0x189)
0xffffffff8188f0d9 is in start_transaction (fs/btrfs/transaction.c:697).
692 * Thus it need to be called after current->journal_info initialized,
693 * or we can deadlock.
694 */
695 btrfs_record_root_in_trans(h, root);
696
697 return h;
698
699 join_fail:
700 if (type & __TRANS_FREEZABLE)
701 sb_end_intwrite(fs_info->sb);
(gdb) quit
It seems to be very early in the transaction. Is there anything to
output here? Or are we more interested in what is left over from
the previous transaction?
> But for the first step, we can hunt down the BUG_ON()s first and make it
> exist more gracefully.
>
> I'll try to spare some time to do this in the following week.
>
> Thanks,
> Qu
>
> On 2020/9/2 上午6:53, Zygo Blaxell wrote:
> > On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote:
> >> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
> >>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
> >>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
> >>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
> >>>
> >>> <snip>
> >>>
> >>> Since you can repro reliably could you modify the code in
> >>> create_reloc_root so it prints what's the returned error value, I'd
> >>> speculate it's EEXIST from
> >>>
> >>> btrfs_insert_root
> >>> btrfs_insert_item
> >>> btrfs_insert_empty_item
> >>> btrfs_insert_empty_items
> >>> btrfs_search_slot
> >>>
> >>> But better be sure.
> >>
> >> Here you go, EEXIST == 17:
> >>
> >> Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
> >> Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
> >> Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
> >> Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
> >> Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
> >> Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
> >> Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
> >> Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
> >> Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
> >> Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
> >
> > I did a low-resolution bisect for this issue. I dug up 5.4, 5.5, 5.6,
> > and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete
> > kernels, and ran the tests on each kernel. Results:
> >
> > 5.8: kernel BUG at fs/btrfs/relocation.c:794
> >
> > 5.7: kernel BUG (same code but different line number)
> >
> > 5.6: kernel BUG (same as the others)
> >
> > 5.5: assertion failure (stack trace below)
> >
> > 5.4: kernel BUG (!)
> >
> > The 5.4 result is interesting--I've been running 5.4 for some time and
> > not hit this before. So there are 3 possible theories:
> >
> > 1. It's because of sending signals to balance. That has been
> > added to my test workload after 5.7 was released, so earlier
> > tests on 5.4 would not have triggered it.
> >
> > 2. There's a regression in 5.4-stable, which I've cherry-picked
> > to all the other kernels during my test setup. (On the other
> > hand, if I don't backport some fixes, kernels 5.5..5.7 crash
> > before they get to this bug.)
> >
> > 3. There's something rotten in my test filesystem, and the
> > BUG will go away for a while if I do a mkfs. Qu asked for
> > a dump earlier in this thread, and I provided one.
> >
> > All three of these theories are testable to some extent, so I'll have
> > my test VM run some variations.
> >
> > The full test workload is:
> >
> > balance metadata or data at random intervals
> >
> > scrub, scrub cancel at random intervals
> >
> > 20x rsync updating files
> >
> > snapshot create, delete at random intervals
> >
> > bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
> >
> > balance cancel at random intervals
> >
> > kill -9 $(pidof btrfs balance) at random intervals (NEW - added
> > when 5.7 came out)
> >
> > This is the 5.5 root assertion failure:
> >
> > Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
> > Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
> > Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
> > Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
> > Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
> > Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
> > Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14
> > Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
> > Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
> > Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
> > Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
> > Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
> > Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
> > Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
> > Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
> > Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
> > Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
> > Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
> > Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
> > Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
> > Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
> > Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590
> > Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0
> > Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0
> > Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260
> > Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140
> > Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80
> > Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230
> > Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140
> > Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50
> > Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590
> > Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400
> > Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
> > Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0
> > Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230
> > Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0
> > Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470
> > Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0
> > Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0
> > Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800
> > Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0
> > Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
> > Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0
> > Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0
> > Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470
> > Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20
> > Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20
> > Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
> > Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20
> > Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190
> > Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20
> > Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190
> > Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20
> > Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0
> > Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20
> > Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190
> > Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30
> > Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0
> > Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120
> > Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0
> > Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
> > Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0
> > Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170
> > Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20
> > Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0
> > Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0
> > Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480
> > Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20
> > Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110
> > Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90
> > Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50
> > Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0
> > Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
> > Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
> > Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> > Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
> > Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
> > Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> > Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
> > Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
> > Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
> > Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
> >
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-09-02 0:14 ` Zygo Blaxell
@ 2020-09-02 1:46 ` Qu Wenruo
2020-09-04 15:54 ` Zygo Blaxell
0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-09-02 1:46 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu
On 2020/9/2 上午8:14, Zygo Blaxell wrote:
> On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote:
>> This looks like a race between some reloc tree creation from some other
>> part.
>>
>> If you could add debug output for create_reloc_root() and its callers,
>> we may have a chance to debug it.
>
> The callers are always the same:
>
> btrfs_init_reloc_root+0x1b0
> record_root_in_trans+0x18c
> record_root_in_trans+0x8b
> start_transaction+0x189
>
> (gdb) l *(create_reloc_root+0x468)
> 0xffffffff81930848 is in create_reloc_root (fs/btrfs/relocation.c:1503).
> 1498 btrfs_tree_unlock(eb);
> 1499 free_extent_buffer(eb);
> 1500
> 1501 ret = btrfs_insert_root(trans, fs_info->tree_root,
> 1502 &root_key, root_item);
> 1503 BUG_ON(ret);
> 1504 kfree(root_item);
> 1505
> 1506 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key);
> 1507 BUG_ON(IS_ERR(reloc_root));
> (gdb) l *(btrfs_init_reloc_root+0x1b0)
> 0xffffffff81937db0 is in btrfs_init_reloc_root (fs/btrfs/relocation.c:1567).
> 1562 if (!trans->reloc_reserved) {
> 1563 rsv = trans->block_rsv;
> 1564 trans->block_rsv = rc->block_rsv;
> 1565 clear_rsv = 1;
> 1566 }
> 1567 reloc_root = create_reloc_root(trans, root, root->root_key.objectid);
> 1568 if (clear_rsv)
> 1569 trans->block_rsv = rsv;
> 1570
> 1571 ret = __add_reloc_root(reloc_root);
> (gdb) l *(record_root_in_trans+0x18c)
> 0xffffffff81889bfc is in record_root_in_trans (./include/asm-generic/bitops/instrumented-atomic.h:41).
> 36 *
> 37 * This is a relaxed atomic operation (no implied memory barriers).
> 38 */
> 39 static inline void clear_bit(long nr, volatile unsigned long *addr)
> 40 {
> 41 kasan_check_write(addr + BIT_WORD(nr), sizeof(long));
> 42 arch_clear_bit(nr, addr);
> 43 }
> 44
> 45 /**
> (gdb) l *(start_transaction+0x189)
> 0xffffffff8188f0d9 is in start_transaction (fs/btrfs/transaction.c:697).
> 692 * Thus it need to be called after current->journal_info initialized,
> 693 * or we can deadlock.
> 694 */
> 695 btrfs_record_root_in_trans(h, root);
> 696
> 697 return h;
> 698
> 699 join_fail:
> 700 if (type & __TRANS_FREEZABLE)
> 701 sb_end_intwrite(fs_info->sb);
> (gdb) quit
>
> It seems to be very early in the transaction. Is there anything to
> output here? Or are we more interested in what is left over from
> the previous transaction?
What I mean is, I want to see who else created the reloc tree, not only
who caused the EEXIST BUG_ON().
That's why I hope to add enough debug pr_info or whatever for
create_reloc_root(), so that we can catch the ordinary calls that seems
safe but may be unsafe for other callers.
Thanks,
Qu
>
>> But for the first step, we can hunt down the BUG_ON()s first and make it
>> exist more gracefully.
>>
>> I'll try to spare some time to do this in the following week.
>>
>> Thanks,
>> Qu
>>
>> On 2020/9/2 上午6:53, Zygo Blaxell wrote:
>>> On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote:
>>>> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote:
>>>>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote:
>>>>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote:
>>>>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>> Since you can repro reliably could you modify the code in
>>>>> create_reloc_root so it prints what's the returned error value, I'd
>>>>> speculate it's EEXIST from
>>>>>
>>>>> btrfs_insert_root
>>>>> btrfs_insert_item
>>>>> btrfs_insert_empty_item
>>>>> btrfs_insert_empty_items
>>>>> btrfs_search_slot
>>>>>
>>>>> But better be sure.
>>>>
>>>> Here you go, EEXIST == 17:
>>>>
>>>> Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9
>>>> Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data
>>>> Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0
>>>> Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0
>>>> Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0
>>>> Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0
>>>> Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17
>>>> Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17
>>>> Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------
>>>> Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795!
>>>
>>> I did a low-resolution bisect for this issue. I dug up 5.4, 5.5, 5.6,
>>> and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete
>>> kernels, and ran the tests on each kernel. Results:
>>>
>>> 5.8: kernel BUG at fs/btrfs/relocation.c:794
>>>
>>> 5.7: kernel BUG (same code but different line number)
>>>
>>> 5.6: kernel BUG (same as the others)
>>>
>>> 5.5: assertion failure (stack trace below)
>>>
>>> 5.4: kernel BUG (!)
>>>
>>> The 5.4 result is interesting--I've been running 5.4 for some time and
>>> not hit this before. So there are 3 possible theories:
>>>
>>> 1. It's because of sending signals to balance. That has been
>>> added to my test workload after 5.7 was released, so earlier
>>> tests on 5.4 would not have triggered it.
>>>
>>> 2. There's a regression in 5.4-stable, which I've cherry-picked
>>> to all the other kernels during my test setup. (On the other
>>> hand, if I don't backport some fixes, kernels 5.5..5.7 crash
>>> before they get to this bug.)
>>>
>>> 3. There's something rotten in my test filesystem, and the
>>> BUG will go away for a while if I do a mkfs. Qu asked for
>>> a dump earlier in this thread, and I provided one.
>>>
>>> All three of these theories are testable to some extent, so I'll have
>>> my test VM run some variations.
>>>
>>> The full test workload is:
>>>
>>> balance metadata or data at random intervals
>>>
>>> scrub, scrub cancel at random intervals
>>>
>>> 20x rsync updating files
>>>
>>> snapshot create, delete at random intervals
>>>
>>> bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
>>>
>>> balance cancel at random intervals
>>>
>>> kill -9 $(pidof btrfs balance) at random intervals (NEW - added
>>> when 5.7 came out)
>>>
>>> This is the 5.5 root assertion failure:
>>>
>>> Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
>>> Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
>>> Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
>>> Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
>>> Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
>>> Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
>>> Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14
>>> Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
>>> Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
>>> Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
>>> Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
>>> Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
>>> Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
>>> Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
>>> Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
>>> Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
>>> Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
>>> Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
>>> Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
>>> Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
>>> Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
>>> Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
>>> Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590
>>> Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0
>>> Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0
>>> Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260
>>> Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140
>>> Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80
>>> Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230
>>> Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140
>>> Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50
>>> Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590
>>> Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400
>>> Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
>>> Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0
>>> Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230
>>> Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0
>>> Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470
>>> Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0
>>> Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0
>>> Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800
>>> Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0
>>> Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
>>> Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0
>>> Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0
>>> Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470
>>> Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20
>>> Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20
>>> Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
>>> Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20
>>> Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190
>>> Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20
>>> Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190
>>> Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20
>>> Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0
>>> Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20
>>> Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190
>>> Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30
>>> Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0
>>> Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120
>>> Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0
>>> Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
>>> Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0
>>> Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170
>>> Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20
>>> Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0
>>> Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0
>>> Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480
>>> Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20
>>> Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110
>>> Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90
>>> Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50
>>> Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0
>>> Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
>>> Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
>>> Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>>> Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
>>> Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
>>> Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
>>> Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
>>> Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
>>> Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
>>> Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
>>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3
2020-09-02 1:46 ` Qu Wenruo
@ 2020-09-04 15:54 ` Zygo Blaxell
0 siblings, 0 replies; 13+ messages in thread
From: Zygo Blaxell @ 2020-09-04 15:54 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu
On Wed, Sep 02, 2020 at 09:46:29AM +0800, Qu Wenruo wrote:
> On 2020/9/2 上午8:14, Zygo Blaxell wrote:
> > On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote:
> >> This looks like a race between some reloc tree creation from some other
> >> part.
> >>
> >> If you could add debug output for create_reloc_root() and its callers,
> >> we may have a chance to debug it.
>
> What I mean is, I want to see who else created the reloc tree, not only
> who caused the EEXIST BUG_ON().
>
> That's why I hope to add enough debug pr_info or whatever for
> create_reloc_root(), so that we can catch the ordinary calls that seems
> safe but may be unsafe for other callers.
There doesn't appear to be a race with multiple instances of
create_reloc_root as nobody else seems to be calling it at the time
when it fails. On the other hand, it is a kworker thread, so it could
be racing with something else.
Sep 4 01:46:42 regress kernel: [12131.050264][ T5245] btrfs_search_slot ret = 0
Sep 4 01:46:51 regress kernel: [12140.058734][ T5245] btrfs_search_slot ret = 0
Sep 4 01:47:00 regress kernel: [12149.079892][ T5245] btrfs_search_slot ret = 0
Sep 4 01:47:09 regress kernel: [12158.091883][ T5245] btrfs_search_slot ret = 0
Sep 4 01:47:14 regress kernel: [12162.521167][ T2993] btrfs_search_slot ret = 0
Sep 4 01:47:14 regress kernel: [12162.823894][ T2993] btrfs_search_slot ret = 0
Sep 4 01:47:14 regress kernel: [12162.991624][ T2993] btrfs_search_slot ret = 0
Sep 4 01:47:14 regress kernel: [12162.999665][ T2993] btrfs_search_slot ret = 0
Sep 4 01:47:19 regress kernel: [12167.117620][ T5245] btrfs_search_slot ret = 0
Sep 4 01:47:28 regress kernel: [12176.232713][ T5245] btrfs_search_slot ret = 0
Sep 4 01:47:37 regress kernel: [12185.237905][ T5245] btrfs_search_slot ret = 0
Sep 4 01:47:50 regress kernel: [12199.005753][ T5245] btrfs_search_slot ret = 0
Sep 4 01:47:51 regress kernel: [12199.953977][T27716] BTRFS info (device dm-0): balance: start -dlimit=9
Sep 4 01:47:51 regress kernel: [12199.992918][T27716] BTRFS info (device dm-0): relocating block group 16502453436416 flags data
Sep 4 01:47:54 regress kernel: [12202.443621][T11829] root->root_key.objectid == 0, objectid = 10502
Sep 4 01:47:54 regress kernel: [12202.444916][T11829] CPU: 0 PID: 11829 Comm: kworker/u8:20 Tainted: G W 5.8.6-ce459d8ff170+ #8
Sep 4 01:47:54 regress kernel: [12202.446791][T11829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Sep 4 01:47:54 regress kernel: [12202.449187][T11829] Workqueue: btrfs-endio-write btrfs_work_helper
Sep 4 01:47:54 regress kernel: [12202.450355][T11829] Call Trace:
Sep 4 01:47:54 regress kernel: [12202.451580][T11829] dump_stack+0xc8/0x11a
Sep 4 01:47:54 regress kernel: [12202.452475][T11829] create_reloc_root.cold.42+0x60/0x4d9
Sep 4 01:47:54 regress kernel: [12202.453512][T11829] ? invalidate_extent_cache+0x2a0/0x2a0
Sep 4 01:47:54 regress kernel: [12202.454538][T11829] ? check_chain_key+0x1e6/0x2e0
Sep 4 01:47:54 regress kernel: [12202.455479][T11829] btrfs_init_reloc_root+0x2d7/0x310
Sep 4 01:47:54 regress kernel: [12202.456493][T11829] ? find_reloc_root+0x200/0x200
Sep 4 01:47:54 regress kernel: [12202.457510][T11829] ? do_raw_spin_unlock+0xa8/0x140
Sep 4 01:47:54 regress kernel: [12202.458446][T11829] record_root_in_trans+0x18c/0x1d0
Sep 4 01:47:54 regress kernel: [12202.459394][T11829] btrfs_record_root_in_trans+0x8b/0xc0
Sep 4 01:47:54 regress kernel: [12202.460673][T11829] start_transaction+0x16b/0x8f0
Sep 4 01:47:54 regress kernel: [12202.461595][T11829] btrfs_join_transaction+0x1d/0x20
Sep 4 01:47:54 regress kernel: [12202.462586][T11829] btrfs_finish_ordered_io+0x535/0xd10
Sep 4 01:47:54 regress kernel: [12202.463590][T11829] ? register_lock_class+0x900/0x900
Sep 4 01:47:54 regress kernel: [12202.464576][T11829] ? btrfs_update_inode_fallback+0x40/0x40
Sep 4 01:47:54 regress kernel: [12202.465584][T11829] ? rcu_read_lock_sched_held+0xa1/0xd0
Sep 4 01:47:54 regress kernel: [12202.466547][T11829] ? rcu_read_lock_bh_held+0xb0/0xb0
Sep 4 01:47:54 regress kernel: [12202.467463][T11829] ? lock_is_held_type+0xc9/0x100
Sep 4 01:47:54 regress kernel: [12202.468371][T11829] finish_ordered_fn+0x15/0x20
Sep 4 01:47:54 regress kernel: [12202.469224][T11829] btrfs_work_helper+0x118/0x920
Sep 4 01:47:54 regress kernel: [12202.470105][T11829] ? rcu_read_lock_bh_held+0xb0/0xb0
Sep 4 01:47:54 regress kernel: [12202.471082][T11829] ? trace_hardirqs_on+0x57/0x140
Sep 4 01:47:54 regress kernel: [12202.471998][T11829] process_one_work+0x507/0xa70
Sep 4 01:47:54 regress kernel: [12202.472885][T11829] ? pwq_dec_nr_in_flight+0x130/0x130
Sep 4 01:47:54 regress kernel: [12202.473816][T11829] ? do_raw_spin_lock+0x1e0/0x1e0
Sep 4 01:47:54 regress kernel: [12202.474716][T11829] worker_thread+0x63/0x5a0
Sep 4 01:47:54 regress kernel: [12202.475510][T11829] ? process_one_work+0xa70/0xa70
Sep 4 01:47:54 regress kernel: [12202.476428][T11829] kthread+0x20c/0x230
Sep 4 01:47:54 regress kernel: [12202.477137][T11829] ? kthread_create_worker_on_cpu+0xc0/0xc0
Sep 4 01:47:54 regress kernel: [12202.478152][T11829] ret_from_fork+0x22/0x30
Sep 4 01:47:54 regress kernel: [12202.480616][T11829] btrfs_search_slot ret = 0
Sep 4 01:47:54 regress kernel: [12202.482834][T11829] btrfs_insert_empty_item ret = -17
Sep 4 01:47:54 regress kernel: [12202.485447][T11829] btrfs_insert_root ret = -17
Sep 4 01:47:54 regress kernel: [12202.487775][T11829] ------------[ cut here ]------------
Sep 4 01:47:54 regress kernel: [12202.490086][T11829] kernel BUG at fs/btrfs/relocation.c:798!
Sep 4 01:47:54 regress kernel: [12202.491104][T11829] invalid opcode: 0000 [#1] SMP KASAN PTI
Sep 4 01:47:54 regress kernel: [12202.492056][T11829] CPU: 1 PID: 11829 Comm: kworker/u8:20 Tainted: G W 5.8.6-ce459d8ff170+ #8
Sep 4 01:47:54 regress kernel: [12202.493712][T11829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Sep 4 01:47:54 regress kernel: [12202.495311][T11829] Workqueue: btrfs-endio-write btrfs_work_helper
Sep 4 01:47:54 regress kernel: [12202.496424][T11829] RIP: 0010:create_reloc_root.cold.42+0x434/0x4d9
Sep 4 01:47:54 regress kernel: [12202.497550][T11829] Code: e8 6c d6 f3 ff 48 c7 c7 e0 98 03 8f 89 c6 89 85 30 ff ff ff e8 0d 53 8c ff 8b 95 30 ff ff ff 4c 8b 8d 28 ff ff ff 85 d2 74 02 <0f> 0b 4c 89 cf e8 fd 56 bc ff 4c 89 e7 e8 45 9c bc ff 49 8b 7f 20
Sep 4 01:47:54 regress kernel: [12202.501225][T11829] RSP: 0018:ffffc9000b80f920 EFLAGS: 00010282
Sep 4 01:47:54 regress kernel: [12202.503239][T11829] RAX: 000000000000001b RBX: 1ffff92001701f29 RCX: ffffffff8d273af2
Sep 4 01:47:54 regress kernel: [12202.504644][T11829] RDX: 00000000ffffffef RSI: 0000000000000008 RDI: ffff8881f59ff28c
Sep 4 01:47:54 regress kernel: [12202.507025][T11829] RBP: ffffc9000b80fa10 R08: ffffed103eb41645 R09: ffff8880a598b400
Sep 4 01:47:54 regress kernel: [12202.509429][T11829] R10: ffff8881f5a0b227 R11: ffffed103eb41644 R12: ffff8881c93e8020
Sep 4 01:47:54 regress kernel: [12202.510781][T11829] R13: ffff8881cbefd2a0 R14: ffffc9000b80f9a8 R15: ffff8881c93e8000
Sep 4 01:47:54 regress kernel: [12202.512142][T11829] FS: 0000000000000000(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000
Sep 4 01:47:54 regress kernel: [12202.513651][T11829] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 4 01:47:54 regress kernel: [12202.514790][T11829] CR2: 00007fb4c11f0a68 CR3: 00000001dc604005 CR4: 00000000001606e0
Sep 4 01:47:54 regress kernel: [12202.516258][T11829] Call Trace:
For reference, here's my kernel logging so far:
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 82ab6e5a386d..b98b74397afc 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -4748,10 +4748,14 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle *trans,
total_size = total_data + (nr * sizeof(struct btrfs_item));
ret = btrfs_search_slot(trans, root, cpu_key, path, total_size, 1);
- if (ret == 0)
+ if (ret == 0) {
+ printk(KERN_ERR "btrfs_search_slot ret = %d\n", ret);
return -EEXIST;
- if (ret < 0)
+ }
+ if (ret < 0) {
+ printk(KERN_ERR "btrfs_search_slot ret = %d\n", ret);
return ret;
+ }
slot = path->slots[0];
BUG_ON(slot < 0);
@@ -4775,14 +4779,18 @@ int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root *root,
unsigned long ptr;
path = btrfs_alloc_path();
- if (!path)
+ if (!path) {
+ printk(KERN_ERR "btrfs_alloc_path ENOMEM\n");
return -ENOMEM;
+ }
ret = btrfs_insert_empty_item(trans, root, path, cpu_key, data_size);
if (!ret) {
leaf = path->nodes[0];
ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
write_extent_buffer(leaf, data, ptr, data_size);
btrfs_mark_buffer_dirty(leaf);
+ } else {
+ printk(KERN_ERR "btrfs_insert_empty_item ret = %d\n", ret);
}
btrfs_free_path(path);
return ret;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 350050b288e4..23fffd4bfd41 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -744,6 +744,9 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans,
root_key.type = BTRFS_ROOT_ITEM_KEY;
root_key.offset = objectid;
+ printk(KERN_ERR "root->root_key.objectid == %zu, objectid = %zu\n", ret, root->root_key.objectid, objectid);
+ dump_stack();
+
if (root->root_key.objectid == objectid) {
u64 commit_root_gen;
@@ -791,6 +794,7 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans,
ret = btrfs_insert_root(trans, fs_info->tree_root,
&root_key, root_item);
+ printk(KERN_ERR "btrfs_insert_root ret = %d\n", ret);
BUG_ON(ret);
kfree(root_item);
> >>> The 5.4 result is interesting--I've been running 5.4 for some time and
> >>> not hit this before. So there are 3 possible theories:
> >>>
> >>> 1. It's because of sending signals to balance. That has been
> >>> added to my test workload after 5.7 was released, so earlier
> >>> tests on 5.4 would not have triggered it.
This might be related. I removed 'kill the balance process' from my
test workload, and didn't have any BUG_ONs for two days. When I put
the kill-the-balance-process test back in the workload, it went back
to BUG_ON at fairly reliable 1-5 hour intervals. Of course that's just
correlation, and with random events at that, but so far the data supports
theory 1 and refutes theory 3.
> >>> 2. There's a regression in 5.4-stable, which I've cherry-picked
> >>> to all the other kernels during my test setup. (On the other
> >>> hand, if I don't backport some fixes, kernels 5.5..5.7 crash
> >>> before they get to this bug.)
> >>>
> >>> 3. There's something rotten in my test filesystem, and the
> >>> BUG will go away for a while if I do a mkfs. Qu asked for
> >>> a dump earlier in this thread, and I provided one.
> >>>
> >>> All three of these theories are testable to some extent, so I'll have
> >>> my test VM run some variations.
> >>>
> >>> The full test workload is:
> >>>
> >>> balance metadata or data at random intervals
> >>>
> >>> scrub, scrub cancel at random intervals
> >>>
> >>> 20x rsync updating files
> >>>
> >>> snapshot create, delete at random intervals
> >>>
> >>> bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls)
> >>>
> >>> balance cancel at random intervals
> >>>
> >>> kill -9 $(pidof btrfs balance) at random intervals (NEW - added
> >>> when 5.7 came out)
> >>>
> >>> This is the 5.5 root assertion failure:
> >>>
> >>> Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837
> >>> Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------
> >>> Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125!
> >>> Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI
> >>> Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809
> >>> Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c
> >>> Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14
> >>> Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c
> >>> Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> >>> Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be
> >>> Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120
> >>> Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e
> >>> Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83
> >>> Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282
> >>> Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242
> >>> Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c
> >>> Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1
> >>> Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0
> >>> Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0
> >>> Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
> >>> Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0
> >>> Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace:
> >>> Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590
> >>> Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0
> >>> Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0
> >>> Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260
> >>> Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140
> >>> Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80
> >>> Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230
> >>> Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140
> >>> Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50
> >>> Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590
> >>> Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400
> >>> Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
> >>> Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0
> >>> Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230
> >>> Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0
> >>> Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470
> >>> Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0
> >>> Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0
> >>> Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800
> >>> Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0
> >>> Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740
> >>> Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0
> >>> Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0
> >>> Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470
> >>> Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20
> >>> Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20
> >>> Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
> >>> Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20
> >>> Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190
> >>> Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20
> >>> Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190
> >>> Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20
> >>> Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0
> >>> Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20
> >>> Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190
> >>> Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30
> >>> Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0
> >>> Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120
> >>> Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0
> >>> Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30
> >>> Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0
> >>> Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170
> >>> Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20
> >>> Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0
> >>> Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0
> >>> Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480
> >>> Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20
> >>> Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110
> >>> Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90
> >>> Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50
> >>> Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0
> >>> Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>> Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427
> >>> Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
> >>> Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> >>> Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427
> >>> Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003
> >>> Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
> >>> Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001
> >>> Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001
> >>> Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in:
> >>> Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]---
> >>>
^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-09-04 15:54 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-30 22:10 BUG at fs/btrfs/relocation.c:794! David Sterba
2020-07-23 21:56 ` Zygo Blaxell
2020-07-24 0:19 ` Qu Wenruo
2020-08-04 16:16 ` Zygo Blaxell
2020-08-28 0:03 ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell
2020-08-28 0:08 ` Zygo Blaxell
2020-08-28 6:34 ` Nikolay Borisov
2020-08-28 20:42 ` Zygo Blaxell
2020-09-01 22:53 ` Zygo Blaxell
2020-09-01 23:33 ` Qu Wenruo
2020-09-02 0:14 ` Zygo Blaxell
2020-09-02 1:46 ` Qu Wenruo
2020-09-04 15:54 ` Zygo Blaxell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).