* BUG at fs/btrfs/relocation.c:794! @ 2020-06-30 22:10 David Sterba 2020-07-23 21:56 ` Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: David Sterba @ 2020-06-30 22:10 UTC (permalink / raw) To: linux-btrfs; +Cc: wqu Hi, I've hit a crash in relocation I've never seen before. [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794! [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638 [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs] [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282 [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000 [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8 [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000 [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78 [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78 [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000 [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0 [ 2129.258775] Call Trace: [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs] [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs] [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs] [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs] [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs] [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40 [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs] [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs] [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs] [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs] [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs] [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs] [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330 [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs] [ 2129.389898] ? do_user_addr_fault+0x221/0x49c [ 2129.404070] ? sched_clock_cpu+0x15/0x140 [ 2129.404073] ? do_user_addr_fault+0x221/0x49c [ 2129.404079] ? up_read+0x18/0x240 [ 2129.404086] ? ksys_ioctl+0x68/0xa0 [ 2129.404091] ksys_ioctl+0x68/0xa0 [ 2129.423308] __x64_sys_ioctl+0x16/0x20 [ 2129.423312] do_syscall_64+0x50/0xe0 [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 2129.423318] RIP: 0033:0x7f82a51c6327 [ 2129.423319] Code: Bad RIP value. [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327 [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003 [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823 [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000 Relevant code called from create_reloc_root: ret = btrfs_insert_root(trans, fs_info->tree_root, &root_key, root_item); BUG_ON(ret) and according to EAX, ret is -17 which is EEXIST. I don't have a reproducer, the testing image has been filled by random git checkouts, deduplicated by BEES, then tons of snapshots created until the metadata got exhausted, some file deletion and balances. This is the same image that led to the patch "btrfs: allow use of global block reserve for balance item deletion", so this could have left it in some intermediate state where the balance item was not removed and the reloc tree as well. There were a few unsuccessful mounts due to relocation recovery, that was trying to debug but then it started to work. The error happened with this 'fi df' saved after the balance start: # btrfs fi df mnt Data, single: total=80.01GiB, used=38.67GiB System, single: total=4.00MiB, used=16.00KiB Metadata, single: total=19.99GiB, used=19.46GiB GlobalReserve, single: total=512.00MiB, used=44.00KiB The error looks like a repeated relocation tree creation, which would point to the unsuccesful balances or inconsistent state (balance item, reloc trees). It's not a "typical" mix of operations but I'd appreciate any insights here. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! 2020-06-30 22:10 BUG at fs/btrfs/relocation.c:794! David Sterba @ 2020-07-23 21:56 ` Zygo Blaxell 2020-07-24 0:19 ` Qu Wenruo 0 siblings, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-07-23 21:56 UTC (permalink / raw) To: David Sterba; +Cc: linux-btrfs, wqu On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote: > Hi, > > I've hit a crash in relocation I've never seen before. > > [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794! I hit an issue yesterday that reminded me of this. > [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP > [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638 > [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 > [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs] > [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282 > [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000 > [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8 > [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000 > [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78 > [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78 > [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000 > [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0 > [ 2129.258775] Call Trace: > [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs] > [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs] > [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs] > [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs] > [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs] > [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40 > [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs] > [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs] > [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs] > [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs] > [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs] > [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs] > [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330 > [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs] > [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs] > [ 2129.389898] ? do_user_addr_fault+0x221/0x49c > [ 2129.404070] ? sched_clock_cpu+0x15/0x140 > [ 2129.404073] ? do_user_addr_fault+0x221/0x49c > [ 2129.404079] ? up_read+0x18/0x240 > [ 2129.404086] ? ksys_ioctl+0x68/0xa0 > [ 2129.404091] ksys_ioctl+0x68/0xa0 > [ 2129.423308] __x64_sys_ioctl+0x16/0x20 > [ 2129.423312] do_syscall_64+0x50/0xe0 > [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 2129.423318] RIP: 0033:0x7f82a51c6327 > [ 2129.423319] Code: Bad RIP value. > [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327 > [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003 > [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 > [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823 > [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000 > > Relevant code called from create_reloc_root: > > ret = btrfs_insert_root(trans, fs_info->tree_root, > &root_key, root_item); > BUG_ON(ret) > > and according to EAX, ret is -17 which is EEXIST. > > I don't have a reproducer, the testing image has been filled by random git > checkouts, deduplicated by BEES, then tons of snapshots created until the > metadata got exhausted, some file deletion and balances. Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also added random 'killall -INT btrfs' to send balance some fatal signals. > This is the same image that led to the patch "btrfs: allow use of global block > reserve for balance item deletion", so this could have left it in some > intermediate state where the balance item was not removed and the reloc tree as > well. > > There were a few unsuccessful mounts due to relocation recovery, that was > trying to debug but then it started to work. > > The error happened with this 'fi df' saved after the balance start: > > # btrfs fi df mnt > Data, single: total=80.01GiB, used=38.67GiB > System, single: total=4.00MiB, used=16.00KiB > Metadata, single: total=19.99GiB, used=19.46GiB > GlobalReserve, single: total=512.00MiB, used=44.00KiB Mine is: Data, single: total=1.75TiB, used=1.74TiB System, RAID1: total=32.00MiB, used=208.00KiB Metadata, RAID1: total=25.00GiB, used=22.89GiB GlobalReserve, single: total=512.00MiB, used=0.00B though this is some time after the failure (and a reboot). I do notice that there's lots of unallocated space, but metadata usage is close to allocated, and I have been experiencing a lot of EROFS events when that happens, even if there's gigabytes unallocated. btrfs fi us: Overall: Device size: 2.00TiB Device allocated: 1.80TiB Device unallocated: 208.94GiB Device missing: 0.00B Used: 1.79TiB Free (estimated): 211.30GiB (min: 106.83GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:1.75TiB, Used:1.74TiB (99.87%) /dev/mapper/vgtest-tvdb 894.00GiB /dev/mapper/vgtest-tvdc 895.00GiB Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%) /dev/mapper/vgtest-tvdb 25.00GiB /dev/mapper/vgtest-tvdc 25.00GiB System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%) /dev/mapper/vgtest-tvdb 32.00MiB /dev/mapper/vgtest-tvdc 32.00MiB Unallocated: /dev/mapper/vgtest-tvdb 104.97GiB /dev/mapper/vgtest-tvdc 103.97GiB > The error looks like a repeated relocation tree creation, which would point to > the unsuccesful balances or inconsistent state (balance item, reloc trees). > It's not a "typical" mix of operations but I'd appreciate any insights here. I have the same line but different call stack, with misc-next e3027d10af42d24940be74dabaf1550cd770bd48: [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1 [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1 [ 9718.511137][T13609] ------------[ cut here ]------------ [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794! [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44 [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480 [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 [ 9718.533608][T13609] Call Trace: [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0 [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0 [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310 [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200 [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140 [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0 [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0 [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0 [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410 [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220 [ 9718.545186][T13609] ? check_flags+0x26/0x30 [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100 [ 9718.546651][T13609] do_relocation+0x242/0xc90 [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0 [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220 [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20 [ 9718.549745][T13609] ? mark_lock+0xa8/0x440 [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0 [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600 [ 9718.552079][T13609] ? memcpy+0x4d/0x60 [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120 [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00 [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90 [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140 [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360 [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20 [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0 [ 9718.559387][T13609] relocate_block_group+0x52e/0x830 [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0 [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0 [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120 [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910 [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0 [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120 [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0 [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0 [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250 [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20 [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0 [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30 [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30 [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0 [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0 [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0 [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0 [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220 [ 9718.575472][T13609] ? check_flags+0x26/0x30 [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100 [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220 [ 9718.577836][T13609] ? check_flags+0x26/0x30 [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100 [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20 [ 9718.580225][T13609] ? __fget_light+0xae/0x110 [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0 [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50 [ 9718.582334][T13609] do_syscall_64+0x60/0xf0 [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427 [ 9718.585289][T13609] Code: Bad RIP value. [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427 [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003 [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001 [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001 [ 9718.596109][T13609] Modules linked in: [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]--- [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480 [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 [ 9718.869689][ T4545] ================================================================== same line, different call stack: 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794). 789 btrfs_tree_unlock(eb); 790 free_extent_buffer(eb); 791 792 ret = btrfs_insert_root(trans, fs_info->tree_root, 793 &root_key, root_item); 794 BUG_ON(ret); 795 kfree(root_item); 796 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key); 798 BUG_ON(IS_ERR(reloc_root)); followed by [ 9718.869689][ T4545] ================================================================== [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0 [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545 [ 9718.873746][ T4545] [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44 [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 9718.877149][ T4545] Call Trace: [ 9718.877655][ T4545] dump_stack+0xc8/0x11a [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0 [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200 [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0 [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0 [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0 [ 9718.883229][ T4545] __asan_load4+0x69/0x90 [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0 [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230 [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0 [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20 [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20 [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0 [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0 [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0 [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20 [ 9718.891308][ T4545] ? lock_contended+0x720/0x720 [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0 [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230 [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20 [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20 [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0 [ 9718.896245][ T4545] start_transaction+0x189/0x8f0 [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20 [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0 [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930 [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0 [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310 [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850 [ 9718.902035][ T4545] ? current_time+0x8c/0xe0 [ 9718.902799][ T4545] notify_change+0x4ec/0x700 [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220 [ 9718.904459][ T4545] do_truncate+0xe4/0x160 [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170 [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270 [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220 [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40 [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0 [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947 [ 9718.911247][ T4545] Code: Bad RIP value. [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947 [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1 [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78 [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20 [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0 [ 9718.919882][ T4545] [ 9718.920268][ T4545] Allocated by task 6732: [ 9718.920973][ T4545] save_stack+0x21/0x50 [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0 [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20 [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720 [ 9718.924203][ T4545] copy_process+0x357/0x3680 [ 9718.924955][ T4545] _do_fork+0xed/0x880 [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130 [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80 [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0 [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9718.928812][ T4545] [ 9718.929173][ T4545] Freed by task 24: [ 9718.929787][ T4545] save_stack+0x21/0x50 [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170 [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10 [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280 [ 9718.932730][ T4545] free_task+0x73/0x90 [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0 [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0 [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0 [ 9718.935758][ T4545] rcu_core_si+0xe/0x10 [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3 [ 9718.937165][ T4545] [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000 [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072 [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40) [ 9718.942559][ T4545] The buggy address belongs to the page: [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0 [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head) [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700 [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000 [ 9718.949889][ T4545] page dumped because: kasan: bad access detected [ 9718.950977][ T4545] [ 9718.951354][ T4545] Memory state around the buggy address: [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 9718.956366][ T4545] ^ [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 9718.960034][ T4545] ================================================================== ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! 2020-07-23 21:56 ` Zygo Blaxell @ 2020-07-24 0:19 ` Qu Wenruo 2020-08-04 16:16 ` Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: Qu Wenruo @ 2020-07-24 0:19 UTC (permalink / raw) To: Zygo Blaxell, David Sterba; +Cc: linux-btrfs, wqu [-- Attachment #1.1: Type: text/plain, Size: 21223 bytes --] On 2020/7/24 上午5:56, Zygo Blaxell wrote: > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote: >> Hi, >> >> I've hit a crash in relocation I've never seen before. >> >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794! > > I hit an issue yesterday that reminded me of this. > >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638 >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs] >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282 >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000 >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8 >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000 >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78 >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78 >> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000 >> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0 >> [ 2129.258775] Call Trace: >> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs] >> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs] >> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs] >> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs] >> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs] >> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40 >> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs] >> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs] >> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs] >> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs] >> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs] >> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs] >> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330 >> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs] >> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs] >> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c >> [ 2129.404070] ? sched_clock_cpu+0x15/0x140 >> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c >> [ 2129.404079] ? up_read+0x18/0x240 >> [ 2129.404086] ? ksys_ioctl+0x68/0xa0 >> [ 2129.404091] ksys_ioctl+0x68/0xa0 >> [ 2129.423308] __x64_sys_ioctl+0x16/0x20 >> [ 2129.423312] do_syscall_64+0x50/0xe0 >> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> [ 2129.423318] RIP: 0033:0x7f82a51c6327 >> [ 2129.423319] Code: Bad RIP value. >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327 >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003 >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823 >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000 >> >> Relevant code called from create_reloc_root: >> >> ret = btrfs_insert_root(trans, fs_info->tree_root, >> &root_key, root_item); >> BUG_ON(ret) >> >> and according to EAX, ret is -17 which is EEXIST. >> >> I don't have a reproducer, the testing image has been filled by random git >> checkouts, deduplicated by BEES, then tons of snapshots created until the >> metadata got exhausted, some file deletion and balances. > > Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also > added random 'killall -INT btrfs' to send balance some fatal signals. > >> This is the same image that led to the patch "btrfs: allow use of global block >> reserve for balance item deletion", so this could have left it in some >> intermediate state where the balance item was not removed and the reloc tree as >> well. >> >> There were a few unsuccessful mounts due to relocation recovery, that was >> trying to debug but then it started to work. >> >> The error happened with this 'fi df' saved after the balance start: >> >> # btrfs fi df mnt >> Data, single: total=80.01GiB, used=38.67GiB >> System, single: total=4.00MiB, used=16.00KiB >> Metadata, single: total=19.99GiB, used=19.46GiB >> GlobalReserve, single: total=512.00MiB, used=44.00KiB > > Mine is: > > Data, single: total=1.75TiB, used=1.74TiB > System, RAID1: total=32.00MiB, used=208.00KiB > Metadata, RAID1: total=25.00GiB, used=22.89GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > though this is some time after the failure (and a reboot). I do notice > that there's lots of unallocated space, but metadata usage is close > to allocated, and I have been experiencing a lot of EROFS events when > that happens, even if there's gigabytes unallocated. > > btrfs fi us: > > Overall: > Device size: 2.00TiB > Device allocated: 1.80TiB > Device unallocated: 208.94GiB > Device missing: 0.00B > Used: 1.79TiB > Free (estimated): 211.30GiB (min: 106.83GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:1.75TiB, Used:1.74TiB (99.87%) > /dev/mapper/vgtest-tvdb 894.00GiB > /dev/mapper/vgtest-tvdc 895.00GiB > > Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%) > /dev/mapper/vgtest-tvdb 25.00GiB > /dev/mapper/vgtest-tvdc 25.00GiB > > System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%) > /dev/mapper/vgtest-tvdb 32.00MiB > /dev/mapper/vgtest-tvdc 32.00MiB > > Unallocated: > /dev/mapper/vgtest-tvdb 104.97GiB > /dev/mapper/vgtest-tvdc 103.97GiB > >> The error looks like a repeated relocation tree creation, which would point to >> the unsuccesful balances or inconsistent state (balance item, reloc trees). >> It's not a "typical" mix of operations but I'd appreciate any insights here. > > I have the same line but different call stack, with misc-next > e3027d10af42d24940be74dabaf1550cd770bd48: > > [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1 > [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1 > [ 9718.511137][T13609] ------------[ cut here ]------------ > [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794! > [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI > [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44 > [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480 > [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b > e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 > [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 > [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 > [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 > [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 > [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 > [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 > [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 > [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 > [ 9718.533608][T13609] Call Trace: > [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0 > [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0 > [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310 That's the same problem. Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON(). In that case, that means there are some reloc trees not cleaned up. Would you mind to provide the "btrfs ins dump-tree -t root" dump for that fs if the problem still happens? Thanks, Qu > [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200 > [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140 > [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0 > [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0 > [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0 > [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410 > [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0 > [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220 > [ 9718.545186][T13609] ? check_flags+0x26/0x30 > [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100 > [ 9718.546651][T13609] do_relocation+0x242/0xc90 > [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0 > [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220 > [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20 > [ 9718.549745][T13609] ? mark_lock+0xa8/0x440 > [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0 > [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600 > [ 9718.552079][T13609] ? memcpy+0x4d/0x60 > [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120 > [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00 > [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90 > [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 > [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140 > [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360 > [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20 > [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0 > [ 9718.559387][T13609] relocate_block_group+0x52e/0x830 > [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0 > [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0 > [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120 > [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910 > [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0 > [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120 > [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 > [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0 > [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0 > [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250 > [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20 > [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0 > [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30 > [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30 > [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0 > [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0 > [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0 > [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0 > [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220 > [ 9718.575472][T13609] ? check_flags+0x26/0x30 > [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100 > [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220 > [ 9718.577836][T13609] ? check_flags+0x26/0x30 > [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100 > [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20 > [ 9718.580225][T13609] ? __fget_light+0xae/0x110 > [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0 > [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50 > [ 9718.582334][T13609] do_syscall_64+0x60/0xf0 > [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427 > [ 9718.585289][T13609] Code: Bad RIP value. > [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427 > [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003 > [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 > [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001 > [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001 > [ 9718.596109][T13609] Modules linked in: > [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]--- > [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480 > [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 > [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 > [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 > [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 > [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 > [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 > [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 > [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 > [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 > [ 9718.869689][ T4545] ================================================================== > > same line, different call stack: > > 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794). > 789 btrfs_tree_unlock(eb); > 790 free_extent_buffer(eb); > 791 > 792 ret = btrfs_insert_root(trans, fs_info->tree_root, > 793 &root_key, root_item); > 794 BUG_ON(ret); > 795 kfree(root_item); > 796 > 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key); > 798 BUG_ON(IS_ERR(reloc_root)); > > followed by > > [ 9718.869689][ T4545] ================================================================== > [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0 > [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545 > [ 9718.873746][ T4545] > [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44 > [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > [ 9718.877149][ T4545] Call Trace: > [ 9718.877655][ T4545] dump_stack+0xc8/0x11a > [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0 > [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200 > [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0 > [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0 > [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e > [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0 > [ 9718.883229][ T4545] __asan_load4+0x69/0x90 > [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0 > [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230 > [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0 > [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20 > [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20 > [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0 > [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0 > [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0 > [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20 > [ 9718.891308][ T4545] ? lock_contended+0x720/0x720 > [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0 > [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230 > [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20 > [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20 > [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0 > [ 9718.896245][ T4545] start_transaction+0x189/0x8f0 > [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20 > [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0 > [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930 > [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0 > [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310 > [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850 > [ 9718.902035][ T4545] ? current_time+0x8c/0xe0 > [ 9718.902799][ T4545] notify_change+0x4ec/0x700 > [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220 > [ 9718.904459][ T4545] do_truncate+0xe4/0x160 > [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170 > [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270 > [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220 > [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40 > [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0 > [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947 > [ 9718.911247][ T4545] Code: Bad RIP value. > [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d > [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947 > [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1 > [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78 > [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20 > [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0 > [ 9718.919882][ T4545] > [ 9718.920268][ T4545] Allocated by task 6732: > [ 9718.920973][ T4545] save_stack+0x21/0x50 > [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0 > [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20 > [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720 > [ 9718.924203][ T4545] copy_process+0x357/0x3680 > [ 9718.924955][ T4545] _do_fork+0xed/0x880 > [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130 > [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80 > [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0 > [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 9718.928812][ T4545] > [ 9718.929173][ T4545] Freed by task 24: > [ 9718.929787][ T4545] save_stack+0x21/0x50 > [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170 > [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10 > [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280 > [ 9718.932730][ T4545] free_task+0x73/0x90 > [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0 > [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0 > [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0 > [ 9718.935758][ T4545] rcu_core_si+0xe/0x10 > [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3 > [ 9718.937165][ T4545] > [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000 > [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072 > [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of > [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40) > [ 9718.942559][ T4545] The buggy address belongs to the page: > [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0 > [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head) > [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700 > [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000 > [ 9718.949889][ T4545] page dumped because: kasan: bad access detected > [ 9718.950977][ T4545] > [ 9718.951354][ T4545] Memory state around the buggy address: > [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > [ 9718.956366][ T4545] ^ > [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > [ 9718.960034][ T4545] ================================================================== > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! 2020-07-24 0:19 ` Qu Wenruo @ 2020-08-04 16:16 ` Zygo Blaxell 2020-08-28 0:03 ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-08-04 16:16 UTC (permalink / raw) To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote: > > > On 2020/7/24 上午5:56, Zygo Blaxell wrote: > > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote: > >> Hi, > >> > >> I've hit a crash in relocation I've never seen before. > >> > >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794! > > > > I hit an issue yesterday that reminded me of this. > > > >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP > >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638 > >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 > >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs] > >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282 > >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000 > >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8 > >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000 > >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78 > >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78 > >> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000 > >> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0 > >> [ 2129.258775] Call Trace: > >> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs] > >> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs] > >> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs] > >> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs] > >> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs] > >> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40 > >> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs] > >> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs] > >> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs] > >> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs] > >> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs] > >> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs] > >> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330 > >> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs] > >> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs] > >> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c > >> [ 2129.404070] ? sched_clock_cpu+0x15/0x140 > >> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c > >> [ 2129.404079] ? up_read+0x18/0x240 > >> [ 2129.404086] ? ksys_ioctl+0x68/0xa0 > >> [ 2129.404091] ksys_ioctl+0x68/0xa0 > >> [ 2129.423308] __x64_sys_ioctl+0x16/0x20 > >> [ 2129.423312] do_syscall_64+0x50/0xe0 > >> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >> [ 2129.423318] RIP: 0033:0x7f82a51c6327 > >> [ 2129.423319] Code: Bad RIP value. > >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327 > >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003 > >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 > >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823 > >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000 > >> > >> Relevant code called from create_reloc_root: > >> > >> ret = btrfs_insert_root(trans, fs_info->tree_root, > >> &root_key, root_item); > >> BUG_ON(ret) > >> > >> and according to EAX, ret is -17 which is EEXIST. > >> > >> I don't have a reproducer, the testing image has been filled by random git > >> checkouts, deduplicated by BEES, then tons of snapshots created until the > >> metadata got exhausted, some file deletion and balances. > > > > Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also > > added random 'killall -INT btrfs' to send balance some fatal signals. > > > >> This is the same image that led to the patch "btrfs: allow use of global block > >> reserve for balance item deletion", so this could have left it in some > >> intermediate state where the balance item was not removed and the reloc tree as > >> well. > >> > >> There were a few unsuccessful mounts due to relocation recovery, that was > >> trying to debug but then it started to work. > >> > >> The error happened with this 'fi df' saved after the balance start: > >> > >> # btrfs fi df mnt > >> Data, single: total=80.01GiB, used=38.67GiB > >> System, single: total=4.00MiB, used=16.00KiB > >> Metadata, single: total=19.99GiB, used=19.46GiB > >> GlobalReserve, single: total=512.00MiB, used=44.00KiB > > > > Mine is: > > > > Data, single: total=1.75TiB, used=1.74TiB > > System, RAID1: total=32.00MiB, used=208.00KiB > > Metadata, RAID1: total=25.00GiB, used=22.89GiB > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > though this is some time after the failure (and a reboot). I do notice > > that there's lots of unallocated space, but metadata usage is close > > to allocated, and I have been experiencing a lot of EROFS events when > > that happens, even if there's gigabytes unallocated. > > > > btrfs fi us: > > > > Overall: > > Device size: 2.00TiB > > Device allocated: 1.80TiB > > Device unallocated: 208.94GiB > > Device missing: 0.00B > > Used: 1.79TiB > > Free (estimated): 211.30GiB (min: 106.83GiB) > > Data ratio: 1.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) > > > > Data,single: Size:1.75TiB, Used:1.74TiB (99.87%) > > /dev/mapper/vgtest-tvdb 894.00GiB > > /dev/mapper/vgtest-tvdc 895.00GiB > > > > Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%) > > /dev/mapper/vgtest-tvdb 25.00GiB > > /dev/mapper/vgtest-tvdc 25.00GiB > > > > System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%) > > /dev/mapper/vgtest-tvdb 32.00MiB > > /dev/mapper/vgtest-tvdc 32.00MiB > > > > Unallocated: > > /dev/mapper/vgtest-tvdb 104.97GiB > > /dev/mapper/vgtest-tvdc 103.97GiB > > > >> The error looks like a repeated relocation tree creation, which would point to > >> the unsuccesful balances or inconsistent state (balance item, reloc trees). > >> It's not a "typical" mix of operations but I'd appreciate any insights here. > > > > I have the same line but different call stack, with misc-next > > e3027d10af42d24940be74dabaf1550cd770bd48: > > > > [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1 > > [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1 > > [ 9718.511137][T13609] ------------[ cut here ]------------ > > [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794! > > [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI > > [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44 > > [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480 > > [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b > > e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 > > [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 > > [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 > > [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 > > [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 > > [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 > > [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 > > [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 > > [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 > > [ 9718.533608][T13609] Call Trace: > > [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0 > > [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0 > > [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310 > > That's the same problem. > > Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON(). > > In that case, that means there are some reloc trees not cleaned up. > > Would you mind to provide the "btrfs ins dump-tree -t root" dump for > that fs if the problem still happens? http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt The problem is now happening multiple times per day, starting with kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date: Thu Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0. The previous misc-next (that I have test data for), cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020 -0700 does not have this problem. These commit hashes are from https://gitlab.com/kdave/btrfs-devel. > Thanks, > Qu > > [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200 > > [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140 > > [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0 > > [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0 > > [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0 > > [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410 > > [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0 > > [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220 > > [ 9718.545186][T13609] ? check_flags+0x26/0x30 > > [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100 > > [ 9718.546651][T13609] do_relocation+0x242/0xc90 > > [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0 > > [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220 > > [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20 > > [ 9718.549745][T13609] ? mark_lock+0xa8/0x440 > > [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0 > > [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600 > > [ 9718.552079][T13609] ? memcpy+0x4d/0x60 > > [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120 > > [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00 > > [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90 > > [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 > > [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140 > > [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360 > > [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20 > > [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0 > > [ 9718.559387][T13609] relocate_block_group+0x52e/0x830 > > [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0 > > [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0 > > [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120 > > [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910 > > [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0 > > [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120 > > [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 > > [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0 > > [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0 > > [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250 > > [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20 > > [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0 > > [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30 > > [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30 > > [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0 > > [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0 > > [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0 > > [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0 > > [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220 > > [ 9718.575472][T13609] ? check_flags+0x26/0x30 > > [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100 > > [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220 > > [ 9718.577836][T13609] ? check_flags+0x26/0x30 > > [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100 > > [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20 > > [ 9718.580225][T13609] ? __fget_light+0xae/0x110 > > [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0 > > [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50 > > [ 9718.582334][T13609] do_syscall_64+0x60/0xf0 > > [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427 > > [ 9718.585289][T13609] Code: Bad RIP value. > > [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > > [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427 > > [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003 > > [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 > > [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001 > > [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001 > > [ 9718.596109][T13609] Modules linked in: > > [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]--- > > [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480 > > [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 > > [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 > > [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 > > [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 > > [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 > > [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 > > [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 > > [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 > > [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 > > [ 9718.869689][ T4545] ================================================================== > > > > same line, different call stack: > > > > 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794). > > 789 btrfs_tree_unlock(eb); > > 790 free_extent_buffer(eb); > > 791 > > 792 ret = btrfs_insert_root(trans, fs_info->tree_root, > > 793 &root_key, root_item); > > 794 BUG_ON(ret); > > 795 kfree(root_item); > > 796 > > 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key); > > 798 BUG_ON(IS_ERR(reloc_root)); > > > > followed by > > > > [ 9718.869689][ T4545] ================================================================== > > [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0 > > [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545 > > [ 9718.873746][ T4545] > > [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44 > > [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > [ 9718.877149][ T4545] Call Trace: > > [ 9718.877655][ T4545] dump_stack+0xc8/0x11a > > [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0 > > [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200 > > [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0 > > [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0 > > [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e > > [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0 > > [ 9718.883229][ T4545] __asan_load4+0x69/0x90 > > [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0 > > [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230 > > [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0 > > [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20 > > [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20 > > [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0 > > [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0 > > [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0 > > [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20 > > [ 9718.891308][ T4545] ? lock_contended+0x720/0x720 > > [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0 > > [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230 > > [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20 > > [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20 > > [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0 > > [ 9718.896245][ T4545] start_transaction+0x189/0x8f0 > > [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20 > > [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0 > > [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930 > > [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0 > > [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310 > > [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850 > > [ 9718.902035][ T4545] ? current_time+0x8c/0xe0 > > [ 9718.902799][ T4545] notify_change+0x4ec/0x700 > > [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220 > > [ 9718.904459][ T4545] do_truncate+0xe4/0x160 > > [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170 > > [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270 > > [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220 > > [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40 > > [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0 > > [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947 > > [ 9718.911247][ T4545] Code: Bad RIP value. > > [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d > > [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947 > > [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1 > > [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78 > > [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20 > > [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0 > > [ 9718.919882][ T4545] > > [ 9718.920268][ T4545] Allocated by task 6732: > > [ 9718.920973][ T4545] save_stack+0x21/0x50 > > [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0 > > [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20 > > [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720 > > [ 9718.924203][ T4545] copy_process+0x357/0x3680 > > [ 9718.924955][ T4545] _do_fork+0xed/0x880 > > [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130 > > [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80 > > [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0 > > [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 9718.928812][ T4545] > > [ 9718.929173][ T4545] Freed by task 24: > > [ 9718.929787][ T4545] save_stack+0x21/0x50 > > [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170 > > [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10 > > [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280 > > [ 9718.932730][ T4545] free_task+0x73/0x90 > > [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0 > > [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0 > > [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0 > > [ 9718.935758][ T4545] rcu_core_si+0xe/0x10 > > [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3 > > [ 9718.937165][ T4545] > > [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000 > > [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072 > > [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of > > [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40) > > [ 9718.942559][ T4545] The buggy address belongs to the page: > > [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0 > > [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head) > > [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700 > > [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000 > > [ 9718.949889][ T4545] page dumped because: kasan: bad access detected > > [ 9718.950977][ T4545] > > [ 9718.951354][ T4545] Memory state around the buggy address: > > [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > [ 9718.956366][ T4545] ^ > > [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > [ 9718.960034][ T4545] ================================================================== > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-08-04 16:16 ` Zygo Blaxell @ 2020-08-28 0:03 ` Zygo Blaxell 2020-08-28 0:08 ` Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-08-28 0:03 UTC (permalink / raw) To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote: > On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote: > > > > > > On 2020/7/24 上午5:56, Zygo Blaxell wrote: > > > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote: > > >> Hi, > > >> > > >> I've hit a crash in relocation I've never seen before. > > >> > > >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794! > > > > > > I hit an issue yesterday that reminded me of this. > > > > > >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP > > >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638 > > >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 > > >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs] > > >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282 > > >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000 > > >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8 > > >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000 > > >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78 > > >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78 > > >> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000 > > >> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0 > > >> [ 2129.258775] Call Trace: > > >> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs] > > >> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs] > > >> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs] > > >> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs] > > >> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs] > > >> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40 > > >> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs] > > >> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs] > > >> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs] > > >> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs] > > >> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs] > > >> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs] > > >> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330 > > >> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs] > > >> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs] > > >> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c > > >> [ 2129.404070] ? sched_clock_cpu+0x15/0x140 > > >> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c > > >> [ 2129.404079] ? up_read+0x18/0x240 > > >> [ 2129.404086] ? ksys_ioctl+0x68/0xa0 > > >> [ 2129.404091] ksys_ioctl+0x68/0xa0 > > >> [ 2129.423308] __x64_sys_ioctl+0x16/0x20 > > >> [ 2129.423312] do_syscall_64+0x50/0xe0 > > >> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > >> [ 2129.423318] RIP: 0033:0x7f82a51c6327 > > >> [ 2129.423319] Code: Bad RIP value. > > >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > > >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327 > > >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003 > > >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 > > >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823 > > >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000 > > >> > > >> Relevant code called from create_reloc_root: > > >> > > >> ret = btrfs_insert_root(trans, fs_info->tree_root, > > >> &root_key, root_item); > > >> BUG_ON(ret) > > >> > > >> and according to EAX, ret is -17 which is EEXIST. > > >> > > >> I don't have a reproducer, the testing image has been filled by random git > > >> checkouts, deduplicated by BEES, then tons of snapshots created until the > > >> metadata got exhausted, some file deletion and balances. > > > > > > Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also > > > added random 'killall -INT btrfs' to send balance some fatal signals. > > > > > >> This is the same image that led to the patch "btrfs: allow use of global block > > >> reserve for balance item deletion", so this could have left it in some > > >> intermediate state where the balance item was not removed and the reloc tree as > > >> well. > > >> > > >> There were a few unsuccessful mounts due to relocation recovery, that was > > >> trying to debug but then it started to work. > > >> > > >> The error happened with this 'fi df' saved after the balance start: > > >> > > >> # btrfs fi df mnt > > >> Data, single: total=80.01GiB, used=38.67GiB > > >> System, single: total=4.00MiB, used=16.00KiB > > >> Metadata, single: total=19.99GiB, used=19.46GiB > > >> GlobalReserve, single: total=512.00MiB, used=44.00KiB > > > > > > Mine is: > > > > > > Data, single: total=1.75TiB, used=1.74TiB > > > System, RAID1: total=32.00MiB, used=208.00KiB > > > Metadata, RAID1: total=25.00GiB, used=22.89GiB > > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > though this is some time after the failure (and a reboot). I do notice > > > that there's lots of unallocated space, but metadata usage is close > > > to allocated, and I have been experiencing a lot of EROFS events when > > > that happens, even if there's gigabytes unallocated. > > > > > > btrfs fi us: > > > > > > Overall: > > > Device size: 2.00TiB > > > Device allocated: 1.80TiB > > > Device unallocated: 208.94GiB > > > Device missing: 0.00B > > > Used: 1.79TiB > > > Free (estimated): 211.30GiB (min: 106.83GiB) > > > Data ratio: 1.00 > > > Metadata ratio: 2.00 > > > Global reserve: 512.00MiB (used: 0.00B) > > > > > > Data,single: Size:1.75TiB, Used:1.74TiB (99.87%) > > > /dev/mapper/vgtest-tvdb 894.00GiB > > > /dev/mapper/vgtest-tvdc 895.00GiB > > > > > > Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%) > > > /dev/mapper/vgtest-tvdb 25.00GiB > > > /dev/mapper/vgtest-tvdc 25.00GiB > > > > > > System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%) > > > /dev/mapper/vgtest-tvdb 32.00MiB > > > /dev/mapper/vgtest-tvdc 32.00MiB > > > > > > Unallocated: > > > /dev/mapper/vgtest-tvdb 104.97GiB > > > /dev/mapper/vgtest-tvdc 103.97GiB > > > > > >> The error looks like a repeated relocation tree creation, which would point to > > >> the unsuccesful balances or inconsistent state (balance item, reloc trees). > > >> It's not a "typical" mix of operations but I'd appreciate any insights here. > > > > > > I have the same line but different call stack, with misc-next > > > e3027d10af42d24940be74dabaf1550cd770bd48: > > > > > > [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1 > > > [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1 > > > [ 9718.511137][T13609] ------------[ cut here ]------------ > > > [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794! > > > [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI > > > [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44 > > > [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > > [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480 > > > [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b > > > e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 > > > [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 > > > [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 > > > [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 > > > [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 > > > [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 > > > [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 > > > [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 > > > [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 > > > [ 9718.533608][T13609] Call Trace: > > > [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0 > > > [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0 > > > [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310 > > > > That's the same problem. > > > > Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON(). > > > > In that case, that means there are some reloc trees not cleaned up. > > > > Would you mind to provide the "btrfs ins dump-tree -t root" dump for > > that fs if the problem still happens? > > http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt > > The problem is now happening multiple times per day, starting with > kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date: Thu > Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0. > > The previous misc-next (that I have test data for), > cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020 > -0700 does not have this problem. > > These commit hashes are from https://gitlab.com/kdave/btrfs-devel. Still hitting this bug every few hours on all 7.8.x so far, and misc-next. There is a strong correlation between hitting the bug and starting a metadata block group in balance, and a weaker correlation with data balances: Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1 Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------ Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794! Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1 Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------ Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794! Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1 Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------ Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794! Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1 Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------ Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794! Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1 Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------ Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794! Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------ Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794! Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1 Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------ Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794! Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1 Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------ Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794! There don't seem to be any instances of the BUG that did not occur within 30 seconds of starting a balance. The on-disk data is fine. After a reboot the same block group can be successfully balanced. > > > Thanks, > > Qu > > > [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200 > > > [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140 > > > [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0 > > > [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0 > > > [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0 > > > [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410 > > > [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0 > > > [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220 > > > [ 9718.545186][T13609] ? check_flags+0x26/0x30 > > > [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100 > > > [ 9718.546651][T13609] do_relocation+0x242/0xc90 > > > [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0 > > > [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220 > > > [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20 > > > [ 9718.549745][T13609] ? mark_lock+0xa8/0x440 > > > [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0 > > > [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600 > > > [ 9718.552079][T13609] ? memcpy+0x4d/0x60 > > > [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120 > > > [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00 > > > [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90 > > > [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 > > > [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140 > > > [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360 > > > [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20 > > > [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0 > > > [ 9718.559387][T13609] relocate_block_group+0x52e/0x830 > > > [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0 > > > [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0 > > > [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120 > > > [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910 > > > [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0 > > > [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120 > > > [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 > > > [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0 > > > [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0 > > > [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250 > > > [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20 > > > [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0 > > > [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30 > > > [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30 > > > [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0 > > > [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0 > > > [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0 > > > [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0 > > > [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220 > > > [ 9718.575472][T13609] ? check_flags+0x26/0x30 > > > [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100 > > > [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220 > > > [ 9718.577836][T13609] ? check_flags+0x26/0x30 > > > [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100 > > > [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20 > > > [ 9718.580225][T13609] ? __fget_light+0xae/0x110 > > > [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0 > > > [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50 > > > [ 9718.582334][T13609] do_syscall_64+0x60/0xf0 > > > [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427 > > > [ 9718.585289][T13609] Code: Bad RIP value. > > > [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > > > [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427 > > > [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003 > > > [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 > > > [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001 > > > [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001 > > > [ 9718.596109][T13609] Modules linked in: > > > [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]--- > > > [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480 > > > [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 > > > [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 > > > [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 > > > [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 > > > [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 > > > [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 > > > [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 > > > [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 > > > [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 > > > [ 9718.869689][ T4545] ================================================================== > > > > > > same line, different call stack: > > > > > > 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794). > > > 789 btrfs_tree_unlock(eb); > > > 790 free_extent_buffer(eb); > > > 791 > > > 792 ret = btrfs_insert_root(trans, fs_info->tree_root, > > > 793 &root_key, root_item); > > > 794 BUG_ON(ret); > > > 795 kfree(root_item); > > > 796 > > > 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key); > > > 798 BUG_ON(IS_ERR(reloc_root)); > > > > > > followed by > > > > > > [ 9718.869689][ T4545] ================================================================== > > > [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0 > > > [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545 > > > [ 9718.873746][ T4545] > > > [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44 > > > [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > > [ 9718.877149][ T4545] Call Trace: > > > [ 9718.877655][ T4545] dump_stack+0xc8/0x11a > > > [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0 > > > [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200 > > > [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0 > > > [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0 > > > [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e > > > [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0 > > > [ 9718.883229][ T4545] __asan_load4+0x69/0x90 > > > [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0 > > > [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230 > > > [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0 > > > [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20 > > > [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20 > > > [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0 > > > [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0 > > > [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0 > > > [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20 > > > [ 9718.891308][ T4545] ? lock_contended+0x720/0x720 > > > [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0 > > > [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230 > > > [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20 > > > [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20 > > > [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0 > > > [ 9718.896245][ T4545] start_transaction+0x189/0x8f0 > > > [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20 > > > [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0 > > > [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930 > > > [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0 > > > [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310 > > > [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850 > > > [ 9718.902035][ T4545] ? current_time+0x8c/0xe0 > > > [ 9718.902799][ T4545] notify_change+0x4ec/0x700 > > > [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220 > > > [ 9718.904459][ T4545] do_truncate+0xe4/0x160 > > > [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170 > > > [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270 > > > [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220 > > > [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40 > > > [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0 > > > [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947 > > > [ 9718.911247][ T4545] Code: Bad RIP value. > > > [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d > > > [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947 > > > [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1 > > > [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78 > > > [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20 > > > [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0 > > > [ 9718.919882][ T4545] > > > [ 9718.920268][ T4545] Allocated by task 6732: > > > [ 9718.920973][ T4545] save_stack+0x21/0x50 > > > [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0 > > > [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20 > > > [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720 > > > [ 9718.924203][ T4545] copy_process+0x357/0x3680 > > > [ 9718.924955][ T4545] _do_fork+0xed/0x880 > > > [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130 > > > [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80 > > > [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0 > > > [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > [ 9718.928812][ T4545] > > > [ 9718.929173][ T4545] Freed by task 24: > > > [ 9718.929787][ T4545] save_stack+0x21/0x50 > > > [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170 > > > [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10 > > > [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280 > > > [ 9718.932730][ T4545] free_task+0x73/0x90 > > > [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0 > > > [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0 > > > [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0 > > > [ 9718.935758][ T4545] rcu_core_si+0xe/0x10 > > > [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3 > > > [ 9718.937165][ T4545] > > > [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000 > > > [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072 > > > [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of > > > [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40) > > > [ 9718.942559][ T4545] The buggy address belongs to the page: > > > [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0 > > > [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head) > > > [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700 > > > [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000 > > > [ 9718.949889][ T4545] page dumped because: kasan: bad access detected > > > [ 9718.950977][ T4545] > > > [ 9718.951354][ T4545] Memory state around the buggy address: > > > [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > [ 9718.956366][ T4545] ^ > > > [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > [ 9718.960034][ T4545] ================================================================== > > > > > > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-08-28 0:03 ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell @ 2020-08-28 0:08 ` Zygo Blaxell 2020-08-28 6:34 ` Nikolay Borisov 0 siblings, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-08-28 0:08 UTC (permalink / raw) To: Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote: > On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote: > > On Fri, Jul 24, 2020 at 08:19:36AM +0800, Qu Wenruo wrote: > > > > > > > > > On 2020/7/24 上午5:56, Zygo Blaxell wrote: > > > > On Wed, Jul 01, 2020 at 12:10:06AM +0200, David Sterba wrote: > > > >> Hi, > > > >> > > > >> I've hit a crash in relocation I've never seen before. > > > >> > > > >> [ 2129.210066] kernel BUG at fs/btrfs/relocation.c:794! > > > > > > > > I hit an issue yesterday that reminded me of this. > > > > > > > >> [ 2129.215268] invalid opcode: 0000 [#1] PREEMPT SMP > > > >> [ 2129.220114] CPU: 1 PID: 3303 Comm: btrfs Not tainted 5.8.0-rc3-git+ #638 > > > >> [ 2129.220116] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 > > > >> [ 2129.220265] RIP: 0010:create_reloc_root+0x214/0x260 [btrfs] > > > >> [ 2129.258760] RSP: 0018:ffffbe1e809b38b8 EFLAGS: 00010282 > > > >> [ 2129.258763] RAX: 00000000ffffffef RBX: ffff988d577f9000 RCX: 0000000000000000 > > > >> [ 2129.258765] RDX: 0000000000000001 RSI: ffffffff8e2a2580 RDI: ffff988d64aaa6a8 > > > >> [ 2129.258766] RBP: ffff988d5dfcdc00 R08: 0000000000000000 R09: 0000000000000000 > > > >> [ 2129.258767] R10: 0000000000000001 R11: 0000000000000000 R12: ffff988d0e02fa78 > > > >> [ 2129.258769] R13: 0000000000000005 R14: ffff988d64fe8000 R15: ffff988d0e02fa78 > > > >> [ 2129.258771] FS: 00007f82a612e8c0(0000) GS:ffff988d67000000(0000) knlGS:0000000000000000 > > > >> [ 2129.258772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > >> [ 2129.258774] CR2: 000000000559d028 CR3: 000000020b289000 CR4: 00000000000006e0 > > > >> [ 2129.258775] Call Trace: > > > >> [ 2129.258825] btrfs_init_reloc_root+0xe8/0x120 [btrfs] > > > >> [ 2129.258862] record_root_in_trans+0xae/0xd0 [btrfs] > > > >> [ 2129.258901] btrfs_record_root_in_trans+0x51/0x70 [btrfs] > > > >> [ 2129.340388] select_reloc_root+0x94/0x340 [btrfs] > > > >> [ 2129.340433] do_relocation+0xda/0x7b0 [btrfs] > > > >> [ 2129.349854] ? _raw_spin_unlock+0x1f/0x40 > > > >> [ 2129.349898] relocate_tree_blocks+0x336/0x670 [btrfs] > > > >> [ 2129.359325] relocate_block_group+0x2f6/0x600 [btrfs] > > > >> [ 2129.359365] btrfs_relocate_block_group+0x15e/0x340 [btrfs] > > > >> [ 2129.359408] btrfs_relocate_chunk+0x38/0x110 [btrfs] > > > >> [ 2129.375494] __btrfs_balance+0x42c/0xce0 [btrfs] > > > >> [ 2129.375553] btrfs_balance+0x66a/0xbe0 [btrfs] > > > >> [ 2129.375562] ? kmem_cache_alloc_trace+0x19c/0x330 > > > >> [ 2129.389852] btrfs_ioctl_balance+0x298/0x350 [btrfs] > > > >> [ 2129.389887] btrfs_ioctl+0x304/0x2490 [btrfs] > > > >> [ 2129.389898] ? do_user_addr_fault+0x221/0x49c > > > >> [ 2129.404070] ? sched_clock_cpu+0x15/0x140 > > > >> [ 2129.404073] ? do_user_addr_fault+0x221/0x49c > > > >> [ 2129.404079] ? up_read+0x18/0x240 > > > >> [ 2129.404086] ? ksys_ioctl+0x68/0xa0 > > > >> [ 2129.404091] ksys_ioctl+0x68/0xa0 > > > >> [ 2129.423308] __x64_sys_ioctl+0x16/0x20 > > > >> [ 2129.423312] do_syscall_64+0x50/0xe0 > > > >> [ 2129.423315] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > >> [ 2129.423318] RIP: 0033:0x7f82a51c6327 > > > >> [ 2129.423319] Code: Bad RIP value. > > > >> [ 2129.423348] RSP: 002b:00007ffd32cf6218 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > > > >> [ 2129.423367] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f82a51c6327 > > > >> [ 2129.423368] RDX: 00007ffd32cf62a0 RSI: 00000000c4009420 RDI: 0000000000000003 > > > >> [ 2129.423372] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 > > > >> [ 2129.423377] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 00007ffd32cf8823 > > > >> [ 2129.423379] R13: 00007ffd32cf62a0 R14: 0000000000000001 R15: 0000000000000000 > > > >> > > > >> Relevant code called from create_reloc_root: > > > >> > > > >> ret = btrfs_insert_root(trans, fs_info->tree_root, > > > >> &root_key, root_item); > > > >> BUG_ON(ret) > > > >> > > > >> and according to EAX, ret is -17 which is EEXIST. > > > >> > > > >> I don't have a reproducer, the testing image has been filled by random git > > > >> checkouts, deduplicated by BEES, then tons of snapshots created until the > > > >> metadata got exhausted, some file deletion and balances. > > > > > > > > Mine is rsync, bees, lots of snapshots, balances, scrubs. I recently also > > > > added random 'killall -INT btrfs' to send balance some fatal signals. > > > > > > > >> This is the same image that led to the patch "btrfs: allow use of global block > > > >> reserve for balance item deletion", so this could have left it in some > > > >> intermediate state where the balance item was not removed and the reloc tree as > > > >> well. > > > >> > > > >> There were a few unsuccessful mounts due to relocation recovery, that was > > > >> trying to debug but then it started to work. > > > >> > > > >> The error happened with this 'fi df' saved after the balance start: > > > >> > > > >> # btrfs fi df mnt > > > >> Data, single: total=80.01GiB, used=38.67GiB > > > >> System, single: total=4.00MiB, used=16.00KiB > > > >> Metadata, single: total=19.99GiB, used=19.46GiB > > > >> GlobalReserve, single: total=512.00MiB, used=44.00KiB > > > > > > > > Mine is: > > > > > > > > Data, single: total=1.75TiB, used=1.74TiB > > > > System, RAID1: total=32.00MiB, used=208.00KiB > > > > Metadata, RAID1: total=25.00GiB, used=22.89GiB > > > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > > > though this is some time after the failure (and a reboot). I do notice > > > > that there's lots of unallocated space, but metadata usage is close > > > > to allocated, and I have been experiencing a lot of EROFS events when > > > > that happens, even if there's gigabytes unallocated. > > > > > > > > btrfs fi us: > > > > > > > > Overall: > > > > Device size: 2.00TiB > > > > Device allocated: 1.80TiB > > > > Device unallocated: 208.94GiB > > > > Device missing: 0.00B > > > > Used: 1.79TiB > > > > Free (estimated): 211.30GiB (min: 106.83GiB) > > > > Data ratio: 1.00 > > > > Metadata ratio: 2.00 > > > > Global reserve: 512.00MiB (used: 0.00B) > > > > > > > > Data,single: Size:1.75TiB, Used:1.74TiB (99.87%) > > > > /dev/mapper/vgtest-tvdb 894.00GiB > > > > /dev/mapper/vgtest-tvdc 895.00GiB > > > > > > > > Metadata,RAID1: Size:25.00GiB, Used:22.87GiB (91.47%) > > > > /dev/mapper/vgtest-tvdb 25.00GiB > > > > /dev/mapper/vgtest-tvdc 25.00GiB > > > > > > > > System,RAID1: Size:32.00MiB, Used:208.00KiB (0.63%) > > > > /dev/mapper/vgtest-tvdb 32.00MiB > > > > /dev/mapper/vgtest-tvdc 32.00MiB > > > > > > > > Unallocated: > > > > /dev/mapper/vgtest-tvdb 104.97GiB > > > > /dev/mapper/vgtest-tvdc 103.97GiB > > > > > > > >> The error looks like a repeated relocation tree creation, which would point to > > > >> the unsuccesful balances or inconsistent state (balance item, reloc trees). > > > >> It's not a "typical" mix of operations but I'd appreciate any insights here. > > > > > > > > I have the same line but different call stack, with misc-next > > > > e3027d10af42d24940be74dabaf1550cd770bd48: > > > > > > > > [ 9717.746937][T13609] BTRFS info (device dm-0): balance: start -mlimit=1 -slimit=1 > > > > [ 9717.765086][T13609] BTRFS info (device dm-0): relocating block group 10991411658752 flags metadata|raid1 > > > > [ 9718.511137][T13609] ------------[ cut here ]------------ > > > > [ 9718.512293][T13609] kernel BUG at fs/btrfs/relocation.c:794! > > > > [ 9718.513421][T13609] invalid opcode: 0000 [#1] SMP KASAN PTI > > > > [ 9718.514590][T13609] CPU: 1 PID: 13609 Comm: btrfs Tainted: G W 5.8.0-6582a95aabfe+ #44 > > > > [ 9718.516178][T13609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > > > [ 9718.517750][T13609] RIP: 0010:create_reloc_root+0x468/0x480 > > > > [ 9718.518717][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b > > > > e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 > > > > [ 9718.521995][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 > > > > [ 9718.522991][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 > > > > [ 9718.524300][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 > > > > [ 9718.525612][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 > > > > [ 9718.527056][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 > > > > [ 9718.528386][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 > > > > [ 9718.529756][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 > > > > [ 9718.531211][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [ 9718.532295][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 > > > > [ 9718.533608][T13609] Call Trace: > > > > [ 9718.534151][T13609] ? update_backref_node+0xf0/0xf0 > > > > [ 9718.535137][T13609] ? check_chain_key+0x1e6/0x2e0 > > > > [ 9718.536057][T13609] btrfs_init_reloc_root+0x2d7/0x310 > > > > > > That's the same problem. > > > > > > Btrfs_init_reloc_root() got -EEXIST and triggering BUG_ON(). > > > > > > In that case, that means there are some reloc trees not cleaned up. > > > > > > Would you mind to provide the "btrfs ins dump-tree -t root" dump for > > > that fs if the problem still happens? > > > > http://furryterror.org/~zblaxell/tmp/.tvdb/tvdb.txt > > > > The problem is now happening multiple times per day, starting with > > kdave's misc-next e3027d10af42d24940be74dabaf1550cd770bd48 Date: Thu > > Jul 23 00:18:04 2020 +0900 and continuing on v5.8.0. > > > > The previous misc-next (that I have test data for), > > cb799f0a0bb372f37f96893d2e80c1dc2f5206da Date: Thu Jul 16 13:29:46 2020 > > -0700 does not have this problem. > > > > These commit hashes are from https://gitlab.com/kdave/btrfs-devel. > > Still hitting this bug every few hours on all 7.8.x so far, and misc-next. > > There is a strong correlation between hitting the bug and starting a metadata > block group in balance, and a weaker correlation with data balances: > > Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1 > Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------ > Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794! > > Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1 > Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------ > Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794! > > Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1 > Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------ > Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794! > > Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1 > Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------ > Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794! > > Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1 > Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------ > Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794! > > Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data > Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------ > Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794! > > Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1 > Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------ > Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794! > > Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1 > Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------ > Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794! > > There don't seem to be any instances of the BUG that did not occur > within 30 seconds of starting a balance. > > The on-disk data is fine. After a reboot the same block group can be > successfully balanced. Forgot to mention the failure rate: 8 crashes (listed above) among 1492 block groups balanced over the same 4-day period. > > > > > Thanks, > > > Qu > > > > [ 9718.537016][T13609] ? find_reloc_root+0x200/0x200 > > > > [ 9718.537992][T13609] ? do_raw_spin_unlock+0xa8/0x140 > > > > [ 9718.538899][T13609] record_root_in_trans+0x18c/0x1d0 > > > > [ 9718.539848][T13609] btrfs_record_root_in_trans+0x8b/0xc0 > > > > [ 9718.540843][T13609] select_reloc_root+0x15f/0x6a0 > > > > [ 9718.541943][T13609] ? create_reloc_inode.isra.28+0x410/0x410 > > > > [ 9718.543066][T13609] ? rcu_read_lock_sched_held+0xa1/0xd0 > > > > [ 9718.544333][T13609] ? check_flags.part.44+0x86/0x220 > > > > [ 9718.545186][T13609] ? check_flags+0x26/0x30 > > > > [ 9718.545870][T13609] ? lock_is_held_type+0xc9/0x100 > > > > [ 9718.546651][T13609] do_relocation+0x242/0xc90 > > > > [ 9718.547372][T13609] ? select_reloc_root+0x6a0/0x6a0 > > > > [ 9718.548160][T13609] ? check_flags.part.44+0x86/0x220 > > > > [ 9718.548969][T13609] ? __kasan_check_read+0x11/0x20 > > > > [ 9718.549745][T13609] ? mark_lock+0xa8/0x440 > > > > [ 9718.550426][T13609] ? mark_held_locks+0x8d/0xb0 > > > > [ 9718.551165][T13609] ? btrfs_backref_cleanup_node+0x5c1/0x600 > > > > [ 9718.552079][T13609] ? memcpy+0x4d/0x60 > > > > [ 9718.552694][T13609] ? read_extent_buffer+0xcc/0x120 > > > > [ 9718.553478][T13609] relocate_tree_blocks+0xa29/0xb00 > > > > [ 9718.554255][T13609] ? do_relocation+0xc90/0xc90 > > > > [ 9718.554978][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 > > > > [ 9718.555855][T13609] ? free_extent_buffer.part.46+0x90/0x140 > > > > [ 9718.556756][T13609] ? rb_insert_color+0x342/0x360 > > > > [ 9718.557581][T13609] ? free_extent_buffer+0x13/0x20 > > > > [ 9718.558445][T13609] ? add_tree_block.isra.34+0x236/0x2b0 > > > > [ 9718.559387][T13609] relocate_block_group+0x52e/0x830 > > > > [ 9718.560275][T13609] ? merge_reloc_roots+0x4b0/0x4b0 > > > > [ 9718.561137][T13609] btrfs_relocate_block_group+0x26e/0x4c0 > > > > [ 9718.562137][T13609] btrfs_relocate_chunk+0x52/0x120 > > > > [ 9718.562918][T13609] btrfs_balance+0xe22/0x1910 > > > > [ 9718.563605][T13609] ? check_chain_key+0x1e6/0x2e0 > > > > [ 9718.564331][T13609] ? btrfs_relocate_chunk+0x120/0x120 > > > > [ 9718.565126][T13609] ? kmem_cache_alloc_trace+0x5af/0x740 > > > > [ 9718.565943][T13609] ? _copy_from_user+0x95/0xd0 > > > > [ 9718.566649][T13609] btrfs_ioctl_balance+0x3de/0x4c0 > > > > [ 9718.567414][T13609] btrfs_ioctl+0x2385/0x4250 > > > > [ 9718.568090][T13609] ? __kasan_check_read+0x11/0x20 > > > > [ 9718.568830][T13609] ? check_chain_key+0x1e6/0x2e0 > > > > [ 9718.569619][T13609] ? btrfs_ioctl_get_supported_features+0x30/0x30 > > > > [ 9718.570658][T13609] ? kvm_sched_clock_read+0x18/0x30 > > > > [ 9718.571526][T13609] ? check_chain_key+0x1e6/0x2e0 > > > > [ 9718.572348][T13609] ? lock_downgrade+0x3e0/0x3e0 > > > > [ 9718.573121][T13609] ? do_vfs_ioctl+0xfc/0x9d0 > > > > [ 9718.573835][T13609] ? ioctl_file_clone+0xe0/0xe0 > > > > [ 9718.574637][T13609] ? check_flags.part.44+0x86/0x220 > > > > [ 9718.575472][T13609] ? check_flags+0x26/0x30 > > > > [ 9718.576190][T13609] ? lock_is_held_type+0xc9/0x100 > > > > [ 9718.576990][T13609] ? check_flags.part.44+0x86/0x220 > > > > [ 9718.577836][T13609] ? check_flags+0x26/0x30 > > > > [ 9718.578542][T13609] ? lock_is_held_type+0xc9/0x100 > > > > [ 9718.579403][T13609] ? __kasan_check_read+0x11/0x20 > > > > [ 9718.580225][T13609] ? __fget_light+0xae/0x110 > > > > [ 9718.580983][T13609] ksys_ioctl+0xa1/0xe0 > > > > [ 9718.581628][T13609] __x64_sys_ioctl+0x43/0x50 > > > > [ 9718.582334][T13609] do_syscall_64+0x60/0xf0 > > > > [ 9718.583285][T13609] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > [ 9718.584378][T13609] RIP: 0033:0x7f9577e85427 > > > > [ 9718.585289][T13609] Code: Bad RIP value. > > > > [ 9718.586076][T13609] RSP: 002b:00007ffdc7b82548 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > > > > [ 9718.587896][T13609] RAX: ffffffffffffffda RBX: 00007ffdc7b825e8 RCX: 00007f9577e85427 > > > > [ 9718.589391][T13609] RDX: 00007ffdc7b825e8 RSI: 00000000c4009420 RDI: 0000000000000003 > > > > [ 9718.590817][T13609] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 > > > > [ 9718.592631][T13609] R10: fffffffffffff31c R11: 0000000000000206 R12: 0000000000000001 > > > > [ 9718.594405][T13609] R13: 0000000000000000 R14: 00007ffdc7b84a48 R15: 0000000000000001 > > > > [ 9718.596109][T13609] Modules linked in: > > > > [ 9718.597056][T13609] ---[ end trace 2cf173f8217fc093 ]--- > > > > [ 9718.598018][T13609] RIP: 0010:create_reloc_root+0x468/0x480 > > > > [ 9718.602850][T13609] Code: e8 bd 5b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 c7 5b bd ff 4d 89 b4 24 f0 00 00 00 e9 ee fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 9b df 07 01 66 66 2e 0f 1f 84 00 00 00 > > > > [ 9718.613371][T13609] RSP: 0018:ffffc900018e7018 EFLAGS: 00010282 > > > > [ 9718.621286][T13609] RAX: 00000000ffffffef RBX: ffff8881e103a400 RCX: 0000000000000000 > > > > [ 9718.631255][T13609] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000246 > > > > [ 9718.639764][T13609] RBP: ffffc900018e7108 R08: 0000000000000000 R09: 0000000000000001 > > > > [ 9718.641533][T13609] R10: 0000000000000001 R11: fffffbfff3dfb081 R12: ffff8881f37c8020 > > > > [ 9718.643173][T13609] R13: ffff88801fbc5b28 R14: ffff8881f37c8000 R15: ffffc900018e70a0 > > > > [ 9718.644840][T13609] FS: 00007f9577d928c0(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 > > > > [ 9718.646728][T13609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [ 9718.648607][T13609] CR2: 00007f9823e35500 CR3: 00000000a52e0002 CR4: 00000000001606e0 > > > > [ 9718.869689][ T4545] ================================================================== > > > > > > > > same line, different call stack: > > > > > > > > 0xffffffff81933dd8 is in create_reloc_root (fs/btrfs/relocation.c:794). > > > > 789 btrfs_tree_unlock(eb); > > > > 790 free_extent_buffer(eb); > > > > 791 > > > > 792 ret = btrfs_insert_root(trans, fs_info->tree_root, > > > > 793 &root_key, root_item); > > > > 794 BUG_ON(ret); > > > > 795 kfree(root_item); > > > > 796 > > > > 797 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key); > > > > 798 BUG_ON(IS_ERR(reloc_root)); > > > > > > > > followed by > > > > > > > > [ 9718.869689][ T4545] ================================================================== > > > > [ 9718.871333][ T4545] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0 > > > > [ 9718.872483][ T4545] Read of size 4 at addr ffff888014e9402c by task crawl_28443/4545 > > > > [ 9718.873746][ T4545] > > > > [ 9718.874106][ T4545] CPU: 1 PID: 4545 Comm: crawl_28443 Tainted: G D W 5.8.0-6582a95aabfe+ #44 > > > > [ 9718.875684][ T4545] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > > > [ 9718.877149][ T4545] Call Trace: > > > > [ 9718.877655][ T4545] dump_stack+0xc8/0x11a > > > > [ 9718.878317][ T4545] ? __mutex_lock+0x202/0xce0 > > > > [ 9718.879065][ T4545] print_address_description.constprop.8+0x1f/0x200 > > > > [ 9718.880167][ T4545] ? __mutex_lock+0x202/0xce0 > > > > [ 9718.880916][ T4545] ? __mutex_lock+0x202/0xce0 > > > > [ 9718.881666][ T4545] kasan_report.cold.11+0x20/0x3e > > > > [ 9718.882483][ T4545] ? __mutex_lock+0x202/0xce0 > > > > [ 9718.883229][ T4545] __asan_load4+0x69/0x90 > > > > [ 9718.883920][ T4545] __mutex_lock+0x202/0xce0 > > > > [ 9718.884651][ T4545] ? wait_current_trans+0xb7/0x230 > > > > [ 9718.885465][ T4545] ? btrfs_record_root_in_trans+0x7e/0xc0 > > > > [ 9718.886388][ T4545] ? mutex_lock_io_nested+0xc20/0xc20 > > > > [ 9718.887246][ T4545] ? __kasan_check_read+0x11/0x20 > > > > [ 9718.888035][ T4545] ? join_transaction+0x32/0x6f0 > > > > [ 9718.888854][ T4545] ? join_transaction+0x1a6/0x6f0 > > > > [ 9718.889679][ T4545] ? lock_downgrade+0x3e0/0x3e0 > > > > [ 9718.890496][ T4545] ? __kasan_check_write+0x14/0x20 > > > > [ 9718.891308][ T4545] ? lock_contended+0x720/0x720 > > > > [ 9718.892093][ T4545] ? do_raw_spin_lock+0x1e0/0x1e0 > > > > [ 9718.892912][ T4545] ? wait_current_trans+0xb7/0x230 > > > > [ 9718.893705][ T4545] mutex_lock_nested+0x1b/0x20 > > > > [ 9718.894494][ T4545] ? mutex_lock_nested+0x1b/0x20 > > > > [ 9718.895317][ T4545] btrfs_record_root_in_trans+0x7e/0xc0 > > > > [ 9718.896245][ T4545] start_transaction+0x189/0x8f0 > > > > [ 9718.897081][ T4545] btrfs_start_transaction+0x1e/0x20 > > > > [ 9718.897941][ T4545] btrfs_cont_expand+0x549/0x7a0 > > > > [ 9718.898805][ T4545] ? btrfs_truncate_block+0x930/0x930 > > > > [ 9718.899665][ T4545] ? inode_newsize_ok+0x75/0xc0 > > > > [ 9718.900438][ T4545] ? setattr_prepare+0x9c/0x310 > > > > [ 9718.901242][ T4545] btrfs_setattr+0x514/0x850 > > > > [ 9718.902035][ T4545] ? current_time+0x8c/0xe0 > > > > [ 9718.902799][ T4545] notify_change+0x4ec/0x700 > > > > [ 9718.903584][ T4545] ? do_sys_ftruncate+0x108/0x220 > > > > [ 9718.904459][ T4545] do_truncate+0xe4/0x160 > > > > [ 9718.905200][ T4545] ? __x64_sys_openat2+0x170/0x170 > > > > [ 9718.906116][ T4545] ? __sb_start_write+0x1a1/0x270 > > > > [ 9718.906954][ T4545] do_sys_ftruncate+0x1b8/0x220 > > > > [ 9718.907759][ T4545] __x64_sys_ftruncate+0x36/0x40 > > > > [ 9718.908577][ T4545] do_syscall_64+0x60/0xf0 > > > > [ 9718.909292][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > [ 9718.910521][ T4545] RIP: 0033:0x7f201fcab947 > > > > [ 9718.911247][ T4545] Code: Bad RIP value. > > > > [ 9718.911915][ T4545] RSP: 002b:00007f201d3abeb8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d > > > > [ 9718.913285][ T4545] RAX: ffffffffffffffda RBX: 00007f201d3abfa0 RCX: 00007f201fcab947 > > > > [ 9718.914613][ T4545] RDX: 000000005f18a6d2 RSI: 0000000000286000 RDI: 0000000000000ec1 > > > > [ 9718.915921][ T4545] RBP: 00007f1fb01c2f00 R08: 00007ffe1e345080 R09: 00000000011b1f78 > > > > [ 9718.917236][ T4545] R10: 00000000011b1f78 R11: 0000000000000202 R12: 00007f201d3abf20 > > > > [ 9718.918556][ T4545] R13: 00007f201d3abef0 R14: 00007f201d3abf50 R15: 00007f201d3abed0 > > > > [ 9718.919882][ T4545] > > > > [ 9718.920268][ T4545] Allocated by task 6732: > > > > [ 9718.920973][ T4545] save_stack+0x21/0x50 > > > > [ 9718.921648][ T4545] __kasan_kmalloc.constprop.17+0xc1/0xd0 > > > > [ 9718.922580][ T4545] kasan_slab_alloc+0x12/0x20 > > > > [ 9718.923345][ T4545] kmem_cache_alloc_node+0x113/0x720 > > > > [ 9718.924203][ T4545] copy_process+0x357/0x3680 > > > > [ 9718.924955][ T4545] _do_fork+0xed/0x880 > > > > [ 9718.925622][ T4545] __do_sys_clone+0xee/0x130 > > > > [ 9718.926369][ T4545] __x64_sys_clone+0x67/0x80 > > > > [ 9718.927119][ T4545] do_syscall_64+0x60/0xf0 > > > > [ 9718.927848][ T4545] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > [ 9718.928812][ T4545] > > > > [ 9718.929173][ T4545] Freed by task 24: > > > > [ 9718.929787][ T4545] save_stack+0x21/0x50 > > > > [ 9718.930453][ T4545] __kasan_slab_free+0x118/0x170 > > > > [ 9718.931242][ T4545] kasan_slab_free+0xe/0x10 > > > > [ 9718.931970][ T4545] kmem_cache_free+0x5f/0x280 > > > > [ 9718.932730][ T4545] free_task+0x73/0x90 > > > > [ 9718.933391][ T4545] __put_task_struct+0x199/0x1d0 > > > > [ 9718.934187][ T4545] delayed_put_task_struct+0x124/0x1b0 > > > > [ 9718.935071][ T4545] rcu_core+0x3b0/0xeb0 > > > > [ 9718.935758][ T4545] rcu_core_si+0xe/0x10 > > > > [ 9718.936433][ T4545] __do_softirq+0x120/0x5e3 > > > > [ 9718.937165][ T4545] > > > > [ 9718.937545][ T4545] The buggy address belongs to the object at ffff888014e94000 > > > > [ 9718.937545][ T4545] which belongs to the cache task_struct(168:screen-wrapper.service) of size 11072 > > > > [ 9718.940391][ T4545] The buggy address is located 44 bytes inside of > > > > [ 9718.940391][ T4545] 11072-byte region [ffff888014e94000, ffff888014e96b40) > > > > [ 9718.942559][ T4545] The buggy address belongs to the page: > > > > [ 9718.943454][ T4545] page:ffffea000053a500 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888014e97fff head:ffffea000053a500 order:2 compound_mapcount:0 compound_pincount:0 > > > > [ 9718.946072][ T4545] flags: 0xfffe0000010200(slab|head) > > > > [ 9718.946958][ T4545] raw: 00fffe0000010200 ffffea00011ab108 ffffea0001d6f108 ffff8881eabd9700 > > > > [ 9718.948406][ T4545] raw: ffff888014e97fff ffff888014e94000 0000000100000001 0000000000000000 > > > > [ 9718.949889][ T4545] page dumped because: kasan: bad access detected > > > > [ 9718.950977][ T4545] > > > > [ 9718.951354][ T4545] Memory state around the buggy address: > > > > [ 9718.952296][ T4545] ffff888014e93f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > [ 9718.953641][ T4545] ffff888014e93f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > [ 9718.955004][ T4545] >ffff888014e94000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > > [ 9718.956366][ T4545] ^ > > > > [ 9718.957258][ T4545] ffff888014e94080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > > [ 9718.958653][ T4545] ffff888014e94100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > > [ 9718.960034][ T4545] ================================================================== > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-08-28 0:08 ` Zygo Blaxell @ 2020-08-28 6:34 ` Nikolay Borisov 2020-08-28 20:42 ` Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: Nikolay Borisov @ 2020-08-28 6:34 UTC (permalink / raw) To: Zygo Blaxell, Qu Wenruo; +Cc: David Sterba, linux-btrfs, wqu On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote: > On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote: >> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote: <snip> >> >> Aug 23 05:04:05 regress kernel: [53458.128928][ T9737] BTRFS info (device dm-0): relocating block group 14939862335488 flags metadata|raid1 >> Aug 23 05:04:05 regress kernel: [53458.999342][ T9737] ------------[ cut here ]------------ >> Aug 23 05:04:05 regress kernel: [53459.000275][ T9737] kernel BUG at fs/btrfs/relocation.c:794! >> >> Aug 24 01:23:52 regress kernel: [58662.545914][T17474] BTRFS info (device dm-0): relocating block group 15083978620928 flags metadata|raid1 >> Aug 24 01:23:54 regress kernel: [58664.778274][T17474] ------------[ cut here ]------------ >> Aug 24 01:23:54 regress kernel: [58664.782182][T17474] kernel BUG at fs/btrfs/relocation.c:794! >> >> Aug 24 07:17:07 regress kernel: [21068.421134][T29457] BTRFS info (device dm-0): relocating block group 15160784715776 flags metadata|raid1 >> Aug 24 07:17:08 regress kernel: [21069.307661][ T5176] ------------[ cut here ]------------ >> Aug 24 07:17:08 regress kernel: [21069.309195][ T5176] kernel BUG at fs/btrfs/relocation.c:794! >> >> Aug 25 18:58:26 regress kernel: [22013.457555][ T2164] BTRFS info (device dm-0): relocating block group 15530051239936 flags metadata|raid1 >> Aug 25 18:58:27 regress kernel: [22014.460689][ T4939] ------------[ cut here ]------------ >> Aug 25 18:58:27 regress kernel: [22014.461653][ T4939] kernel BUG at fs/btrfs/relocation.c:794! >> >> Aug 26 03:39:20 regress kernel: [31172.016638][T30882] BTRFS info (device dm-0): relocating block group 15576759009280 flags metadata|raid1 >> Aug 26 03:39:21 regress kernel: [31173.329719][T12663] ------------[ cut here ]------------ >> Aug 26 03:39:21 regress kernel: [31173.330682][T12663] kernel BUG at fs/btrfs/relocation.c:794! >> >> Aug 26 16:00:02 regress kernel: [44334.231395][T25917] BTRFS info (device dm-0): relocating block group 15631888941056 flags data >> Aug 26 16:00:04 regress kernel: [44336.800710][T26519] ------------[ cut here ]------------ >> Aug 26 16:00:04 regress kernel: [44336.802888][T26519] kernel BUG at fs/btrfs/relocation.c:794! >> >> Aug 27 15:45:29 regress kernel: [55423.626717][ T5878] BTRFS info (device dm-0): relocating block group 15820229967872 flags metadata|raid1 >> Aug 27 15:45:29 regress kernel: [55423.798584][T15744] ------------[ cut here ]------------ >> Aug 27 15:45:29 regress kernel: [55423.802581][T15744] kernel BUG at fs/btrfs/relocation.c:794! >> >> Aug 27 17:35:26 regress kernel: [ 6459.129124][T21053] BTRFS info (device dm-0): relocating block group 15831168712704 flags metadata|raid1 >> Aug 27 17:35:43 regress kernel: [ 6475.931029][T25720] ------------[ cut here ]------------ >> Aug 27 17:35:43 regress kernel: [ 6475.932403][T25720] kernel BUG at fs/btrfs/relocation.c:794! >> >> There don't seem to be any instances of the BUG that did not occur >> within 30 seconds of starting a balance. >> >> The on-disk data is fine. After a reboot the same block group can be >> successfully balanced. > > Forgot to mention the failure rate: 8 crashes (listed above) among 1492 > block groups balanced over the same 4-day period. Since you can repro reliably could you modify the code in create_reloc_root so it prints what's the returned error value, I'd speculate it's EEXIST from btrfs_insert_root btrfs_insert_item btrfs_insert_empty_item btrfs_insert_empty_items btrfs_search_slot But better be sure. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-08-28 6:34 ` Nikolay Borisov @ 2020-08-28 20:42 ` Zygo Blaxell 2020-09-01 22:53 ` Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-08-28 20:42 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Qu Wenruo, David Sterba, linux-btrfs, wqu On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote: > On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote: > > On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote: > >> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote: > > <snip> > > Since you can repro reliably could you modify the code in > create_reloc_root so it prints what's the returned error value, I'd > speculate it's EEXIST from > > btrfs_insert_root > btrfs_insert_item > btrfs_insert_empty_item > btrfs_insert_empty_items > btrfs_search_slot > > But better be sure. Here you go, EEXIST == 17: Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9 Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0 Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0 Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0 Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0 Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17 Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17 Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------ Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795! Aug 28 15:30:56 regress kernel: [18454.459006][ T2100] invalid opcode: 0000 [#1] SMP KASAN PTI Aug 28 15:30:56 regress kernel: [18454.460356][ T2100] CPU: 2 PID: 2100 Comm: rsync Tainted: G W 5.8.5-8de74804e45b+ #6 Aug 28 15:30:57 regress kernel: [18454.462324][ T2100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Aug 28 15:30:57 regress kernel: [18454.464289][ T2100] RIP: 0010:create_reloc_root+0x47a/0x490 Aug 28 15:30:57 regress kernel: [18454.465507][ T2100] Code: e8 5b 3b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 65 3b bd ff 4d 89 b4 24 f0 00 00 00 e9 dc fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 b9 90 07 01 66 0f 1f 84 00 00 00 00 00 Aug 28 15:30:57 regress kernel: [18454.468861][ T2100] RSP: 0018:ffffc90000c777d0 EFLAGS: 00010282 Aug 28 15:30:57 regress kernel: [18454.469787][ T2100] RAX: 000000000000001b RBX: ffff88817cbc9400 RCX: ffffffffa5273b42 Aug 28 15:30:57 regress kernel: [18454.471005][ T2100] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f5dff28c Aug 28 15:30:57 regress kernel: [18454.472278][ T2100] RBP: ffffc90000c778c0 R08: ffffed103ebc1645 R09: ffffed103ebc1645 Aug 28 15:30:57 regress kernel: [18454.473547][ T2100] R10: ffff8881f5e0b227 R11: ffffed103ebc1644 R12: ffff8881cb710020 Aug 28 15:30:57 regress kernel: [18454.474949][ T2100] R13: ffff888118800a80 R14: 00000000ffffffef R15: ffffc90000c77858 Aug 28 15:30:57 regress kernel: [18454.476224][ T2100] FS: 00007f1b8f7d9b80(0000) GS:ffff8881f5c00000(0000) knlGS:0000000000000000 Aug 28 15:30:57 regress kernel: [18454.477635][ T2100] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 28 15:30:57 regress kernel: [18454.478661][ T2100] CR2: 00007fc1d25e7100 CR3: 0000000120a8e006 CR4: 00000000001606e0 Aug 28 15:30:57 regress kernel: [18454.479894][ T2100] Call Trace: Aug 28 15:30:57 regress kernel: [18454.480416][ T2100] ? update_backref_node+0xf0/0xf0 Aug 28 15:30:57 regress kernel: [18454.481209][ T2100] ? check_chain_key+0x1e6/0x2e0 Aug 28 15:30:57 regress kernel: [18454.482012][ T2100] btrfs_init_reloc_root+0x1b0/0x310 Aug 28 15:30:57 regress kernel: [18454.482859][ T2100] ? find_reloc_root+0x200/0x200 Aug 28 15:30:57 regress kernel: [18454.483661][ T2100] ? do_raw_spin_unlock+0xa8/0x140 Aug 28 15:30:57 regress kernel: [18454.484482][ T2100] record_root_in_trans+0x18c/0x1d0 Aug 28 15:30:57 regress kernel: [18454.485435][ T2100] btrfs_record_root_in_trans+0x8b/0xc0 Aug 28 15:30:57 regress kernel: [18454.486301][ T2100] start_transaction+0x16b/0x8f0 Aug 28 15:30:57 regress kernel: [18454.487082][ T2100] btrfs_start_transaction+0x1e/0x20 Aug 28 15:30:57 regress kernel: [18454.487905][ T2100] btrfs_cont_expand+0x549/0x7a0 Aug 28 15:30:57 regress kernel: [18454.488680][ T2100] ? btrfs_truncate_block+0x970/0x970 Aug 28 15:30:57 regress kernel: [18454.489527][ T2100] ? timestamp_truncate+0x180/0x180 Aug 28 15:30:57 regress kernel: [18454.490344][ T2100] ? check_chain_key+0x1e6/0x2e0 Aug 28 15:30:57 regress kernel: [18454.491117][ T2100] btrfs_file_write_iter+0x7ae/0x957 Aug 28 15:30:57 regress kernel: [18454.491938][ T2100] ? btrfs_sync_file+0x7c0/0x7c0 Aug 28 15:30:57 regress kernel: [18454.492710][ T2100] ? iov_iter_init+0x99/0xd0 Aug 28 15:30:57 regress kernel: [18454.493426][ T2100] new_sync_write+0x2ad/0x3f0 Aug 28 15:30:57 regress kernel: [18454.494153][ T2100] ? new_sync_read+0x3e0/0x3e0 Aug 28 15:30:57 regress kernel: [18454.494890][ T2100] ? check_flags+0x26/0x30 Aug 28 15:30:57 regress kernel: [18454.495582][ T2100] ? lock_is_held_type+0xc9/0x100 Aug 28 15:30:57 regress kernel: [18454.496365][ T2100] ? rcu_read_lock_any_held+0xd2/0x100 Aug 28 15:30:57 regress kernel: [18454.497211][ T2100] ? rcu_read_lock_held+0xb0/0xb0 Aug 28 15:30:57 regress kernel: [18454.497985][ T2100] ? __sb_start_write+0x1a1/0x270 Aug 28 15:30:57 regress kernel: [18454.498768][ T2100] vfs_write+0x2d2/0x300 Aug 28 15:30:57 regress kernel: [18454.499417][ T2100] ksys_write+0xcc/0x170 Aug 28 15:30:57 regress kernel: [18454.500064][ T2100] ? __ia32_sys_read+0x50/0x50 Aug 28 15:30:57 regress kernel: [18454.500783][ T2100] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Aug 28 15:30:57 regress kernel: [18454.501704][ T2100] __x64_sys_write+0x43/0x50 Aug 28 15:30:57 regress kernel: [18454.502403][ T2100] do_syscall_64+0x60/0xf0 Aug 28 15:30:57 regress kernel: [18454.503079][ T2100] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Aug 28 15:30:57 regress kernel: [18454.503971][ T2100] RIP: 0033:0x7f1b8f8c5504 Aug 28 15:30:57 regress kernel: [18454.504644][ T2100] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53 Aug 28 15:30:57 regress kernel: [18454.507565][ T2100] RSP: 002b:00007fff3419eaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 Aug 28 15:30:57 regress kernel: [18454.508800][ T2100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1b8f8c5504 Aug 28 15:30:57 regress kernel: [18454.509982][ T2100] RDX: 0000000000000400 RSI: 000055e56f375bb0 RDI: 0000000000000001 Aug 28 15:30:57 regress kernel: [18454.511153][ T2100] RBP: 0000000000000400 R08: 0000000000000400 R09: 000000002c4a4095 Aug 28 15:30:57 regress kernel: [18454.512325][ T2100] R10: 000000000a7b98fd R11: 0000000000000246 R12: 000055e56f375bb0 Aug 28 15:30:57 regress kernel: [18454.513503][ T2100] R13: 000055e56f375bb0 R14: 0000000000008000 R15: 0000000000000400 Aug 28 15:30:57 regress kernel: [18454.514685][ T2100] Modules linked in: Aug 28 15:30:57 regress kernel: [18454.515321][ T2100] ---[ end trace dc1ad17026339b11 ]--- Aug 28 15:30:57 regress kernel: [18454.516184][ T2100] RIP: 0010:create_reloc_root+0x47a/0x490 Aug 28 15:30:57 regress kernel: [18454.517085][ T2100] Code: e8 5b 3b bd ff 4d 8b 76 50 be 08 00 00 00 49 8d bc 24 f0 00 00 00 e8 65 3b bd ff 4d 89 b4 24 f0 00 00 00 e9 dc fc ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b e8 b9 90 07 01 66 0f 1f 84 00 00 00 00 00 Aug 28 15:30:57 regress kernel: [18454.520010][ T2100] RSP: 0018:ffffc90000c777d0 EFLAGS: 00010282 Aug 28 15:30:57 regress kernel: [18454.520935][ T2100] RAX: 000000000000001b RBX: ffff88817cbc9400 RCX: ffffffffa5273b42 Aug 28 15:30:57 regress kernel: [18454.522172][ T2100] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f5dff28c Aug 28 15:30:57 regress kernel: [18454.523567][ T2100] RBP: ffffc90000c778c0 R08: ffffed103ebc1645 R09: ffffed103ebc1645 Aug 28 15:30:57 regress kernel: [18454.524985][ T2100] R10: ffff8881f5e0b227 R11: ffffed103ebc1644 R12: ffff8881cb710020 Aug 28 15:30:57 regress kernel: [18454.526404][ T2100] R13: ffff888118800a80 R14: 00000000ffffffef R15: ffffc90000c77858 Aug 28 15:30:57 regress kernel: [18454.527887][ T2100] FS: 00007f1b8f7d9b80(0000) GS:ffff8881f5c00000(0000) knlGS:0000000000000000 Aug 28 15:30:57 regress kernel: [18454.529576][ T2100] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 28 15:30:57 regress kernel: [18454.530845][ T2100] CR2: 00007fc1d25e7100 CR3: 0000000120a8e006 CR4: 00000000001606e0 Aug 28 15:30:57 regress kernel: [18454.821401][T32222] ================================================================== Aug 28 15:30:57 regress kernel: [18454.822634][T32222] BUG: KASAN: use-after-free in __mutex_lock+0x202/0xce0 Aug 28 15:30:57 regress kernel: [18454.823654][T32222] Read of size 4 at addr ffff88811329c02c by task mkdir/32222 Aug 28 15:30:57 regress kernel: [18454.824781][T32222] Aug 28 15:30:57 regress kernel: [18454.825148][T32222] CPU: 1 PID: 32222 Comm: mkdir Tainted: G D W 5.8.5-8de74804e45b+ #6 Aug 28 15:30:57 regress kernel: [18454.826616][T32222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Aug 28 15:30:57 regress kernel: [18454.828088][T32222] Call Trace: Aug 28 15:30:57 regress kernel: [18454.828607][T32222] dump_stack+0xc8/0x11a Aug 28 15:30:57 regress kernel: [18454.829297][T32222] ? __mutex_lock+0x202/0xce0 Aug 28 15:30:57 regress kernel: [18454.830033][T32222] print_address_description.constprop.8+0x1f/0x200 Aug 28 15:30:57 regress kernel: [18454.831062][T32222] ? __mutex_lock+0x202/0xce0 Aug 28 15:30:57 regress kernel: [18454.831783][T32222] ? __mutex_lock+0x202/0xce0 Aug 28 15:30:57 regress kernel: [18454.832537][T32222] kasan_report.cold.11+0x20/0x3e Aug 28 15:30:57 regress kernel: [18454.833323][T32222] ? __mutex_lock+0x202/0xce0 Aug 28 15:30:57 regress kernel: [18454.834056][T32222] __asan_load4+0x69/0x90 Aug 28 15:30:57 regress kernel: [18454.834754][T32222] __mutex_lock+0x202/0xce0 Aug 28 15:30:57 regress kernel: [18454.835475][T32222] ? wait_current_trans+0xb7/0x230 Aug 28 15:30:57 regress kernel: [18454.836295][T32222] ? btrfs_record_root_in_trans+0x7e/0xc0 Aug 28 15:30:57 regress kernel: [18454.837206][T32222] ? mutex_lock_io_nested+0xc20/0xc20 Aug 28 15:30:57 regress kernel: [18454.838064][T32222] ? __kasan_check_read+0x11/0x20 Aug 28 15:30:57 regress kernel: [18454.838860][T32222] ? join_transaction+0x32/0x6f0 Aug 28 15:30:57 regress kernel: [18454.839653][T32222] ? join_transaction+0x1a6/0x6f0 Aug 28 15:30:57 regress kernel: [18454.840592][T32222] ? lock_downgrade+0x3e0/0x3e0 Aug 28 15:30:57 regress kernel: [18454.841401][T32222] ? __kasan_check_write+0x14/0x20 Aug 28 15:30:57 regress kernel: [18454.842165][T32222] ? lock_contended+0x720/0x720 Aug 28 15:30:57 regress kernel: [18454.842883][T32222] ? do_raw_spin_lock+0x1e0/0x1e0 Aug 28 15:30:57 regress kernel: [18454.843629][T32222] ? wait_current_trans+0xb7/0x230 Aug 28 15:30:57 regress kernel: [18454.844409][T32222] mutex_lock_nested+0x1b/0x20 Aug 28 15:30:57 regress kernel: [18454.845121][T32222] ? mutex_lock_nested+0x1b/0x20 Aug 28 15:30:57 regress kernel: [18454.845867][T32222] btrfs_record_root_in_trans+0x7e/0xc0 Aug 28 15:30:57 regress kernel: [18454.846694][T32222] start_transaction+0x16b/0x8f0 Aug 28 15:30:57 regress kernel: [18454.847438][T32222] btrfs_start_transaction+0x1e/0x20 Aug 28 15:30:57 regress kernel: [18454.848223][T32222] btrfs_mkdir+0xf5/0x3b0 Aug 28 15:30:57 regress kernel: [18454.848863][T32222] ? make_kprojid+0x20/0x20 Aug 28 15:30:57 regress kernel: [18454.849533][T32222] ? putname+0x6b/0x80 Aug 28 15:30:57 regress kernel: [18454.850141][T32222] ? btrfs_rename2+0x2b20/0x2b20 Aug 28 15:30:57 regress kernel: [18454.850866][T32222] ? generic_permission+0x58/0x250 Aug 28 15:30:57 regress kernel: [18454.851753][T32222] ? security_inode_permission+0x1d/0x70 Aug 28 15:30:57 regress kernel: [18454.852598][T32222] ? inode_permission+0x7a/0x1f0 Aug 28 15:30:57 regress kernel: [18454.853343][T32222] vfs_mkdir+0x1e1/0x2f0 Aug 28 15:30:57 regress kernel: [18454.853971][T32222] do_mkdirat+0x192/0x1c0 Aug 28 15:30:57 regress kernel: [18454.854620][T32222] ? __ia32_sys_mknod+0x50/0x50 Aug 28 15:30:57 regress kernel: [18454.855357][T32222] ? trace_hardirqs_on_prepare+0x35/0x170 Aug 28 15:30:57 regress kernel: [18454.856239][T32222] __x64_sys_mkdir+0x37/0x40 Aug 28 15:30:57 regress kernel: [18454.856951][T32222] do_syscall_64+0x60/0xf0 Aug 28 15:30:57 regress kernel: [18454.857645][T32222] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Aug 28 15:30:57 regress kernel: [18454.858609][T32222] RIP: 0033:0x7f36074470d7 Aug 28 15:30:57 regress kernel: [18454.859287][T32222] Code: 1f 40 00 48 8b 05 b9 0d 0d 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 0d 0d 00 f7 d8 64 89 01 48 Aug 28 15:30:57 regress kernel: [18454.862597][T32222] RSP: 002b:00007ffc5c8419e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000053 Aug 28 15:30:57 regress kernel: [18454.863874][T32222] RAX: ffffffffffffffda RBX: 00007ffc5c842bc8 RCX: 00007f36074470d7 Aug 28 15:30:57 regress kernel: [18454.865087][T32222] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc5c842bc8 Aug 28 15:30:57 regress kernel: [18454.866297][T32222] RBP: 00007ffc5c842bc8 R08: 00000000000001ff R09: 0000557174728c00 Aug 28 15:30:57 regress kernel: [18454.867501][T32222] R10: fffffffffffff35a R11: 0000000000000206 R12: 00000000000001ff Aug 28 15:30:57 regress kernel: [18454.868709][T32222] R13: 0000000000000000 R14: 00007ffc5c841b60 R15: 00007ffc5c841cf0 Aug 28 15:30:57 regress kernel: [18454.869923][T32222] Aug 28 15:30:57 regress kernel: [18454.870296][T32222] Allocated by task 2066: Aug 28 15:30:57 regress kernel: [18454.870939][T32222] save_stack+0x21/0x50 Aug 28 15:30:57 regress kernel: [18454.871572][T32222] __kasan_kmalloc.constprop.17+0xc1/0xd0 Aug 28 15:30:57 regress kernel: [18454.872434][T32222] kasan_slab_alloc+0x12/0x20 Aug 28 15:30:57 regress kernel: [18454.873133][T32222] kmem_cache_alloc_node+0x113/0x720 Aug 28 15:30:57 regress kernel: [18454.873914][T32222] copy_process+0x357/0x3680 Aug 28 15:30:57 regress kernel: [18454.874653][T32222] _do_fork+0xed/0x880 Aug 28 15:30:57 regress kernel: [18454.875353][T32222] __do_sys_clone+0xee/0x130 Aug 28 15:30:57 regress kernel: [18454.876057][T32222] __x64_sys_clone+0x67/0x80 Aug 28 15:30:57 regress kernel: [18454.876782][T32222] do_syscall_64+0x60/0xf0 Aug 28 15:30:57 regress kernel: [18454.877476][T32222] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Aug 28 15:30:57 regress kernel: [18454.878398][T32222] Aug 28 15:30:57 regress kernel: [18454.878760][T32222] Freed by task 3558: Aug 28 15:30:57 regress kernel: [18454.879384][T32222] save_stack+0x21/0x50 Aug 28 15:30:57 regress kernel: [18454.880038][T32222] __kasan_slab_free+0x118/0x170 Aug 28 15:30:57 regress kernel: [18454.880855][T32222] kasan_slab_free+0xe/0x10 Aug 28 15:30:57 regress kernel: [18454.881565][T32222] kmem_cache_free+0x5f/0x280 Aug 28 15:30:57 regress kernel: [18454.882297][T32222] free_task+0x73/0x90 Aug 28 15:30:57 regress kernel: [18454.882928][T32222] __put_task_struct+0x199/0x1d0 Aug 28 15:30:57 regress kernel: [18454.883699][T32222] delayed_put_task_struct+0x124/0x1b0 Aug 28 15:30:57 regress kernel: [18454.884615][T32222] rcu_core+0x3b0/0xea0 Aug 28 15:30:57 regress kernel: [18454.885273][T32222] rcu_core_si+0xe/0x10 Aug 28 15:30:57 regress kernel: [18454.886251][T32222] __do_softirq+0x120/0x5e3 Aug 28 15:30:57 regress kernel: [18454.886964][T32222] Aug 28 15:30:57 regress kernel: [18454.887332][T32222] The buggy address belongs to the object at ffff88811329c000 Aug 28 15:30:57 regress kernel: [18454.887332][T32222] which belongs to the cache task_struct(192:ssh.service) of size 11072 Aug 28 15:30:57 regress kernel: [18454.889771][T32222] The buggy address is located 44 bytes inside of Aug 28 15:30:57 regress kernel: [18454.889771][T32222] 11072-byte region [ffff88811329c000, ffff88811329eb40) Aug 28 15:30:57 regress kernel: [18454.891843][T32222] The buggy address belongs to the page: Aug 28 15:30:57 regress kernel: [18454.892718][T32222] page:ffffea00044ca700 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88811329ffff head:ffffea00044ca700 order:2 compound_mapcount:0 compound_pincount:0 Aug 28 15:30:57 regress kernel: [18454.895303][T32222] flags: 0x17ffe0000010200(slab|head) Aug 28 15:30:57 regress kernel: [18454.896186][T32222] raw: 017ffe0000010200 ffffea0001a49908 ffff8881f5b36498 ffff8881eb5a1380 Aug 28 15:30:57 regress kernel: [18454.897618][T32222] raw: ffff88811329ffff ffff88811329c000 0000000100000001 0000000000000000 Aug 28 15:30:57 regress kernel: [18454.899016][T32222] page dumped because: kasan: bad access detected Aug 28 15:30:57 regress kernel: [18454.900061][T32222] Aug 28 15:30:57 regress kernel: [18454.900439][T32222] Memory state around the buggy address: Aug 28 15:30:57 regress kernel: [18454.901364][T32222] ffff88811329bf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Aug 28 15:30:57 regress kernel: [18454.902699][T32222] ffff88811329bf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Aug 28 15:30:57 regress kernel: [18454.904052][T32222] >ffff88811329c000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Aug 28 15:30:57 regress kernel: [18454.905345][T32222] ^ Aug 28 15:30:57 regress kernel: [18454.906245][T32222] ffff88811329c080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Aug 28 15:30:57 regress kernel: [18454.907675][T32222] ffff88811329c100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Aug 28 15:30:57 regress kernel: [18454.909247][T32222] ================================================================== ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-08-28 20:42 ` Zygo Blaxell @ 2020-09-01 22:53 ` Zygo Blaxell 2020-09-01 23:33 ` Qu Wenruo 0 siblings, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-09-01 22:53 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Qu Wenruo, David Sterba, linux-btrfs, wqu On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote: > On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote: > > On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote: > > > On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote: > > >> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote: > > > > <snip> > > > > Since you can repro reliably could you modify the code in > > create_reloc_root so it prints what's the returned error value, I'd > > speculate it's EEXIST from > > > > btrfs_insert_root > > btrfs_insert_item > > btrfs_insert_empty_item > > btrfs_insert_empty_items > > btrfs_search_slot > > > > But better be sure. > > Here you go, EEXIST == 17: > > Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9 > Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data > Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0 > Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0 > Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0 > Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0 > Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17 > Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17 > Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------ > Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795! I did a low-resolution bisect for this issue. I dug up 5.4, 5.5, 5.6, and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete kernels, and ran the tests on each kernel. Results: 5.8: kernel BUG at fs/btrfs/relocation.c:794 5.7: kernel BUG (same code but different line number) 5.6: kernel BUG (same as the others) 5.5: assertion failure (stack trace below) 5.4: kernel BUG (!) The 5.4 result is interesting--I've been running 5.4 for some time and not hit this before. So there are 3 possible theories: 1. It's because of sending signals to balance. That has been added to my test workload after 5.7 was released, so earlier tests on 5.4 would not have triggered it. 2. There's a regression in 5.4-stable, which I've cherry-picked to all the other kernels during my test setup. (On the other hand, if I don't backport some fixes, kernels 5.5..5.7 crash before they get to this bug.) 3. There's something rotten in my test filesystem, and the BUG will go away for a while if I do a mkfs. Qu asked for a dump earlier in this thread, and I provided one. All three of these theories are testable to some extent, so I'll have my test VM run some variations. The full test workload is: balance metadata or data at random intervals scrub, scrub cancel at random intervals 20x rsync updating files snapshot create, delete at random intervals bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls) balance cancel at random intervals kill -9 $(pidof btrfs balance) at random intervals (NEW - added when 5.7 came out) This is the 5.5 root assertion failure: Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837 Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------ Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125! Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809 Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14 Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120 Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83 Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282 Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242 Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1 Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0 Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0 Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000 Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0 Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace: Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590 Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0 Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0 Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260 Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140 Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80 Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230 Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140 Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50 Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590 Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400 Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0 Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230 Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0 Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470 Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0 Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0 Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800 Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0 Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0 Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0 Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470 Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20 Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20 Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20 Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190 Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20 Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190 Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20 Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0 Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20 Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190 Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30 Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0 Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120 Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0 Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0 Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170 Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20 Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0 Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0 Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480 Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20 Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110 Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90 Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50 Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0 Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427 Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427 Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003 Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001 Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001 Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in: Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]--- ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-09-01 22:53 ` Zygo Blaxell @ 2020-09-01 23:33 ` Qu Wenruo 2020-09-02 0:14 ` Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: Qu Wenruo @ 2020-09-01 23:33 UTC (permalink / raw) To: Zygo Blaxell, Nikolay Borisov; +Cc: David Sterba, linux-btrfs, wqu This looks like a race between some reloc tree creation from some other part. If you could add debug output for create_reloc_root() and its callers, we may have a chance to debug it. But for the first step, we can hunt down the BUG_ON()s first and make it exist more gracefully. I'll try to spare some time to do this in the following week. Thanks, Qu On 2020/9/2 上午6:53, Zygo Blaxell wrote: > On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote: >> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote: >>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote: >>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote: >>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote: >>> >>> <snip> >>> >>> Since you can repro reliably could you modify the code in >>> create_reloc_root so it prints what's the returned error value, I'd >>> speculate it's EEXIST from >>> >>> btrfs_insert_root >>> btrfs_insert_item >>> btrfs_insert_empty_item >>> btrfs_insert_empty_items >>> btrfs_search_slot >>> >>> But better be sure. >> >> Here you go, EEXIST == 17: >> >> Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9 >> Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data >> Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0 >> Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0 >> Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0 >> Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0 >> Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17 >> Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17 >> Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------ >> Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795! > > I did a low-resolution bisect for this issue. I dug up 5.4, 5.5, 5.6, > and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete > kernels, and ran the tests on each kernel. Results: > > 5.8: kernel BUG at fs/btrfs/relocation.c:794 > > 5.7: kernel BUG (same code but different line number) > > 5.6: kernel BUG (same as the others) > > 5.5: assertion failure (stack trace below) > > 5.4: kernel BUG (!) > > The 5.4 result is interesting--I've been running 5.4 for some time and > not hit this before. So there are 3 possible theories: > > 1. It's because of sending signals to balance. That has been > added to my test workload after 5.7 was released, so earlier > tests on 5.4 would not have triggered it. > > 2. There's a regression in 5.4-stable, which I've cherry-picked > to all the other kernels during my test setup. (On the other > hand, if I don't backport some fixes, kernels 5.5..5.7 crash > before they get to this bug.) > > 3. There's something rotten in my test filesystem, and the > BUG will go away for a while if I do a mkfs. Qu asked for > a dump earlier in this thread, and I provided one. > > All three of these theories are testable to some extent, so I'll have > my test VM run some variations. > > The full test workload is: > > balance metadata or data at random intervals > > scrub, scrub cancel at random intervals > > 20x rsync updating files > > snapshot create, delete at random intervals > > bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls) > > balance cancel at random intervals > > kill -9 $(pidof btrfs balance) at random intervals (NEW - added > when 5.7 came out) > > This is the 5.5 root assertion failure: > > Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837 > Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------ > Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125! > Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI > Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809 > Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c > Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14 > Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c > Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be > Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120 > Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e > Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83 > Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282 > Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242 > Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c > Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1 > Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0 > Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0 > Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000 > Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0 > Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace: > Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590 > Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0 > Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0 > Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260 > Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140 > Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80 > Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230 > Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140 > Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50 > Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590 > Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400 > Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 > Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0 > Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230 > Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0 > Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470 > Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0 > Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0 > Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800 > Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0 > Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 > Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0 > Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0 > Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470 > Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20 > Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20 > Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 > Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20 > Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190 > Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20 > Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190 > Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20 > Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0 > Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20 > Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190 > Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30 > Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0 > Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120 > Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0 > Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 > Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0 > Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170 > Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20 > Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0 > Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0 > Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480 > Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20 > Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110 > Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90 > Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50 > Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0 > Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe > Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427 > Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 > Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427 > Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003 > Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 > Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001 > Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001 > Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in: > Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]--- > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-09-01 23:33 ` Qu Wenruo @ 2020-09-02 0:14 ` Zygo Blaxell 2020-09-02 1:46 ` Qu Wenruo 0 siblings, 1 reply; 13+ messages in thread From: Zygo Blaxell @ 2020-09-02 0:14 UTC (permalink / raw) To: Qu Wenruo; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote: > This looks like a race between some reloc tree creation from some other > part. > > If you could add debug output for create_reloc_root() and its callers, > we may have a chance to debug it. The callers are always the same: btrfs_init_reloc_root+0x1b0 record_root_in_trans+0x18c record_root_in_trans+0x8b start_transaction+0x189 (gdb) l *(create_reloc_root+0x468) 0xffffffff81930848 is in create_reloc_root (fs/btrfs/relocation.c:1503). 1498 btrfs_tree_unlock(eb); 1499 free_extent_buffer(eb); 1500 1501 ret = btrfs_insert_root(trans, fs_info->tree_root, 1502 &root_key, root_item); 1503 BUG_ON(ret); 1504 kfree(root_item); 1505 1506 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key); 1507 BUG_ON(IS_ERR(reloc_root)); (gdb) l *(btrfs_init_reloc_root+0x1b0) 0xffffffff81937db0 is in btrfs_init_reloc_root (fs/btrfs/relocation.c:1567). 1562 if (!trans->reloc_reserved) { 1563 rsv = trans->block_rsv; 1564 trans->block_rsv = rc->block_rsv; 1565 clear_rsv = 1; 1566 } 1567 reloc_root = create_reloc_root(trans, root, root->root_key.objectid); 1568 if (clear_rsv) 1569 trans->block_rsv = rsv; 1570 1571 ret = __add_reloc_root(reloc_root); (gdb) l *(record_root_in_trans+0x18c) 0xffffffff81889bfc is in record_root_in_trans (./include/asm-generic/bitops/instrumented-atomic.h:41). 36 * 37 * This is a relaxed atomic operation (no implied memory barriers). 38 */ 39 static inline void clear_bit(long nr, volatile unsigned long *addr) 40 { 41 kasan_check_write(addr + BIT_WORD(nr), sizeof(long)); 42 arch_clear_bit(nr, addr); 43 } 44 45 /** (gdb) l *(start_transaction+0x189) 0xffffffff8188f0d9 is in start_transaction (fs/btrfs/transaction.c:697). 692 * Thus it need to be called after current->journal_info initialized, 693 * or we can deadlock. 694 */ 695 btrfs_record_root_in_trans(h, root); 696 697 return h; 698 699 join_fail: 700 if (type & __TRANS_FREEZABLE) 701 sb_end_intwrite(fs_info->sb); (gdb) quit It seems to be very early in the transaction. Is there anything to output here? Or are we more interested in what is left over from the previous transaction? > But for the first step, we can hunt down the BUG_ON()s first and make it > exist more gracefully. > > I'll try to spare some time to do this in the following week. > > Thanks, > Qu > > On 2020/9/2 上午6:53, Zygo Blaxell wrote: > > On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote: > >> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote: > >>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote: > >>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote: > >>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote: > >>> > >>> <snip> > >>> > >>> Since you can repro reliably could you modify the code in > >>> create_reloc_root so it prints what's the returned error value, I'd > >>> speculate it's EEXIST from > >>> > >>> btrfs_insert_root > >>> btrfs_insert_item > >>> btrfs_insert_empty_item > >>> btrfs_insert_empty_items > >>> btrfs_search_slot > >>> > >>> But better be sure. > >> > >> Here you go, EEXIST == 17: > >> > >> Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9 > >> Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data > >> Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0 > >> Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0 > >> Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0 > >> Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0 > >> Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17 > >> Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17 > >> Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------ > >> Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795! > > > > I did a low-resolution bisect for this issue. I dug up 5.4, 5.5, 5.6, > > and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete > > kernels, and ran the tests on each kernel. Results: > > > > 5.8: kernel BUG at fs/btrfs/relocation.c:794 > > > > 5.7: kernel BUG (same code but different line number) > > > > 5.6: kernel BUG (same as the others) > > > > 5.5: assertion failure (stack trace below) > > > > 5.4: kernel BUG (!) > > > > The 5.4 result is interesting--I've been running 5.4 for some time and > > not hit this before. So there are 3 possible theories: > > > > 1. It's because of sending signals to balance. That has been > > added to my test workload after 5.7 was released, so earlier > > tests on 5.4 would not have triggered it. > > > > 2. There's a regression in 5.4-stable, which I've cherry-picked > > to all the other kernels during my test setup. (On the other > > hand, if I don't backport some fixes, kernels 5.5..5.7 crash > > before they get to this bug.) > > > > 3. There's something rotten in my test filesystem, and the > > BUG will go away for a while if I do a mkfs. Qu asked for > > a dump earlier in this thread, and I provided one. > > > > All three of these theories are testable to some extent, so I'll have > > my test VM run some variations. > > > > The full test workload is: > > > > balance metadata or data at random intervals > > > > scrub, scrub cancel at random intervals > > > > 20x rsync updating files > > > > snapshot create, delete at random intervals > > > > bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls) > > > > balance cancel at random intervals > > > > kill -9 $(pidof btrfs balance) at random intervals (NEW - added > > when 5.7 came out) > > > > This is the 5.5 root assertion failure: > > > > Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837 > > Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------ > > Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125! > > Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI > > Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809 > > Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c > > Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14 > > Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c > > Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be > > Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120 > > Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e > > Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83 > > Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282 > > Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242 > > Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c > > Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1 > > Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0 > > Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0 > > Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000 > > Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0 > > Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace: > > Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590 > > Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0 > > Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0 > > Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260 > > Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140 > > Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80 > > Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230 > > Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140 > > Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50 > > Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590 > > Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400 > > Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 > > Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0 > > Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230 > > Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0 > > Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470 > > Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0 > > Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0 > > Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800 > > Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0 > > Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 > > Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0 > > Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0 > > Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470 > > Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20 > > Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20 > > Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 > > Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20 > > Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190 > > Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20 > > Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190 > > Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20 > > Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0 > > Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20 > > Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190 > > Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30 > > Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0 > > Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120 > > Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0 > > Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 > > Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0 > > Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170 > > Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20 > > Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0 > > Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0 > > Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480 > > Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20 > > Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110 > > Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90 > > Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50 > > Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0 > > Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427 > > Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 > > Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > > Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427 > > Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003 > > Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 > > Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001 > > Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001 > > Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in: > > Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]--- > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-09-02 0:14 ` Zygo Blaxell @ 2020-09-02 1:46 ` Qu Wenruo 2020-09-04 15:54 ` Zygo Blaxell 0 siblings, 1 reply; 13+ messages in thread From: Qu Wenruo @ 2020-09-02 1:46 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu On 2020/9/2 上午8:14, Zygo Blaxell wrote: > On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote: >> This looks like a race between some reloc tree creation from some other >> part. >> >> If you could add debug output for create_reloc_root() and its callers, >> we may have a chance to debug it. > > The callers are always the same: > > btrfs_init_reloc_root+0x1b0 > record_root_in_trans+0x18c > record_root_in_trans+0x8b > start_transaction+0x189 > > (gdb) l *(create_reloc_root+0x468) > 0xffffffff81930848 is in create_reloc_root (fs/btrfs/relocation.c:1503). > 1498 btrfs_tree_unlock(eb); > 1499 free_extent_buffer(eb); > 1500 > 1501 ret = btrfs_insert_root(trans, fs_info->tree_root, > 1502 &root_key, root_item); > 1503 BUG_ON(ret); > 1504 kfree(root_item); > 1505 > 1506 reloc_root = btrfs_read_tree_root(fs_info->tree_root, &root_key); > 1507 BUG_ON(IS_ERR(reloc_root)); > (gdb) l *(btrfs_init_reloc_root+0x1b0) > 0xffffffff81937db0 is in btrfs_init_reloc_root (fs/btrfs/relocation.c:1567). > 1562 if (!trans->reloc_reserved) { > 1563 rsv = trans->block_rsv; > 1564 trans->block_rsv = rc->block_rsv; > 1565 clear_rsv = 1; > 1566 } > 1567 reloc_root = create_reloc_root(trans, root, root->root_key.objectid); > 1568 if (clear_rsv) > 1569 trans->block_rsv = rsv; > 1570 > 1571 ret = __add_reloc_root(reloc_root); > (gdb) l *(record_root_in_trans+0x18c) > 0xffffffff81889bfc is in record_root_in_trans (./include/asm-generic/bitops/instrumented-atomic.h:41). > 36 * > 37 * This is a relaxed atomic operation (no implied memory barriers). > 38 */ > 39 static inline void clear_bit(long nr, volatile unsigned long *addr) > 40 { > 41 kasan_check_write(addr + BIT_WORD(nr), sizeof(long)); > 42 arch_clear_bit(nr, addr); > 43 } > 44 > 45 /** > (gdb) l *(start_transaction+0x189) > 0xffffffff8188f0d9 is in start_transaction (fs/btrfs/transaction.c:697). > 692 * Thus it need to be called after current->journal_info initialized, > 693 * or we can deadlock. > 694 */ > 695 btrfs_record_root_in_trans(h, root); > 696 > 697 return h; > 698 > 699 join_fail: > 700 if (type & __TRANS_FREEZABLE) > 701 sb_end_intwrite(fs_info->sb); > (gdb) quit > > It seems to be very early in the transaction. Is there anything to > output here? Or are we more interested in what is left over from > the previous transaction? What I mean is, I want to see who else created the reloc tree, not only who caused the EEXIST BUG_ON(). That's why I hope to add enough debug pr_info or whatever for create_reloc_root(), so that we can catch the ordinary calls that seems safe but may be unsafe for other callers. Thanks, Qu > >> But for the first step, we can hunt down the BUG_ON()s first and make it >> exist more gracefully. >> >> I'll try to spare some time to do this in the following week. >> >> Thanks, >> Qu >> >> On 2020/9/2 上午6:53, Zygo Blaxell wrote: >>> On Fri, Aug 28, 2020 at 04:42:55PM -0400, Zygo Blaxell wrote: >>>> On Fri, Aug 28, 2020 at 09:34:31AM +0300, Nikolay Borisov wrote: >>>>> On 28.08.20 г. 3:08 ч., Zygo Blaxell wrote: >>>>>> On Thu, Aug 27, 2020 at 08:03:13PM -0400, Zygo Blaxell wrote: >>>>>>> On Tue, Aug 04, 2020 at 12:16:26PM -0400, Zygo Blaxell wrote: >>>>> >>>>> <snip> >>>>> >>>>> Since you can repro reliably could you modify the code in >>>>> create_reloc_root so it prints what's the returned error value, I'd >>>>> speculate it's EEXIST from >>>>> >>>>> btrfs_insert_root >>>>> btrfs_insert_item >>>>> btrfs_insert_empty_item >>>>> btrfs_insert_empty_items >>>>> btrfs_search_slot >>>>> >>>>> But better be sure. >>>> >>>> Here you go, EEXIST == 17: >>>> >>>> Aug 28 15:30:55 regress kernel: [18452.845182][T31546] BTRFS info (device dm-0): balance: start -dlimit=9 >>>> Aug 28 15:30:55 regress kernel: [18452.874627][T31546] BTRFS info (device dm-0): relocating block group 15873413742592 flags data >>>> Aug 28 15:30:55 regress kernel: [18453.097516][ T2100] btrfs_search_slot ret = 0 >>>> Aug 28 15:30:55 regress kernel: [18453.104865][ T2100] btrfs_search_slot ret = 0 >>>> Aug 28 15:30:55 regress kernel: [18453.109751][ T2100] btrfs_search_slot ret = 0 >>>> Aug 28 15:30:56 regress kernel: [18454.453135][ T2100] btrfs_search_slot ret = 0 >>>> Aug 28 15:30:56 regress kernel: [18454.453955][ T2100] btrfs_insert_empty_item ret = -17 >>>> Aug 28 15:30:56 regress kernel: [18454.455022][ T2100] btrfs_insert_root ret = -17 >>>> Aug 28 15:30:56 regress kernel: [18454.456229][ T2100] ------------[ cut here ]------------ >>>> Aug 28 15:30:56 regress kernel: [18454.457622][ T2100] kernel BUG at fs/btrfs/relocation.c:795! >>> >>> I did a low-resolution bisect for this issue. I dug up 5.4, 5.5, 5.6, >>> and 5.7 kernel sources, backported btrfs fixes from 5.4 to the obsolete >>> kernels, and ran the tests on each kernel. Results: >>> >>> 5.8: kernel BUG at fs/btrfs/relocation.c:794 >>> >>> 5.7: kernel BUG (same code but different line number) >>> >>> 5.6: kernel BUG (same as the others) >>> >>> 5.5: assertion failure (stack trace below) >>> >>> 5.4: kernel BUG (!) >>> >>> The 5.4 result is interesting--I've been running 5.4 for some time and >>> not hit this before. So there are 3 possible theories: >>> >>> 1. It's because of sending signals to balance. That has been >>> added to my test workload after 5.7 was released, so earlier >>> tests on 5.4 would not have triggered it. >>> >>> 2. There's a regression in 5.4-stable, which I've cherry-picked >>> to all the other kernels during my test setup. (On the other >>> hand, if I don't backport some fixes, kernels 5.5..5.7 crash >>> before they get to this bug.) >>> >>> 3. There's something rotten in my test filesystem, and the >>> BUG will go away for a while if I do a mkfs. Qu asked for >>> a dump earlier in this thread, and I provided one. >>> >>> All three of these theories are testable to some extent, so I'll have >>> my test VM run some variations. >>> >>> The full test workload is: >>> >>> balance metadata or data at random intervals >>> >>> scrub, scrub cancel at random intervals >>> >>> 20x rsync updating files >>> >>> snapshot create, delete at random intervals >>> >>> bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls) >>> >>> balance cancel at random intervals >>> >>> kill -9 $(pidof btrfs balance) at random intervals (NEW - added >>> when 5.7 came out) >>> >>> This is the 5.5 root assertion failure: >>> >>> Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837 >>> Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------ >>> Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125! >>> Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI >>> Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809 >>> Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c >>> Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14 >>> Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c >>> Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 >>> Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be >>> Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120 >>> Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e >>> Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83 >>> Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282 >>> Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242 >>> Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c >>> Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1 >>> Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0 >>> Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0 >>> Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000 >>> Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0 >>> Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace: >>> Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590 >>> Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0 >>> Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0 >>> Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260 >>> Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140 >>> Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80 >>> Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230 >>> Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140 >>> Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50 >>> Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590 >>> Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400 >>> Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 >>> Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0 >>> Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230 >>> Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0 >>> Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470 >>> Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0 >>> Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0 >>> Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800 >>> Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0 >>> Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 >>> Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0 >>> Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0 >>> Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470 >>> Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20 >>> Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20 >>> Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 >>> Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20 >>> Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190 >>> Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20 >>> Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190 >>> Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20 >>> Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0 >>> Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20 >>> Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190 >>> Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30 >>> Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0 >>> Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120 >>> Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0 >>> Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 >>> Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0 >>> Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170 >>> Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20 >>> Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0 >>> Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0 >>> Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480 >>> Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20 >>> Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110 >>> Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90 >>> Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50 >>> Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0 >>> Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427 >>> Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 >>> Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 >>> Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427 >>> Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003 >>> Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 >>> Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001 >>> Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001 >>> Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in: >>> Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]--- >>> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 2020-09-02 1:46 ` Qu Wenruo @ 2020-09-04 15:54 ` Zygo Blaxell 0 siblings, 0 replies; 13+ messages in thread From: Zygo Blaxell @ 2020-09-04 15:54 UTC (permalink / raw) To: Qu Wenruo; +Cc: Nikolay Borisov, David Sterba, linux-btrfs, wqu On Wed, Sep 02, 2020 at 09:46:29AM +0800, Qu Wenruo wrote: > On 2020/9/2 上午8:14, Zygo Blaxell wrote: > > On Wed, Sep 02, 2020 at 07:33:21AM +0800, Qu Wenruo wrote: > >> This looks like a race between some reloc tree creation from some other > >> part. > >> > >> If you could add debug output for create_reloc_root() and its callers, > >> we may have a chance to debug it. > > What I mean is, I want to see who else created the reloc tree, not only > who caused the EEXIST BUG_ON(). > > That's why I hope to add enough debug pr_info or whatever for > create_reloc_root(), so that we can catch the ordinary calls that seems > safe but may be unsafe for other callers. There doesn't appear to be a race with multiple instances of create_reloc_root as nobody else seems to be calling it at the time when it fails. On the other hand, it is a kworker thread, so it could be racing with something else. Sep 4 01:46:42 regress kernel: [12131.050264][ T5245] btrfs_search_slot ret = 0 Sep 4 01:46:51 regress kernel: [12140.058734][ T5245] btrfs_search_slot ret = 0 Sep 4 01:47:00 regress kernel: [12149.079892][ T5245] btrfs_search_slot ret = 0 Sep 4 01:47:09 regress kernel: [12158.091883][ T5245] btrfs_search_slot ret = 0 Sep 4 01:47:14 regress kernel: [12162.521167][ T2993] btrfs_search_slot ret = 0 Sep 4 01:47:14 regress kernel: [12162.823894][ T2993] btrfs_search_slot ret = 0 Sep 4 01:47:14 regress kernel: [12162.991624][ T2993] btrfs_search_slot ret = 0 Sep 4 01:47:14 regress kernel: [12162.999665][ T2993] btrfs_search_slot ret = 0 Sep 4 01:47:19 regress kernel: [12167.117620][ T5245] btrfs_search_slot ret = 0 Sep 4 01:47:28 regress kernel: [12176.232713][ T5245] btrfs_search_slot ret = 0 Sep 4 01:47:37 regress kernel: [12185.237905][ T5245] btrfs_search_slot ret = 0 Sep 4 01:47:50 regress kernel: [12199.005753][ T5245] btrfs_search_slot ret = 0 Sep 4 01:47:51 regress kernel: [12199.953977][T27716] BTRFS info (device dm-0): balance: start -dlimit=9 Sep 4 01:47:51 regress kernel: [12199.992918][T27716] BTRFS info (device dm-0): relocating block group 16502453436416 flags data Sep 4 01:47:54 regress kernel: [12202.443621][T11829] root->root_key.objectid == 0, objectid = 10502 Sep 4 01:47:54 regress kernel: [12202.444916][T11829] CPU: 0 PID: 11829 Comm: kworker/u8:20 Tainted: G W 5.8.6-ce459d8ff170+ #8 Sep 4 01:47:54 regress kernel: [12202.446791][T11829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Sep 4 01:47:54 regress kernel: [12202.449187][T11829] Workqueue: btrfs-endio-write btrfs_work_helper Sep 4 01:47:54 regress kernel: [12202.450355][T11829] Call Trace: Sep 4 01:47:54 regress kernel: [12202.451580][T11829] dump_stack+0xc8/0x11a Sep 4 01:47:54 regress kernel: [12202.452475][T11829] create_reloc_root.cold.42+0x60/0x4d9 Sep 4 01:47:54 regress kernel: [12202.453512][T11829] ? invalidate_extent_cache+0x2a0/0x2a0 Sep 4 01:47:54 regress kernel: [12202.454538][T11829] ? check_chain_key+0x1e6/0x2e0 Sep 4 01:47:54 regress kernel: [12202.455479][T11829] btrfs_init_reloc_root+0x2d7/0x310 Sep 4 01:47:54 regress kernel: [12202.456493][T11829] ? find_reloc_root+0x200/0x200 Sep 4 01:47:54 regress kernel: [12202.457510][T11829] ? do_raw_spin_unlock+0xa8/0x140 Sep 4 01:47:54 regress kernel: [12202.458446][T11829] record_root_in_trans+0x18c/0x1d0 Sep 4 01:47:54 regress kernel: [12202.459394][T11829] btrfs_record_root_in_trans+0x8b/0xc0 Sep 4 01:47:54 regress kernel: [12202.460673][T11829] start_transaction+0x16b/0x8f0 Sep 4 01:47:54 regress kernel: [12202.461595][T11829] btrfs_join_transaction+0x1d/0x20 Sep 4 01:47:54 regress kernel: [12202.462586][T11829] btrfs_finish_ordered_io+0x535/0xd10 Sep 4 01:47:54 regress kernel: [12202.463590][T11829] ? register_lock_class+0x900/0x900 Sep 4 01:47:54 regress kernel: [12202.464576][T11829] ? btrfs_update_inode_fallback+0x40/0x40 Sep 4 01:47:54 regress kernel: [12202.465584][T11829] ? rcu_read_lock_sched_held+0xa1/0xd0 Sep 4 01:47:54 regress kernel: [12202.466547][T11829] ? rcu_read_lock_bh_held+0xb0/0xb0 Sep 4 01:47:54 regress kernel: [12202.467463][T11829] ? lock_is_held_type+0xc9/0x100 Sep 4 01:47:54 regress kernel: [12202.468371][T11829] finish_ordered_fn+0x15/0x20 Sep 4 01:47:54 regress kernel: [12202.469224][T11829] btrfs_work_helper+0x118/0x920 Sep 4 01:47:54 regress kernel: [12202.470105][T11829] ? rcu_read_lock_bh_held+0xb0/0xb0 Sep 4 01:47:54 regress kernel: [12202.471082][T11829] ? trace_hardirqs_on+0x57/0x140 Sep 4 01:47:54 regress kernel: [12202.471998][T11829] process_one_work+0x507/0xa70 Sep 4 01:47:54 regress kernel: [12202.472885][T11829] ? pwq_dec_nr_in_flight+0x130/0x130 Sep 4 01:47:54 regress kernel: [12202.473816][T11829] ? do_raw_spin_lock+0x1e0/0x1e0 Sep 4 01:47:54 regress kernel: [12202.474716][T11829] worker_thread+0x63/0x5a0 Sep 4 01:47:54 regress kernel: [12202.475510][T11829] ? process_one_work+0xa70/0xa70 Sep 4 01:47:54 regress kernel: [12202.476428][T11829] kthread+0x20c/0x230 Sep 4 01:47:54 regress kernel: [12202.477137][T11829] ? kthread_create_worker_on_cpu+0xc0/0xc0 Sep 4 01:47:54 regress kernel: [12202.478152][T11829] ret_from_fork+0x22/0x30 Sep 4 01:47:54 regress kernel: [12202.480616][T11829] btrfs_search_slot ret = 0 Sep 4 01:47:54 regress kernel: [12202.482834][T11829] btrfs_insert_empty_item ret = -17 Sep 4 01:47:54 regress kernel: [12202.485447][T11829] btrfs_insert_root ret = -17 Sep 4 01:47:54 regress kernel: [12202.487775][T11829] ------------[ cut here ]------------ Sep 4 01:47:54 regress kernel: [12202.490086][T11829] kernel BUG at fs/btrfs/relocation.c:798! Sep 4 01:47:54 regress kernel: [12202.491104][T11829] invalid opcode: 0000 [#1] SMP KASAN PTI Sep 4 01:47:54 regress kernel: [12202.492056][T11829] CPU: 1 PID: 11829 Comm: kworker/u8:20 Tainted: G W 5.8.6-ce459d8ff170+ #8 Sep 4 01:47:54 regress kernel: [12202.493712][T11829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Sep 4 01:47:54 regress kernel: [12202.495311][T11829] Workqueue: btrfs-endio-write btrfs_work_helper Sep 4 01:47:54 regress kernel: [12202.496424][T11829] RIP: 0010:create_reloc_root.cold.42+0x434/0x4d9 Sep 4 01:47:54 regress kernel: [12202.497550][T11829] Code: e8 6c d6 f3 ff 48 c7 c7 e0 98 03 8f 89 c6 89 85 30 ff ff ff e8 0d 53 8c ff 8b 95 30 ff ff ff 4c 8b 8d 28 ff ff ff 85 d2 74 02 <0f> 0b 4c 89 cf e8 fd 56 bc ff 4c 89 e7 e8 45 9c bc ff 49 8b 7f 20 Sep 4 01:47:54 regress kernel: [12202.501225][T11829] RSP: 0018:ffffc9000b80f920 EFLAGS: 00010282 Sep 4 01:47:54 regress kernel: [12202.503239][T11829] RAX: 000000000000001b RBX: 1ffff92001701f29 RCX: ffffffff8d273af2 Sep 4 01:47:54 regress kernel: [12202.504644][T11829] RDX: 00000000ffffffef RSI: 0000000000000008 RDI: ffff8881f59ff28c Sep 4 01:47:54 regress kernel: [12202.507025][T11829] RBP: ffffc9000b80fa10 R08: ffffed103eb41645 R09: ffff8880a598b400 Sep 4 01:47:54 regress kernel: [12202.509429][T11829] R10: ffff8881f5a0b227 R11: ffffed103eb41644 R12: ffff8881c93e8020 Sep 4 01:47:54 regress kernel: [12202.510781][T11829] R13: ffff8881cbefd2a0 R14: ffffc9000b80f9a8 R15: ffff8881c93e8000 Sep 4 01:47:54 regress kernel: [12202.512142][T11829] FS: 0000000000000000(0000) GS:ffff8881f5800000(0000) knlGS:0000000000000000 Sep 4 01:47:54 regress kernel: [12202.513651][T11829] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 4 01:47:54 regress kernel: [12202.514790][T11829] CR2: 00007fb4c11f0a68 CR3: 00000001dc604005 CR4: 00000000001606e0 Sep 4 01:47:54 regress kernel: [12202.516258][T11829] Call Trace: For reference, here's my kernel logging so far: diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 82ab6e5a386d..b98b74397afc 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -4748,10 +4748,14 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle *trans, total_size = total_data + (nr * sizeof(struct btrfs_item)); ret = btrfs_search_slot(trans, root, cpu_key, path, total_size, 1); - if (ret == 0) + if (ret == 0) { + printk(KERN_ERR "btrfs_search_slot ret = %d\n", ret); return -EEXIST; - if (ret < 0) + } + if (ret < 0) { + printk(KERN_ERR "btrfs_search_slot ret = %d\n", ret); return ret; + } slot = path->slots[0]; BUG_ON(slot < 0); @@ -4775,14 +4779,18 @@ int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root *root, unsigned long ptr; path = btrfs_alloc_path(); - if (!path) + if (!path) { + printk(KERN_ERR "btrfs_alloc_path ENOMEM\n"); return -ENOMEM; + } ret = btrfs_insert_empty_item(trans, root, path, cpu_key, data_size); if (!ret) { leaf = path->nodes[0]; ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); write_extent_buffer(leaf, data, ptr, data_size); btrfs_mark_buffer_dirty(leaf); + } else { + printk(KERN_ERR "btrfs_insert_empty_item ret = %d\n", ret); } btrfs_free_path(path); return ret; diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 350050b288e4..23fffd4bfd41 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -744,6 +744,9 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans, root_key.type = BTRFS_ROOT_ITEM_KEY; root_key.offset = objectid; + printk(KERN_ERR "root->root_key.objectid == %zu, objectid = %zu\n", ret, root->root_key.objectid, objectid); + dump_stack(); + if (root->root_key.objectid == objectid) { u64 commit_root_gen; @@ -791,6 +794,7 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans, ret = btrfs_insert_root(trans, fs_info->tree_root, &root_key, root_item); + printk(KERN_ERR "btrfs_insert_root ret = %d\n", ret); BUG_ON(ret); kfree(root_item); > >>> The 5.4 result is interesting--I've been running 5.4 for some time and > >>> not hit this before. So there are 3 possible theories: > >>> > >>> 1. It's because of sending signals to balance. That has been > >>> added to my test workload after 5.7 was released, so earlier > >>> tests on 5.4 would not have triggered it. This might be related. I removed 'kill the balance process' from my test workload, and didn't have any BUG_ONs for two days. When I put the kill-the-balance-process test back in the workload, it went back to BUG_ON at fairly reliable 1-5 hour intervals. Of course that's just correlation, and with random events at that, but so far the data supports theory 1 and refutes theory 3. > >>> 2. There's a regression in 5.4-stable, which I've cherry-picked > >>> to all the other kernels during my test setup. (On the other > >>> hand, if I don't backport some fixes, kernels 5.5..5.7 crash > >>> before they get to this bug.) > >>> > >>> 3. There's something rotten in my test filesystem, and the > >>> BUG will go away for a while if I do a mkfs. Qu asked for > >>> a dump earlier in this thread, and I provided one. > >>> > >>> All three of these theories are testable to some extent, so I'll have > >>> my test VM run some variations. > >>> > >>> The full test workload is: > >>> > >>> balance metadata or data at random intervals > >>> > >>> scrub, scrub cancel at random intervals > >>> > >>> 20x rsync updating files > >>> > >>> snapshot create, delete at random intervals > >>> > >>> bees dedupe (lots of EXTENT_SAME and LOGICAL_INO calls) > >>> > >>> balance cancel at random intervals > >>> > >>> kill -9 $(pidof btrfs balance) at random intervals (NEW - added > >>> when 5.7 came out) > >>> > >>> This is the 5.5 root assertion failure: > >>> > >>> Sep 1 04:48:48 regress kernel: [10642.537825][T24161] assertion failed: root, in fs/btrfs/relocation.c:837 > >>> Sep 1 04:48:48 regress kernel: [10642.538911][T24161] ------------[ cut here ]------------ > >>> Sep 1 04:48:48 regress kernel: [10642.539704][T24161] kernel BUG at fs/btrfs/ctree.h:3125! > >>> Sep 1 04:48:48 regress kernel: [10642.540621][T24161] invalid opcode: 0000 [#1] SMP KASAN PTI > >>> Sep 1 04:48:48 regress kernel: [10642.540624][ T4639] irq event stamp: 49626809 > >>> Sep 1 04:48:48 regress kernel: [10642.540632][ T4639] hardirqs last enabled at (49626809): [<ffffffff8a00481a>] trace_hardirqs_on_thunk+0x1a/0x1c > >>> Sep 1 04:48:48 regress kernel: [10642.541451][T24161] CPU: 1 PID: 24161 Comm: btrfs Tainted: G W 5.5.19-76348822ab91+ #14 > >>> Sep 1 04:48:48 regress kernel: [10642.542114][ T4639] hardirqs last disabled at (49626808): [<ffffffff8a004836>] trace_hardirqs_off_thunk+0x1a/0x1c > >>> Sep 1 04:48:48 regress kernel: [10642.543693][T24161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > >>> Sep 1 04:48:48 regress kernel: [10642.545017][ T4639] softirqs last enabled at (49626726): [<ffffffff8bc0048b>] __do_softirq+0x48b/0x5be > >>> Sep 1 04:48:48 regress kernel: [10642.545023][ T4639] softirqs last disabled at (49626715): [<ffffffff8a1258a2>] irq_exit+0x112/0x120 > >>> Sep 1 04:48:48 regress kernel: [10642.546536][T24161] RIP: 0010:assertfail.constprop.42+0x1c/0x1e > >>> Sep 1 04:48:49 regress kernel: [10642.551589][T24161] Code: 48 c7 c6 c0 90 03 8c e8 89 29 f1 ff 0f 0b 55 89 f1 48 c7 c2 40 82 03 8c 48 89 fe 48 c7 c7 60 83 03 8c 48 89 e5 e8 41 b0 90 ff <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 83 > >>> Sep 1 04:48:49 regress kernel: [10642.554495][T24161] RSP: 0018:ffffc90002327150 EFLAGS: 00010282 > >>> Sep 1 04:48:49 regress kernel: [10642.555371][T24161] RAX: 0000000000000034 RBX: 000004513701c000 RCX: ffffffff8a264242 > >>> Sep 1 04:48:49 regress kernel: [10642.556515][T24161] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8881f580004c > >>> Sep 1 04:48:49 regress kernel: [10642.557680][T24161] RBP: ffffc90002327150 R08: ffffed103eb017e1 R09: ffffed103eb017e1 > >>> Sep 1 04:48:49 regress kernel: [10642.558895][T24161] R10: ffffed103eb017e0 R11: ffff8881f580bf07 R12: ffff88800d1ea5c0 > >>> Sep 1 04:48:49 regress kernel: [10642.560139][T24161] R13: ffffc900023273e0 R14: 0000000000000000 R15: ffff8880bbf238f0 > >>> Sep 1 04:48:49 regress kernel: [10642.561391][T24161] FS: 00007f03f48488c0(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000 > >>> Sep 1 04:48:49 regress kernel: [10642.562779][T24161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> Sep 1 04:48:49 regress kernel: [10642.563801][T24161] CR2: 00007f1cab76f718 CR3: 0000000189a5e004 CR4: 00000000001606e0 > >>> Sep 1 04:48:49 regress kernel: [10642.565046][T24161] Call Trace: > >>> Sep 1 04:48:49 regress kernel: [10642.565565][T24161] build_backref_tree+0x186b/0x2590 > >>> Sep 1 04:48:49 regress kernel: [10642.566389][T24161] ? relocate_data_extent+0x1a0/0x1a0 > >>> Sep 1 04:48:49 regress kernel: [10642.567295][T24161] ? lock_downgrade+0x3d0/0x3d0 > >>> Sep 1 04:48:49 regress kernel: [10642.568142][T24161] ? match_held_lock+0x20/0x260 > >>> Sep 1 04:48:49 regress kernel: [10642.568925][T24161] ? do_raw_spin_unlock+0xa8/0x140 > >>> Sep 1 04:48:49 regress kernel: [10642.569765][T24161] ? _raw_spin_trylock_bh+0x60/0x80 > >>> Sep 1 04:48:49 regress kernel: [10642.570605][T24161] ? release_extent_buffer+0x13b/0x230 > >>> Sep 1 04:48:49 regress kernel: [10642.571480][T24161] ? free_extent_buffer.part.45+0xd7/0x140 > >>> Sep 1 04:48:49 regress kernel: [10642.572406][T24161] relocate_tree_blocks+0x204/0xa50 > >>> Sep 1 04:48:49 regress kernel: [10642.573244][T24161] ? build_backref_tree+0x2590/0x2590 > >>> Sep 1 04:48:49 regress kernel: [10642.574103][T24161] ? rb_insert_color+0x3af/0x400 > >>> Sep 1 04:48:49 regress kernel: [10642.574896][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 > >>> Sep 1 04:48:49 regress kernel: [10642.575785][T24161] ? tree_insert+0x90/0xb0 > >>> Sep 1 04:48:49 regress kernel: [10642.576495][T24161] ? add_tree_block.isra.38+0x1d6/0x230 > >>> Sep 1 04:48:49 regress kernel: [10642.577387][T24161] relocate_block_group+0x528/0x9d0 > >>> Sep 1 04:48:49 regress kernel: [10642.578220][T24161] ? merge_reloc_roots+0x470/0x470 > >>> Sep 1 04:48:49 regress kernel: [10642.579047][T24161] btrfs_relocate_block_group+0x26e/0x4c0 > >>> Sep 1 04:48:49 regress kernel: [10642.579968][T24161] btrfs_relocate_chunk+0x52/0xf0 > >>> Sep 1 04:48:49 regress kernel: [10642.580773][T24161] btrfs_balance+0xe5b/0x1800 > >>> Sep 1 04:48:49 regress kernel: [10642.581542][T24161] ? btrfs_relocate_chunk+0xf0/0xf0 > >>> Sep 1 04:48:49 regress kernel: [10642.582381][T24161] ? kmem_cache_alloc_trace+0x5af/0x740 > >>> Sep 1 04:48:49 regress kernel: [10642.583270][T24161] ? _copy_from_user+0xaa/0xd0 > >>> Sep 1 04:48:49 regress kernel: [10642.584022][T24161] btrfs_ioctl_balance+0x3de/0x4c0 > >>> Sep 1 04:48:49 regress kernel: [10642.584819][T24161] btrfs_ioctl+0x3122/0x4470 > >>> Sep 1 04:48:49 regress kernel: [10642.585540][T24161] ? __asan_loadN+0xf/0x20 > >>> Sep 1 04:48:49 regress kernel: [10642.586229][T24161] ? __asan_loadN+0xf/0x20 > >>> Sep 1 04:48:49 regress kernel: [10642.586920][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 > >>> Sep 1 04:48:49 regress kernel: [10642.587935][T24161] ? __asan_loadN+0xf/0x20 > >>> Sep 1 04:48:49 regress kernel: [10642.588649][T24161] ? pvclock_clocksource_read+0xeb/0x190 > >>> Sep 1 04:48:49 regress kernel: [10642.589566][T24161] ? __asan_loadN+0xf/0x20 > >>> Sep 1 04:48:49 regress kernel: [10642.590254][T24161] ? pvclock_clocksource_read+0xeb/0x190 > >>> Sep 1 04:48:49 regress kernel: [10642.591128][T24161] ? __kasan_check_read+0x11/0x20 > >>> Sep 1 04:48:49 regress kernel: [10642.591913][T24161] ? check_chain_key+0x1e6/0x2e0 > >>> Sep 1 04:48:49 regress kernel: [10642.592707][T24161] ? __asan_loadN+0xf/0x20 > >>> Sep 1 04:48:49 regress kernel: [10642.593409][T24161] ? pvclock_clocksource_read+0xeb/0x190 > >>> Sep 1 04:48:49 regress kernel: [10642.594312][T24161] ? kvm_sched_clock_read+0x18/0x30 > >>> Sep 1 04:48:49 regress kernel: [10642.595139][T24161] ? check_chain_key+0x1e6/0x2e0 > >>> Sep 1 04:48:49 regress kernel: [10642.595929][T24161] ? sched_clock_cpu+0x1b/0x120 > >>> Sep 1 04:48:49 regress kernel: [10642.596712][T24161] do_vfs_ioctl+0x13e/0xad0 > >>> Sep 1 04:48:49 regress kernel: [10642.597432][T24161] ? btrfs_ioctl_get_supported_features+0x30/0x30 > >>> Sep 1 04:48:49 regress kernel: [10642.598455][T24161] ? do_vfs_ioctl+0x13e/0xad0 > >>> Sep 1 04:48:49 regress kernel: [10642.599202][T24161] ? compat_ioctl_preallocate+0x170/0x170 > >>> Sep 1 04:48:49 regress kernel: [10642.600128][T24161] ? __kasan_check_write+0x14/0x20 > >>> Sep 1 04:48:49 regress kernel: [10642.600949][T24161] ? up_read+0x176/0x4f0 > >>> Sep 1 04:48:49 regress kernel: [10642.601648][T24161] ? down_write_nested+0x2d0/0x2d0 > >>> Sep 1 04:48:49 regress kernel: [10642.602476][T24161] ? handle_mm_fault+0x211/0x480 > >>> Sep 1 04:48:49 regress kernel: [10642.603263][T24161] ? __kasan_check_read+0x11/0x20 > >>> Sep 1 04:48:49 regress kernel: [10642.604062][T24161] ? __fget_light+0xb2/0x110 > >>> Sep 1 04:48:49 regress kernel: [10642.604805][T24161] ksys_ioctl+0x67/0x90 > >>> Sep 1 04:48:49 regress kernel: [10642.605471][T24161] __x64_sys_ioctl+0x43/0x50 > >>> Sep 1 04:48:49 regress kernel: [10642.606203][T24161] do_syscall_64+0x77/0x2d0 > >>> Sep 1 04:48:49 regress kernel: [10642.606933][T24161] entry_SYSCALL_64_after_hwframe+0x49/0xbe > >>> Sep 1 04:48:49 regress kernel: [10642.607875][T24161] RIP: 0033:0x7f03f493b427 > >>> Sep 1 04:48:49 regress kernel: [10642.608588][T24161] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 > >>> Sep 1 04:48:49 regress kernel: [10642.611697][T24161] RSP: 002b:00007ffd6bd78fb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 > >>> Sep 1 04:48:49 regress kernel: [10642.613035][T24161] RAX: ffffffffffffffda RBX: 00007ffd6bd79058 RCX: 00007f03f493b427 > >>> Sep 1 04:48:49 regress kernel: [10642.614313][T24161] RDX: 00007ffd6bd79058 RSI: 00000000c4009420 RDI: 0000000000000003 > >>> Sep 1 04:48:49 regress kernel: [10642.615605][T24161] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 > >>> Sep 1 04:48:49 regress kernel: [10642.616877][T24161] R10: fffffffffffff5ab R11: 0000000000000206 R12: 0000000000000001 > >>> Sep 1 04:48:49 regress kernel: [10642.618132][T24161] R13: 0000000000000000 R14: 00007ffd6bd7aa46 R15: 0000000000000001 > >>> Sep 1 04:48:49 regress kernel: [10642.619378][T24161] Modules linked in: > >>> Sep 1 04:48:49 regress kernel: [10642.620153][T24161] ---[ end trace a33c17a7d43dd973 ]--- > >>> ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-09-04 15:54 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-06-30 22:10 BUG at fs/btrfs/relocation.c:794! David Sterba 2020-07-23 21:56 ` Zygo Blaxell 2020-07-24 0:19 ` Qu Wenruo 2020-08-04 16:16 ` Zygo Blaxell 2020-08-28 0:03 ` BUG at fs/btrfs/relocation.c:794! Still happening on misc-next and 5.8.3 Zygo Blaxell 2020-08-28 0:08 ` Zygo Blaxell 2020-08-28 6:34 ` Nikolay Borisov 2020-08-28 20:42 ` Zygo Blaxell 2020-09-01 22:53 ` Zygo Blaxell 2020-09-01 23:33 ` Qu Wenruo 2020-09-02 0:14 ` Zygo Blaxell 2020-09-02 1:46 ` Qu Wenruo 2020-09-04 15:54 ` Zygo Blaxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).