misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery

* misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
@ 2022-08-09  3:31 Zygo Blaxell
  2022-08-09  4:36 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Zygo Blaxell @ 2022-08-09  3:31 UTC (permalink / raw)
  To: linux-btrfs

Test case is:

	- start with a -draid5 -mraid1 filesystem on 2 disks

	- run assorted IO with a mix of reads and writes (randomly
	run rsync, bees, snapshot create/delete, balance, scrub, start
	replacing one of the disks...)

	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
	blkdiscard on the underlying SSD in the VM host, to simulate
	single-disk data corruption

	- repeat until something goes badly wrong, like unrecoverable
	read error or crash

This test case always failed quickly before (corruption was rarely
if ever fully repaired on btrfs raid5 data), and it still doesn't work
now, but now it doesn't work for a new reason.  Progress?

There is now a BUG_ON arising from this test case:

	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
	[  241.100910][   T45] ------------[ cut here ]------------
	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
	7 c7 b0 1d 45 88 e8 d0 8e 98
	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
	[  241.126226][   T45] Call Trace:
	[  241.126646][   T45]  <TASK>
	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
	[  241.128411][   T45]  bio_endio+0x361/0x3c0
	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
	[  241.128528][   T45]  worker_thread+0x32e/0x720
	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
	[  241.128588][   T45]  kthread+0x1ab/0x1e0
	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
	[  241.128628][   T45]  ret_from_fork+0x22/0x30
	[  241.128659][   T45]  </TASK>
	[  241.128667][   T45] Modules linked in:
	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
	7 c7 b0 1d 45 88 e8 d0 8e 98
	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0

KFENCE and UBSAN aren't reporting anything before the BUG_ON.

KCSAN complains about a lot of stuff as usual, including several issues
in the btrfs allocator, but it doesn't look like anything that would
mess with a bio.

	$ git log --no-walk --oneline FETCH_HEAD
	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804

	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
	 2345           u64 sector;
	 2346           struct btrfs_io_context *bioc = NULL;
	 2347           int ret = 0;
	 2348   
	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
	>2350<          BUG_ON(!mirror_num);
	 2351   
	 2352           if (btrfs_repair_one_zone(fs_info, logical))
	 2353                   return 0;
	 2354   
	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread