All of lore.kernel.org
 help / color / mirror / Atom feed
* misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
@ 2022-08-09  3:31 Zygo Blaxell
  2022-08-09  4:36 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Zygo Blaxell @ 2022-08-09  3:31 UTC (permalink / raw)
  To: linux-btrfs

Test case is:

	- start with a -draid5 -mraid1 filesystem on 2 disks

	- run assorted IO with a mix of reads and writes (randomly
	run rsync, bees, snapshot create/delete, balance, scrub, start
	replacing one of the disks...)

	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
	blkdiscard on the underlying SSD in the VM host, to simulate
	single-disk data corruption

	- repeat until something goes badly wrong, like unrecoverable
	read error or crash

This test case always failed quickly before (corruption was rarely
if ever fully repaired on btrfs raid5 data), and it still doesn't work
now, but now it doesn't work for a new reason.  Progress?

There is now a BUG_ON arising from this test case:

	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
	[  241.100910][   T45] ------------[ cut here ]------------
	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
	7 c7 b0 1d 45 88 e8 d0 8e 98
	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
	[  241.126226][   T45] Call Trace:
	[  241.126646][   T45]  <TASK>
	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
	[  241.128411][   T45]  bio_endio+0x361/0x3c0
	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
	[  241.128528][   T45]  worker_thread+0x32e/0x720
	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
	[  241.128588][   T45]  kthread+0x1ab/0x1e0
	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
	[  241.128628][   T45]  ret_from_fork+0x22/0x30
	[  241.128659][   T45]  </TASK>
	[  241.128667][   T45] Modules linked in:
	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
	7 c7 b0 1d 45 88 e8 d0 8e 98
	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0

KFENCE and UBSAN aren't reporting anything before the BUG_ON.

KCSAN complains about a lot of stuff as usual, including several issues
in the btrfs allocator, but it doesn't look like anything that would
mess with a bio.

	$ git log --no-walk --oneline FETCH_HEAD
	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804

	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
	 2345           u64 sector;
	 2346           struct btrfs_io_context *bioc = NULL;
	 2347           int ret = 0;
	 2348   
	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
	>2350<          BUG_ON(!mirror_num);
	 2351   
	 2352           if (btrfs_repair_one_zone(fs_info, logical))
	 2353                   return 0;
	 2354   
	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09  3:31 misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery Zygo Blaxell
@ 2022-08-09  4:36 ` Qu Wenruo
  2022-08-09 19:46   ` Zygo Blaxell
  2022-08-09  7:35 ` Qu Wenruo
  2022-08-09  8:29 ` Christoph Hellwig
  2 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2022-08-09  4:36 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs, Christoph Hellwig



On 2022/8/9 11:31, Zygo Blaxell wrote:
> Test case is:
>
> 	- start with a -draid5 -mraid1 filesystem on 2 disks
>
> 	- run assorted IO with a mix of reads and writes (randomly
> 	run rsync, bees, snapshot create/delete, balance, scrub, start
> 	replacing one of the disks...)
>
> 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
> 	blkdiscard on the underlying SSD in the VM host, to simulate
> 	single-disk data corruption
>
> 	- repeat until something goes badly wrong, like unrecoverable
> 	read error or crash
>
> This test case always failed quickly before (corruption was rarely
> if ever fully repaired on btrfs raid5 data), and it still doesn't work
> now, but now it doesn't work for a new reason.  Progress?

The new read repair work for compressed extents, adding HCH to the thread.

But just curious, have you tested without compression?

Thanks,
Qu
>
> There is now a BUG_ON arising from this test case:
>
> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
> 	[  241.100910][   T45] ------------[ cut here ]------------
> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
> 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
> 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
> 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
> 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> 	7 c7 b0 1d 45 88 e8 d0 8e 98
> 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
> 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
> 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
> 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> 	[  241.126226][   T45] Call Trace:
> 	[  241.126646][   T45]  <TASK>
> 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
> 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
> 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
> 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
> 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
> 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
> 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
> 	[  241.128528][   T45]  worker_thread+0x32e/0x720
> 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
> 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
> 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
> 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
> 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
> 	[  241.128659][   T45]  </TASK>
> 	[  241.128667][   T45] Modules linked in:
> 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
> 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> 	7 c7 b0 1d 45 88 e8 d0 8e 98
> 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
> 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
> 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
> 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>
> KFENCE and UBSAN aren't reporting anything before the BUG_ON.
>
> KCSAN complains about a lot of stuff as usual, including several issues
> in the btrfs allocator, but it doesn't look like anything that would
> mess with a bio.
>
> 	$ git log --no-walk --oneline FETCH_HEAD
> 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
>
> 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
> 	 2345           u64 sector;
> 	 2346           struct btrfs_io_context *bioc = NULL;
> 	 2347           int ret = 0;
> 	 2348
> 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
> 	>2350<          BUG_ON(!mirror_num);
> 	 2351
> 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
> 	 2353                   return 0;
> 	 2354
> 	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09  3:31 misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery Zygo Blaxell
  2022-08-09  4:36 ` Qu Wenruo
@ 2022-08-09  7:35 ` Qu Wenruo
  2022-08-09 19:29   ` Zygo Blaxell
  2022-08-09  8:29 ` Christoph Hellwig
  2 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2022-08-09  7:35 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs



On 2022/8/9 11:31, Zygo Blaxell wrote:
> Test case is:
>
> 	- start with a -draid5 -mraid1 filesystem on 2 disks
>
> 	- run assorted IO with a mix of reads and writes (randomly
> 	run rsync, bees, snapshot create/delete, balance, scrub, start
> 	replacing one of the disks...)
>
> 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
> 	blkdiscard on the underlying SSD in the VM host, to simulate
> 	single-disk data corruption

One thing to mention is, this is going to cause destructive RMW to happen.

As currently substripe write will not verify if the on-disk data stripe
matches its csum.

Thus if the wipeout happens while above workload is still running, it's
going to corrupt data eventually.

Thanks,
Qu
>
> 	- repeat until something goes badly wrong, like unrecoverable
> 	read error or crash
>
> This test case always failed quickly before (corruption was rarely
> if ever fully repaired on btrfs raid5 data), and it still doesn't work
> now, but now it doesn't work for a new reason.  Progress?
>
> There is now a BUG_ON arising from this test case:
>
> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
> 	[  241.100910][   T45] ------------[ cut here ]------------
> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
> 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
> 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
> 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
> 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> 	7 c7 b0 1d 45 88 e8 d0 8e 98
> 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
> 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
> 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
> 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> 	[  241.126226][   T45] Call Trace:
> 	[  241.126646][   T45]  <TASK>
> 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
> 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
> 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
> 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
> 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
> 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
> 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
> 	[  241.128528][   T45]  worker_thread+0x32e/0x720
> 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
> 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
> 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
> 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
> 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
> 	[  241.128659][   T45]  </TASK>
> 	[  241.128667][   T45] Modules linked in:
> 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
> 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> 	7 c7 b0 1d 45 88 e8 d0 8e 98
> 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
> 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
> 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
> 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>
> KFENCE and UBSAN aren't reporting anything before the BUG_ON.
>
> KCSAN complains about a lot of stuff as usual, including several issues
> in the btrfs allocator, but it doesn't look like anything that would
> mess with a bio.
>
> 	$ git log --no-walk --oneline FETCH_HEAD
> 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
>
> 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
> 	 2345           u64 sector;
> 	 2346           struct btrfs_io_context *bioc = NULL;
> 	 2347           int ret = 0;
> 	 2348
> 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
> 	>2350<          BUG_ON(!mirror_num);
> 	 2351
> 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
> 	 2353                   return 0;
> 	 2354
> 	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09  3:31 misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery Zygo Blaxell
  2022-08-09  4:36 ` Qu Wenruo
  2022-08-09  7:35 ` Qu Wenruo
@ 2022-08-09  8:29 ` Christoph Hellwig
  2022-08-09 19:24   ` Zygo Blaxell
  2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2022-08-09  8:29 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Mon, Aug 08, 2022 at 11:31:51PM -0400, Zygo Blaxell wrote:
> There is now a BUG_ON arising from this test case:
> 
> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
> 	[  241.100910][   T45] ------------[ cut here ]------------
> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!

This

        BUG_ON(!mirror_num);

so repair_io_failure gets called with a mirror_num of 0..

> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260

.. from clean_io_failure.  Which starts from failrec->this_mirror and
tries to go back to failrec->failed_mirror using the prev_mirror
helper.  prev_mirror looks like:

static int prev_mirror(const struct io_failure_record *failrec, int cur_mirror)
{
        if (cur_mirror == 1)
		return failrec->num_copies;
	return cur_mirror - 1;
}

So the only way we could end up with a mirror = 0 is if
failrec->num_copies is 0.  -failrec->num_copies is initialized
in btrfs_get_io_failure_record by doing:

        failrec->num_copies = btrfs_num_copies(fs_info, failrec->logical, sectorsize);

just adter allocating the failrec.  I can't see any obvious way how
btrfs_num_copies would return 0, though, as for raid5 it just copies
from btrfs_raid_array.

Any chance you could share a script for your reproducer?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09  8:29 ` Christoph Hellwig
@ 2022-08-09 19:24   ` Zygo Blaxell
  2022-08-12  2:58     ` Wang Yugui
  2022-08-13  1:50     ` Qu Wenruo
  0 siblings, 2 replies; 19+ messages in thread
From: Zygo Blaxell @ 2022-08-09 19:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs

On Tue, Aug 09, 2022 at 01:29:59AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 08, 2022 at 11:31:51PM -0400, Zygo Blaxell wrote:
> > There is now a BUG_ON arising from this test case:
> > 
> > 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
> > 	[  241.100910][   T45] ------------[ cut here ]------------
> > 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
> 
> This
> 
>         BUG_ON(!mirror_num);
> 
> so repair_io_failure gets called with a mirror_num of 0..
> 
> > 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
> 
> .. from clean_io_failure.  Which starts from failrec->this_mirror and
> tries to go back to failrec->failed_mirror using the prev_mirror
> helper.  prev_mirror looks like:
> 
> static int prev_mirror(const struct io_failure_record *failrec, int cur_mirror)
> {
>         if (cur_mirror == 1)
> 		return failrec->num_copies;
> 	return cur_mirror - 1;
> }
> 
> So the only way we could end up with a mirror = 0 is if
> failrec->num_copies is 0.  -failrec->num_copies is initialized
> in btrfs_get_io_failure_record by doing:
> 
>         failrec->num_copies = btrfs_num_copies(fs_info, failrec->logical, sectorsize);
> 
> just adter allocating the failrec.  I can't see any obvious way how
> btrfs_num_copies would return 0, though, as for raid5 it just copies
> from btrfs_raid_array.

Judging from prior raid5 testing behavior, it looks like there's a race
condition specific to btrfs raid5 IO.  Previous kernel versions have had
assorted UAF bugs from time to time that KASAN tripped over, and btrfs
replace almost never works on the first try on a real raid5 array.
These issues were reported years ago, but nobody seems to have been
working on them until recently.

A similar test setup previously produced data corruption during raid5
recovery, even on an otherwise idle filesystem, at a low rate (~1
error per 100 GB).  I expect whatever bug was leading to that hasn't
been entirely fixed yet.

> Any chance you could share a script for your reproducer?

The simplest reproducer is some variant of:

	mkfs.btrfs -draid5 -mraid1 /dev/vdb /dev/vdc /dev/vdd
	mount /dev/vdb /mnt -ocompress=zstd,noatime
	cd /mnt
	cp -a /40gb-test-data .
	sync
	while true; do
		find -type f -exec cat {} + > /dev/null
	done &
	while true; do
		cat /dev/zero > /dev/vdb
	done &
	while true; do
		btrfs scrub start -Bd /mnt
	done &
	wait

but it can take a long time to hit a failure with something that gentle.
I throw on some extra test workload (e.g. lots of rsyncs) to keep the
page cache full and under memory pressure, which seems to speed up the
failure rate to once every few hours.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09  7:35 ` Qu Wenruo
@ 2022-08-09 19:29   ` Zygo Blaxell
  2022-08-09 21:50     ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Zygo Blaxell @ 2022-08-09 19:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Aug 09, 2022 at 03:35:25PM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/8/9 11:31, Zygo Blaxell wrote:
> > Test case is:
> > 
> > 	- start with a -draid5 -mraid1 filesystem on 2 disks
> > 
> > 	- run assorted IO with a mix of reads and writes (randomly
> > 	run rsync, bees, snapshot create/delete, balance, scrub, start
> > 	replacing one of the disks...)
> > 
> > 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
> > 	blkdiscard on the underlying SSD in the VM host, to simulate
> > 	single-disk data corruption
> 
> One thing to mention is, this is going to cause destructive RMW to happen.
> 
> As currently substripe write will not verify if the on-disk data stripe
> matches its csum.
> 
> Thus if the wipeout happens while above workload is still running, it's
> going to corrupt data eventually.

That would be a btrfs raid5 design bug, since disks don't schedule their
corruptions during idle periods; however, I can modify the test design
to avoid this case (i.e. only corrupt disks while idle or reading).

A BUG_ON is not expected, though, no matter what the disk does.

> Thanks,
> Qu
> > 
> > 	- repeat until something goes badly wrong, like unrecoverable
> > 	read error or crash
> > 
> > This test case always failed quickly before (corruption was rarely
> > if ever fully repaired on btrfs raid5 data), and it still doesn't work
> > now, but now it doesn't work for a new reason.  Progress?
> > 
> > There is now a BUG_ON arising from this test case:
> > 
> > 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
> > 	[  241.100910][   T45] ------------[ cut here ]------------
> > 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
> > 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
> > 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
> > 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> > 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
> > 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> > 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> > 	7 c7 b0 1d 45 88 e8 d0 8e 98
> > 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
> > 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
> > 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> > 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
> > 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> > 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> > 	[  241.126226][   T45] Call Trace:
> > 	[  241.126646][   T45]  <TASK>
> > 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
> > 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
> > 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
> > 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
> > 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
> > 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
> > 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
> > 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
> > 	[  241.128528][   T45]  worker_thread+0x32e/0x720
> > 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
> > 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
> > 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
> > 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
> > 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
> > 	[  241.128659][   T45]  </TASK>
> > 	[  241.128667][   T45] Modules linked in:
> > 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
> > 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> > 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> > 	7 c7 b0 1d 45 88 e8 d0 8e 98
> > 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
> > 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
> > 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> > 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
> > 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> > 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> > 
> > KFENCE and UBSAN aren't reporting anything before the BUG_ON.
> > 
> > KCSAN complains about a lot of stuff as usual, including several issues
> > in the btrfs allocator, but it doesn't look like anything that would
> > mess with a bio.
> > 
> > 	$ git log --no-walk --oneline FETCH_HEAD
> > 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
> > 
> > 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
> > 	 2345           u64 sector;
> > 	 2346           struct btrfs_io_context *bioc = NULL;
> > 	 2347           int ret = 0;
> > 	 2348
> > 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
> > 	>2350<          BUG_ON(!mirror_num);
> > 	 2351
> > 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
> > 	 2353                   return 0;
> > 	 2354
> > 	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09  4:36 ` Qu Wenruo
@ 2022-08-09 19:46   ` Zygo Blaxell
  2022-08-10  7:17     ` Qu Wenruo
  2022-08-14  4:52     ` Qu Wenruo
  0 siblings, 2 replies; 19+ messages in thread
From: Zygo Blaxell @ 2022-08-09 19:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Christoph Hellwig

On Tue, Aug 09, 2022 at 12:36:44PM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/8/9 11:31, Zygo Blaxell wrote:
> > Test case is:
> > 
> > 	- start with a -draid5 -mraid1 filesystem on 2 disks
> > 
> > 	- run assorted IO with a mix of reads and writes (randomly
> > 	run rsync, bees, snapshot create/delete, balance, scrub, start
> > 	replacing one of the disks...)
> > 
> > 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
> > 	blkdiscard on the underlying SSD in the VM host, to simulate
> > 	single-disk data corruption
> > 
> > 	- repeat until something goes badly wrong, like unrecoverable
> > 	read error or crash
> > 
> > This test case always failed quickly before (corruption was rarely
> > if ever fully repaired on btrfs raid5 data), and it still doesn't work
> > now, but now it doesn't work for a new reason.  Progress?
> 
> The new read repair work for compressed extents, adding HCH to the thread.
> 
> But just curious, have you tested without compression?

All of the ~200 BUG_ON stack traces in my logs have the same list of
functions as above.  If the bug affected uncompressed data, I'd expect
to see two different stack traces.  It's a fairly decent sample size,
so I'd say it's most likely not happening with uncompressed extents.

All the production workloads have compression enabled, so we don't
normally test with compression disabled.  I can run a separate test for
that if you'd like.

> Thanks,
> Qu
> > 
> > There is now a BUG_ON arising from this test case:
> > 
> > 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
> > 	[  241.100910][   T45] ------------[ cut here ]------------
> > 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
> > 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
> > 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
> > 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> > 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
> > 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> > 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> > 	7 c7 b0 1d 45 88 e8 d0 8e 98
> > 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
> > 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
> > 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> > 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
> > 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> > 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> > 	[  241.126226][   T45] Call Trace:
> > 	[  241.126646][   T45]  <TASK>
> > 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
> > 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
> > 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
> > 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
> > 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
> > 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
> > 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
> > 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
> > 	[  241.128528][   T45]  worker_thread+0x32e/0x720
> > 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
> > 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
> > 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
> > 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
> > 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
> > 	[  241.128659][   T45]  </TASK>
> > 	[  241.128667][   T45] Modules linked in:
> > 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
> > 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> > 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> > 	7 c7 b0 1d 45 88 e8 d0 8e 98
> > 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
> > 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
> > 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> > 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
> > 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> > 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> > 
> > KFENCE and UBSAN aren't reporting anything before the BUG_ON.
> > 
> > KCSAN complains about a lot of stuff as usual, including several issues
> > in the btrfs allocator, but it doesn't look like anything that would
> > mess with a bio.
> > 
> > 	$ git log --no-walk --oneline FETCH_HEAD
> > 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
> > 
> > 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
> > 	 2345           u64 sector;
> > 	 2346           struct btrfs_io_context *bioc = NULL;
> > 	 2347           int ret = 0;
> > 	 2348
> > 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
> > 	>2350<          BUG_ON(!mirror_num);
> > 	 2351
> > 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
> > 	 2353                   return 0;
> > 	 2354
> > 	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09 19:29   ` Zygo Blaxell
@ 2022-08-09 21:50     ` Qu Wenruo
  2022-08-10  8:08       ` Goffredo Baroncelli
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2022-08-09 21:50 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs



On 2022/8/10 03:29, Zygo Blaxell wrote:
> On Tue, Aug 09, 2022 at 03:35:25PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2022/8/9 11:31, Zygo Blaxell wrote:
>>> Test case is:
>>>
>>> 	- start with a -draid5 -mraid1 filesystem on 2 disks
>>>
>>> 	- run assorted IO with a mix of reads and writes (randomly
>>> 	run rsync, bees, snapshot create/delete, balance, scrub, start
>>> 	replacing one of the disks...)
>>>
>>> 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>> 	blkdiscard on the underlying SSD in the VM host, to simulate
>>> 	single-disk data corruption
>>
>> One thing to mention is, this is going to cause destructive RMW to happen.
>>
>> As currently substripe write will not verify if the on-disk data stripe
>> matches its csum.
>>
>> Thus if the wipeout happens while above workload is still running, it's
>> going to corrupt data eventually.
>
> That would be a btrfs raid5 design bug,

That's something all RAID5 design would have the problem, not just btrfs.

Any P/Q based profile will have the problem.

> since disks don't schedule their
> corruptions during idle periods; however, I can modify the test design
> to avoid this case (i.e. only corrupt disks while idle or reading).

Then it's not much different than this new test case, except the
compression part:
https://patchwork.kernel.org/project/linux-btrfs/patch/20220727054148.73405-1-wqu@suse.com/

Thanks,
Qu
>
> A BUG_ON is not expected, though, no matter what the disk does.
>
>> Thanks,
>> Qu
>>>
>>> 	- repeat until something goes badly wrong, like unrecoverable
>>> 	read error or crash
>>>
>>> This test case always failed quickly before (corruption was rarely
>>> if ever fully repaired on btrfs raid5 data), and it still doesn't work
>>> now, but now it doesn't work for a new reason.  Progress?
>>>
>>> There is now a BUG_ON arising from this test case:
>>>
>>> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
>>> 	[  241.100910][   T45] ------------[ cut here ]------------
>>> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
>>> 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
>>> 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
>>> 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>>> 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
>>> 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>> 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>> 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
>>> 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>> 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>> 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
>>> 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>> 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
>>> 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>> 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>> 	[  241.126226][   T45] Call Trace:
>>> 	[  241.126646][   T45]  <TASK>
>>> 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
>>> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
>>> 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
>>> 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
>>> 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
>>> 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
>>> 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
>>> 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
>>> 	[  241.128528][   T45]  worker_thread+0x32e/0x720
>>> 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
>>> 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
>>> 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
>>> 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
>>> 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
>>> 	[  241.128659][   T45]  </TASK>
>>> 	[  241.128667][   T45] Modules linked in:
>>> 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
>>> 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>> 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>> 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
>>> 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>> 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>> 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
>>> 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>> 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
>>> 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>> 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>>
>>> KFENCE and UBSAN aren't reporting anything before the BUG_ON.
>>>
>>> KCSAN complains about a lot of stuff as usual, including several issues
>>> in the btrfs allocator, but it doesn't look like anything that would
>>> mess with a bio.
>>>
>>> 	$ git log --no-walk --oneline FETCH_HEAD
>>> 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
>>>
>>> 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
>>> 	 2345           u64 sector;
>>> 	 2346           struct btrfs_io_context *bioc = NULL;
>>> 	 2347           int ret = 0;
>>> 	 2348
>>> 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
>>> 	>2350<          BUG_ON(!mirror_num);
>>> 	 2351
>>> 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
>>> 	 2353                   return 0;
>>> 	 2354
>>> 	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09 19:46   ` Zygo Blaxell
@ 2022-08-10  7:17     ` Qu Wenruo
  2022-08-14  4:52     ` Qu Wenruo
  1 sibling, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2022-08-10  7:17 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Christoph Hellwig



On 2022/8/10 03:46, Zygo Blaxell wrote:
> On Tue, Aug 09, 2022 at 12:36:44PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2022/8/9 11:31, Zygo Blaxell wrote:
>>> Test case is:
>>>
>>> 	- start with a -draid5 -mraid1 filesystem on 2 disks
>>>
>>> 	- run assorted IO with a mix of reads and writes (randomly
>>> 	run rsync, bees, snapshot create/delete, balance, scrub, start
>>> 	replacing one of the disks...)
>>>
>>> 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>> 	blkdiscard on the underlying SSD in the VM host, to simulate
>>> 	single-disk data corruption
>>>
>>> 	- repeat until something goes badly wrong, like unrecoverable
>>> 	read error or crash
>>>
>>> This test case always failed quickly before (corruption was rarely
>>> if ever fully repaired on btrfs raid5 data), and it still doesn't work
>>> now, but now it doesn't work for a new reason.  Progress?
>>
>> The new read repair work for compressed extents, adding HCH to the thread.
>>
>> But just curious, have you tested without compression?
>
> All of the ~200 BUG_ON stack traces in my logs have the same list of
> functions as above.  If the bug affected uncompressed data, I'd expect
> to see two different stack traces.  It's a fairly decent sample size,
> so I'd say it's most likely not happening with uncompressed extents.
>
> All the production workloads have compression enabled, so we don't
> normally test with compression disabled.  I can run a separate test for
> that if you'd like.

Please do a uncompressed run, as we want to isolate the problem first.

Thanks,
Qu

>
>> Thanks,
>> Qu
>>>
>>> There is now a BUG_ON arising from this test case:
>>>
>>> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
>>> 	[  241.100910][   T45] ------------[ cut here ]------------
>>> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
>>> 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
>>> 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
>>> 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>>> 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
>>> 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>> 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>> 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
>>> 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>> 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>> 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
>>> 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>> 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
>>> 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>> 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>> 	[  241.126226][   T45] Call Trace:
>>> 	[  241.126646][   T45]  <TASK>
>>> 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
>>> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
>>> 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
>>> 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
>>> 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
>>> 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
>>> 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
>>> 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
>>> 	[  241.128528][   T45]  worker_thread+0x32e/0x720
>>> 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
>>> 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
>>> 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
>>> 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
>>> 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
>>> 	[  241.128659][   T45]  </TASK>
>>> 	[  241.128667][   T45] Modules linked in:
>>> 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
>>> 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>> 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>> 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
>>> 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>> 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>> 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
>>> 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>> 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
>>> 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>> 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>>
>>> KFENCE and UBSAN aren't reporting anything before the BUG_ON.
>>>
>>> KCSAN complains about a lot of stuff as usual, including several issues
>>> in the btrfs allocator, but it doesn't look like anything that would
>>> mess with a bio.
>>>
>>> 	$ git log --no-walk --oneline FETCH_HEAD
>>> 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
>>>
>>> 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
>>> 	 2345           u64 sector;
>>> 	 2346           struct btrfs_io_context *bioc = NULL;
>>> 	 2347           int ret = 0;
>>> 	 2348
>>> 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
>>> 	>2350<          BUG_ON(!mirror_num);
>>> 	 2351
>>> 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
>>> 	 2353                   return 0;
>>> 	 2354
>>> 	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09 21:50     ` Qu Wenruo
@ 2022-08-10  8:08       ` Goffredo Baroncelli
  2022-08-10  8:24         ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Goffredo Baroncelli @ 2022-08-10  8:08 UTC (permalink / raw)
  To: Qu Wenruo, Zygo Blaxell; +Cc: linux-btrfs

On 09/08/2022 23.50, Qu Wenruo wrote:
>>> n 2022/8/9 11:31, Zygo Blaxell wrote:
>>>> Test case is:
>>>>
>>>>     - start with a -draid5 -mraid1 filesystem on 2 disks
>>>>
>>>>     - run assorted IO with a mix of reads and writes (randomly
>>>>     run rsync, bees, snapshot create/delete, balance, scrub, start
>>>>     replacing one of the disks...)
>>>>
>>>>     - cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>>>     blkdiscard on the underlying SSD in the VM host, to simulate
>>>>     single-disk data corruption
>>>
>>> One thing to mention is, this is going to cause destructive RMW to happen.
>>>
>>> As currently substripe write will not verify if the on-disk data stripe
>>> matches its csum.
>>>
>>> Thus if the wipeout happens while above workload is still running, it's
>>> going to corrupt data eventually.
>>
>> That would be a btrfs raid5 design bug,
> 
> That's something all RAID5 design would have the problem, not just btrfs.
> 
> Any P/Q based profile will have the problem.


Hi Qu,

I looked at your description of 'destructive RMW' cyle:

----
Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:

         0       32K     64K
Data1:  | Good  | Good  |
Data2:  | Bad   | Bad   |
Parity: | Good  | Good  |

In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.
----

What I don't understood if we have a "implementation problem" or an
intrinsic problem of raid56...

To calculate parity we need to know:
	- data1 (in ram)
	- data2 (not cached, bad on disk)

So, first, we need to "read data2" then to calculate the parity and then to write data1.

The key factor is "read data", where we can face three cases:
1) the data is referenced and has a checksum: we can check against the checksum and if the checksum doesn't match we should perform a recover (on the basis of the data stored on the disk)
2) the data is referenced but doesn't have a checksum (nocow): we cannot ensure the corruption of the data if checksum is not enabled. We can only ensure the availability of the data (which may be corrupted)
3) the data is not referenced: so the data is good.

So in effect for the case 2) the data may be corrupted and not recoverable (but this is true in any case); but for the case 1) from a theoretical point of view it seems recoverable. Of course this has a cost: you need to read the stripe and their checksum (doing a recovery if needed) before updating any part of the stripe itself, maintaining a strict order between the read and the writing.


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-10  8:08       ` Goffredo Baroncelli
@ 2022-08-10  8:24         ` Qu Wenruo
  2022-08-10  8:45           ` Goffredo Baroncelli
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2022-08-10  8:24 UTC (permalink / raw)
  To: kreijack, Zygo Blaxell; +Cc: linux-btrfs



On 2022/8/10 16:08, Goffredo Baroncelli wrote:
> On 09/08/2022 23.50, Qu Wenruo wrote:
>>>> n 2022/8/9 11:31, Zygo Blaxell wrote:
>>>>> Test case is:
>>>>>
>>>>>     - start with a -draid5 -mraid1 filesystem on 2 disks
>>>>>
>>>>>     - run assorted IO with a mix of reads and writes (randomly
>>>>>     run rsync, bees, snapshot create/delete, balance, scrub, start
>>>>>     replacing one of the disks...)
>>>>>
>>>>>     - cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>>>>     blkdiscard on the underlying SSD in the VM host, to simulate
>>>>>     single-disk data corruption
>>>>
>>>> One thing to mention is, this is going to cause destructive RMW to
>>>> happen.
>>>>
>>>> As currently substripe write will not verify if the on-disk data stripe
>>>> matches its csum.
>>>>
>>>> Thus if the wipeout happens while above workload is still running, it's
>>>> going to corrupt data eventually.
>>>
>>> That would be a btrfs raid5 design bug,
>>
>> That's something all RAID5 design would have the problem, not just btrfs.
>>
>> Any P/Q based profile will have the problem.
>
>
> Hi Qu,
>
> I looked at your description of 'destructive RMW' cyle:
>
> ----
> Test case btrfs/125 (and above workload) always has its trouble with
> the destructive read-modify-write (RMW) cycle:
>
>          0       32K     64K
> Data1:  | Good  | Good  |
> Data2:  | Bad   | Bad   |
> Parity: | Good  | Good  |
>
> In above case, if we trigger any write into Data1, we will use the bad
> data in Data2 to re-generate parity, killing the only chance to recovery
> Data2, thus Data2 is lost forever.
> ----
>
> What I don't understood if we have a "implementation problem" or an
> intrinsic problem of raid56...
>
> To calculate parity we need to know:
>      - data1 (in ram)
>      - data2 (not cached, bad on disk)
>
> So, first, we need to "read data2" then to calculate the parity and then
> to write data1.

First thing first, the idea of RAID5/6 itself doesn't involve how to
verify the data is correct or not.

Btrfs is better because it has extra csum ability.

Thus for bad on-disk data case, it's even worse for all the other
RAID5/6 implementations which don't have csum for its data.

>
> The key factor is "read data", where we can face three cases:
> 1) the data is referenced and has a checksum: we can check against the
> checksum and if the checksum doesn't match we should perform a recover
> (on the basis of the data stored on the disk)

Then let's talk about the detail problems in the btrfs specific areas.

Yes, it's possible that we can try to check the csums when doing RMW for
sub-stripe writes.

But there are problems:

=== csum tree race ===

1) Data writes (with csum) into some data 1 stripes finished

2) By somehow the just written data is corrupted

3) RAID56 RMW triggered for write into data 2

4) We search csum tree, and found nothing, as the csum insert hasn't
    happen yet

5) Csum insert happens for data write in 1)

Then we still have the destructive RMW.
Yes we can argue that 2) is so rare that we shouldn't really bother, but
in theory this can still happen, all just because the csum tree search
and csum insert has race window.

=== performance ===

This means, every time we do a RMW, we have to do a full csum tree and
extent tree search.
And this also mean, we have to read the full stripe, including the range
we're going to write data into.

AKA, this is full scrub level check.

This is definitely not a small cost, which will further slowdown the
RAID56 write.

Thus even if we're going to implement such scrub level check, I'm not
going to make it the default option, but make it opt-in.

Although recently since I'm working on RAID56, unless there are some
possible dead-lock, I believe it's possible to implement.

Maybe in the near future we may see some prototypes to try to address
this problem.

Thanks,
Qu


> 2) the data is referenced but doesn't have a checksum (nocow): we cannot
> ensure the corruption of the data if checksum is not enabled. We can
> only ensure the availability of the data (which may be corrupted)
> 3) the data is not referenced: so the data is good.
>
> So in effect for the case 2) the data may be corrupted and not
> recoverable (but this is true in any case); but for the case 1) from a
> theoretical point of view it seems recoverable. Of course this has a
> cost: you need to read the stripe and their checksum (doing a recovery
> if needed) before updating any part of the stripe itself, maintaining a
> strict order between the read and the writing.
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-10  8:24         ` Qu Wenruo
@ 2022-08-10  8:45           ` Goffredo Baroncelli
  2022-08-10  9:14             ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Goffredo Baroncelli @ 2022-08-10  8:45 UTC (permalink / raw)
  To: Qu Wenruo, Zygo Blaxell; +Cc: linux-btrfs

On 10/08/2022 10.24, Qu Wenruo wrote:
> 
> 
> On 2022/8/10 16:08, Goffredo Baroncelli wrote:
>> On 09/08/2022 23.50, Qu Wenruo wrote:
>>>>> n 2022/8/9 11:31, Zygo Blaxell wrote:
>>>>>> Test case is:
>>>>>>
>>>>>>     - start with a -draid5 -mraid1 filesystem on 2 disks
>>>>>>
>>>>>>     - run assorted IO with a mix of reads and writes (randomly
>>>>>>     run rsync, bees, snapshot create/delete, balance, scrub, start
>>>>>>     replacing one of the disks...)
>>>>>>
>>>>>>     - cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>>>>>     blkdiscard on the underlying SSD in the VM host, to simulate
>>>>>>     single-disk data corruption
>>>>>
>>>>> One thing to mention is, this is going to cause destructive RMW to
>>>>> happen.
>>>>>
>>>>> As currently substripe write will not verify if the on-disk data stripe
>>>>> matches its csum.
>>>>>
>>>>> Thus if the wipeout happens while above workload is still running, it's
>>>>> going to corrupt data eventually.
>>>>
>>>> That would be a btrfs raid5 design bug,
>>>
>>> That's something all RAID5 design would have the problem, not just btrfs.
>>>
>>> Any P/Q based profile will have the problem.
>>
>>
>> Hi Qu,
>>
>> I looked at your description of 'destructive RMW' cyle:
>>
>> ----
>> Test case btrfs/125 (and above workload) always has its trouble with
>> the destructive read-modify-write (RMW) cycle:
>>
>>          0       32K     64K
>> Data1:  | Good  | Good  |
>> Data2:  | Bad   | Bad   |
>> Parity: | Good  | Good  |
>>
>> In above case, if we trigger any write into Data1, we will use the bad
>> data in Data2 to re-generate parity, killing the only chance to recovery
>> Data2, thus Data2 is lost forever.
>> ----
>>
>> What I don't understood if we have a "implementation problem" or an
>> intrinsic problem of raid56...
>>
>> To calculate parity we need to know:
>>      - data1 (in ram)
>>      - data2 (not cached, bad on disk)
>>
>> So, first, we need to "read data2" then to calculate the parity and then
>> to write data1.
> 
> First thing first, the idea of RAID5/6 itself doesn't involve how to
> verify the data is correct or not.
> 
> Btrfs is better because it has extra csum ability.
> 
> Thus for bad on-disk data case, it's even worse for all the other
> RAID5/6 implementations which don't have csum for its data.
> 
>>
>> The key factor is "read data", where we can face three cases:
>> 1) the data is referenced and has a checksum: we can check against the
>> checksum and if the checksum doesn't match we should perform a recover
>> (on the basis of the data stored on the disk)
> 
> Then let's talk about the detail problems in the btrfs specific areas.
> 
> Yes, it's possible that we can try to check the csums when doing RMW for
> sub-stripe writes.
> 
> But there are problems:
> 
> === csum tree race ===
> 
> 1) Data writes (with csum) into some data 1 stripes finished
> 
> 2) By somehow the just written data is corrupted
> 
> 3) RAID56 RMW triggered for write into data 2
> 
> 4) We search csum tree, and found nothing, as the csum insert hasn't
>     happen yet
> 
This is the point. Theoretically this is doable, but this would require a "strict"
order not only inside the stripe but also between all the data and metadata.
This would have an impact on the performance; however I think that this is not
the highest problem. The highest problem is increase of the complexity of the code.

It seems that the ZFS approach (parity inside the extent) should not have this problem.


> 5) Csum insert happens for data write in 1)
> 
> Then we still have the destructive RMW.
> Yes we can argue that 2) is so rare that we shouldn't really bother, but
> in theory this can still happen, all just because the csum tree search
> and csum insert has race window.
> 
> === performance ===
> 
> This means, every time we do a RMW, we have to do a full csum tree and
> extent tree search.
> And this also mean, we have to read the full stripe, including the range
> we're going to write data into.

Likely we need to read the checksums in any case, because we need to update
the page that hosts it for the first write.
Also we need to read the full stripe to compute the parity; only for
raid 5 we can do:

	new_parity = old_parity ^ new_data

> 
> AKA, this is full scrub level check.
Is the usual cost that we need to pay when we read a data from the disk.

> 
> This is definitely not a small cost, which will further slowdown the
> RAID56 write.
> 
> Thus even if we're going to implement such scrub level check, I'm not
> going to make it the default option, but make it opt-in.
> 
> Although recently since I'm working on RAID56, unless there are some
> possible dead-lock, I believe it's possible to implement.
> 
> Maybe in the near future we may see some prototypes to try to address
> this problem.
> 
> Thanks,
> Qu
> 
> 
>> 2) the data is referenced but doesn't have a checksum (nocow): we cannot
>> ensure the corruption of the data if checksum is not enabled. We can
>> only ensure the availability of the data (which may be corrupted)
>> 3) the data is not referenced: so the data is good.
>>
>> So in effect for the case 2) the data may be corrupted and not
>> recoverable (but this is true in any case); but for the case 1) from a
>> theoretical point of view it seems recoverable. Of course this has a
>> cost: you need to read the stripe and their checksum (doing a recovery
>> if needed) before updating any part of the stripe itself, maintaining a
>> strict order between the read and the writing.
>>
>>

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-10  8:45           ` Goffredo Baroncelli
@ 2022-08-10  9:14             ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2022-08-10  9:14 UTC (permalink / raw)
  To: kreijack, Zygo Blaxell; +Cc: linux-btrfs



On 2022/8/10 16:45, Goffredo Baroncelli wrote:
[...]
>>
>>>
>>> The key factor is "read data", where we can face three cases:
>>> 1) the data is referenced and has a checksum: we can check against the
>>> checksum and if the checksum doesn't match we should perform a recover
>>> (on the basis of the data stored on the disk)
>>
>> Then let's talk about the detail problems in the btrfs specific areas.
>>
>> Yes, it's possible that we can try to check the csums when doing RMW for
>> sub-stripe writes.
>>
>> But there are problems:
>>
>> === csum tree race ===
>>
>> 1) Data writes (with csum) into some data 1 stripes finished
>>
>> 2) By somehow the just written data is corrupted
>>
>> 3) RAID56 RMW triggered for write into data 2
>>
>> 4) We search csum tree, and found nothing, as the csum insert hasn't
>>     happen yet
>>
> This is the point. Theoretically this is doable, but this would require
> a "strict"
> order not only inside the stripe but also between all the data and
> metadata.
> This would have an impact on the performance; however I think that this
> is not
> the highest problem. The highest problem is increase of the complexity
> of the code.
>
> It seems that the ZFS approach (parity inside the extent) should not
> have this problem.
>
>
>> 5) Csum insert happens for data write in 1)
>>
>> Then we still have the destructive RMW.
>> Yes we can argue that 2) is so rare that we shouldn't really bother, but
>> in theory this can still happen, all just because the csum tree search
>> and csum insert has race window.
>>
>> === performance ===
>>
>> This means, every time we do a RMW, we have to do a full csum tree and
>> extent tree search.
>> And this also mean, we have to read the full stripe, including the range
>> we're going to write data into.
>
> Likely we need to read the checksums in any case, because we need to update
> the page that hosts it for the first write.

Checksum read and full stripe read are completely different.

Checksum read needs to involve several tree block reads by itself, and
the timing is way tricky.

For normal data read, we search csum tree beforehand, but for this
RAID56 RMW, we must search csum tree at the very end stage before we
submit the write.

I'm not 100% confident if the timing is fine.

And for normal write, even for RAID56 RMW, we never trigger read for
csum tree, we just read the full stripe at most.

Thus even the code change may not be that complex, the timing change can
be very tricky.

> Also we need to read the full stripe to compute the parity; only for
> raid 5 we can do:
>
>      new_parity = old_parity ^ new_data
>
>>
>> AKA, this is full scrub level check.
> Is the usual cost that we need to pay when we read a data from the disk.

Not the same. Scrub has extra setup, like marking the whole block group
RO, to avoid a lot of race.

For regular file read, we don't need to bother the race for the range we
want to read, we have page locked and no race can happen.

But this is no longer true for RAID56, at RAID56 level, we have nothing
like page cache to help us.

Thanks,
Qu

>
>>
>> This is definitely not a small cost, which will further slowdown the
>> RAID56 write.
>>
>> Thus even if we're going to implement such scrub level check, I'm not
>> going to make it the default option, but make it opt-in.
>>
>> Although recently since I'm working on RAID56, unless there are some
>> possible dead-lock, I believe it's possible to implement.
>>
>> Maybe in the near future we may see some prototypes to try to address
>> this problem.
>>
>> Thanks,
>> Qu
>>
>>
>>> 2) the data is referenced but doesn't have a checksum (nocow): we cannot
>>> ensure the corruption of the data if checksum is not enabled. We can
>>> only ensure the availability of the data (which may be corrupted)
>>> 3) the data is not referenced: so the data is good.
>>>
>>> So in effect for the case 2) the data may be corrupted and not
>>> recoverable (but this is true in any case); but for the case 1) from a
>>> theoretical point of view it seems recoverable. Of course this has a
>>> cost: you need to read the stripe and their checksum (doing a recovery
>>> if needed) before updating any part of the stripe itself, maintaining a
>>> strict order between the read and the writing.
>>>
>>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09 19:24   ` Zygo Blaxell
@ 2022-08-12  2:58     ` Wang Yugui
  2022-08-12 22:47       ` Wang Yugui
  2022-08-13  1:50     ` Qu Wenruo
  1 sibling, 1 reply; 19+ messages in thread
From: Wang Yugui @ 2022-08-12  2:58 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Christoph Hellwig, linux-btrfs

Hi,

With a modified script based on this reproducer,
'uncorrectable errors' is easy to reproduce

kernel:
	5.20.0(2022/08/09) + some patches in misc-next
	0*-btrfs-scrub-properly-report-super-block-errors-in-sy.patch
	0*-btrfs-scrub-try-to-fix-super-block-errors.patch
	0*-btrfs-fix-lost-error-handling-when-looking-up-extend.patch

script:
	source ~/xfstests/local.config

	SCRATCH_DEV_ARRAY=($SCRATCH_DEV_POOL)
	umount $SCRATCH_MNT
	set -e
	uname -a

	mkfs.btrfs -f -draid5 -mraid1 ${SCRATCH_DEV_ARRAY[@]}
	mount ${SCRATCH_DEV_ARRAY[0]} $SCRATCH_MNT # -o
	compress=zstd,noatime
	mkdir -p $SCRATCH_MNT/dir1

	/bin/cp -a /usr/hpc-bio $SCRATCH_MNT/dir1/
	sync
	du -sh $SCRATCH_MNT
	cd $SCRATCH_MNT
	while true; do
		find -type f -exec cat {} + > /dev/null
	done &

	for((i=0;i>=0;++i)); do

		sync; sleep 2; sync; sleep 4; sync; sleep 20; # change the
	device to discard in every loop
		j=$(( i % ${#SCRATCH_DEV_ARRAY[@]} ))
		/usr/sbin/blkdiscard -f ${SCRATCH_DEV_ARRAY[$j]} >/dev/null 2>&1

		btrfs scrub start -Bd $SCRATCH_MNT | grep
'summary\|Uncorrectable'
done &
wait

Result:
	'uncorrectable errors' is reported in 2nd loop.

Is this a problem of this test script or btrfs kernel?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/08/12


> > Any chance you could share a script for your reproducer?
> 
> The simplest reproducer is some variant of:
> 
> 	mkfs.btrfs -draid5 -mraid1 /dev/vdb /dev/vdc /dev/vdd
> 	mount /dev/vdb /mnt -ocompress=zstd,noatime
> 	cd /mnt
> 	cp -a /40gb-test-data .
> 	sync
> 	while true; do
> 		find -type f -exec cat {} + > /dev/null
> 	done &
> 	while true; do
> 		cat /dev/zero > /dev/vdb
> 	done &
> 	while true; do
> 		btrfs scrub start -Bd /mnt
> 	done &
> 	wait
> 
> but it can take a long time to hit a failure with something that gentle.
> I throw on some extra test workload (e.g. lots of rsyncs) to keep the
> page cache full and under memory pressure, which seems to speed up the
> failure rate to once every few hours.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-12  2:58     ` Wang Yugui
@ 2022-08-12 22:47       ` Wang Yugui
  0 siblings, 0 replies; 19+ messages in thread
From: Wang Yugui @ 2022-08-12 22:47 UTC (permalink / raw)
  To: Zygo Blaxell, Christoph Hellwig, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2382 bytes --]

Hi,

If we add 'echo 3 > /proc/sys/vm/drop_caches; free -h' before
'blkdiscard', 'uncorrectable errors'  will happen in 1st loop.

please see the attachment file for the v2 script.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/08/13

> Hi,
> 
> With a modified script based on this reproducer,
> 'uncorrectable errors' is easy to reproduce
> 
> kernel:
> 	5.20.0(2022/08/09) + some patches in misc-next
> 	0*-btrfs-scrub-properly-report-super-block-errors-in-sy.patch
> 	0*-btrfs-scrub-try-to-fix-super-block-errors.patch
> 	0*-btrfs-fix-lost-error-handling-when-looking-up-extend.patch
> 
> script:
> 	source ~/xfstests/local.config
> 
> 	SCRATCH_DEV_ARRAY=($SCRATCH_DEV_POOL)
> 	umount $SCRATCH_MNT
> 	set -e
> 	uname -a
> 
> 	mkfs.btrfs -f -draid5 -mraid1 ${SCRATCH_DEV_ARRAY[@]}
> 	mount ${SCRATCH_DEV_ARRAY[0]} $SCRATCH_MNT # -o
> 	compress=zstd,noatime
> 	mkdir -p $SCRATCH_MNT/dir1
> 
> 	/bin/cp -a /usr/hpc-bio $SCRATCH_MNT/dir1/
> 	sync
> 	du -sh $SCRATCH_MNT
> 	cd $SCRATCH_MNT
> 	while true; do
> 		find -type f -exec cat {} + > /dev/null
> 	done &
> 
> 	for((i=0;i>=0;++i)); do
> 
> 		sync; sleep 2; sync; sleep 4; sync; sleep 20; # change the
> 	device to discard in every loop
> 		j=$(( i % ${#SCRATCH_DEV_ARRAY[@]} ))
> 		/usr/sbin/blkdiscard -f ${SCRATCH_DEV_ARRAY[$j]} >/dev/null 2>&1
> 
> 		btrfs scrub start -Bd $SCRATCH_MNT | grep
> 'summary\|Uncorrectable'
> done &
> wait
> 
> Result:
> 	'uncorrectable errors' is reported in 2nd loop.
> 
> Is this a problem of this test script or btrfs kernel?
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/08/12
> 
> 
> > > Any chance you could share a script for your reproducer?
> > 
> > The simplest reproducer is some variant of:
> > 
> > 	mkfs.btrfs -draid5 -mraid1 /dev/vdb /dev/vdc /dev/vdd
> > 	mount /dev/vdb /mnt -ocompress=zstd,noatime
> > 	cd /mnt
> > 	cp -a /40gb-test-data .
> > 	sync
> > 	while true; do
> > 		find -type f -exec cat {} + > /dev/null
> > 	done &
> > 	while true; do
> > 		cat /dev/zero > /dev/vdb
> > 	done &
> > 	while true; do
> > 		btrfs scrub start -Bd /mnt
> > 	done &
> > 	wait
> > 
> > but it can take a long time to hit a failure with something that gentle.
> > I throw on some extra test workload (e.g. lots of rsyncs) to keep the
> > page cache full and under memory pressure, which seems to speed up the
> > failure rate to once every few hours.
> 
> 
> 


[-- Attachment #2: raid5.disk.fail-v2.sh --]
[-- Type: application/octet-stream, Size: 788 bytes --]

#!/bin/bash
set -ux -o pipefail

source ~/xfstests/local.config

SCRATCH_DEV_ARRAY=($SCRATCH_DEV_POOL)
umount $SCRATCH_MNT
set -e
uname -a

mkfs.btrfs -f -draid5 -mraid1 ${SCRATCH_DEV_ARRAY[@]}
mount ${SCRATCH_DEV_ARRAY[0]} $SCRATCH_MNT # -o compress=zstd,noatime
mkdir -p $SCRATCH_MNT/dir1

/bin/cp -a /usr/hpc-bio $SCRATCH_MNT/dir1/
sync
du -sh $SCRATCH_MNT
cd $SCRATCH_MNT
while true; do
	find -type f -exec cat {} + > /dev/null
done &

for((i=1;i>=0;++i)); do

	sync; sleep 2; sync; sleep 4; sync; sleep 20; # change the device to discard in every loop
	echo 3 > /proc/sys/vm/drop_caches; free -h
	j=$(( i % ${#SCRATCH_DEV_ARRAY[@]} ))
	/usr/sbin/blkdiscard -f ${SCRATCH_DEV_ARRAY[$j]} >/dev/null 2>&1

	btrfs scrub start -Bd $SCRATCH_MNT | grep 'summary\|Uncorrectable'
done &
wait


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09 19:24   ` Zygo Blaxell
  2022-08-12  2:58     ` Wang Yugui
@ 2022-08-13  1:50     ` Qu Wenruo
  1 sibling, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2022-08-13  1:50 UTC (permalink / raw)
  To: Zygo Blaxell, Christoph Hellwig; +Cc: linux-btrfs



On 2022/8/10 03:24, Zygo Blaxell wrote:
> On Tue, Aug 09, 2022 at 01:29:59AM -0700, Christoph Hellwig wrote:
>> On Mon, Aug 08, 2022 at 11:31:51PM -0400, Zygo Blaxell wrote:
>>> There is now a BUG_ON arising from this test case:
>>>
>>> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
>>> 	[  241.100910][   T45] ------------[ cut here ]------------
>>> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
>>
>> This
>>
>>          BUG_ON(!mirror_num);
>>
>> so repair_io_failure gets called with a mirror_num of 0..
>>
>>> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
>>
>> .. from clean_io_failure.  Which starts from failrec->this_mirror and
>> tries to go back to failrec->failed_mirror using the prev_mirror
>> helper.  prev_mirror looks like:
>>
>> static int prev_mirror(const struct io_failure_record *failrec, int cur_mirror)
>> {
>>          if (cur_mirror == 1)
>> 		return failrec->num_copies;
>> 	return cur_mirror - 1;
>> }
>>
>> So the only way we could end up with a mirror = 0 is if
>> failrec->num_copies is 0.  -failrec->num_copies is initialized
>> in btrfs_get_io_failure_record by doing:
>>
>>          failrec->num_copies = btrfs_num_copies(fs_info, failrec->logical, sectorsize);
>>
>> just adter allocating the failrec.  I can't see any obvious way how
>> btrfs_num_copies would return 0, though, as for raid5 it just copies
>> from btrfs_raid_array.
>
> Judging from prior raid5 testing behavior, it looks like there's a race
> condition specific to btrfs raid5 IO.  Previous kernel versions have had
> assorted UAF bugs from time to time that KASAN tripped over, and btrfs
> replace almost never works on the first try on a real raid5 array.
> These issues were reported years ago, but nobody seems to have been
> working on them until recently.
>
> A similar test setup previously produced data corruption during raid5
> recovery, even on an otherwise idle filesystem, at a low rate (~1
> error per 100 GB).  I expect whatever bug was leading to that hasn't
> been entirely fixed yet.
>
>> Any chance you could share a script for your reproducer?
>
> The simplest reproducer is some variant of:
>
> 	mkfs.btrfs -draid5 -mraid1 /dev/vdb /dev/vdc /dev/vdd
> 	mount /dev/vdb /mnt -ocompress=zstd,noatime
> 	cd /mnt
> 	cp -a /40gb-test-data .
> 	sync
> 	while true; do
> 		find -type f -exec cat {} + > /dev/null
> 	done &
> 	while true; do
> 		cat /dev/zero > /dev/vdb
> 	done &
> 	while true; do
> 		btrfs scrub start -Bd /mnt
> 	done &
> 	wait
>
> but it can take a long time to hit a failure with something that gentle.
> I throw on some extra test workload (e.g. lots of rsyncs) to keep the
> page cache full and under memory pressure, which seems to speed up the
> failure rate to once every few hours.

I got it reproduced, although it's a different crash it has the minimal
workload so far, and it doesn't even need the race of corrupting the
disk on-the-fly:

         mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
         mount $dev1 $mnt
         $fsstress -w -d $mnt -n 100 -s 1660337237
         sync
         $fssum -A -f -w /tmp/fssum.saved $mnt
         umount $mnt

         xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
         mount $dev1 $mnt
         $fssum -r /tmp/fssum.saved $mnt > /dev/null # <<< CRASH here
         umount $mnt

The point here is, before another BUG_ON() triggered, we can got the
following "unable to find logical" triggered:

[  173.089275] BTRFS critical (device dm-1): unable to find logical
18446744073709551613 length 4096
[  173.089825] BTRFS critical (device dm-1): unable to find logical 4093
length 4096
[  173.090224] BTRFS critical (device dm-1): unable to find logical
18446744073709551613 length 4096
[  173.090657] BTRFS critical (device dm-1): unable to find logical 4093
length 4096
[  173.091451] assertion failed: failrec->this_mirror ==
bbio->mirror_num, in fs/btrfs/extent_io.c:2566

I have even pinned down the offending commit: "btrfs: pass a btrfs_bio
to btrfs_repair_one_sector". (not RAID56 though)

The "unable to find logical" is caused by the fact that, we're repairing
HOLE extent, which should not happen at all.

Unfortunately I can not find out why that seemingly harmless refactor is
causing the problem.
My initial guess is the bbio->file_offset is incorrect, I need to do
more debugging using above minimal script to find out why.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-09 19:46   ` Zygo Blaxell
  2022-08-10  7:17     ` Qu Wenruo
@ 2022-08-14  4:52     ` Qu Wenruo
  2022-08-16  1:01       ` Zygo Blaxell
  1 sibling, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2022-08-14  4:52 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Christoph Hellwig

Hi Zygo,

We have pinned down the root cause of the crash, and got a quick fix for it.

https://lore.kernel.org/linux-btrfs/1d9b69af6ce0a79e54fbaafcc65ead8f71b54b60.1660377678.git.wqu@suse.com/

Mind to test above patch to see if this can solve your crash?
(Don't expect this to improve the RAID56 recovery though, the bug itself
is not in RAID56).

For the RAID56 recovery, unfortunately we don't have any better way to
enhance it during writes yet.

So your tests will still lead to data corruption anyway.

Thanks,
Qu

On 2022/8/10 03:46, Zygo Blaxell wrote:
> On Tue, Aug 09, 2022 at 12:36:44PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2022/8/9 11:31, Zygo Blaxell wrote:
>>> Test case is:
>>>
>>> 	- start with a -draid5 -mraid1 filesystem on 2 disks
>>>
>>> 	- run assorted IO with a mix of reads and writes (randomly
>>> 	run rsync, bees, snapshot create/delete, balance, scrub, start
>>> 	replacing one of the disks...)
>>>
>>> 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>> 	blkdiscard on the underlying SSD in the VM host, to simulate
>>> 	single-disk data corruption
>>>
>>> 	- repeat until something goes badly wrong, like unrecoverable
>>> 	read error or crash
>>>
>>> This test case always failed quickly before (corruption was rarely
>>> if ever fully repaired on btrfs raid5 data), and it still doesn't work
>>> now, but now it doesn't work for a new reason.  Progress?
>>
>> The new read repair work for compressed extents, adding HCH to the thread.
>>
>> But just curious, have you tested without compression?
>
> All of the ~200 BUG_ON stack traces in my logs have the same list of
> functions as above.  If the bug affected uncompressed data, I'd expect
> to see two different stack traces.  It's a fairly decent sample size,
> so I'd say it's most likely not happening with uncompressed extents.
>
> All the production workloads have compression enabled, so we don't
> normally test with compression disabled.  I can run a separate test for
> that if you'd like.
>
>> Thanks,
>> Qu
>>>
>>> There is now a BUG_ON arising from this test case:
>>>
>>> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
>>> 	[  241.100910][   T45] ------------[ cut here ]------------
>>> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
>>> 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
>>> 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
>>> 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>>> 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
>>> 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>> 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>> 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
>>> 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>> 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>> 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
>>> 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>> 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
>>> 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>> 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>> 	[  241.126226][   T45] Call Trace:
>>> 	[  241.126646][   T45]  <TASK>
>>> 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
>>> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
>>> 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
>>> 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
>>> 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
>>> 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
>>> 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
>>> 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
>>> 	[  241.128528][   T45]  worker_thread+0x32e/0x720
>>> 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
>>> 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
>>> 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
>>> 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
>>> 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
>>> 	[  241.128659][   T45]  </TASK>
>>> 	[  241.128667][   T45] Modules linked in:
>>> 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
>>> 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>> 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>> 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
>>> 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>> 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>> 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
>>> 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>> 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
>>> 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>> 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>>
>>> KFENCE and UBSAN aren't reporting anything before the BUG_ON.
>>>
>>> KCSAN complains about a lot of stuff as usual, including several issues
>>> in the btrfs allocator, but it doesn't look like anything that would
>>> mess with a bio.
>>>
>>> 	$ git log --no-walk --oneline FETCH_HEAD
>>> 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
>>>
>>> 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
>>> 	 2345           u64 sector;
>>> 	 2346           struct btrfs_io_context *bioc = NULL;
>>> 	 2347           int ret = 0;
>>> 	 2348
>>> 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
>>> 	>2350<          BUG_ON(!mirror_num);
>>> 	 2351
>>> 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
>>> 	 2353                   return 0;
>>> 	 2354
>>> 	 2355           map_length = length;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-14  4:52     ` Qu Wenruo
@ 2022-08-16  1:01       ` Zygo Blaxell
  2022-08-16  1:25         ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Zygo Blaxell @ 2022-08-16  1:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Christoph Hellwig

On Sun, Aug 14, 2022 at 12:52:24PM +0800, Qu Wenruo wrote:
> Hi Zygo,
> 
> We have pinned down the root cause of the crash, and got a quick fix for it.
> 
> https://lore.kernel.org/linux-btrfs/1d9b69af6ce0a79e54fbaafcc65ead8f71b54b60.1660377678.git.wqu@suse.com/
> 
> Mind to test above patch to see if this can solve your crash?

Without the patch, tests using compressed data and uncompressed data
both lead to crashes:

	Compressed data hits the BUG_ON from end_compressed_bio_read
	as previously reported, in less than 30 minutes.

	Uncompressed data hits a BUG at fs/btrfs/tree-mod-log.c:675 about
	once every 2 hours.

Since applying the patch, neither of these crashes has occurred so far
(37 hours).

The BUG in tree-mod-log doesn't seem to be related to the raid5 corruption
issues.  The BUG starts happening before I corrupt any of the disk data.
It might have been present before, but obscured by the much more frequent
end_compressed_bio_read crashes.

I find it very interesting that the tree-mod-log BUG seems to have
stopped immediately after applying this patch.  Could they have the same
root cause, or a related cause?

I'll treat the tree-mod-log thing as a separate bug for now, and start
a new thread if I can still reproduce it on up-to-date for-next or
misc-next.

> For the RAID56 recovery, unfortunately we don't have any better way to
> enhance it during writes yet.
> 
> So your tests will still lead to data corruption anyway.

One bug at a time...  ;)

I've adapted my setup for future tests to stop writes, sync, inject the
corruption on one drive, then either run scrub or a readonly test to
correct errors, before resuming writes.  If I understand the constraints
correctly, all errors should be recoverable.

Thanks

> Thanks,
> Qu
> 
> On 2022/8/10 03:46, Zygo Blaxell wrote:
> > On Tue, Aug 09, 2022 at 12:36:44PM +0800, Qu Wenruo wrote:
> > > 
> > > 
> > > On 2022/8/9 11:31, Zygo Blaxell wrote:
> > > > Test case is:
> > > > 
> > > > 	- start with a -draid5 -mraid1 filesystem on 2 disks
> > > > 
> > > > 	- run assorted IO with a mix of reads and writes (randomly
> > > > 	run rsync, bees, snapshot create/delete, balance, scrub, start
> > > > 	replacing one of the disks...)
> > > > 
> > > > 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
> > > > 	blkdiscard on the underlying SSD in the VM host, to simulate
> > > > 	single-disk data corruption
> > > > 
> > > > 	- repeat until something goes badly wrong, like unrecoverable
> > > > 	read error or crash
> > > > 
> > > > This test case always failed quickly before (corruption was rarely
> > > > if ever fully repaired on btrfs raid5 data), and it still doesn't work
> > > > now, but now it doesn't work for a new reason.  Progress?
> > > 
> > > The new read repair work for compressed extents, adding HCH to the thread.
> > > 
> > > But just curious, have you tested without compression?
> > 
> > All of the ~200 BUG_ON stack traces in my logs have the same list of
> > functions as above.  If the bug affected uncompressed data, I'd expect
> > to see two different stack traces.  It's a fairly decent sample size,
> > so I'd say it's most likely not happening with uncompressed extents.
> > 
> > All the production workloads have compression enabled, so we don't
> > normally test with compression disabled.  I can run a separate test for
> > that if you'd like.
> > 
> > > Thanks,
> > > Qu
> > > > 
> > > > There is now a BUG_ON arising from this test case:
> > > > 
> > > > 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
> > > > 	[  241.100910][   T45] ------------[ cut here ]------------
> > > > 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
> > > > 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
> > > > 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
> > > > 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> > > > 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
> > > > 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> > > > 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> > > > 	7 c7 b0 1d 45 88 e8 d0 8e 98
> > > > 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
> > > > 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > > > 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
> > > > 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> > > > 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
> > > > 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> > > > 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> > > > 	[  241.126226][   T45] Call Trace:
> > > > 	[  241.126646][   T45]  <TASK>
> > > > 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
> > > > 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
> > > > 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
> > > > 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
> > > > 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
> > > > 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
> > > > 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
> > > > 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
> > > > 	[  241.128528][   T45]  worker_thread+0x32e/0x720
> > > > 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
> > > > 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
> > > > 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
> > > > 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
> > > > 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
> > > > 	[  241.128659][   T45]  </TASK>
> > > > 	[  241.128667][   T45] Modules linked in:
> > > > 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
> > > > 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> > > > 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> > > > 	7 c7 b0 1d 45 88 e8 d0 8e 98
> > > > 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
> > > > 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > > > 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
> > > > 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> > > > 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
> > > > 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> > > > 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> > > > 
> > > > KFENCE and UBSAN aren't reporting anything before the BUG_ON.
> > > > 
> > > > KCSAN complains about a lot of stuff as usual, including several issues
> > > > in the btrfs allocator, but it doesn't look like anything that would
> > > > mess with a bio.
> > > > 
> > > > 	$ git log --no-walk --oneline FETCH_HEAD
> > > > 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
> > > > 
> > > > 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
> > > > 	 2345           u64 sector;
> > > > 	 2346           struct btrfs_io_context *bioc = NULL;
> > > > 	 2347           int ret = 0;
> > > > 	 2348
> > > > 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
> > > > 	>2350<          BUG_ON(!mirror_num);
> > > > 	 2351
> > > > 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
> > > > 	 2353                   return 0;
> > > > 	 2354
> > > > 	 2355           map_length = length;
> 
> enhance it during writes yet.
> 
> So your tests will still lead to data corruption anyway.
> 
> Thanks,
> Qu
> 
> On 2022/8/10 03:46, Zygo Blaxell wrote:
> > On Tue, Aug 09, 2022 at 12:36:44PM +0800, Qu Wenruo wrote:
> > > 
> > > 
> > > On 2022/8/9 11:31, Zygo Blaxell wrote:
> > > > Test case is:
> > > > 
> > > > 	- start with a -draid5 -mraid1 filesystem on 2 disks
> > > > 
> > > > 	- run assorted IO with a mix of reads and writes (randomly
> > > > 	run rsync, bees, snapshot create/delete, balance, scrub, start
> > > > 	replacing one of the disks...)
> > > > 
> > > > 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
> > > > 	blkdiscard on the underlying SSD in the VM host, to simulate
> > > > 	single-disk data corruption
> > > > 
> > > > 	- repeat until something goes badly wrong, like unrecoverable
> > > > 	read error or crash
> > > > 
> > > > This test case always failed quickly before (corruption was rarely
> > > > if ever fully repaired on btrfs raid5 data), and it still doesn't work
> > > > now, but now it doesn't work for a new reason.  Progress?
> > > 
> > > The new read repair work for compressed extents, adding HCH to the thread.
> > > 
> > > But just curious, have you tested without compression?
> > 
> > All of the ~200 BUG_ON stack traces in my logs have the same list of
> > functions as above.  If the bug affected uncompressed data, I'd expect
> > to see two different stack traces.  It's a fairly decent sample size,
> > so I'd say it's most likely not happening with uncompressed extents.
> > 
> > All the production workloads have compression enabled, so we don't
> > normally test with compression disabled.  I can run a separate test for
> > that if you'd like.
> > 
> > > Thanks,
> > > Qu
> > > > 
> > > > There is now a BUG_ON arising from this test case:
> > > > 
> > > > 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
> > > > 	[  241.100910][   T45] ------------[ cut here ]------------
> > > > 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
> > > > 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
> > > > 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
> > > > 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> > > > 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
> > > > 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> > > > 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> > > > 	7 c7 b0 1d 45 88 e8 d0 8e 98
> > > > 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
> > > > 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > > > 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
> > > > 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> > > > 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
> > > > 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> > > > 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> > > > 	[  241.126226][   T45] Call Trace:
> > > > 	[  241.126646][   T45]  <TASK>
> > > > 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
> > > > 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
> > > > 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
> > > > 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
> > > > 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
> > > > 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
> > > > 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
> > > > 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
> > > > 	[  241.128528][   T45]  worker_thread+0x32e/0x720
> > > > 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
> > > > 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
> > > > 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
> > > > 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
> > > > 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
> > > > 	[  241.128659][   T45]  </TASK>
> > > > 	[  241.128667][   T45] Modules linked in:
> > > > 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
> > > > 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
> > > > 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
> > > > 	7 c7 b0 1d 45 88 e8 d0 8e 98
> > > > 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
> > > > 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > > > 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
> > > > 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
> > > > 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
> > > > 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
> > > > 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
> > > > 
> > > > KFENCE and UBSAN aren't reporting anything before the BUG_ON.
> > > > 
> > > > KCSAN complains about a lot of stuff as usual, including several issues
> > > > in the btrfs allocator, but it doesn't look like anything that would
> > > > mess with a bio.
> > > > 
> > > > 	$ git log --no-walk --oneline FETCH_HEAD
> > > > 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
> > > > 
> > > > 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
> > > > 	 2345           u64 sector;
> > > > 	 2346           struct btrfs_io_context *bioc = NULL;
> > > > 	 2347           int ret = 0;
> > > > 	 2348
> > > > 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
> > > > 	>2350<          BUG_ON(!mirror_num);
> > > > 	 2351
> > > > 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
> > > > 	 2353                   return 0;
> > > > 	 2354
> > > > 	 2355           map_length = length;
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
  2022-08-16  1:01       ` Zygo Blaxell
@ 2022-08-16  1:25         ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2022-08-16  1:25 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Christoph Hellwig



On 2022/8/16 09:01, Zygo Blaxell wrote:
> On Sun, Aug 14, 2022 at 12:52:24PM +0800, Qu Wenruo wrote:
>> Hi Zygo,
>>
>> We have pinned down the root cause of the crash, and got a quick fix for it.
>>
>> https://lore.kernel.org/linux-btrfs/1d9b69af6ce0a79e54fbaafcc65ead8f71b54b60.1660377678.git.wqu@suse.com/
>>
>> Mind to test above patch to see if this can solve your crash?
>
> Without the patch, tests using compressed data and uncompressed data
> both lead to crashes:
>
> 	Compressed data hits the BUG_ON from end_compressed_bio_read
> 	as previously reported, in less than 30 minutes.
>
> 	Uncompressed data hits a BUG at fs/btrfs/tree-mod-log.c:675 about
> 	once every 2 hours.
>
> Since applying the patch, neither of these crashes has occurred so far
> (37 hours).
>
> The BUG in tree-mod-log doesn't seem to be related to the raid5 corruption
> issues.

To be more clear, the crash on end_compressed_bio_read() is not even
related to btrfs RAID56.

In theory other RAID profiles with extra duplication should also hit that.

>  The BUG starts happening before I corrupt any of the disk data.
> It might have been present before, but obscured by the much more frequent
> end_compressed_bio_read crashes.
>
> I find it very interesting that the tree-mod-log BUG seems to have
> stopped immediately after applying this patch.  Could they have the same
> root cause, or a related cause?

For the tree-mod-log one, I have no clue at all why the fix can even help...

>
> I'll treat the tree-mod-log thing as a separate bug for now, and start
> a new thread if I can still reproduce it on up-to-date for-next or
> misc-next.

That would help a lot.

>
>> For the RAID56 recovery, unfortunately we don't have any better way to
>> enhance it during writes yet.
>>
>> So your tests will still lead to data corruption anyway.
>
> One bug at a time...  ;)
>
> I've adapted my setup for future tests to stop writes, sync, inject the
> corruption on one drive, then either run scrub or a readonly test to
> correct errors, before resuming writes.  If I understand the constraints
> correctly, all errors should be recoverable.

Yes, that's correct.

If with that you can still hit unrecoverable errors, I'll spare no time
to fix those bugs.

Thanks,
Qu

>
> Thanks
>
>> Thanks,
>> Qu
>>
>> On 2022/8/10 03:46, Zygo Blaxell wrote:
>>> On Tue, Aug 09, 2022 at 12:36:44PM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2022/8/9 11:31, Zygo Blaxell wrote:
>>>>> Test case is:
>>>>>
>>>>> 	- start with a -draid5 -mraid1 filesystem on 2 disks
>>>>>
>>>>> 	- run assorted IO with a mix of reads and writes (randomly
>>>>> 	run rsync, bees, snapshot create/delete, balance, scrub, start
>>>>> 	replacing one of the disks...)
>>>>>
>>>>> 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>>>> 	blkdiscard on the underlying SSD in the VM host, to simulate
>>>>> 	single-disk data corruption
>>>>>
>>>>> 	- repeat until something goes badly wrong, like unrecoverable
>>>>> 	read error or crash
>>>>>
>>>>> This test case always failed quickly before (corruption was rarely
>>>>> if ever fully repaired on btrfs raid5 data), and it still doesn't work
>>>>> now, but now it doesn't work for a new reason.  Progress?
>>>>
>>>> The new read repair work for compressed extents, adding HCH to the thread.
>>>>
>>>> But just curious, have you tested without compression?
>>>
>>> All of the ~200 BUG_ON stack traces in my logs have the same list of
>>> functions as above.  If the bug affected uncompressed data, I'd expect
>>> to see two different stack traces.  It's a fairly decent sample size,
>>> so I'd say it's most likely not happening with uncompressed extents.
>>>
>>> All the production workloads have compression enabled, so we don't
>>> normally test with compression disabled.  I can run a separate test for
>>> that if you'd like.
>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> There is now a BUG_ON arising from this test case:
>>>>>
>>>>> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
>>>>> 	[  241.100910][   T45] ------------[ cut here ]------------
>>>>> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
>>>>> 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
>>>>> 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
>>>>> 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>>>>> 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
>>>>> 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>>>> 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>>>> 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
>>>>> 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>>> 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>>>> 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
>>>>> 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>>>> 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
>>>>> 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>>>> 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>>>> 	[  241.126226][   T45] Call Trace:
>>>>> 	[  241.126646][   T45]  <TASK>
>>>>> 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
>>>>> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
>>>>> 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
>>>>> 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
>>>>> 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
>>>>> 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
>>>>> 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
>>>>> 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
>>>>> 	[  241.128528][   T45]  worker_thread+0x32e/0x720
>>>>> 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
>>>>> 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
>>>>> 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
>>>>> 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
>>>>> 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
>>>>> 	[  241.128659][   T45]  </TASK>
>>>>> 	[  241.128667][   T45] Modules linked in:
>>>>> 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
>>>>> 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>>>> 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>>>> 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
>>>>> 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>>> 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>>>> 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
>>>>> 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>>>> 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
>>>>> 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>>>> 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>>>>
>>>>> KFENCE and UBSAN aren't reporting anything before the BUG_ON.
>>>>>
>>>>> KCSAN complains about a lot of stuff as usual, including several issues
>>>>> in the btrfs allocator, but it doesn't look like anything that would
>>>>> mess with a bio.
>>>>>
>>>>> 	$ git log --no-walk --oneline FETCH_HEAD
>>>>> 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
>>>>>
>>>>> 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
>>>>> 	 2345           u64 sector;
>>>>> 	 2346           struct btrfs_io_context *bioc = NULL;
>>>>> 	 2347           int ret = 0;
>>>>> 	 2348
>>>>> 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
>>>>> 	>2350<          BUG_ON(!mirror_num);
>>>>> 	 2351
>>>>> 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
>>>>> 	 2353                   return 0;
>>>>> 	 2354
>>>>> 	 2355           map_length = length;
>>
>> enhance it during writes yet.
>>
>> So your tests will still lead to data corruption anyway.
>>
>> Thanks,
>> Qu
>>
>> On 2022/8/10 03:46, Zygo Blaxell wrote:
>>> On Tue, Aug 09, 2022 at 12:36:44PM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2022/8/9 11:31, Zygo Blaxell wrote:
>>>>> Test case is:
>>>>>
>>>>> 	- start with a -draid5 -mraid1 filesystem on 2 disks
>>>>>
>>>>> 	- run assorted IO with a mix of reads and writes (randomly
>>>>> 	run rsync, bees, snapshot create/delete, balance, scrub, start
>>>>> 	replacing one of the disks...)
>>>>>
>>>>> 	- cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>>>> 	blkdiscard on the underlying SSD in the VM host, to simulate
>>>>> 	single-disk data corruption
>>>>>
>>>>> 	- repeat until something goes badly wrong, like unrecoverable
>>>>> 	read error or crash
>>>>>
>>>>> This test case always failed quickly before (corruption was rarely
>>>>> if ever fully repaired on btrfs raid5 data), and it still doesn't work
>>>>> now, but now it doesn't work for a new reason.  Progress?
>>>>
>>>> The new read repair work for compressed extents, adding HCH to the thread.
>>>>
>>>> But just curious, have you tested without compression?
>>>
>>> All of the ~200 BUG_ON stack traces in my logs have the same list of
>>> functions as above.  If the bug affected uncompressed data, I'd expect
>>> to see two different stack traces.  It's a fairly decent sample size,
>>> so I'd say it's most likely not happening with uncompressed extents.
>>>
>>> All the production workloads have compression enabled, so we don't
>>> normally test with compression disabled.  I can run a separate test for
>>> that if you'd like.
>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> There is now a BUG_ON arising from this test case:
>>>>>
>>>>> 	[  241.051326][   T45] btrfs_print_data_csum_error: 156 callbacks suppressed
>>>>> 	[  241.100910][   T45] ------------[ cut here ]------------
>>>>> 	[  241.102531][   T45] kernel BUG at fs/btrfs/extent_io.c:2350!
>>>>> 	[  241.103261][   T45] invalid opcode: 0000 [#2] PREEMPT SMP PTI
>>>>> 	[  241.104044][   T45] CPU: 2 PID: 45 Comm: kworker/u8:4 Tainted: G      D           5.19.0-466d9d7ea677-for-next+ #85 89955463945a81b56a449b1f12383cf0d5e6b898
>>>>> 	[  241.105652][   T45] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>>>>> 	[  241.106726][   T45] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work
>>>>> 	[  241.107716][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>>>> 	[  241.108569][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>>>> 	[  241.111990][   T45] RSP: 0018:ffffbca9009f7a08 EFLAGS: 00010246
>>>>> 	[  241.112911][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>>> 	[  241.115676][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>>>> 	[  241.118009][   T45] RBP: ffffbca9009f7b00 R08: 0000000000000000 R09: 0000000000000000
>>>>> 	[  241.119484][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>>>> 	[  241.120717][   T45] R13: 0000000000000000 R14: ffffe60cc81a4200 R15: ffff9cd235b4dfa4
>>>>> 	[  241.122594][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>>>> 	[  241.123831][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> 	[  241.125003][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>>>> 	[  241.126226][   T45] Call Trace:
>>>>> 	[  241.126646][   T45]  <TASK>
>>>>> 	[  241.127165][   T45]  ? __bio_clone+0x1c0/0x1c0
>>>>> 	[  241.128354][   T45]  clean_io_failure+0x21a/0x260
>>>>> 	[  241.128384][   T45]  end_compressed_bio_read+0x2a9/0x470
>>>>> 	[  241.128411][   T45]  bio_endio+0x361/0x3c0
>>>>> 	[  241.128427][   T45]  rbio_orig_end_io+0x127/0x1c0
>>>>> 	[  241.128447][   T45]  __raid_recover_end_io+0x405/0x8f0
>>>>> 	[  241.128477][   T45]  raid_recover_end_io_work+0x8c/0xb0
>>>>> 	[  241.128494][   T45]  process_one_work+0x4e5/0xaa0
>>>>> 	[  241.128528][   T45]  worker_thread+0x32e/0x720
>>>>> 	[  241.128541][   T45]  ? _raw_spin_unlock_irqrestore+0x7d/0xa0
>>>>> 	[  241.128573][   T45]  ? process_one_work+0xaa0/0xaa0
>>>>> 	[  241.128588][   T45]  kthread+0x1ab/0x1e0
>>>>> 	[  241.128600][   T45]  ? kthread_complete_and_exit+0x40/0x40
>>>>> 	[  241.128628][   T45]  ret_from_fork+0x22/0x30
>>>>> 	[  241.128659][   T45]  </TASK>
>>>>> 	[  241.128667][   T45] Modules linked in:
>>>>> 	[  241.129700][   T45] ---[ end trace 0000000000000000 ]---
>>>>> 	[  241.152310][   T45] RIP: 0010:repair_io_failure+0x359/0x4b0
>>>>> 	[  241.153328][   T45] Code: 2b e8 cb 12 79 ff 48 c7 c6 20 23 ac 85 48 c7 c7 00 b9 14 88 e8 d8 e3 72 ff 48 8d bd 48 ff ff ff e8 5c 7e 26 00 e9 f6 fd ff ff <0f> 0b e8 60 d1 5e 01 85 c0 74 cc 48 c
>>>>> 	7 c7 b0 1d 45 88 e8 d0 8e 98
>>>>> 	[  241.156882][   T45] RSP: 0018:ffffbca902487a08 EFLAGS: 00010246
>>>>> 	[  241.158103][   T45] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>>> 	[  241.160072][   T45] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>>>> 	[  241.161984][   T45] RBP: ffffbca902487b00 R08: 0000000000000000 R09: 0000000000000000
>>>>> 	[  241.164067][   T45] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cd1b9da4000
>>>>> 	[  241.165979][   T45] R13: 0000000000000000 R14: ffffe60cc7589740 R15: ffff9cd1f45495e4
>>>>> 	[  241.167928][   T45] FS:  0000000000000000(0000) GS:ffff9cd2b7600000(0000) knlGS:0000000000000000
>>>>> 	[  241.169978][   T45] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> 	[  241.171649][   T45] CR2: 00007fbb76b1a738 CR3: 0000000109c26001 CR4: 0000000000170ee0
>>>>>
>>>>> KFENCE and UBSAN aren't reporting anything before the BUG_ON.
>>>>>
>>>>> KCSAN complains about a lot of stuff as usual, including several issues
>>>>> in the btrfs allocator, but it doesn't look like anything that would
>>>>> mess with a bio.
>>>>>
>>>>> 	$ git log --no-walk --oneline FETCH_HEAD
>>>>> 	6130a25681d4 (kdave/for-next) Merge branch 'for-next-next-v5.20-20220804' into for-next-20220804
>>>>>
>>>>> 	repair_io_failure at fs/btrfs/extent_io.c:2350 (discriminator 1)
>>>>> 	 2345           u64 sector;
>>>>> 	 2346           struct btrfs_io_context *bioc = NULL;
>>>>> 	 2347           int ret = 0;
>>>>> 	 2348
>>>>> 	 2349           ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
>>>>> 	>2350<          BUG_ON(!mirror_num);
>>>>> 	 2351
>>>>> 	 2352           if (btrfs_repair_one_zone(fs_info, logical))
>>>>> 	 2353                   return 0;
>>>>> 	 2354
>>>>> 	 2355           map_length = length;
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-08-16  6:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-09  3:31 misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery Zygo Blaxell
2022-08-09  4:36 ` Qu Wenruo
2022-08-09 19:46   ` Zygo Blaxell
2022-08-10  7:17     ` Qu Wenruo
2022-08-14  4:52     ` Qu Wenruo
2022-08-16  1:01       ` Zygo Blaxell
2022-08-16  1:25         ` Qu Wenruo
2022-08-09  7:35 ` Qu Wenruo
2022-08-09 19:29   ` Zygo Blaxell
2022-08-09 21:50     ` Qu Wenruo
2022-08-10  8:08       ` Goffredo Baroncelli
2022-08-10  8:24         ` Qu Wenruo
2022-08-10  8:45           ` Goffredo Baroncelli
2022-08-10  9:14             ` Qu Wenruo
2022-08-09  8:29 ` Christoph Hellwig
2022-08-09 19:24   ` Zygo Blaxell
2022-08-12  2:58     ` Wang Yugui
2022-08-12 22:47       ` Wang Yugui
2022-08-13  1:50     ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.