linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
@ 2021-12-10 18:34 Zygo Blaxell
  2021-12-12 10:03 ` Thorsten Leemhuis
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Zygo Blaxell @ 2021-12-10 18:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Miklos Szeredi

I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
some bees users have reported it as well.  I bisected to this commit:

	3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()

These kernels work for at least 18 hours:

	5.10.83 (months)
	5.11.22 with 3078d85c9a10 reverted (36 hours)
	btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted

These kernels lock up in 3 hours or less:

	5.11.22
	5.12.19
	5.14.21
	5.15.6
	btrfs for-next 279373dee83e

All of the failing kernels include this commit, none of the non-failing
kernels include the commit.

Kernel logs from the lockup:

	[19647.696042][ T3721] sysrq: Show Blocked State
	[19647.697024][ T3721] task:btrfs-transacti state:D stack:    0 pid: 6161 ppid:     2 flags:0x00004000
	[19647.698203][ T3721] Call Trace:
	[19647.698608][ T3721]  __schedule+0x388/0xaf0
	[19647.699125][ T3721]  schedule+0x68/0xe0
	[19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0
	[19647.700276][ T3721]  ? start_transaction+0xd5/0x6f0
	[19647.700897][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
	[19647.701507][ T3721]  transaction_kthread+0x138/0x1b0
	[19647.702154][ T3721]  kthread+0x151/0x170
	[19647.702651][ T3721]  ? btrfs_cleanup_transaction.isra.0+0x620/0x620
	[19647.703404][ T3721]  ? kthread_create_worker_on_cpu+0x70/0x70
	[19647.704119][ T3721]  ret_from_fork+0x22/0x30
	[19647.704679][ T3721] task:crawl_257_291   state:D stack:    0 pid: 6494 ppid:  6435 flags:0x00000000
	[19647.705797][ T3721] Call Trace:
	[19647.706188][ T3721]  __schedule+0x388/0xaf0
	[19647.706723][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
	[19647.707414][ T3721]  schedule+0x68/0xe0
	[19647.707905][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
	[19647.708597][ T3721]  down_write_nested+0xc1/0x130
	[19647.709167][ T3721]  lock_two_nondirectories+0x59/0x70
	[19647.709831][ T3721]  btrfs_remap_file_range+0x54/0x3c0
	[19647.710505][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
	[19647.711197][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
	[19647.711902][ T3721]  do_vfs_ioctl+0x551/0x720
	[19647.712434][ T3721]  ? __fget_files+0x109/0x1d0
	[19647.713010][ T3721]  __x64_sys_ioctl+0x6f/0xc0
	[19647.713544][ T3721]  do_syscall_64+0x38/0x90
	[19647.714098][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[19647.714818][ T3721] RIP: 0033:0x7f0ab2429cc7
	[19647.715327][ T3721] RSP: 002b:00007f0ab03241d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[19647.716326][ T3721] RAX: ffffffffffffffda RBX: 00007f0ab0324420 RCX: 00007f0ab2429cc7
	[19647.717264][ T3721] RDX: 00007f0a981afe50 RSI: 00000000c0189436 RDI: 0000000000000012
	[19647.718203][ T3721] RBP: 00007f0a981afe50 R08: 00007f0a983759e0 R09: 0000000000000000
	[19647.719151][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a981afe50
	[19647.720087][ T3721] R13: 00007f0ab0324428 R14: 0000000000000000 R15: 00007f0ab0324448
	[19647.721041][ T3721] task:crawl_256_295   state:D stack:    0 pid: 6496 ppid:  6435 flags:0x00000000
	[19647.722126][ T3721] Call Trace:
	[19647.722517][ T3721]  __schedule+0x388/0xaf0
	[19647.723060][ T3721]  schedule+0x68/0xe0
	[19647.723524][ T3721]  wait_current_trans+0xed/0x150
	[19647.724117][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
	[19647.724697][ T3721]  start_transaction+0x37e/0x6f0
	[19647.725273][ T3721]  ? btrfs_inode_flags_to_xflags+0x50/0x50
	[19647.725969][ T3721]  btrfs_attach_transaction+0x1d/0x20
	[19647.726625][ T3721]  iterate_extent_inodes+0x7b/0x270
	[19647.727236][ T3721]  iterate_inodes_from_logical+0x9f/0xe0
	[19647.727912][ T3721]  ? btrfs_inode_flags_to_xflags+0x50/0x50
	[19647.728599][ T3721]  btrfs_ioctl_logical_to_ino+0x183/0x210
	[19647.729263][ T3721]  btrfs_ioctl+0xa83/0x2fe0
	[19647.729818][ T3721]  ? kvm_sched_clock_read+0x18/0x30
	[19647.730430][ T3721]  ? sched_clock+0x9/0x10
	[19647.730976][ T3721]  ? __fget_files+0xe6/0x1d0
	[19647.731521][ T3721]  ? __fget_files+0x109/0x1d0
	[19647.732096][ T3721]  __x64_sys_ioctl+0x91/0xc0
	[19647.732647][ T3721]  ? __x64_sys_ioctl+0x91/0xc0
	[19647.733209][ T3721]  do_syscall_64+0x38/0x90
	[19647.733750][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[19647.734438][ T3721] RIP: 0033:0x7f0ab2429cc7
	[19647.734992][ T3721] RSP: 002b:00007f0aaf322378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[19647.735994][ T3721] RAX: ffffffffffffffda RBX: 0000558110d25dd0 RCX: 00007f0ab2429cc7
	[19647.736949][ T3721] RDX: 00007f0aaf322608 RSI: 00000000c038943b RDI: 0000000000000003
	[19647.737895][ T3721] RBP: 00007f0aaf322550 R08: 0000000000000000 R09: 00007f0aaf3227e0
	[19647.738862][ T3721] R10: 000019f231c642a4 R11: 0000000000000246 R12: 00007f0aaf322600
	[19647.739799][ T3721] R13: 0000000000000003 R14: 00007f0aaf322608 R15: 00007f0aaf3225e0
	[19647.740748][ T3721] task:crawl_257_292   state:D stack:    0 pid: 6502 ppid:  6435 flags:0x00000000
	[19647.741832][ T3721] Call Trace:
	[19647.742216][ T3721]  __schedule+0x388/0xaf0
	[19647.742761][ T3721]  schedule+0x68/0xe0
	[19647.743225][ T3721]  wait_current_trans+0xed/0x150
	[19647.743825][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
	[19647.744399][ T3721]  start_transaction+0x587/0x6f0
	[19647.745003][ T3721]  btrfs_start_transaction+0x1e/0x20
	[19647.745638][ T3721]  btrfs_replace_file_extents+0x135/0x8d0
	[19647.746305][ T3721]  ? release_extent_buffer+0xae/0xf0
	[19647.746973][ T3721]  btrfs_clone+0x828/0x8c0
	[19647.747513][ T3721]  btrfs_extent_same_range+0x75/0xa0
	[19647.748152][ T3721]  btrfs_remap_file_range+0x354/0x3c0
	[19647.748802][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
	[19647.749460][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
	[19647.750098][ T3721]  do_vfs_ioctl+0x551/0x720
	[19647.750666][ T3721]  ? __fget_files+0x109/0x1d0
	[19647.751216][ T3721]  __x64_sys_ioctl+0x6f/0xc0
	[19647.751777][ T3721]  do_syscall_64+0x38/0x90
	[19647.752290][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[19647.752992][ T3721] RIP: 0033:0x7f0ab2429cc7
	[19647.753511][ T3721] RSP: 002b:00007f0aac31c1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[19647.754500][ T3721] RAX: ffffffffffffffda RBX: 00007f0aac31c420 RCX: 00007f0ab2429cc7
	[19647.755460][ T3721] RDX: 00007f0a787fc690 RSI: 00000000c0189436 RDI: 0000000000000012
	[19647.756406][ T3721] RBP: 00007f0a787fc690 R08: 00007f0a78f4fd40 R09: 0000000000000000
	[19647.757340][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a787fc690
	[19647.758291][ T3721] R13: 00007f0aac31c428 R14: 0000000000000000 R15: 00007f0aac31c448
	[19647.759261][ T3721] task:crawl_257_293   state:D stack:    0 pid: 6503 ppid:  6435 flags:0x00000000
	[19647.760362][ T3721] Call Trace:
	[19647.760761][ T3721]  __schedule+0x388/0xaf0
	[19647.761261][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
	[19647.761958][ T3721]  schedule+0x68/0xe0
	[19647.762425][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
	[19647.763126][ T3721]  down_write+0xbd/0x120
	[19647.763643][ T3721]  btrfs_remap_file_range+0x2eb/0x3c0
	[19647.764278][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
	[19647.764954][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
	[19647.765589][ T3721]  do_vfs_ioctl+0x551/0x720
	[19647.766115][ T3721]  ? __fget_files+0x109/0x1d0
	[19647.766700][ T3721]  __x64_sys_ioctl+0x6f/0xc0
	[19647.767238][ T3721]  do_syscall_64+0x38/0x90
	[19647.767772][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[19647.768458][ T3721] RIP: 0033:0x7f0ab2429cc7
	[19647.768990][ T3721] RSP: 002b:00007f0aabb1b1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[19647.769986][ T3721] RAX: ffffffffffffffda RBX: 00007f0aabb1b420 RCX: 00007f0ab2429cc7
	[19647.770951][ T3721] RDX: 00007f0a6c1f5ac0 RSI: 00000000c0189436 RDI: 0000000000000012
	[19647.771894][ T3721] RBP: 00007f0a6c1f5ac0 R08: 00007f0a6d00da10 R09: 0000000000000000
	[19647.772839][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a6c1f5ac0
	[19647.773789][ T3721] R13: 00007f0aabb1b428 R14: 0000000000000000 R15: 00007f0aabb1b448
	[19647.774784][ T3721] task:crawl_256_289   state:D stack:    0 pid: 6504 ppid:  6435 flags:0x00000000
	[19647.775882][ T3721] Call Trace:
	[19647.776271][ T3721]  __schedule+0x388/0xaf0
	[19647.776807][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
	[19647.777499][ T3721]  schedule+0x68/0xe0
	[19647.777994][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
	[19647.778712][ T3721]  down_write_nested+0xc1/0x130
	[19647.779285][ T3721]  lock_two_nondirectories+0x59/0x70
	[19647.779931][ T3721]  btrfs_remap_file_range+0x54/0x3c0
	[19647.780564][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
	[19647.781366][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
	[19647.782009][ T3721]  do_vfs_ioctl+0x551/0x720
	[19647.782554][ T3721]  ? __fget_files+0x109/0x1d0
	[19647.783230][ T3721]  __x64_sys_ioctl+0x6f/0xc0
	[19647.783804][ T3721]  do_syscall_64+0x38/0x90
	[19647.784319][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[19647.785031][ T3721] RIP: 0033:0x7f0ab2429cc7
	[19647.785548][ T3721] RSP: 002b:00007f0aab31a1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[19647.786660][ T3721] RAX: ffffffffffffffda RBX: 00007f0aab31a420 RCX: 00007f0ab2429cc7
	[19647.787609][ T3721] RDX: 00007f0a70008f00 RSI: 00000000c0189436 RDI: 0000000000000012
	[19647.788551][ T3721] RBP: 00007f0a70008f00 R08: 00007f0a708e66e0 R09: 0000000000000000
	[19647.789533][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a70008f00
	[19647.790482][ T3721] R13: 00007f0aab31a428 R14: 0000000000000000 R15: 00007f0aab31a448
	[32609.668575][ T3721] sysrq: Show Locks Held
	[32609.673950][ T3721] 
	[32609.673950][ T3721] Showing all locks held in the system:
	[32609.675276][ T3721] 1 lock held by in:imklog/3603:
	[32609.675885][ T3721] 1 lock held by dmesg/3720:
	[32609.676432][ T3721]  #0: ffff8a1406ac80e0 (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4d/0x320
	[32609.678403][ T3721] 3 locks held by bash/3721:
	[32609.678972][ T3721]  #0: ffff8a142a589498 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x70/0xf0
	[32609.680364][ T3721]  #1: ffffffff98f199a0 (rcu_read_lock){....}-{1:2}, at: __handle_sysrq+0x5/0xa0
	[32609.682046][ T3721]  #2: ffffffff98f199a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x23/0x187
	[32609.683847][ T3721] 1 lock held by btrfs-transacti/6161:
	[32609.684498][ T3721]  #0: ffff8a14e0178850 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x5a/0x1b0
	[32609.686275][ T3721] 3 locks held by crawl_257_265/6491:
	[32609.686937][ T3721] 3 locks held by crawl_257_291/6494:
	[32609.687578][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
	[32609.689126][ T3721]  #1: ffff8a1410d7c848 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
	[32609.690694][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
	[32609.692091][ T3721] 4 locks held by crawl_257_292/6502:
	[32609.692763][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
	[32609.694014][ T3721]  #1: ffff8a131637a908 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
	[32609.695377][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
	[32609.696764][ T3721]  #3: ffff8a14bd0926b8 (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
	[32609.697986][ T3721] 2 locks held by crawl_257_293/6503:
	[32609.698629][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
	[32609.699882][ T3721]  #1: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: btrfs_remap_file_range+0x2eb/0x3c0
	[32609.701674][ T3721] 3 locks held by crawl_256_289/6504:
	[32609.702443][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
	[32609.703927][ T3721]  #1: ffff8a140f2c4748 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
	[32609.705444][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
	[32609.706899][ T3721] 
	[32609.707177][ T3721] =============================================

There is also a severe performance regression (50% slower) in btrfs dedupe
starting in 5.11, but reverting this commit doesn't make the performance
regression go away.  I'll follow up on that separately.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-10 18:34 bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
@ 2021-12-12 10:03 ` Thorsten Leemhuis
  2021-12-13 13:28 ` Nikolay Borisov
  2022-06-13  8:38 ` Libor Klepáč
  2 siblings, 0 replies; 22+ messages in thread
From: Thorsten Leemhuis @ 2021-12-12 10:03 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs; +Cc: Miklos Szeredi, regressions

[TLDR: adding this regression to regzbot; most of this mail is compiled
from a few templates paragraphs some of you might have seen already.]

Hi, this is your Linux kernel regression tracker speaking.

Top-posting for once, to make this easy accessible to everyone.

Thanks for the report.

Adding the regression mailing list to the list of recipients, as it
should be in the loop for all regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced 3078d85c9a10
#regzbot title btrfs: deadlocks in dedupe
#regzbot ignore-activity

Reminder: when fixing the issue, please add a 'Link:' tag with the URL
to the report (the parent of this mail), then regzbot will automatically
mark the regression as resolved once the fix lands in the appropriate
tree. For more details about regzbot see footer.

Sending this to everyone that got the initial report, to make all aware
of the tracking. I also hope that messages like this motivate people to
directly get at least the regression mailing list and ideally even
regzbot involved when dealing with regressions, as messages like this
wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), as
long as they are intended just for regzbot. With a bit of luck no such
messages will be needed anyway.

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat).

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply. That's in everyone's interest, as
what I wrote above might be misleading to everyone reading this; any
suggestion I gave thus might sent someone reading this down the wrong
rabbit hole, which none of us wants.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

On 10.12.21 19:34, Zygo Blaxell wrote:
> I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
> some bees users have reported it as well.  I bisected to this commit:
> 
> 	3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
> 
> These kernels work for at least 18 hours:
> 
> 	5.10.83 (months)
> 	5.11.22 with 3078d85c9a10 reverted (36 hours)
> 	btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted
> 
> These kernels lock up in 3 hours or less:
> 
> 	5.11.22
> 	5.12.19
> 	5.14.21
> 	5.15.6
> 	btrfs for-next 279373dee83e
> 
> All of the failing kernels include this commit, none of the non-failing
> kernels include the commit.
> 
> Kernel logs from the lockup:
> 
> 	[19647.696042][ T3721] sysrq: Show Blocked State
> 	[19647.697024][ T3721] task:btrfs-transacti state:D stack:    0 pid: 6161 ppid:     2 flags:0x00004000
> 	[19647.698203][ T3721] Call Trace:
> 	[19647.698608][ T3721]  __schedule+0x388/0xaf0
> 	[19647.699125][ T3721]  schedule+0x68/0xe0
> 	[19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0
> 	[19647.700276][ T3721]  ? start_transaction+0xd5/0x6f0
> 	[19647.700897][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> 	[19647.701507][ T3721]  transaction_kthread+0x138/0x1b0
> 	[19647.702154][ T3721]  kthread+0x151/0x170
> 	[19647.702651][ T3721]  ? btrfs_cleanup_transaction.isra.0+0x620/0x620
> 	[19647.703404][ T3721]  ? kthread_create_worker_on_cpu+0x70/0x70
> 	[19647.704119][ T3721]  ret_from_fork+0x22/0x30
> 	[19647.704679][ T3721] task:crawl_257_291   state:D stack:    0 pid: 6494 ppid:  6435 flags:0x00000000
> 	[19647.705797][ T3721] Call Trace:
> 	[19647.706188][ T3721]  __schedule+0x388/0xaf0
> 	[19647.706723][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> 	[19647.707414][ T3721]  schedule+0x68/0xe0
> 	[19647.707905][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> 	[19647.708597][ T3721]  down_write_nested+0xc1/0x130
> 	[19647.709167][ T3721]  lock_two_nondirectories+0x59/0x70
> 	[19647.709831][ T3721]  btrfs_remap_file_range+0x54/0x3c0
> 	[19647.710505][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> 	[19647.711197][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> 	[19647.711902][ T3721]  do_vfs_ioctl+0x551/0x720
> 	[19647.712434][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.713010][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> 	[19647.713544][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.714098][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.714818][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.715327][ T3721] RSP: 002b:00007f0ab03241d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.716326][ T3721] RAX: ffffffffffffffda RBX: 00007f0ab0324420 RCX: 00007f0ab2429cc7
> 	[19647.717264][ T3721] RDX: 00007f0a981afe50 RSI: 00000000c0189436 RDI: 0000000000000012
> 	[19647.718203][ T3721] RBP: 00007f0a981afe50 R08: 00007f0a983759e0 R09: 0000000000000000
> 	[19647.719151][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a981afe50
> 	[19647.720087][ T3721] R13: 00007f0ab0324428 R14: 0000000000000000 R15: 00007f0ab0324448
> 	[19647.721041][ T3721] task:crawl_256_295   state:D stack:    0 pid: 6496 ppid:  6435 flags:0x00000000
> 	[19647.722126][ T3721] Call Trace:
> 	[19647.722517][ T3721]  __schedule+0x388/0xaf0
> 	[19647.723060][ T3721]  schedule+0x68/0xe0
> 	[19647.723524][ T3721]  wait_current_trans+0xed/0x150
> 	[19647.724117][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> 	[19647.724697][ T3721]  start_transaction+0x37e/0x6f0
> 	[19647.725273][ T3721]  ? btrfs_inode_flags_to_xflags+0x50/0x50
> 	[19647.725969][ T3721]  btrfs_attach_transaction+0x1d/0x20
> 	[19647.726625][ T3721]  iterate_extent_inodes+0x7b/0x270
> 	[19647.727236][ T3721]  iterate_inodes_from_logical+0x9f/0xe0
> 	[19647.727912][ T3721]  ? btrfs_inode_flags_to_xflags+0x50/0x50
> 	[19647.728599][ T3721]  btrfs_ioctl_logical_to_ino+0x183/0x210
> 	[19647.729263][ T3721]  btrfs_ioctl+0xa83/0x2fe0
> 	[19647.729818][ T3721]  ? kvm_sched_clock_read+0x18/0x30
> 	[19647.730430][ T3721]  ? sched_clock+0x9/0x10
> 	[19647.730976][ T3721]  ? __fget_files+0xe6/0x1d0
> 	[19647.731521][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.732096][ T3721]  __x64_sys_ioctl+0x91/0xc0
> 	[19647.732647][ T3721]  ? __x64_sys_ioctl+0x91/0xc0
> 	[19647.733209][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.733750][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.734438][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.734992][ T3721] RSP: 002b:00007f0aaf322378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.735994][ T3721] RAX: ffffffffffffffda RBX: 0000558110d25dd0 RCX: 00007f0ab2429cc7
> 	[19647.736949][ T3721] RDX: 00007f0aaf322608 RSI: 00000000c038943b RDI: 0000000000000003
> 	[19647.737895][ T3721] RBP: 00007f0aaf322550 R08: 0000000000000000 R09: 00007f0aaf3227e0
> 	[19647.738862][ T3721] R10: 000019f231c642a4 R11: 0000000000000246 R12: 00007f0aaf322600
> 	[19647.739799][ T3721] R13: 0000000000000003 R14: 00007f0aaf322608 R15: 00007f0aaf3225e0
> 	[19647.740748][ T3721] task:crawl_257_292   state:D stack:    0 pid: 6502 ppid:  6435 flags:0x00000000
> 	[19647.741832][ T3721] Call Trace:
> 	[19647.742216][ T3721]  __schedule+0x388/0xaf0
> 	[19647.742761][ T3721]  schedule+0x68/0xe0
> 	[19647.743225][ T3721]  wait_current_trans+0xed/0x150
> 	[19647.743825][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> 	[19647.744399][ T3721]  start_transaction+0x587/0x6f0
> 	[19647.745003][ T3721]  btrfs_start_transaction+0x1e/0x20
> 	[19647.745638][ T3721]  btrfs_replace_file_extents+0x135/0x8d0
> 	[19647.746305][ T3721]  ? release_extent_buffer+0xae/0xf0
> 	[19647.746973][ T3721]  btrfs_clone+0x828/0x8c0
> 	[19647.747513][ T3721]  btrfs_extent_same_range+0x75/0xa0
> 	[19647.748152][ T3721]  btrfs_remap_file_range+0x354/0x3c0
> 	[19647.748802][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> 	[19647.749460][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> 	[19647.750098][ T3721]  do_vfs_ioctl+0x551/0x720
> 	[19647.750666][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.751216][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> 	[19647.751777][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.752290][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.752992][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.753511][ T3721] RSP: 002b:00007f0aac31c1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.754500][ T3721] RAX: ffffffffffffffda RBX: 00007f0aac31c420 RCX: 00007f0ab2429cc7
> 	[19647.755460][ T3721] RDX: 00007f0a787fc690 RSI: 00000000c0189436 RDI: 0000000000000012
> 	[19647.756406][ T3721] RBP: 00007f0a787fc690 R08: 00007f0a78f4fd40 R09: 0000000000000000
> 	[19647.757340][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a787fc690
> 	[19647.758291][ T3721] R13: 00007f0aac31c428 R14: 0000000000000000 R15: 00007f0aac31c448
> 	[19647.759261][ T3721] task:crawl_257_293   state:D stack:    0 pid: 6503 ppid:  6435 flags:0x00000000
> 	[19647.760362][ T3721] Call Trace:
> 	[19647.760761][ T3721]  __schedule+0x388/0xaf0
> 	[19647.761261][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> 	[19647.761958][ T3721]  schedule+0x68/0xe0
> 	[19647.762425][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> 	[19647.763126][ T3721]  down_write+0xbd/0x120
> 	[19647.763643][ T3721]  btrfs_remap_file_range+0x2eb/0x3c0
> 	[19647.764278][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> 	[19647.764954][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> 	[19647.765589][ T3721]  do_vfs_ioctl+0x551/0x720
> 	[19647.766115][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.766700][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> 	[19647.767238][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.767772][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.768458][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.768990][ T3721] RSP: 002b:00007f0aabb1b1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.769986][ T3721] RAX: ffffffffffffffda RBX: 00007f0aabb1b420 RCX: 00007f0ab2429cc7
> 	[19647.770951][ T3721] RDX: 00007f0a6c1f5ac0 RSI: 00000000c0189436 RDI: 0000000000000012
> 	[19647.771894][ T3721] RBP: 00007f0a6c1f5ac0 R08: 00007f0a6d00da10 R09: 0000000000000000
> 	[19647.772839][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a6c1f5ac0
> 	[19647.773789][ T3721] R13: 00007f0aabb1b428 R14: 0000000000000000 R15: 00007f0aabb1b448
> 	[19647.774784][ T3721] task:crawl_256_289   state:D stack:    0 pid: 6504 ppid:  6435 flags:0x00000000
> 	[19647.775882][ T3721] Call Trace:
> 	[19647.776271][ T3721]  __schedule+0x388/0xaf0
> 	[19647.776807][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> 	[19647.777499][ T3721]  schedule+0x68/0xe0
> 	[19647.777994][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> 	[19647.778712][ T3721]  down_write_nested+0xc1/0x130
> 	[19647.779285][ T3721]  lock_two_nondirectories+0x59/0x70
> 	[19647.779931][ T3721]  btrfs_remap_file_range+0x54/0x3c0
> 	[19647.780564][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> 	[19647.781366][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> 	[19647.782009][ T3721]  do_vfs_ioctl+0x551/0x720
> 	[19647.782554][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.783230][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> 	[19647.783804][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.784319][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.785031][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.785548][ T3721] RSP: 002b:00007f0aab31a1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.786660][ T3721] RAX: ffffffffffffffda RBX: 00007f0aab31a420 RCX: 00007f0ab2429cc7
> 	[19647.787609][ T3721] RDX: 00007f0a70008f00 RSI: 00000000c0189436 RDI: 0000000000000012
> 	[19647.788551][ T3721] RBP: 00007f0a70008f00 R08: 00007f0a708e66e0 R09: 0000000000000000
> 	[19647.789533][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a70008f00
> 	[19647.790482][ T3721] R13: 00007f0aab31a428 R14: 0000000000000000 R15: 00007f0aab31a448
> 	[32609.668575][ T3721] sysrq: Show Locks Held
> 	[32609.673950][ T3721] 
> 	[32609.673950][ T3721] Showing all locks held in the system:
> 	[32609.675276][ T3721] 1 lock held by in:imklog/3603:
> 	[32609.675885][ T3721] 1 lock held by dmesg/3720:
> 	[32609.676432][ T3721]  #0: ffff8a1406ac80e0 (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4d/0x320
> 	[32609.678403][ T3721] 3 locks held by bash/3721:
> 	[32609.678972][ T3721]  #0: ffff8a142a589498 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x70/0xf0
> 	[32609.680364][ T3721]  #1: ffffffff98f199a0 (rcu_read_lock){....}-{1:2}, at: __handle_sysrq+0x5/0xa0
> 	[32609.682046][ T3721]  #2: ffffffff98f199a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x23/0x187
> 	[32609.683847][ T3721] 1 lock held by btrfs-transacti/6161:
> 	[32609.684498][ T3721]  #0: ffff8a14e0178850 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x5a/0x1b0
> 	[32609.686275][ T3721] 3 locks held by crawl_257_265/6491:
> 	[32609.686937][ T3721] 3 locks held by crawl_257_291/6494:
> 	[32609.687578][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> 	[32609.689126][ T3721]  #1: ffff8a1410d7c848 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> 	[32609.690694][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> 	[32609.692091][ T3721] 4 locks held by crawl_257_292/6502:
> 	[32609.692763][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> 	[32609.694014][ T3721]  #1: ffff8a131637a908 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> 	[32609.695377][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> 	[32609.696764][ T3721]  #3: ffff8a14bd0926b8 (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
> 	[32609.697986][ T3721] 2 locks held by crawl_257_293/6503:
> 	[32609.698629][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> 	[32609.699882][ T3721]  #1: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: btrfs_remap_file_range+0x2eb/0x3c0
> 	[32609.701674][ T3721] 3 locks held by crawl_256_289/6504:
> 	[32609.702443][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> 	[32609.703927][ T3721]  #1: ffff8a140f2c4748 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> 	[32609.705444][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> 	[32609.706899][ T3721] 
> 	[32609.707177][ T3721] =============================================
> 
> There is also a severe performance regression (50% slower) in btrfs dedupe
> starting in 5.11, but reverting this commit doesn't make the performance
> regression go away.  I'll follow up on that separately.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-10 18:34 bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
  2021-12-12 10:03 ` Thorsten Leemhuis
@ 2021-12-13 13:28 ` Nikolay Borisov
  2021-12-13 23:12   ` Zygo Blaxell
  2022-06-13  8:38 ` Libor Klepáč
  2 siblings, 1 reply; 22+ messages in thread
From: Nikolay Borisov @ 2021-12-13 13:28 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs; +Cc: Miklos Szeredi



On 10.12.21 г. 20:34, Zygo Blaxell wrote:
> I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
> some bees users have reported it as well.  I bisected to this commit:
> 
> 	3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
> 
> These kernels work for at least 18 hours:
> 
> 	5.10.83 (months)
> 	5.11.22 with 3078d85c9a10 reverted (36 hours)
> 	btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted
> 
> These kernels lock up in 3 hours or less:
> 
> 	5.11.22
> 	5.12.19
> 	5.14.21
> 	5.15.6
> 	btrfs for-next 279373dee83e
> 
> All of the failing kernels include this commit, none of the non-failing
> kernels include the commit.
> 
> Kernel logs from the lockup:
> 
> 	[19647.696042][ T3721] sysrq: Show Blocked State
> 	[19647.697024][ T3721] task:btrfs-transacti state:D stack:    0 pid: 6161 ppid:     2 flags:0x00004000
> 	[19647.698203][ T3721] Call Trace:
> 	[19647.698608][ T3721]  __schedule+0x388/0xaf0
> 	[19647.699125][ T3721]  schedule+0x68/0xe0
> 	[19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0

Can you run this through symbolize script as I'd like to understand
where in transaction commit the sleep is happening. The picture painted
by the presented logs is that  PID 6502 has already locked 2 inodes and
is trying to get a handle for the currently run transaction via
btrfs_start_transaction but has to wait for it to commit. Subsequently
there are 3 more crawl processes (I assume those come from bees?) PIDs
6494, 6503 and 6504 which are trying to lock an inode as part of remap
but have to wait since likely 6502 has those locked.

So the real reason of the hang is that transaction commit is stuck for
some reason and we have to figure why this is the case.

> 	[19647.700276][ T3721]  ? start_transaction+0xd5/0x6f0
> 	[19647.700897][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> 	[19647.701507][ T3721]  transaction_kthread+0x138/0x1b0
> 	[19647.702154][ T3721]  kthread+0x151/0x170
> 	[19647.702651][ T3721]  ? btrfs_cleanup_transaction.isra.0+0x620/0x620
> 	[19647.703404][ T3721]  ? kthread_create_worker_on_cpu+0x70/0x70
> 	[19647.704119][ T3721]  ret_from_fork+0x22/0x30
> 	[19647.704679][ T3721] task:crawl_257_291   state:D stack:    0 pid: 6494 ppid:  6435 flags:0x00000000
> 	[19647.705797][ T3721] Call Trace:
> 	[19647.706188][ T3721]  __schedule+0x388/0xaf0
> 	[19647.706723][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> 	[19647.707414][ T3721]  schedule+0x68/0xe0
> 	[19647.707905][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> 	[19647.708597][ T3721]  down_write_nested+0xc1/0x130
> 	[19647.709167][ T3721]  lock_two_nondirectories+0x59/0x70
> 	[19647.709831][ T3721]  btrfs_remap_file_range+0x54/0x3c0
> 	[19647.710505][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> 	[19647.711197][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> 	[19647.711902][ T3721]  do_vfs_ioctl+0x551/0x720
> 	[19647.712434][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.713010][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> 	[19647.713544][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.714098][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.714818][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.715327][ T3721] RSP: 002b:00007f0ab03241d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.716326][ T3721] RAX: ffffffffffffffda RBX: 00007f0ab0324420 RCX: 00007f0ab2429cc7
> 	[19647.717264][ T3721] RDX: 00007f0a981afe50 RSI: 00000000c0189436 RDI: 0000000000000012
> 	[19647.718203][ T3721] RBP: 00007f0a981afe50 R08: 00007f0a983759e0 R09: 0000000000000000
> 	[19647.719151][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a981afe50
> 	[19647.720087][ T3721] R13: 00007f0ab0324428 R14: 0000000000000000 R15: 00007f0ab0324448

<snip>

> 	[19647.734438][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.734992][ T3721] RSP: 002b:00007f0aaf322378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.735994][ T3721] RAX: ffffffffffffffda RBX: 0000558110d25dd0 RCX: 00007f0ab2429cc7
> 	[19647.736949][ T3721] RDX: 00007f0aaf322608 RSI: 00000000c038943b RDI: 0000000000000003
> 	[19647.737895][ T3721] RBP: 00007f0aaf322550 R08: 0000000000000000 R09: 00007f0aaf3227e0
> 	[19647.738862][ T3721] R10: 000019f231c642a4 R11: 0000000000000246 R12: 00007f0aaf322600
> 	[19647.739799][ T3721] R13: 0000000000000003 R14: 00007f0aaf322608 R15: 00007f0aaf3225e0
> 	[19647.740748][ T3721] task:crawl_257_292   state:D stack:    0 pid: 6502 ppid:  6435 flags:0x00000000
> 	[19647.741832][ T3721] Call Trace:
> 	[19647.742216][ T3721]  __schedule+0x388/0xaf0
> 	[19647.742761][ T3721]  schedule+0x68/0xe0
> 	[19647.743225][ T3721]  wait_current_trans+0xed/0x150
> 	[19647.743825][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> 	[19647.744399][ T3721]  start_transaction+0x587/0x6f0
> 	[19647.745003][ T3721]  btrfs_start_transaction+0x1e/0x20
> 	[19647.745638][ T3721]  btrfs_replace_file_extents+0x135/0x8d0
> 	[19647.746305][ T3721]  ? release_extent_buffer+0xae/0xf0
> 	[19647.746973][ T3721]  btrfs_clone+0x828/0x8c0
> 	[19647.747513][ T3721]  btrfs_extent_same_range+0x75/0xa0
> 	[19647.748152][ T3721]  btrfs_remap_file_range+0x354/0x3c0
> 	[19647.748802][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> 	[19647.749460][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> 	[19647.750098][ T3721]  do_vfs_ioctl+0x551/0x720
> 	[19647.750666][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.751216][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> 	[19647.751777][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.752290][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.752992][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.753511][ T3721] RSP: 002b:00007f0aac31c1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.754500][ T3721] RAX: ffffffffffffffda RBX: 00007f0aac31c420 RCX: 00007f0ab2429cc7
> 	[19647.755460][ T3721] RDX: 00007f0a787fc690 RSI: 00000000c0189436 RDI: 0000000000000012
> 	[19647.756406][ T3721] RBP: 00007f0a787fc690 R08: 00007f0a78f4fd40 R09: 0000000000000000
> 	[19647.757340][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a787fc690
> 	[19647.758291][ T3721] R13: 00007f0aac31c428 R14: 0000000000000000 R15: 00007f0aac31c448
> 	[19647.759261][ T3721] task:crawl_257_293   state:D stack:    0 pid: 6503 ppid:  6435 flags:0x00000000
> 	[19647.760362][ T3721] Call Trace:
> 	[19647.760761][ T3721]  __schedule+0x388/0xaf0
> 	[19647.761261][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> 	[19647.761958][ T3721]  schedule+0x68/0xe0
> 	[19647.762425][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> 	[19647.763126][ T3721]  down_write+0xbd/0x120
> 	[19647.763643][ T3721]  btrfs_remap_file_range+0x2eb/0x3c0
> 	[19647.764278][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> 	[19647.764954][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> 	[19647.765589][ T3721]  do_vfs_ioctl+0x551/0x720
> 	[19647.766115][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.766700][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> 	[19647.767238][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.767772][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.768458][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.768990][ T3721] RSP: 002b:00007f0aabb1b1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.769986][ T3721] RAX: ffffffffffffffda RBX: 00007f0aabb1b420 RCX: 00007f0ab2429cc7
> 	[19647.770951][ T3721] RDX: 00007f0a6c1f5ac0 RSI: 00000000c0189436 RDI: 0000000000000012
> 	[19647.771894][ T3721] RBP: 00007f0a6c1f5ac0 R08: 00007f0a6d00da10 R09: 0000000000000000
> 	[19647.772839][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a6c1f5ac0
> 	[19647.773789][ T3721] R13: 00007f0aabb1b428 R14: 0000000000000000 R15: 00007f0aabb1b448
> 	[19647.774784][ T3721] task:crawl_256_289   state:D stack:    0 pid: 6504 ppid:  6435 flags:0x00000000
> 	[19647.775882][ T3721] Call Trace:
> 	[19647.776271][ T3721]  __schedule+0x388/0xaf0
> 	[19647.776807][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> 	[19647.777499][ T3721]  schedule+0x68/0xe0
> 	[19647.777994][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> 	[19647.778712][ T3721]  down_write_nested+0xc1/0x130
> 	[19647.779285][ T3721]  lock_two_nondirectories+0x59/0x70
> 	[19647.779931][ T3721]  btrfs_remap_file_range+0x54/0x3c0
> 	[19647.780564][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> 	[19647.781366][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> 	[19647.782009][ T3721]  do_vfs_ioctl+0x551/0x720
> 	[19647.782554][ T3721]  ? __fget_files+0x109/0x1d0
> 	[19647.783230][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> 	[19647.783804][ T3721]  do_syscall_64+0x38/0x90
> 	[19647.784319][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	[19647.785031][ T3721] RIP: 0033:0x7f0ab2429cc7
> 	[19647.785548][ T3721] RSP: 002b:00007f0aab31a1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> 	[19647.786660][ T3721] RAX: ffffffffffffffda RBX: 00007f0aab31a420 RCX: 00007f0ab2429cc7
> 	[19647.787609][ T3721] RDX: 00007f0a70008f00 RSI: 00000000c0189436 RDI: 0000000000000012
> 	[19647.788551][ T3721] RBP: 00007f0a70008f00 R08: 00007f0a708e66e0 R09: 0000000000000000
> 	[19647.789533][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a70008f00
> 	[19647.790482][ T3721] R13: 00007f0aab31a428 R14: 0000000000000000 R15: 00007f0aab31a448
> 	[32609.668575][ T3721] sysrq: Show Locks Held
> 	[32609.673950][ T3721] 
> 	[32609.673950][ T3721] Showing all locks held in the system:
> 	[32609.675276][ T3721] 1 lock held by in:imklog/3603:
> 	[32609.675885][ T3721] 1 lock held by dmesg/3720:
> 	[32609.676432][ T3721]  #0: ffff8a1406ac80e0 (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4d/0x320
> 	[32609.678403][ T3721] 3 locks held by bash/3721:
> 	[32609.678972][ T3721]  #0: ffff8a142a589498 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x70/0xf0
> 	[32609.680364][ T3721]  #1: ffffffff98f199a0 (rcu_read_lock){....}-{1:2}, at: __handle_sysrq+0x5/0xa0
> 	[32609.682046][ T3721]  #2: ffffffff98f199a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x23/0x187
> 	[32609.683847][ T3721] 1 lock held by btrfs-transacti/6161:
> 	[32609.684498][ T3721]  #0: ffff8a14e0178850 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x5a/0x1b0
> 	[32609.686275][ T3721] 3 locks held by crawl_257_265/6491:
> 	[32609.686937][ T3721] 3 locks held by crawl_257_291/6494:
> 	[32609.687578][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> 	[32609.689126][ T3721]  #1: ffff8a1410d7c848 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> 	[32609.690694][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> 	[32609.692091][ T3721] 4 locks held by crawl_257_292/6502:
> 	[32609.692763][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> 	[32609.694014][ T3721]  #1: ffff8a131637a908 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> 	[32609.695377][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> 	[32609.696764][ T3721]  #3: ffff8a14bd0926b8 (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
> 	[32609.697986][ T3721] 2 locks held by crawl_257_293/6503:
> 	[32609.698629][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> 	[32609.699882][ T3721]  #1: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: btrfs_remap_file_range+0x2eb/0x3c0
> 	[32609.701674][ T3721] 3 locks held by crawl_256_289/6504:
> 	[32609.702443][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> 	[32609.703927][ T3721]  #1: ffff8a140f2c4748 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> 	[32609.705444][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> 	[32609.706899][ T3721] 
> 	[32609.707177][ T3721] =============================================
> 
> There is also a severe performance regression (50% slower) in btrfs dedupe
> starting in 5.11, but reverting this commit doesn't make the performance
> regression go away.  I'll follow up on that separately.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-13 13:28 ` Nikolay Borisov
@ 2021-12-13 23:12   ` Zygo Blaxell
  2021-12-14 11:11     ` Nikolay Borisov
  0 siblings, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2021-12-13 23:12 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, Miklos Szeredi

On Mon, Dec 13, 2021 at 03:28:26PM +0200, Nikolay Borisov wrote:
> On 10.12.21 г. 20:34, Zygo Blaxell wrote:
> > I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
> > some bees users have reported it as well.  I bisected to this commit:
> > 
> > 	3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
> > 
> > These kernels work for at least 18 hours:
> > 
> > 	5.10.83 (months)
> > 	5.11.22 with 3078d85c9a10 reverted (36 hours)
> > 	btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted
> > 
> > These kernels lock up in 3 hours or less:
> > 
> > 	5.11.22
> > 	5.12.19
> > 	5.14.21
> > 	5.15.6
> > 	btrfs for-next 279373dee83e
> > 
> > All of the failing kernels include this commit, none of the non-failing
> > kernels include the commit.
> > 
> > Kernel logs from the lockup:
> > 
> > 	[19647.696042][ T3721] sysrq: Show Blocked State
> > 	[19647.697024][ T3721] task:btrfs-transacti state:D stack:    0 pid: 6161 ppid:     2 flags:0x00004000
> > 	[19647.698203][ T3721] Call Trace:
> > 	[19647.698608][ T3721]  __schedule+0x388/0xaf0
> > 	[19647.699125][ T3721]  schedule+0x68/0xe0
> > 	[19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0
> 
> Can you run this through symbolize script as I'd like to understand
> where in transaction commit the sleep is happening. 

	btrfs_commit_transaction+0x97c/0xbf0:

	btrfs_commit_transaction at fs/btrfs/transaction.c:2159 (discriminator 9)
	 2154
	 2155           ret = btrfs_run_delayed_items(trans);
	 2156           if (ret)
	 2157                   goto cleanup_transaction;
	 2158
	>2159<          wait_event(cur_trans->writer_wait,
	 2160                      extwriter_counter_read(cur_trans) == 0);
	 2161
	 2162           /* some pending stuffs might be added after the previous flush. */
	 2163           ret = btrfs_run_delayed_items(trans);
	 2164           if (ret)

> The picture painted
> by the presented logs is that  PID 6502 has already locked 2 inodes and
> is trying to get a handle for the currently run transaction via
> btrfs_start_transaction but has to wait for it to commit. Subsequently
> there are 3 more crawl processes (I assume those come from bees?) 

They are bees threads.

> PIDs
> 6494, 6503 and 6504 which are trying to lock an inode as part of remap
> but have to wait since likely 6502 has those locked.

That is correct.  I have the filenames from the test log, and they're
all trying to use the same inode as dedupe src:

	6502 is deduping sub1/dz offset 0xab20000 -> sub1/dy offset 0x28ccd000
	logical extent bytenr src 0x73552c000 -> dst 0x735a9a000

85 ms later, the next three threads start dedupe in the same millisecond,
referencing the same src and dst extent, but each thread modifies a
different inode:

	6494 is deduping sub1/dz offset 0xab57000 -> sub1/dx offset 0x2839d000
	logical extent bytenr src 0x735e62000 -> dst 0x735f45000

	6504 is deduping sub1/dz offset 0xab57000 -> sub2/dv offset 0x3faa9000
	logical extent bytenr src 0x735e62000 -> dst 0x735f45000

	6503 is deduping sub1/dz offset 0xab57000 -> sub1/dz offset 0xab58000
	logical extent bytenr src 0x735e62000 -> dst 0x735f45000

but 6502 has already hung by that point.

There are some other threads that are blocked on LOGICAL_INO:

	6491 is running LOGICAL_INO from 26.231 seconds earlier than 6502,
	on logical 0x732071000, but doesn't appear in Sysctl-W output.
	It does appear in lockdep output, but with no locks?  These lines
	are consecutive in the log:

	[32609.686275][ T3721] 3 locks held by crawl_257_265/6491:
	[32609.686937][ T3721] 3 locks held by crawl_257_291/6494:

	6496 is running LOGICAL_INO from 103 ms after 6502, does appear
	in the sysctl-w output, on logical 0x734b79000.  6496 is the last
	thread to make any forward progress before 6502 takes the
	transaction lock.

> So the real reason of the hang is that transaction commit is stuck for
> some reason and we have to figure why this is the case.

In other news, I was trying to reproduce the btrfs replace hang issue
and found that on 5.11 and later, btrfs replace can't run for more than
a few minutes at the same time as bees without hanging.  I've had to
test btrfs replace on 5.10.84 instead, but on that kernel it's hard to
reproduce any failures at all.  So it's possible we are looking at a
more generic lockup issue than dedupe.

> > 	[19647.700276][ T3721]  ? start_transaction+0xd5/0x6f0
> > 	[19647.700897][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> > 	[19647.701507][ T3721]  transaction_kthread+0x138/0x1b0
> > 	[19647.702154][ T3721]  kthread+0x151/0x170
> > 	[19647.702651][ T3721]  ? btrfs_cleanup_transaction.isra.0+0x620/0x620
> > 	[19647.703404][ T3721]  ? kthread_create_worker_on_cpu+0x70/0x70
> > 	[19647.704119][ T3721]  ret_from_fork+0x22/0x30
> > 	[19647.704679][ T3721] task:crawl_257_291   state:D stack:    0 pid: 6494 ppid:  6435 flags:0x00000000
> > 	[19647.705797][ T3721] Call Trace:
> > 	[19647.706188][ T3721]  __schedule+0x388/0xaf0
> > 	[19647.706723][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> > 	[19647.707414][ T3721]  schedule+0x68/0xe0
> > 	[19647.707905][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> > 	[19647.708597][ T3721]  down_write_nested+0xc1/0x130
> > 	[19647.709167][ T3721]  lock_two_nondirectories+0x59/0x70
> > 	[19647.709831][ T3721]  btrfs_remap_file_range+0x54/0x3c0
> > 	[19647.710505][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> > 	[19647.711197][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> > 	[19647.711902][ T3721]  do_vfs_ioctl+0x551/0x720
> > 	[19647.712434][ T3721]  ? __fget_files+0x109/0x1d0
> > 	[19647.713010][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> > 	[19647.713544][ T3721]  do_syscall_64+0x38/0x90
> > 	[19647.714098][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 	[19647.714818][ T3721] RIP: 0033:0x7f0ab2429cc7
> > 	[19647.715327][ T3721] RSP: 002b:00007f0ab03241d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > 	[19647.716326][ T3721] RAX: ffffffffffffffda RBX: 00007f0ab0324420 RCX: 00007f0ab2429cc7
> > 	[19647.717264][ T3721] RDX: 00007f0a981afe50 RSI: 00000000c0189436 RDI: 0000000000000012
> > 	[19647.718203][ T3721] RBP: 00007f0a981afe50 R08: 00007f0a983759e0 R09: 0000000000000000
> > 	[19647.719151][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a981afe50
> > 	[19647.720087][ T3721] R13: 00007f0ab0324428 R14: 0000000000000000 R15: 00007f0ab0324448
> 
> <snip>
> 
> > 	[19647.734438][ T3721] RIP: 0033:0x7f0ab2429cc7
> > 	[19647.734992][ T3721] RSP: 002b:00007f0aaf322378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > 	[19647.735994][ T3721] RAX: ffffffffffffffda RBX: 0000558110d25dd0 RCX: 00007f0ab2429cc7
> > 	[19647.736949][ T3721] RDX: 00007f0aaf322608 RSI: 00000000c038943b RDI: 0000000000000003
> > 	[19647.737895][ T3721] RBP: 00007f0aaf322550 R08: 0000000000000000 R09: 00007f0aaf3227e0
> > 	[19647.738862][ T3721] R10: 000019f231c642a4 R11: 0000000000000246 R12: 00007f0aaf322600
> > 	[19647.739799][ T3721] R13: 0000000000000003 R14: 00007f0aaf322608 R15: 00007f0aaf3225e0
> > 	[19647.740748][ T3721] task:crawl_257_292   state:D stack:    0 pid: 6502 ppid:  6435 flags:0x00000000
> > 	[19647.741832][ T3721] Call Trace:
> > 	[19647.742216][ T3721]  __schedule+0x388/0xaf0
> > 	[19647.742761][ T3721]  schedule+0x68/0xe0
> > 	[19647.743225][ T3721]  wait_current_trans+0xed/0x150
> > 	[19647.743825][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> > 	[19647.744399][ T3721]  start_transaction+0x587/0x6f0
> > 	[19647.745003][ T3721]  btrfs_start_transaction+0x1e/0x20
> > 	[19647.745638][ T3721]  btrfs_replace_file_extents+0x135/0x8d0
> > 	[19647.746305][ T3721]  ? release_extent_buffer+0xae/0xf0
> > 	[19647.746973][ T3721]  btrfs_clone+0x828/0x8c0
> > 	[19647.747513][ T3721]  btrfs_extent_same_range+0x75/0xa0
> > 	[19647.748152][ T3721]  btrfs_remap_file_range+0x354/0x3c0
> > 	[19647.748802][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> > 	[19647.749460][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> > 	[19647.750098][ T3721]  do_vfs_ioctl+0x551/0x720
> > 	[19647.750666][ T3721]  ? __fget_files+0x109/0x1d0
> > 	[19647.751216][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> > 	[19647.751777][ T3721]  do_syscall_64+0x38/0x90
> > 	[19647.752290][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 	[19647.752992][ T3721] RIP: 0033:0x7f0ab2429cc7
> > 	[19647.753511][ T3721] RSP: 002b:00007f0aac31c1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > 	[19647.754500][ T3721] RAX: ffffffffffffffda RBX: 00007f0aac31c420 RCX: 00007f0ab2429cc7
> > 	[19647.755460][ T3721] RDX: 00007f0a787fc690 RSI: 00000000c0189436 RDI: 0000000000000012
> > 	[19647.756406][ T3721] RBP: 00007f0a787fc690 R08: 00007f0a78f4fd40 R09: 0000000000000000
> > 	[19647.757340][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a787fc690
> > 	[19647.758291][ T3721] R13: 00007f0aac31c428 R14: 0000000000000000 R15: 00007f0aac31c448
> > 	[19647.759261][ T3721] task:crawl_257_293   state:D stack:    0 pid: 6503 ppid:  6435 flags:0x00000000
> > 	[19647.760362][ T3721] Call Trace:
> > 	[19647.760761][ T3721]  __schedule+0x388/0xaf0
> > 	[19647.761261][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> > 	[19647.761958][ T3721]  schedule+0x68/0xe0
> > 	[19647.762425][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> > 	[19647.763126][ T3721]  down_write+0xbd/0x120
> > 	[19647.763643][ T3721]  btrfs_remap_file_range+0x2eb/0x3c0
> > 	[19647.764278][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> > 	[19647.764954][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> > 	[19647.765589][ T3721]  do_vfs_ioctl+0x551/0x720
> > 	[19647.766115][ T3721]  ? __fget_files+0x109/0x1d0
> > 	[19647.766700][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> > 	[19647.767238][ T3721]  do_syscall_64+0x38/0x90
> > 	[19647.767772][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 	[19647.768458][ T3721] RIP: 0033:0x7f0ab2429cc7
> > 	[19647.768990][ T3721] RSP: 002b:00007f0aabb1b1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > 	[19647.769986][ T3721] RAX: ffffffffffffffda RBX: 00007f0aabb1b420 RCX: 00007f0ab2429cc7
> > 	[19647.770951][ T3721] RDX: 00007f0a6c1f5ac0 RSI: 00000000c0189436 RDI: 0000000000000012
> > 	[19647.771894][ T3721] RBP: 00007f0a6c1f5ac0 R08: 00007f0a6d00da10 R09: 0000000000000000
> > 	[19647.772839][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a6c1f5ac0
> > 	[19647.773789][ T3721] R13: 00007f0aabb1b428 R14: 0000000000000000 R15: 00007f0aabb1b448
> > 	[19647.774784][ T3721] task:crawl_256_289   state:D stack:    0 pid: 6504 ppid:  6435 flags:0x00000000
> > 	[19647.775882][ T3721] Call Trace:
> > 	[19647.776271][ T3721]  __schedule+0x388/0xaf0
> > 	[19647.776807][ T3721]  ? rwsem_down_write_slowpath+0x35f/0x770
> > 	[19647.777499][ T3721]  schedule+0x68/0xe0
> > 	[19647.777994][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> > 	[19647.778712][ T3721]  down_write_nested+0xc1/0x130
> > 	[19647.779285][ T3721]  lock_two_nondirectories+0x59/0x70
> > 	[19647.779931][ T3721]  btrfs_remap_file_range+0x54/0x3c0
> > 	[19647.780564][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> > 	[19647.781366][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> > 	[19647.782009][ T3721]  do_vfs_ioctl+0x551/0x720
> > 	[19647.782554][ T3721]  ? __fget_files+0x109/0x1d0
> > 	[19647.783230][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> > 	[19647.783804][ T3721]  do_syscall_64+0x38/0x90
> > 	[19647.784319][ T3721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 	[19647.785031][ T3721] RIP: 0033:0x7f0ab2429cc7
> > 	[19647.785548][ T3721] RSP: 002b:00007f0aab31a1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > 	[19647.786660][ T3721] RAX: ffffffffffffffda RBX: 00007f0aab31a420 RCX: 00007f0ab2429cc7
> > 	[19647.787609][ T3721] RDX: 00007f0a70008f00 RSI: 00000000c0189436 RDI: 0000000000000012
> > 	[19647.788551][ T3721] RBP: 00007f0a70008f00 R08: 00007f0a708e66e0 R09: 0000000000000000
> > 	[19647.789533][ T3721] R10: 00007ffc0f9f2080 R11: 0000000000000246 R12: 00007f0a70008f00
> > 	[19647.790482][ T3721] R13: 00007f0aab31a428 R14: 0000000000000000 R15: 00007f0aab31a448
> > 	[32609.668575][ T3721] sysrq: Show Locks Held
> > 	[32609.673950][ T3721] 
> > 	[32609.673950][ T3721] Showing all locks held in the system:
> > 	[32609.675276][ T3721] 1 lock held by in:imklog/3603:
> > 	[32609.675885][ T3721] 1 lock held by dmesg/3720:
> > 	[32609.676432][ T3721]  #0: ffff8a1406ac80e0 (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4d/0x320
> > 	[32609.678403][ T3721] 3 locks held by bash/3721:
> > 	[32609.678972][ T3721]  #0: ffff8a142a589498 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x70/0xf0
> > 	[32609.680364][ T3721]  #1: ffffffff98f199a0 (rcu_read_lock){....}-{1:2}, at: __handle_sysrq+0x5/0xa0
> > 	[32609.682046][ T3721]  #2: ffffffff98f199a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x23/0x187
> > 	[32609.683847][ T3721] 1 lock held by btrfs-transacti/6161:
> > 	[32609.684498][ T3721]  #0: ffff8a14e0178850 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x5a/0x1b0
> > 	[32609.686275][ T3721] 3 locks held by crawl_257_265/6491:
> > 	[32609.686937][ T3721] 3 locks held by crawl_257_291/6494:
> > 	[32609.687578][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> > 	[32609.689126][ T3721]  #1: ffff8a1410d7c848 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> > 	[32609.690694][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> > 	[32609.692091][ T3721] 4 locks held by crawl_257_292/6502:
> > 	[32609.692763][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> > 	[32609.694014][ T3721]  #1: ffff8a131637a908 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> > 	[32609.695377][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> > 	[32609.696764][ T3721]  #3: ffff8a14bd0926b8 (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
> > 	[32609.697986][ T3721] 2 locks held by crawl_257_293/6503:
> > 	[32609.698629][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> > 	[32609.699882][ T3721]  #1: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: btrfs_remap_file_range+0x2eb/0x3c0
> > 	[32609.701674][ T3721] 3 locks held by crawl_256_289/6504:
> > 	[32609.702443][ T3721]  #0: ffff8a14bd092498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> > 	[32609.703927][ T3721]  #1: ffff8a140f2c4748 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> > 	[32609.705444][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> > 	[32609.706899][ T3721] 
> > 	[32609.707177][ T3721] =============================================
> > 
> > There is also a severe performance regression (50% slower) in btrfs dedupe
> > starting in 5.11, but reverting this commit doesn't make the performance
> > regression go away.  I'll follow up on that separately.
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-13 23:12   ` Zygo Blaxell
@ 2021-12-14 11:11     ` Nikolay Borisov
  2021-12-14 19:50       ` Zygo Blaxell
  0 siblings, 1 reply; 22+ messages in thread
From: Nikolay Borisov @ 2021-12-14 11:11 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2648 bytes --]



On 14.12.21 г. 1:12, Zygo Blaxell wrote:
> On Mon, Dec 13, 2021 at 03:28:26PM +0200, Nikolay Borisov wrote:
>> On 10.12.21 г. 20:34, Zygo Blaxell wrote:
>>> I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
>>> some bees users have reported it as well.  I bisected to this commit:
>>>
>>> 	3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
>>>
>>> These kernels work for at least 18 hours:
>>>
>>> 	5.10.83 (months)
>>> 	5.11.22 with 3078d85c9a10 reverted (36 hours)
>>> 	btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted
>>>
>>> These kernels lock up in 3 hours or less:
>>>
>>> 	5.11.22
>>> 	5.12.19
>>> 	5.14.21
>>> 	5.15.6
>>> 	btrfs for-next 279373dee83e
>>>
>>> All of the failing kernels include this commit, none of the non-failing
>>> kernels include the commit.
>>>
>>> Kernel logs from the lockup:
>>>
>>> 	[19647.696042][ T3721] sysrq: Show Blocked State
>>> 	[19647.697024][ T3721] task:btrfs-transacti state:D stack:    0 pid: 6161 ppid:     2 flags:0x00004000
>>> 	[19647.698203][ T3721] Call Trace:
>>> 	[19647.698608][ T3721]  __schedule+0x388/0xaf0
>>> 	[19647.699125][ T3721]  schedule+0x68/0xe0
>>> 	[19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0
>>
>> Can you run this through symbolize script as I'd like to understand
>> where in transaction commit the sleep is happening. 
> 
> 	btrfs_commit_transaction+0x97c/0xbf0:
> 
> 	btrfs_commit_transaction at fs/btrfs/transaction.c:2159 (discriminator 9)
> 	 2154
> 	 2155           ret = btrfs_run_delayed_items(trans);
> 	 2156           if (ret)
> 	 2157                   goto cleanup_transaction;
> 	 2158
> 	>2159<          wait_event(cur_trans->writer_wait,
> 	 2160                      extwriter_counter_read(cur_trans) == 0);
> 	 2161
> 	 2162           /* some pending stuffs might be added after the previous flush. */
> 	 2163           ret = btrfs_run_delayed_items(trans);
> 	 2164           if (ret)
> 

So it seems there is an open transaction handle thus commit can't
continue and everything is stalled behind. Would you be able to run the
attached python script on a host which is stuck. It requires you having
debug symbols for the kernel installed as well as
https://github.com/osandov/drgn/ which is a scriptable debugger. The
easiest way would to follow the instructions at
https://drgn.readthedocs.io/en/latest/installation.html and just get it
via pip.


Once you have it installed run it by doing:

"sudo drgn get-num-extwriters.py 310dd372-0fd1-4496-a232-0fb46ca4afd6"

Where 310dd372-0fd1-4496-a232-0fb46ca4afd6 is the fsid as taken from
'blkid' which corresponds to the wedged fs.



<snip>

[-- Attachment #2: get-num-extwriters.py --]
[-- Type: text/x-python, Size: 738 bytes --]

#!/bin/python3

import uuid, sys
from drgn.helpers.linux import list_for_each_entry

if len(sys.argv) != 2:
    print("Run with 'sudo drgn {} UUID'".format(sys.argv[0]))
    exit()

fsid = sys.argv[1]
found = False

btrfs_fs = prog['fs_uuids']
for fs in list_for_each_entry("struct btrfs_fs_devices", btrfs_fs.address_of_(), "fs_list"):
    current_fsid = uuid.UUID(bytes=fs.fsid.string_())
    user_fsid = uuid.UUID(fsid)
    if current_fsid.int == user_fsid.int:
        transaction = fs.fs_info.running_transaction
        found = True
        print("[{}] num_extwriters is: {}".format(fsid, transaction.num_extwriters.value_()["counter"]));

if found == False:
    print("Couldn't find matching UUID belonging to a BTRFS filesystem")

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-14 11:11     ` Nikolay Borisov
@ 2021-12-14 19:50       ` Zygo Blaxell
  2021-12-14 22:25         ` Nikolay Borisov
  0 siblings, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2021-12-14 19:50 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Tue, Dec 14, 2021 at 01:11:24PM +0200, Nikolay Borisov wrote:
> 
> 
> On 14.12.21 г. 1:12, Zygo Blaxell wrote:
> > On Mon, Dec 13, 2021 at 03:28:26PM +0200, Nikolay Borisov wrote:
> >> On 10.12.21 г. 20:34, Zygo Blaxell wrote:
> >>> I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
> >>> some bees users have reported it as well.  I bisected to this commit:
> >>>
> >>> 	3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
> >>>
> >>> These kernels work for at least 18 hours:
> >>>
> >>> 	5.10.83 (months)
> >>> 	5.11.22 with 3078d85c9a10 reverted (36 hours)
> >>> 	btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted
> >>>
> >>> These kernels lock up in 3 hours or less:
> >>>
> >>> 	5.11.22
> >>> 	5.12.19
> >>> 	5.14.21
> >>> 	5.15.6
> >>> 	btrfs for-next 279373dee83e
> >>>
> >>> All of the failing kernels include this commit, none of the non-failing
> >>> kernels include the commit.
> >>>
> >>> Kernel logs from the lockup:
> >>>
> >>> 	[19647.696042][ T3721] sysrq: Show Blocked State
> >>> 	[19647.697024][ T3721] task:btrfs-transacti state:D stack:    0 pid: 6161 ppid:     2 flags:0x00004000
> >>> 	[19647.698203][ T3721] Call Trace:
> >>> 	[19647.698608][ T3721]  __schedule+0x388/0xaf0
> >>> 	[19647.699125][ T3721]  schedule+0x68/0xe0
> >>> 	[19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0
> >>
> >> Can you run this through symbolize script as I'd like to understand
> >> where in transaction commit the sleep is happening. 
> > 
> > 	btrfs_commit_transaction+0x97c/0xbf0:
> > 
> > 	btrfs_commit_transaction at fs/btrfs/transaction.c:2159 (discriminator 9)
> > 	 2154
> > 	 2155           ret = btrfs_run_delayed_items(trans);
> > 	 2156           if (ret)
> > 	 2157                   goto cleanup_transaction;
> > 	 2158
> > 	>2159<          wait_event(cur_trans->writer_wait,
> > 	 2160                      extwriter_counter_read(cur_trans) == 0);
> > 	 2161
> > 	 2162           /* some pending stuffs might be added after the previous flush. */
> > 	 2163           ret = btrfs_run_delayed_items(trans);
> > 	 2164           if (ret)
> > 
> 
> So it seems there is an open transaction handle thus commit can't
> continue and everything is stalled behind. Would you be able to run the
> attached python script on a host which is stuck. It requires you having
> debug symbols for the kernel installed as well as
> https://github.com/osandov/drgn/ which is a scriptable debugger. The
> easiest way would to follow the instructions at
> https://drgn.readthedocs.io/en/latest/installation.html and just get it
> via pip.
> 
> 
> Once you have it installed run it by doing:
> 
> "sudo drgn get-num-extwriters.py 310dd372-0fd1-4496-a232-0fb46ca4afd6"
> 
> Where 310dd372-0fd1-4496-a232-0fb46ca4afd6 is the fsid as taken from
> 'blkid' which corresponds to the wedged fs.

[drum roll noises...]

	[f79c1081-d81d-4abc-8b47-3b15bf2f93c5] num_extwriters is: 1

> <snip>

> #!/bin/python3
> 
> import uuid, sys
> from drgn.helpers.linux import list_for_each_entry
> 
> if len(sys.argv) != 2:
>     print("Run with 'sudo drgn {} UUID'".format(sys.argv[0]))
>     exit()
> 
> fsid = sys.argv[1]
> found = False
> 
> btrfs_fs = prog['fs_uuids']
> for fs in list_for_each_entry("struct btrfs_fs_devices", btrfs_fs.address_of_(), "fs_list"):
>     current_fsid = uuid.UUID(bytes=fs.fsid.string_())
>     user_fsid = uuid.UUID(fsid)
>     if current_fsid.int == user_fsid.int:
>         transaction = fs.fs_info.running_transaction
>         found = True
>         print("[{}] num_extwriters is: {}".format(fsid, transaction.num_extwriters.value_()["counter"]));
> 
> if found == False:
>     print("Couldn't find matching UUID belonging to a BTRFS filesystem")


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-14 19:50       ` Zygo Blaxell
@ 2021-12-14 22:25         ` Nikolay Borisov
  2021-12-16  5:33           ` Zygo Blaxell
  0 siblings, 1 reply; 22+ messages in thread
From: Nikolay Borisov @ 2021-12-14 22:25 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs



On 14.12.21 г. 21:50, Zygo Blaxell wrote:
> On Tue, Dec 14, 2021 at 01:11:24PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 14.12.21 г. 1:12, Zygo Blaxell wrote:
>>> On Mon, Dec 13, 2021 at 03:28:26PM +0200, Nikolay Borisov wrote:
>>>> On 10.12.21 г. 20:34, Zygo Blaxell wrote:
>>>>> I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
>>>>> some bees users have reported it as well.  I bisected to this commit:
>>>>>
>>>>> 	3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
>>>>>
>>>>> These kernels work for at least 18 hours:
>>>>>
>>>>> 	5.10.83 (months)
>>>>> 	5.11.22 with 3078d85c9a10 reverted (36 hours)
>>>>> 	btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted
>>>>>
>>>>> These kernels lock up in 3 hours or less:
>>>>>
>>>>> 	5.11.22
>>>>> 	5.12.19
>>>>> 	5.14.21
>>>>> 	5.15.6
>>>>> 	btrfs for-next 279373dee83e
>>>>>
>>>>> All of the failing kernels include this commit, none of the non-failing
>>>>> kernels include the commit.
>>>>>
>>>>> Kernel logs from the lockup:
>>>>>
>>>>> 	[19647.696042][ T3721] sysrq: Show Blocked State
>>>>> 	[19647.697024][ T3721] task:btrfs-transacti state:D stack:    0 pid: 6161 ppid:     2 flags:0x00004000
>>>>> 	[19647.698203][ T3721] Call Trace:
>>>>> 	[19647.698608][ T3721]  __schedule+0x388/0xaf0
>>>>> 	[19647.699125][ T3721]  schedule+0x68/0xe0
>>>>> 	[19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0
>>>>
>>>> Can you run this through symbolize script as I'd like to understand
>>>> where in transaction commit the sleep is happening. 
>>>
>>> 	btrfs_commit_transaction+0x97c/0xbf0:
>>>
>>> 	btrfs_commit_transaction at fs/btrfs/transaction.c:2159 (discriminator 9)
>>> 	 2154
>>> 	 2155           ret = btrfs_run_delayed_items(trans);
>>> 	 2156           if (ret)
>>> 	 2157                   goto cleanup_transaction;
>>> 	 2158
>>> 	>2159<          wait_event(cur_trans->writer_wait,
>>> 	 2160                      extwriter_counter_read(cur_trans) == 0);
>>> 	 2161
>>> 	 2162           /* some pending stuffs might be added after the previous flush. */
>>> 	 2163           ret = btrfs_run_delayed_items(trans);
>>> 	 2164           if (ret)
>>>
>>
>> So it seems there is an open transaction handle thus commit can't
>> continue and everything is stalled behind. Would you be able to run the
>> attached python script on a host which is stuck. It requires you having
>> debug symbols for the kernel installed as well as
>> https://github.com/osandov/drgn/ which is a scriptable debugger. The
>> easiest way would to follow the instructions at
>> https://drgn.readthedocs.io/en/latest/installation.html and just get it
>> via pip.
>>
>>
>> Once you have it installed run it by doing:
>>
>> "sudo drgn get-num-extwriters.py 310dd372-0fd1-4496-a232-0fb46ca4afd6"
>>
>> Where 310dd372-0fd1-4496-a232-0fb46ca4afd6 is the fsid as taken from
>> 'blkid' which corresponds to the wedged fs.
> 
> [drum roll noises...]
> 
> 	[f79c1081-d81d-4abc-8b47-3b15bf2f93c5] num_extwriters is: 1

Huhz, this means there is an open transaction handle somewhere o_O. I
checked back the stacktraces in your original email but couldn't see
where that might be coming from. I.e all processes are waiting on
wait_current_trans and this happens _before_ the transaction handle is
opened, hence num_extwriters can't have been incremented by them.

When an fs wedges, and you get again num_extwriters can you provde the
output of "echo w > /proc/sysrq-trigger"

<snip>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-14 22:25         ` Nikolay Borisov
@ 2021-12-16  5:33           ` Zygo Blaxell
  2021-12-16 21:29             ` Nikolay Borisov
  0 siblings, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2021-12-16  5:33 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Dec 15, 2021 at 12:25:04AM +0200, Nikolay Borisov wrote:
> Huhz, this means there is an open transaction handle somewhere o_O. I
> checked back the stacktraces in your original email but couldn't see
> where that might be coming from. I.e all processes are waiting on
> wait_current_trans and this happens _before_ the transaction handle is
> opened, hence num_extwriters can't have been incremented by them.
> 
> When an fs wedges, and you get again num_extwriters can you provde the
> output of "echo w > /proc/sysrq-trigger"

Here you go...

	# drgn -s vmlinux ~/bin/get-num-extwriters.py f79c1081-d81d-4abc-8b47-3b15bf2f93c5
	[f79c1081-d81d-4abc-8b47-3b15bf2f93c5] num_extwriters is: 1

	[30557.917447][ T3230] sysrq: Show Blocked State
	[30557.925266][ T3230] task:btrfs-transacti state:D stack:    0 pid:10552 ppid:     2 flags:0x00004000
	[30557.926479][ T3230] Call Trace:
	[30557.928165][ T3230]  __schedule+0x351/0xaa0
	[30557.928775][ T3230]  schedule+0x68/0xe0
	[30557.930092][ T3230]  btrfs_commit_transaction+0x814/0xbb0

	0xffffffff81576bf4 is in btrfs_commit_transaction (fs/btrfs/transaction.c:2164).
	2159
	2160            ret = btrfs_run_delayed_items(trans);
	2161            if (ret)
	2162                    goto cleanup_transaction;
	2163
	2164            wait_event(cur_trans->writer_wait,
	2165                       extwriter_counter_read(cur_trans) == 0);
	2166
	2167            /* some pending stuffs might be added after the previous flush. */
	2168            ret = btrfs_run_delayed_items(trans);

	[30557.930795][ T3230]  ? start_transaction+0xd5/0x6f0
	[30557.932200][ T3230]  ? add_wait_queue_exclusive+0x80/0x80
	[30557.932933][ T3230]  transaction_kthread+0x138/0x1b0
	[30557.933653][ T3230]  kthread+0x151/0x170
	[30557.934248][ T3230]  ? btrfs_cleanup_transaction.isra.0+0x620/0x620
	[30557.935096][ T3230]  ? kthread_create_worker_on_cpu+0x70/0x70
	[30557.935911][ T3230]  ret_from_fork+0x22/0x30
	[30557.936476][ T3230] task:crawl_5_263     state:D stack:    0 pid:10874 ppid: 10762 flags:0x00000000
	[30557.937707][ T3230] Call Trace:
	[30557.938180][ T3230]  __schedule+0x351/0xaa0
	[30557.938714][ T3230]  schedule+0x68/0xe0
	[30557.939224][ T3230]  wait_current_trans+0xed/0x150

	wait_current_trans at fs/btrfs/transaction.c:526 (discriminator 19)
	 521            cur_trans = fs_info->running_transaction;
	 522            if (cur_trans && is_transaction_blocked(cur_trans)) {
	 523                    refcount_inc(&cur_trans->use_count);
	 524                    spin_unlock(&fs_info->trans_lock);
	 525    
	>526<                   wait_event(fs_info->transaction_wait,
	 527                               cur_trans->state >= TRANS_STATE_UNBLOCKED ||
	 528                               TRANS_ABORTED(cur_trans));
	 529                    btrfs_put_transaction(cur_trans);
	 530            } else {
	 531                    spin_unlock(&fs_info->trans_lock);

	[30557.939834][ T3230]  ? add_wait_queue_exclusive+0x80/0x80
	[30557.940502][ T3230]  start_transaction+0x587/0x6f0
	[30557.941170][ T3230]  btrfs_start_transaction+0x1e/0x20
	[30557.941905][ T3230]  btrfs_replace_file_extents+0x12a/0x8d0
	[30557.942586][ T3230]  ? release_extent_buffer+0xae/0xf0
	[30557.943258][ T3230]  ? do_raw_spin_unlock+0x57/0xb0
	[30557.944211][ T3230]  btrfs_clone+0x7ea/0x880
	[30557.944857][ T3230]  btrfs_extent_same_range+0x75/0xa0
	[30557.945511][ T3230]  btrfs_remap_file_range+0x354/0x3c0
	[30557.946859][ T3230]  vfs_dedupe_file_range_one.part.0+0xc9/0x140
	[30557.947626][ T3230]  vfs_dedupe_file_range+0x186/0x210
	[30557.948434][ T3230]  do_vfs_ioctl+0x551/0x720
	[30557.949073][ T3230]  ? __fget_files+0x10c/0x1e0
	[30557.949636][ T3230]  __x64_sys_ioctl+0x6f/0xc0
	[30557.950382][ T3230]  do_syscall_64+0x38/0x90
	[30557.951024][ T3230]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[30557.951788][ T3230] RIP: 0033:0x7f0bd832ccc7
	[30557.952439][ T3230] RSP: 002b:00007f0bd5a26018 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[30557.953828][ T3230] RAX: ffffffffffffffda RBX: 00007f0bd5a26280 RCX: 00007f0bd832ccc7
	[30557.955254][ T3230] RDX: 00007f0bb401a920 RSI: 00000000c0189436 RDI: 000000000000000d
	[30557.956277][ T3230] RBP: 00007f0bb401a920 R08: 0000000000000000 R09: 00007f0bd5a26690
	[30557.957374][ T3230] R10: 00007ffd6836b080 R11: 0000000000000246 R12: 00007f0bb401a920
	[30557.958502][ T3230] R13: 00007f0bd5a26288 R14: 0000000000000000 R15: 00007f0bd5a262a8
	[30557.959603][ T3230] task:crawl_363_264   state:D stack:    0 pid:10876 ppid: 10762 flags:0x00000000
	[30557.960873][ T3230] Call Trace:
	[30557.961324][ T3230]  __schedule+0x351/0xaa0
	[30557.961947][ T3230]  schedule+0x68/0xe0
	[30557.962515][ T3230]  wait_current_trans+0xed/0x150
	[30557.963180][ T3230]  ? add_wait_queue_exclusive+0x80/0x80
	[30557.963884][ T3230]  start_transaction+0x37e/0x6f0
	[30557.964699][ T3230]  ? btrfs_inode_flags_to_xflags+0x50/0x50
	[30557.965469][ T3230]  btrfs_attach_transaction+0x1d/0x20
	[30557.966158][ T3230]  iterate_extent_inodes+0x7b/0x270
	[30557.966781][ T3230]  iterate_inodes_from_logical+0x9f/0xe0
	[30557.967568][ T3230]  ? btrfs_inode_flags_to_xflags+0x50/0x50
	[30557.968395][ T3230]  btrfs_ioctl_logical_to_ino+0x183/0x210
	[30557.969213][ T3230]  btrfs_ioctl+0xa81/0x2fb0
	[30557.970074][ T3230]  ? kvm_sched_clock_read+0x18/0x30
	[30557.971034][ T3230]  ? sched_clock+0x9/0x10
	[30557.971586][ T3230]  ? __fget_files+0xed/0x1e0
	[30557.972214][ T3230]  ? __fget_files+0x10c/0x1e0
	[30557.972929][ T3230]  __x64_sys_ioctl+0x91/0xc0
	[30557.973527][ T3230]  ? __x64_sys_ioctl+0x91/0xc0
	[30557.974147][ T3230]  do_syscall_64+0x38/0x90
	[30557.974689][ T3230]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[30557.975440][ T3230] RIP: 0033:0x7f0bd832ccc7
	[30557.976059][ T3230] RSP: 002b:00007f0bd4a24378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[30557.977226][ T3230] RAX: ffffffffffffffda RBX: 000055ca79b80de0 RCX: 00007f0bd832ccc7
	[30557.978272][ T3230] RDX: 00007f0bd4a24718 RSI: 00000000c038943b RDI: 0000000000000003
	[30557.979266][ T3230] RBP: 00007f0bd4a24550 R08: 0000000000000000 R09: 00007f0bd4a24800
	[30557.980262][ T3230] R10: 0000071ac32defbe R11: 0000000000000246 R12: 00007f0bd4a24710
	[30557.981255][ T3230] R13: 0000000000000003 R14: 00007f0bd4a24718 R15: 00007f0bd4a24a58
	[30557.982262][ T3230] task:crawl_5_264     state:D stack:    0 pid:10878 ppid: 10762 flags:0x00000000
	[30557.983449][ T3230] Call Trace:
	[30557.983985][ T3230]  __schedule+0x351/0xaa0
	[30557.984522][ T3230]  ? rwsem_down_write_slowpath+0x254/0x510
	[30557.985336][ T3230]  schedule+0x68/0xe0
	[30557.985912][ T3230]  rwsem_down_write_slowpath+0x280/0x510
	[30557.986598][ T3230]  down_write+0xbd/0x120
	[30557.987132][ T3230]  lock_two_nondirectories+0x6b/0x70
	[30557.987786][ T3230]  btrfs_remap_file_range+0x54/0x3c0
	[30557.988452][ T3230]  vfs_dedupe_file_range_one.part.0+0xc9/0x140
	[30557.989231][ T3230]  vfs_dedupe_file_range+0x186/0x210
	[30557.989884][ T3230]  do_vfs_ioctl+0x551/0x720
	[30557.990452][ T3230]  ? __fget_files+0x10c/0x1e0
	[30557.991046][ T3230]  __x64_sys_ioctl+0x6f/0xc0
	[30557.991624][ T3230]  do_syscall_64+0x38/0x90
	[30557.992182][ T3230]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[30557.992933][ T3230] RIP: 0033:0x7f0bd832ccc7
	[30557.993455][ T3230] RSP: 002b:00007f0bd3a22018 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[30557.994531][ T3230] RAX: ffffffffffffffda RBX: 00007f0bd3a22280 RCX: 00007f0bd832ccc7
	[30557.995523][ T3230] RDX: 00007f0ba4454810 RSI: 00000000c0189436 RDI: 000000000000000d
	[30557.996500][ T3230] RBP: 00007f0ba4454810 R08: 0000000000000000 R09: 00007f0bd3a22690
	[30557.997492][ T3230] R10: 00007ffd6836b080 R11: 0000000000000246 R12: 00007f0ba4454810
	[30557.998465][ T3230] R13: 00007f0bd3a22288 R14: 0000000000000000 R15: 00007f0bd3a222a8
	[30557.999462][ T3230] task:crawl_362_264   state:D stack:    0 pid:10881 ppid: 10762 flags:0x00000000
	[30558.000568][ T3230] Call Trace:
	[30558.001025][ T3230]  __schedule+0x351/0xaa0
	[30558.001617][ T3230]  ? rwsem_down_write_slowpath+0x254/0x510
	[30558.002355][ T3230]  schedule+0x68/0xe0
	[30558.002887][ T3230]  rwsem_down_write_slowpath+0x280/0x510
	[30558.003589][ T3230]  down_write+0xbd/0x120
	[30558.004129][ T3230]  lock_two_nondirectories+0x6b/0x70
	[30558.004849][ T3230]  btrfs_remap_file_range+0x54/0x3c0
	[30558.005603][ T3230]  vfs_dedupe_file_range_one.part.0+0xc9/0x140
	[30558.006421][ T3230]  vfs_dedupe_file_range+0x186/0x210
	[30558.007126][ T3230]  do_vfs_ioctl+0x551/0x720
	[30558.007720][ T3230]  ? __fget_files+0x10c/0x1e0
	[30558.008371][ T3230]  __x64_sys_ioctl+0x6f/0xc0
	[30558.008977][ T3230]  do_syscall_64+0x38/0x90
	[30558.009507][ T3230]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[30558.010279][ T3230] RIP: 0033:0x7f0bd832ccc7
	[30558.010856][ T3230] RSP: 002b:00007f0bd2a20018 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[30558.011897][ T3230] RAX: ffffffffffffffda RBX: 00007f0bd2a20280 RCX: 00007f0bd832ccc7
	[30558.012897][ T3230] RDX: 00007f0b9c001630 RSI: 00000000c0189436 RDI: 000000000000000d
	[30558.013862][ T3230] RBP: 00007f0b9c001630 R08: 0000000000000000 R09: 00007f0bd2a20690
	[30558.014858][ T3230] R10: 00007ffd6836b080 R11: 0000000000000246 R12: 00007f0b9c001630
	[30558.015811][ T3230] R13: 00007f0bd2a20288 R14: 0000000000000000 R15: 00007f0bd2a202a8
	[30558.016794][ T3230] task:crawl_362_271   state:D stack:    0 pid:10882 ppid: 10762 flags:0x00000000
	[30558.017953][ T3230] Call Trace:
	[30558.018350][ T3230]  __schedule+0x351/0xaa0
	[30558.018913][ T3230]  schedule+0x68/0xe0
	[30558.019415][ T3230]  wait_current_trans+0xed/0x150
	[30558.020030][ T3230]  ? add_wait_queue_exclusive+0x80/0x80
	[30558.020717][ T3230]  start_transaction+0x587/0x6f0
	[30558.021345][ T3230]  btrfs_start_transaction+0x1e/0x20
	[30558.022018][ T3230]  btrfs_replace_file_extents+0x12a/0x8d0
	[30558.022698][ T3230]  ? release_extent_buffer+0xae/0xf0
	[30558.023351][ T3230]  ? do_raw_spin_unlock+0x57/0xb0
	[30558.024012][ T3230]  btrfs_clone+0x7ea/0x880
	[30558.024563][ T3230]  btrfs_extent_same_range+0x75/0xa0
	[30558.025231][ T3230]  btrfs_remap_file_range+0x354/0x3c0
	[30558.025935][ T3230]  vfs_dedupe_file_range_one.part.0+0xc9/0x140
	[30558.026670][ T3230]  vfs_dedupe_file_range+0x186/0x210
	[30558.027330][ T3230]  do_vfs_ioctl+0x551/0x720
	[30558.027916][ T3230]  ? __fget_files+0x10c/0x1e0
	[30558.028490][ T3230]  __x64_sys_ioctl+0x6f/0xc0
	[30558.029076][ T3230]  do_syscall_64+0x38/0x90
	[30558.029635][ T3230]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[30558.030379][ T3230] RIP: 0033:0x7f0bd832ccc7
	[30558.030960][ T3230] RSP: 002b:00007f0bd221f018 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[30558.031994][ T3230] RAX: ffffffffffffffda RBX: 00007f0bd221f280 RCX: 00007f0bd832ccc7
	[30558.032999][ T3230] RDX: 00007f0ba00d8e40 RSI: 00000000c0189436 RDI: 000000000000002b
	[30558.033992][ T3230] RBP: 00007f0ba00d8e40 R08: 0000000000000000 R09: 00007f0bd221f690
	[30558.035085][ T3230] R10: 00007ffd6836b080 R11: 0000000000000246 R12: 00007f0ba00d8e40
	[30558.036082][ T3230] R13: 00007f0bd221f288 R14: 0000000000000000 R15: 00007f0bd221f2a8
	[30560.726926][ T3230] sysrq: Show Locks Held
	[30560.727574][ T3230] 
	[30560.727574][ T3230] Showing all locks held in the system:
	[30560.728542][ T3230] 1 lock held by in:imklog/3104:
	[30560.729188][ T3230]  #0: ffff93d190807580 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4e/0x60
	[30560.730302][ T3230] 1 lock held by dmesg/3228:
	[30560.730882][ T3230]  #0: ffff93d185b780e0 (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4d/0x330
	[30560.731996][ T3230] 3 locks held by bash/3230:
	[30560.732557][ T3230]  #0: ffff93d1b2d73498 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x70/0xf0
	[30560.733866][ T3230]  #1: ffffffffbcf189a0 (rcu_read_lock){....}-{1:2}, at: __handle_sysrq+0x5/0xa0
	[30560.735609][ T3230]  #2: ffffffffbcf189a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x23/0x187
	[30560.737510][ T3230] 1 lock held by btrfs-transacti/10552:
	[30560.738261][ T3230]  #0: ffff93d195e94850 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x5a/0x1b0
	[30560.740014][ T3230] 4 locks held by crawl_5_263/10874:
	[30560.740854][ T3230]  #0: ffff93d258223498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range+0x162/0x210
	[30560.742405][ T3230]  #1: ffff93d241751c88 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
	[30560.744153][ T3230]  #2: ffff93d2f3354988 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
	[30560.745953][ T3230]  #3: ffff93d2582236b8 (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
	[30560.747557][ T3230] 2 locks held by crawl_5_264/10878:
	[30560.748420][ T3230]  #0: ffff93d258223498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range+0x162/0x210
	[30560.750060][ T3230]  #1: ffff93d241751c88 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
	[30560.751892][ T3230] 2 locks held by crawl_362_264/10881:
	[30560.752761][ T3230]  #0: ffff93d258223498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range+0x162/0x210
	[30560.754424][ T3230]  #1: ffff93d241751c88 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
	[30560.756327][ T3230] 4 locks held by crawl_362_271/10882:
	[30560.757268][ T3230]  #0: ffff93d258223498 (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range+0x162/0x210
	[30560.758971][ T3230]  #1: ffff93d1159c5648 (&sb->s_type->i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
	[30560.760890][ T3230]  #2: ffff93d1356ef688 (&sb->s_type->i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
	[30560.762839][ T3230]  #3: ffff93d2582236b8 (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
	[30560.764554][ T3230] 3 locks held by crawl_362_267/10883:

	[30560.765496][ T3230] 
	[30560.765893][ T3230] =============================================
	[30560.765893][ T3230] 

Again we have "3 locks held" but no list of locks.  WTF is 10883 doing?
Well, first of all it's using 100% CPU in the kernel.  Some samples of
kernel stacks:

	# cat /proc/*/task/10883/stack
	[<0>] down_read_nested+0x32/0x140
	[<0>] __btrfs_tree_read_lock+0x2d/0x110
	[<0>] btrfs_tree_read_lock+0x10/0x20
	[<0>] btrfs_search_old_slot+0x627/0x8a0
	[<0>] btrfs_next_old_leaf+0xcb/0x340
	[<0>] find_parent_nodes+0xcd7/0x1c40
	[<0>] btrfs_find_all_leafs+0x63/0xb0
	[<0>] iterate_extent_inodes+0xc8/0x270
	[<0>] iterate_inodes_from_logical+0x9f/0xe0
	[<0>] btrfs_ioctl_logical_to_ino+0x183/0x210
	[<0>] btrfs_ioctl+0xa81/0x2fb0
	[<0>] __x64_sys_ioctl+0x91/0xc0
	[<0>] do_syscall_64+0x38/0x90
	[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
	# cat /proc/*/task/10883/stack
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	[<0>] __tree_mod_log_rewind+0x57/0x250
	# cat /proc/*/task/10883/stack
	# cat /proc/*/task/10883/stack
	# cat /proc/*/task/10883/stack
	[<0>] free_extent_buffer.part.0+0x51/0xa0
	# cat /proc/*/task/10883/stack
	[<0>] find_held_lock+0x38/0x90
	[<0>] kmem_cache_alloc+0x22d/0x360
	[<0>] __alloc_extent_buffer+0x2a/0xa0
	[<0>] btrfs_clone_extent_buffer+0x42/0x130
	[<0>] btrfs_search_old_slot+0x660/0x8a0
	[<0>] btrfs_next_old_leaf+0xcb/0x340
	[<0>] find_parent_nodes+0xcd7/0x1c40
	[<0>] btrfs_find_all_leafs+0x63/0xb0
	[<0>] iterate_extent_inodes+0xc8/0x270
	[<0>] iterate_inodes_from_logical+0x9f/0xe0
	[<0>] btrfs_ioctl_logical_to_ino+0x183/0x210
	[<0>] btrfs_ioctl+0xa81/0x2fb0
	[<0>] __x64_sys_ioctl+0x91/0xc0
	[<0>] do_syscall_64+0x38/0x90
	[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

So it looks like tree mod log is doing some infinite (or very large
finite) looping in the LOGICAL_INO ioctl.  That ioctl holds a transaction
open while it runs, but it's not blocked per se, so it doesn't show up
in SysRq-W output.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-16  5:33           ` Zygo Blaxell
@ 2021-12-16 21:29             ` Nikolay Borisov
  2021-12-16 22:07               ` Josef Bacik
  2021-12-17  5:38               ` bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
  0 siblings, 2 replies; 22+ messages in thread
From: Nikolay Borisov @ 2021-12-16 21:29 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Josef Bacik



On 16.12.21 г. 7:33, Zygo Blaxell wrote:
> On Wed, Dec 15, 2021 at 12:25:04AM +0200, Nikolay Borisov wrote:
>> Huhz, this means there is an open transaction handle somewhere o_O. I
>> checked back the stacktraces in your original email but couldn't see
>> where that might be coming from. I.e all processes are waiting on
>> wait_current_trans and this happens _before_ the transaction handle is
>> opened, hence num_extwriters can't have been incremented by them.
>>
>> When an fs wedges, and you get again num_extwriters can you provde the
>> output of "echo w > /proc/sysrq-trigger"
> 
> Here you go...

<snip>

> 
> Again we have "3 locks held" but no list of locks.  WTF is 10883 doing?
> Well, first of all it's using 100% CPU in the kernel.  Some samples of
> kernel stacks:
> 
> 	# cat /proc/*/task/10883/stack
> 	[<0>] down_read_nested+0x32/0x140
> 	[<0>] __btrfs_tree_read_lock+0x2d/0x110
> 	[<0>] btrfs_tree_read_lock+0x10/0x20
> 	[<0>] btrfs_search_old_slot+0x627/0x8a0
> 	[<0>] btrfs_next_old_leaf+0xcb/0x340
> 	[<0>] find_parent_nodes+0xcd7/0x1c40
> 	[<0>] btrfs_find_all_leafs+0x63/0xb0
> 	[<0>] iterate_extent_inodes+0xc8/0x270
> 	[<0>] iterate_inodes_from_logical+0x9f/0xe0

That's the real culprit, in this case we are not searching the commit
root hence we've attached to the transaction. So we are doing backref
resolution which either:

a) Hits some pathological case and loops for very long time, backref
resolution is known to take a lot of time.

b) We hit a bug in backref resolution and loop forever which again
results in the transaction being kept open.

Now I wonder why you were able to bisect this to the seemingly unrelated
commit in the vfs code.

Josef any ideas how to proceed further to debug why backref resolution
takes a long time and if it's just an infinite loop?

> 	[<0>] btrfs_ioctl_logical_to_ino+0x183/0x210
> 	[<0>] btrfs_ioctl+0xa81/0x2fb0
> 	[<0>] __x64_sys_ioctl+0x91/0xc0
> 	[<0>] do_syscall_64+0x38/0x90
> 	[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 	# cat /proc/*/task/10883/stack
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	[<0>] __tree_mod_log_rewind+0x57/0x250
> 	# cat /proc/*/task/10883/stack
> 	# cat /proc/*/task/10883/stack
> 	# cat /proc/*/task/10883/stack
> 	[<0>] free_extent_buffer.part.0+0x51/0xa0
> 	# cat /proc/*/task/10883/stack
> 	[<0>] find_held_lock+0x38/0x90
> 	[<0>] kmem_cache_alloc+0x22d/0x360
> 	[<0>] __alloc_extent_buffer+0x2a/0xa0
> 	[<0>] btrfs_clone_extent_buffer+0x42/0x130
> 	[<0>] btrfs_search_old_slot+0x660/0x8a0
> 	[<0>] btrfs_next_old_leaf+0xcb/0x340
> 	[<0>] find_parent_nodes+0xcd7/0x1c40
> 	[<0>] btrfs_find_all_leafs+0x63/0xb0
> 	[<0>] iterate_extent_inodes+0xc8/0x270
> 	[<0>] iterate_inodes_from_logical+0x9f/0xe0
> 	[<0>] btrfs_ioctl_logical_to_ino+0x183/0x210
> 	[<0>] btrfs_ioctl+0xa81/0x2fb0
> 	[<0>] __x64_sys_ioctl+0x91/0xc0
> 	[<0>] do_syscall_64+0x38/0x90
> 	[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> So it looks like tree mod log is doing some infinite (or very large
> finite) looping in the LOGICAL_INO ioctl.  That ioctl holds a transaction
> open while it runs, but it's not blocked per se, so it doesn't show up
> in SysRq-W output.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-16 21:29             ` Nikolay Borisov
@ 2021-12-16 22:07               ` Josef Bacik
  2021-12-17 20:50                 ` Zygo Blaxell
  2021-12-17  5:38               ` bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
  1 sibling, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2021-12-16 22:07 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Zygo Blaxell, linux-btrfs

On Thu, Dec 16, 2021 at 11:29:06PM +0200, Nikolay Borisov wrote:
> 
> 
> On 16.12.21 г. 7:33, Zygo Blaxell wrote:
> > On Wed, Dec 15, 2021 at 12:25:04AM +0200, Nikolay Borisov wrote:
> >> Huhz, this means there is an open transaction handle somewhere o_O. I
> >> checked back the stacktraces in your original email but couldn't see
> >> where that might be coming from. I.e all processes are waiting on
> >> wait_current_trans and this happens _before_ the transaction handle is
> >> opened, hence num_extwriters can't have been incremented by them.
> >>
> >> When an fs wedges, and you get again num_extwriters can you provde the
> >> output of "echo w > /proc/sysrq-trigger"
> > 
> > Here you go...
> 
> <snip>
> 
> > 
> > Again we have "3 locks held" but no list of locks.  WTF is 10883 doing?
> > Well, first of all it's using 100% CPU in the kernel.  Some samples of
> > kernel stacks:
> > 
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] down_read_nested+0x32/0x140
> > 	[<0>] __btrfs_tree_read_lock+0x2d/0x110
> > 	[<0>] btrfs_tree_read_lock+0x10/0x20
> > 	[<0>] btrfs_search_old_slot+0x627/0x8a0
> > 	[<0>] btrfs_next_old_leaf+0xcb/0x340
> > 	[<0>] find_parent_nodes+0xcd7/0x1c40
> > 	[<0>] btrfs_find_all_leafs+0x63/0xb0
> > 	[<0>] iterate_extent_inodes+0xc8/0x270
> > 	[<0>] iterate_inodes_from_logical+0x9f/0xe0
> 
> That's the real culprit, in this case we are not searching the commit
> root hence we've attached to the transaction. So we are doing backref
> resolution which either:
> 
> a) Hits some pathological case and loops for very long time, backref
> resolution is known to take a lot of time.
> 
> b) We hit a bug in backref resolution and loop forever which again
> results in the transaction being kept open.
> 
> Now I wonder why you were able to bisect this to the seemingly unrelated
> commit in the vfs code.
> 
> Josef any ideas how to proceed further to debug why backref resolution
> takes a long time and if it's just an infinite loop?
> 

It's probably an infinite loop, I'd just start with something like this

bpftrace -e 'tracepoint:btrfs:btrfs_prelim_ref_insert { printf("bytenr is %llu", args->bytenr); }'

and see if it's spitting out the same shit over and over again.  If it is can I
get a btrfs inspect-internal dump-tree -e on the device along with the bytenr
it's hung up on so I can figure out wtf it's tripping over?

If it's not looping there, it may be looping higher up, but I don't see where it
would be doing that.  Lets start here and work our way up if we need to.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-16 21:29             ` Nikolay Borisov
  2021-12-16 22:07               ` Josef Bacik
@ 2021-12-17  5:38               ` Zygo Blaxell
  1 sibling, 0 replies; 22+ messages in thread
From: Zygo Blaxell @ 2021-12-17  5:38 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, Josef Bacik

On Thu, Dec 16, 2021 at 11:29:06PM +0200, Nikolay Borisov wrote:
> 
> 
> On 16.12.21 г. 7:33, Zygo Blaxell wrote:
> > On Wed, Dec 15, 2021 at 12:25:04AM +0200, Nikolay Borisov wrote:
> >> Huhz, this means there is an open transaction handle somewhere o_O. I
> >> checked back the stacktraces in your original email but couldn't see
> >> where that might be coming from. I.e all processes are waiting on
> >> wait_current_trans and this happens _before_ the transaction handle is
> >> opened, hence num_extwriters can't have been incremented by them.
> >>
> >> When an fs wedges, and you get again num_extwriters can you provde the
> >> output of "echo w > /proc/sysrq-trigger"
> > 
> > Here you go...
> 
> <snip>
> 
> > 
> > Again we have "3 locks held" but no list of locks.  WTF is 10883 doing?
> > Well, first of all it's using 100% CPU in the kernel.  Some samples of
> > kernel stacks:
> > 
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] down_read_nested+0x32/0x140
> > 	[<0>] __btrfs_tree_read_lock+0x2d/0x110
> > 	[<0>] btrfs_tree_read_lock+0x10/0x20
> > 	[<0>] btrfs_search_old_slot+0x627/0x8a0
> > 	[<0>] btrfs_next_old_leaf+0xcb/0x340
> > 	[<0>] find_parent_nodes+0xcd7/0x1c40
> > 	[<0>] btrfs_find_all_leafs+0x63/0xb0
> > 	[<0>] iterate_extent_inodes+0xc8/0x270
> > 	[<0>] iterate_inodes_from_logical+0x9f/0xe0
> 
> That's the real culprit, in this case we are not searching the commit
> root hence we've attached to the transaction.  So we are doing backref
> resolution which either:
> 
> a) Hits some pathological case and loops for very long time, backref
> resolution is known to take a lot of time.

backref resolve is known to take a long time, which is why bees measures
that time, and avoids extents where the kernel CPU time in LOGICAL_INO
starts climbing up the exponential curve.  In older kernels performance
got worse as more extent refs were added, but it did so slowly enough
that bees could detect and evade the bug before it created multi-second
transaction latencies.  Since 5.7 this isn't really a problem any more,
but bees still has this workaround in its code.

This bug is not that bug:  CPU usage in LOGICAL_INO goes from less than
100 ms to >40 hours (the longest I've let it run before forcing reboot)
in a single step.  On this test setup the 100 ms threshold is hit about
once every 18 testing hours, or once per 2.3M LOGICAL_INO calls, and
it's only 150 ms or so.  Usually these happen when the kernel thread
gets stuck with the CPU bill for flushing delayed refs or something
equally harmless, and they happen hours before the lockup (if at all).

In the "take a lot of time" case, we get a lot of warning as the time
ramps up.  In this case, we go from zero to two days with nothing
in between.

> b) We hit a bug in backref resolution and loop forever which again
> results in the transaction being kept open.
> 
> Now I wonder why you were able to bisect this to the seemingly unrelated
> commit in the vfs code.

Meh, bisection isn't particularly reliable, and this bug might be very
sensitive to its environment, particularly if it's a race condition
between two or more threads running a tight loop over the same objects
(like tree mod log).  I've been running a few more bisection runs and
they're landing all over the place, including things like TI SoC commits.
So I'm now testing btrfs commits chosen myself instead of following
git's bisect algorithm.

There's a definite probability shift somewhere in the middle of v5.11-rc1.
We go from hourly failure rates below 0.002 at v5.10 to above 0.3 by
v5.11-rc1.  But that could as easily be a 10-year-old bug surfacing due
to apparently unrelated changes as it is something that was added in
v5.11-rc1 itself.

> Josef any ideas how to proceed further to debug why backref resolution
> takes a long time and if it's just an infinite loop?
> 
> > 	[<0>] btrfs_ioctl_logical_to_ino+0x183/0x210
> > 	[<0>] btrfs_ioctl+0xa81/0x2fb0
> > 	[<0>] __x64_sys_ioctl+0x91/0xc0
> > 	[<0>] do_syscall_64+0x38/0x90
> > 	[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 	# cat /proc/*/task/10883/stack
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] __tree_mod_log_rewind+0x57/0x250
> > 	# cat /proc/*/task/10883/stack
> > 	# cat /proc/*/task/10883/stack
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] free_extent_buffer.part.0+0x51/0xa0
> > 	# cat /proc/*/task/10883/stack
> > 	[<0>] find_held_lock+0x38/0x90
> > 	[<0>] kmem_cache_alloc+0x22d/0x360
> > 	[<0>] __alloc_extent_buffer+0x2a/0xa0
> > 	[<0>] btrfs_clone_extent_buffer+0x42/0x130
> > 	[<0>] btrfs_search_old_slot+0x660/0x8a0
> > 	[<0>] btrfs_next_old_leaf+0xcb/0x340
> > 	[<0>] find_parent_nodes+0xcd7/0x1c40
> > 	[<0>] btrfs_find_all_leafs+0x63/0xb0
> > 	[<0>] iterate_extent_inodes+0xc8/0x270
> > 	[<0>] iterate_inodes_from_logical+0x9f/0xe0
> > 	[<0>] btrfs_ioctl_logical_to_ino+0x183/0x210
> > 	[<0>] btrfs_ioctl+0xa81/0x2fb0
> > 	[<0>] __x64_sys_ioctl+0x91/0xc0
> > 	[<0>] do_syscall_64+0x38/0x90
> > 	[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > So it looks like tree mod log is doing some infinite (or very large
> > finite) looping in the LOGICAL_INO ioctl.  That ioctl holds a transaction
> > open while it runs, but it's not blocked per se, so it doesn't show up
> > in SysRq-W output.
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-16 22:07               ` Josef Bacik
@ 2021-12-17 20:50                 ` Zygo Blaxell
  2022-01-07 18:31                   ` bisected: btrfs dedupe regression in v5.11-rc1 Zygo Blaxell
  0 siblings, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2021-12-17 20:50 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Nikolay Borisov, linux-btrfs

On Thu, Dec 16, 2021 at 05:07:50PM -0500, Josef Bacik wrote:
> On Thu, Dec 16, 2021 at 11:29:06PM +0200, Nikolay Borisov wrote:
> > 
> > 
> > On 16.12.21 г. 7:33, Zygo Blaxell wrote:
> > > On Wed, Dec 15, 2021 at 12:25:04AM +0200, Nikolay Borisov wrote:
> > >> Huhz, this means there is an open transaction handle somewhere o_O. I
> > >> checked back the stacktraces in your original email but couldn't see
> > >> where that might be coming from. I.e all processes are waiting on
> > >> wait_current_trans and this happens _before_ the transaction handle is
> > >> opened, hence num_extwriters can't have been incremented by them.
> > >>
> > >> When an fs wedges, and you get again num_extwriters can you provde the
> > >> output of "echo w > /proc/sysrq-trigger"
> > > 
> > > Here you go...
> > 
> > <snip>
> > 
> > > 
> > > Again we have "3 locks held" but no list of locks.  WTF is 10883 doing?
> > > Well, first of all it's using 100% CPU in the kernel.  Some samples of
> > > kernel stacks:
> > > 
> > > 	# cat /proc/*/task/10883/stack
> > > 	[<0>] down_read_nested+0x32/0x140
> > > 	[<0>] __btrfs_tree_read_lock+0x2d/0x110
> > > 	[<0>] btrfs_tree_read_lock+0x10/0x20
> > > 	[<0>] btrfs_search_old_slot+0x627/0x8a0
> > > 	[<0>] btrfs_next_old_leaf+0xcb/0x340
> > > 	[<0>] find_parent_nodes+0xcd7/0x1c40
> > > 	[<0>] btrfs_find_all_leafs+0x63/0xb0
> > > 	[<0>] iterate_extent_inodes+0xc8/0x270
> > > 	[<0>] iterate_inodes_from_logical+0x9f/0xe0
> > 
> > That's the real culprit, in this case we are not searching the commit
> > root hence we've attached to the transaction. So we are doing backref
> > resolution which either:
> > 
> > a) Hits some pathological case and loops for very long time, backref
> > resolution is known to take a lot of time.
> > 
> > b) We hit a bug in backref resolution and loop forever which again
> > results in the transaction being kept open.
> > 
> > Now I wonder why you were able to bisect this to the seemingly unrelated
> > commit in the vfs code.
> > 
> > Josef any ideas how to proceed further to debug why backref resolution
> > takes a long time and if it's just an infinite loop?
> > 
> 
> It's probably an infinite loop, I'd just start with something like this
> 
> bpftrace -e 'tracepoint:btrfs:btrfs_prelim_ref_insert { printf("bytenr is %llu", args->bytenr); }'
> 
> and see if it's spitting out the same shit over and over again.  If it is can I
> get a btrfs inspect-internal dump-tree -e on the device along with the bytenr
> it's hung up on so I can figure out wtf it's tripping over?

That bpftrace command doesn't output anything after the hang.  Before the
hang, it's thousands and thousands of values changing as bees moves around
the filesystem.

> If it's not looping there, it may be looping higher up, but I don't see where it
> would be doing that.  Lets start here and work our way up if we need to.
> Thanks,
> 
> Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1
  2021-12-17 20:50                 ` Zygo Blaxell
@ 2022-01-07 18:31                   ` Zygo Blaxell
  2022-01-20 14:04                     ` Thorsten Leemhuis
  0 siblings, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2022-01-07 18:31 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Nikolay Borisov, linux-btrfs

On Fri, Dec 17, 2021 at 03:50:44PM -0500, Zygo Blaxell wrote:
> On Thu, Dec 16, 2021 at 05:07:50PM -0500, Josef Bacik wrote:
> > On Thu, Dec 16, 2021 at 11:29:06PM +0200, Nikolay Borisov wrote:
> > > On 16.12.21 г. 7:33, Zygo Blaxell wrote:
> > > > Again we have "3 locks held" but no list of locks.  WTF is 10883 doing?
> > > > Well, first of all it's using 100% CPU in the kernel.  Some samples of
> > > > kernel stacks:
> > > > 
> > > > 	# cat /proc/*/task/10883/stack
> > > > 	[<0>] down_read_nested+0x32/0x140
> > > > 	[<0>] __btrfs_tree_read_lock+0x2d/0x110
> > > > 	[<0>] btrfs_tree_read_lock+0x10/0x20
> > > > 	[<0>] btrfs_search_old_slot+0x627/0x8a0
> > > > 	[<0>] btrfs_next_old_leaf+0xcb/0x340
> > > > 	[<0>] find_parent_nodes+0xcd7/0x1c40
> > > > 	[<0>] btrfs_find_all_leafs+0x63/0xb0
> > > > 	[<0>] iterate_extent_inodes+0xc8/0x270
> > > > 	[<0>] iterate_inodes_from_logical+0x9f/0xe0
> > > 
> > > That's the real culprit, in this case we are not searching the commit
> > > root hence we've attached to the transaction. So we are doing backref
> > > resolution which either:
> > > 
> > > a) Hits some pathological case and loops for very long time, backref
> > > resolution is known to take a lot of time.
> > > 
> > > b) We hit a bug in backref resolution and loop forever which again
> > > results in the transaction being kept open.
> > > 
> > > Now I wonder why you were able to bisect this to the seemingly unrelated
> > > commit in the vfs code.
> > > 
> > > Josef any ideas how to proceed further to debug why backref resolution
> > > takes a long time and if it's just an infinite loop?
> > > 
> > 
> > It's probably an infinite loop, I'd just start with something like this
> > 
> > bpftrace -e 'tracepoint:btrfs:btrfs_prelim_ref_insert { printf("bytenr is %llu", args->bytenr); }'
> > 
> > and see if it's spitting out the same shit over and over again.  If it is can I
> > get a btrfs inspect-internal dump-tree -e on the device along with the bytenr
> > it's hung up on so I can figure out wtf it's tripping over?
> 
> That bpftrace command doesn't output anything after the hang.  Before the
> hang, it's thousands and thousands of values changing as bees moves around
> the filesystem.

I left my VM running tests for a few weeks and got some more information.
Or at least more data, I'm not feeling particularly informed by it.  :-P

1.  It's not a regression.  5.10 has the same symptoms, but about 100x
less often (once a week under these test conditions, compared to once
every 90 minutes or so on 5.11-rc1).

2.  Bisection doesn't work, because there are patches that are repeatably
good and bad mixed together, so the bisect algorithm (even with stochastic
enhancement) repeatably picks the wrong commits and converges with
high confidence on nonsense.  Instead of bisecting, I picked commits
semi-randomly from 5.11-rc1's patch set, and got these results:

   124  3a160a933111 btrfs: drop never met disk total bytes check in verify_one_dev_extent
	1x hang, 2x slower
   125  bacce86ae8a7 btrfs: drop unused argument step from btrfs_free_extra_devids
	1x pass (fast)
   126  2766ff61762c btrfs: update the number of bytes used by an inode atomically
	1x hang (<20 minutes)
   127  7f458a3873ae btrfs: fix race when defragmenting leads to unnecessary IO
	1x hang, runs 3x slower
   128  5893dfb98f25 btrfs: refactor btrfs_drop_extents() to make it easier to extend
	2x hang (<20 minutes)
   129  e114c545bb69 btrfs: set the lockdep class for extent buffers on creation
	2x pass (but runs 2x slower, both times)
   130  3fbaf25817f7 btrfs: pass the owner_root and level to alloc_extent_buffer
	1x pass
   131  5d81230baa90 btrfs: pass the root owner and level around for readahead
	1x pass
   132  1b7ec85ef490 btrfs: pass root owner to read_tree_block
	1x pass
   133  182c79fcb857 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
   134  3acfbd6a990c btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
	1x hang
   135  6b2cb7cb959a btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
   136  c990ada2a0bb btrfs: use btrfs_read_node_slot in walk_down_tree
	1x hang
   137  6b3426be27de btrfs: use btrfs_read_node_slot in replace_path
	1x hang, 1x pass
   138  c975253682e0 btrfs: use btrfs_read_node_slot in do_relocation
	1x hang
   139  8ef385bbf099 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
	1x hang, 1x pass
   140  206983b72a36 btrfs: use btrfs_read_node_slot in btrfs_realloc_node
	1x pass
   141  bfb484d922a3 btrfs: cleanup extent buffer readahead
	1x pass
   142  416e3445ef80 btrfs: remove lockdep classes for the fs tree
   143  3e48d8d2540d btrfs: discard: reschedule work after sysfs param update
   144  df903e5d294f btrfs: don't miss async discards after scheduled work override
   145  6e88f116bd4c btrfs: discard: store async discard delay as ns not as jiffies
	2x hang
   146  e50404a8a699 btrfs: discard: speed up async discard up to iops_limit

   [snip]

   155  0d01e247a06b btrfs: assert page mapping lock in attach_extent_buffer_page
	1x hang, 1x pass
   156  bbb86a371791 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
	1x hang
   157  e747853cae3a btrfs: load free space cache asynchronously
	1x pass
   158  4d7240f0abda btrfs: load the free space cache inode extents from commit root
	1x hang
   159  cd79909bc7cd btrfs: load free space cache into a temporary ctl
	2x pass
   160  66b53bae46c8 btrfs: cleanup btrfs_discard_update_discardable usage
	2x hang, 1x pass
   161  2ca08c56e813 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
	2x pass
   162  27d56e62e474 btrfs: update last_byte_to_unpin in switch_commit_roots
	2x pass
   163  9076dbd5ee83 btrfs: do not shorten unpin len for caching block groups
   164  dc5161648693 btrfs: reorder extent buffer members for better packing
	2x pass
   165  b9729ce014f6 btrfs: locking: rip out path->leave_spinning
   166  ac5887c8e013 btrfs: locking: remove all the blocking helpers
   167  2ae0c2d80d25 btrfs: scrub: remove local copy of csum_size from context
   168  419b791ce760 btrfs: check integrity: remove local copy of csum_size
	1x hang, 1x pass
   169  713cebfb9891 btrfs: remove unnecessary local variables for checksum size
   170  223486c27b36 btrfs: switch cached fs_info::csum_size from u16 to u32
   171  55fc29bed8dd btrfs: use cached value of fs_info::csum_size everywhere
   172  fe5ecbe818de btrfs: precalculate checksums per leaf once
   173  22b6331d9617 btrfs: store precalculated csum_size in fs_info
   174  265fdfa6ce0a btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
   175  098e63082b9b btrfs: replace div_u64 by shift in free_space_bitmap_size
	2x pass
   176  ab108d992b12 btrfs: use precalculated sectorsize_bits from fs_info

   [snip]

   200  5e8b9ef30392 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
	1x pass
   201  4e4cabece9f9 btrfs: split btrfs_direct_IO to read and write

   [snip]

   215  d70bf7484f72 btrfs: unify the ro checking for mount options
	1x pass
   216  a6889caf6ec6 btrfs: do not start readahead for csum tree when scrubbing non-data block groups
   217  a57ad681f12e btrfs: assert we are holding the reada_lock when releasing a readahead zone
   218  aa8c1a41a1e6 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
   219  13f0dd8f7861 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
   220  949b32732eab btrfs: use iosize while reading compressed pages
   221  eefa45f59379 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
   222  fb8a7e941b1b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
	1x pass

There is some repeatability in these results--some commits have a much
lower failure rate than others--but I don't see a reason why the bad
commits are bad or the good commits are good.  There are some commits with
locking and concurrency implications, but they're as likely to produce
good as bad results in test.  Sometimes there's a consistent change in
test result after a commit that only rearranges function arguments on
the stack.

Maybe what we're looking at is a subtle race that is popping up due
to unrelated changes in the kernel, and disappearing just as often,
and 5.11-rc1 in particular did something innocent that aggravates
it somehow, so all later kernels hit the problem more often than
5.10 did.

3.  Somewhere around "7f458a3873ae btrfs: fix race when defragmenting
leads to unnecessary IO" bees starts running about 3x slower than on
earlier kernels.  bees is a nightmare of nondeterministically racing
worker threads, so I'm not sure how important this observation is,
but it keeps showing up in the data.

4.  I had one machine on 5.10.84 (not a test VM) with a shell process
that got stuck spinning 100% CPU in the kernel on sys_write.  bees was
also running, but its threads were all stuck waiting for the shell to
release the transaction.  Other crashes on 5.10.8x kernels look more
like the one in this thread, with a logical_ino spinning.

> > If it's not looping there, it may be looping higher up, but I don't see where it
> > would be doing that.  Lets start here and work our way up if we need to.
> > Thanks,
> > 
> > Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1
  2022-01-07 18:31                   ` bisected: btrfs dedupe regression in v5.11-rc1 Zygo Blaxell
@ 2022-01-20 14:04                     ` Thorsten Leemhuis
  2022-01-21  0:27                       ` Zygo Blaxell
  0 siblings, 1 reply; 22+ messages in thread
From: Thorsten Leemhuis @ 2022-01-20 14:04 UTC (permalink / raw)
  To: Zygo Blaxell, Josef Bacik; +Cc: Nikolay Borisov, linux-btrfs

Hi, this is your Linux kernel regression tracker speaking.

On 07.01.22 19:31, Zygo Blaxell wrote:
> On Fri, Dec 17, 2021 at 03:50:44PM -0500, Zygo Blaxell wrote:
>> On Thu, Dec 16, 2021 at 05:07:50PM -0500, Josef Bacik wrote:
>>> On Thu, Dec 16, 2021 at 11:29:06PM +0200, Nikolay Borisov wrote:
>>>> On 16.12.21 г. 7:33, Zygo Blaxell wrote:
>>>>> Again we have "3 locks held" but no list of locks.  WTF is 10883 doing?
>>>>> Well, first of all it's using 100% CPU in the kernel.  Some samples of
>>>>> kernel stacks:
>>>>>
>>>>> 	# cat /proc/*/task/10883/stack
>>>>> 	[<0>] down_read_nested+0x32/0x140
>>>>> 	[<0>] __btrfs_tree_read_lock+0x2d/0x110
>>>>> 	[<0>] btrfs_tree_read_lock+0x10/0x20
>>>>> 	[<0>] btrfs_search_old_slot+0x627/0x8a0
>>>>> 	[<0>] btrfs_next_old_leaf+0xcb/0x340
>>>>> 	[<0>] find_parent_nodes+0xcd7/0x1c40
>>>>> 	[<0>] btrfs_find_all_leafs+0x63/0xb0
>>>>> 	[<0>] iterate_extent_inodes+0xc8/0x270
>>>>> 	[<0>] iterate_inodes_from_logical+0x9f/0xe0
>>>>
>>>> That's the real culprit, in this case we are not searching the commit
>>>> root hence we've attached to the transaction. So we are doing backref
>>>> resolution which either:
>>>>
>>>> a) Hits some pathological case and loops for very long time, backref
>>>> resolution is known to take a lot of time.
>>>>
>>>> b) We hit a bug in backref resolution and loop forever which again
>>>> results in the transaction being kept open.
>>>>
>>>> Now I wonder why you were able to bisect this to the seemingly unrelated
>>>> commit in the vfs code.
>>>>
>>>> Josef any ideas how to proceed further to debug why backref resolution
>>>> takes a long time and if it's just an infinite loop?
>>>>
>>>
>>> It's probably an infinite loop, I'd just start with something like this
>>>
>>> bpftrace -e 'tracepoint:btrfs:btrfs_prelim_ref_insert { printf("bytenr is %llu", args->bytenr); }'
>>>
>>> and see if it's spitting out the same shit over and over again.  If it is can I
>>> get a btrfs inspect-internal dump-tree -e on the device along with the bytenr
>>> it's hung up on so I can figure out wtf it's tripping over?
>>
>> That bpftrace command doesn't output anything after the hang.  Before the
>> hang, it's thousands and thousands of values changing as bees moves around
>> the filesystem.
> 
> I left my VM running tests for a few weeks and got some more information.
> Or at least more data, I'm not feeling particularly informed by it.  :-P
> 
> 1.  It's not a regression.  5.10 has the same symptoms, but about 100x
> less often (once a week under these test conditions, compared to once
> every 90 minutes or so on 5.11-rc1).

Well, I'd still call it a regression, as it's now happening way more
often and thus will likely hit more users. It's thus a bit like a
problem that leads to higher energy consumption: things still work, but
worse than before -- nevertheless it's considered a regression. Anway:

What's the status here? Are you still investigating the issue? Are any
developers looking out for the root cause?

Ciao, Thorsten

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

#regzbot poke

> 2.  Bisection doesn't work, because there are patches that are repeatably
> good and bad mixed together, so the bisect algorithm (even with stochastic
> enhancement) repeatably picks the wrong commits and converges with
> high confidence on nonsense.  Instead of bisecting, I picked commits
> semi-randomly from 5.11-rc1's patch set, and got these results:
> 
>    124  3a160a933111 btrfs: drop never met disk total bytes check in verify_one_dev_extent
> 	1x hang, 2x slower
>    125  bacce86ae8a7 btrfs: drop unused argument step from btrfs_free_extra_devids
> 	1x pass (fast)
>    126  2766ff61762c btrfs: update the number of bytes used by an inode atomically
> 	1x hang (<20 minutes)
>    127  7f458a3873ae btrfs: fix race when defragmenting leads to unnecessary IO
> 	1x hang, runs 3x slower
>    128  5893dfb98f25 btrfs: refactor btrfs_drop_extents() to make it easier to extend
> 	2x hang (<20 minutes)
>    129  e114c545bb69 btrfs: set the lockdep class for extent buffers on creation
> 	2x pass (but runs 2x slower, both times)
>    130  3fbaf25817f7 btrfs: pass the owner_root and level to alloc_extent_buffer
> 	1x pass
>    131  5d81230baa90 btrfs: pass the root owner and level around for readahead
> 	1x pass
>    132  1b7ec85ef490 btrfs: pass root owner to read_tree_block
> 	1x pass
>    133  182c79fcb857 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
>    134  3acfbd6a990c btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
> 	1x hang
>    135  6b2cb7cb959a btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
>    136  c990ada2a0bb btrfs: use btrfs_read_node_slot in walk_down_tree
> 	1x hang
>    137  6b3426be27de btrfs: use btrfs_read_node_slot in replace_path
> 	1x hang, 1x pass
>    138  c975253682e0 btrfs: use btrfs_read_node_slot in do_relocation
> 	1x hang
>    139  8ef385bbf099 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
> 	1x hang, 1x pass
>    140  206983b72a36 btrfs: use btrfs_read_node_slot in btrfs_realloc_node
> 	1x pass
>    141  bfb484d922a3 btrfs: cleanup extent buffer readahead
> 	1x pass
>    142  416e3445ef80 btrfs: remove lockdep classes for the fs tree
>    143  3e48d8d2540d btrfs: discard: reschedule work after sysfs param update
>    144  df903e5d294f btrfs: don't miss async discards after scheduled work override
>    145  6e88f116bd4c btrfs: discard: store async discard delay as ns not as jiffies
> 	2x hang
>    146  e50404a8a699 btrfs: discard: speed up async discard up to iops_limit
> 
>    [snip]
> 
>    155  0d01e247a06b btrfs: assert page mapping lock in attach_extent_buffer_page
> 	1x hang, 1x pass
>    156  bbb86a371791 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
> 	1x hang
>    157  e747853cae3a btrfs: load free space cache asynchronously
> 	1x pass
>    158  4d7240f0abda btrfs: load the free space cache inode extents from commit root
> 	1x hang
>    159  cd79909bc7cd btrfs: load free space cache into a temporary ctl
> 	2x pass
>    160  66b53bae46c8 btrfs: cleanup btrfs_discard_update_discardable usage
> 	2x hang, 1x pass
>    161  2ca08c56e813 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
> 	2x pass
>    162  27d56e62e474 btrfs: update last_byte_to_unpin in switch_commit_roots
> 	2x pass
>    163  9076dbd5ee83 btrfs: do not shorten unpin len for caching block groups
>    164  dc5161648693 btrfs: reorder extent buffer members for better packing
> 	2x pass
>    165  b9729ce014f6 btrfs: locking: rip out path->leave_spinning
>    166  ac5887c8e013 btrfs: locking: remove all the blocking helpers
>    167  2ae0c2d80d25 btrfs: scrub: remove local copy of csum_size from context
>    168  419b791ce760 btrfs: check integrity: remove local copy of csum_size
> 	1x hang, 1x pass
>    169  713cebfb9891 btrfs: remove unnecessary local variables for checksum size
>    170  223486c27b36 btrfs: switch cached fs_info::csum_size from u16 to u32
>    171  55fc29bed8dd btrfs: use cached value of fs_info::csum_size everywhere
>    172  fe5ecbe818de btrfs: precalculate checksums per leaf once
>    173  22b6331d9617 btrfs: store precalculated csum_size in fs_info
>    174  265fdfa6ce0a btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
>    175  098e63082b9b btrfs: replace div_u64 by shift in free_space_bitmap_size
> 	2x pass
>    176  ab108d992b12 btrfs: use precalculated sectorsize_bits from fs_info
> 
>    [snip]
> 
>    200  5e8b9ef30392 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
> 	1x pass
>    201  4e4cabece9f9 btrfs: split btrfs_direct_IO to read and write
> 
>    [snip]
> 
>    215  d70bf7484f72 btrfs: unify the ro checking for mount options
> 	1x pass
>    216  a6889caf6ec6 btrfs: do not start readahead for csum tree when scrubbing non-data block groups
>    217  a57ad681f12e btrfs: assert we are holding the reada_lock when releasing a readahead zone
>    218  aa8c1a41a1e6 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
>    219  13f0dd8f7861 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
>    220  949b32732eab btrfs: use iosize while reading compressed pages
>    221  eefa45f59379 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
>    222  fb8a7e941b1b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
> 	1x pass
> 
> There is some repeatability in these results--some commits have a much
> lower failure rate than others--but I don't see a reason why the bad
> commits are bad or the good commits are good.  There are some commits with
> locking and concurrency implications, but they're as likely to produce
> good as bad results in test.  Sometimes there's a consistent change in
> test result after a commit that only rearranges function arguments on
> the stack.
> 
> Maybe what we're looking at is a subtle race that is popping up due
> to unrelated changes in the kernel, and disappearing just as often,
> and 5.11-rc1 in particular did something innocent that aggravates
> it somehow, so all later kernels hit the problem more often than
> 5.10 did.
> 
> 3.  Somewhere around "7f458a3873ae btrfs: fix race when defragmenting
> leads to unnecessary IO" bees starts running about 3x slower than on
> earlier kernels.  bees is a nightmare of nondeterministically racing
> worker threads, so I'm not sure how important this observation is,
> but it keeps showing up in the data.
> 
> 4.  I had one machine on 5.10.84 (not a test VM) with a shell process
> that got stuck spinning 100% CPU in the kernel on sys_write.  bees was
> also running, but its threads were all stuck waiting for the shell to
> release the transaction.  Other crashes on 5.10.8x kernels look more
> like the one in this thread, with a logical_ino spinning.
> 
>>> If it's not looping there, it may be looping higher up, but I don't see where it
>>> would be doing that.  Lets start here and work our way up if we need to.
>>> Thanks,


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1
  2022-01-20 14:04                     ` Thorsten Leemhuis
@ 2022-01-21  0:27                       ` Zygo Blaxell
  2022-02-09 12:22                         ` Libor Klepáč
  2022-02-18 14:46                         ` Thorsten Leemhuis
  0 siblings, 2 replies; 22+ messages in thread
From: Zygo Blaxell @ 2022-01-21  0:27 UTC (permalink / raw)
  To: Thorsten Leemhuis; +Cc: Josef Bacik, Nikolay Borisov, linux-btrfs

On Thu, Jan 20, 2022 at 03:04:19PM +0100, Thorsten Leemhuis wrote:
> Hi, this is your Linux kernel regression tracker speaking.
> 
> On 07.01.22 19:31, Zygo Blaxell wrote:
> > On Fri, Dec 17, 2021 at 03:50:44PM -0500, Zygo Blaxell wrote:
> > I left my VM running tests for a few weeks and got some more information.
> > Or at least more data, I'm not feeling particularly informed by it.  :-P
> > 
> > 1.  It's not a regression.  5.10 has the same symptoms, but about 100x
> > less often (once a week under these test conditions, compared to once
> > every 90 minutes or so on 5.11-rc1).
> 
> Well, I'd still call it a regression, as it's now happening way more
> often and thus will likely hit more users. It's thus a bit like a
> problem that leads to higher energy consumption: things still work, but
> worse than before -- nevertheless it's considered a regression. Anway:
> 
> What's the status here? Are you still investigating the issue? Are any
> developers looking out for the root cause?

I think Josef's plan (start inside the logical_ino ioctl with bpftrace
and work upwards to see where the looping is getting stuck) is a good plan,
but due to conflicting priorities I haven't found the time to act on it.

I can take experimental patches and throw them at my repro setup if
anyone would like to supply some.

> Ciao, Thorsten
> 
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply, that's in everyone's interest.
> 
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
> 
> #regzbot poke
> 
> > 2.  Bisection doesn't work, because there are patches that are repeatably
> > good and bad mixed together, so the bisect algorithm (even with stochastic
> > enhancement) repeatably picks the wrong commits and converges with
> > high confidence on nonsense.  Instead of bisecting, I picked commits
> > semi-randomly from 5.11-rc1's patch set, and got these results:
> > 
> >    124  3a160a933111 btrfs: drop never met disk total bytes check in verify_one_dev_extent
> > 	1x hang, 2x slower
> >    125  bacce86ae8a7 btrfs: drop unused argument step from btrfs_free_extra_devids
> > 	1x pass (fast)
> >    126  2766ff61762c btrfs: update the number of bytes used by an inode atomically
> > 	1x hang (<20 minutes)
> >    127  7f458a3873ae btrfs: fix race when defragmenting leads to unnecessary IO
> > 	1x hang, runs 3x slower
> >    128  5893dfb98f25 btrfs: refactor btrfs_drop_extents() to make it easier to extend
> > 	2x hang (<20 minutes)
> >    129  e114c545bb69 btrfs: set the lockdep class for extent buffers on creation
> > 	2x pass (but runs 2x slower, both times)
> >    130  3fbaf25817f7 btrfs: pass the owner_root and level to alloc_extent_buffer
> > 	1x pass
> >    131  5d81230baa90 btrfs: pass the root owner and level around for readahead
> > 	1x pass
> >    132  1b7ec85ef490 btrfs: pass root owner to read_tree_block
> > 	1x pass
> >    133  182c79fcb857 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
> >    134  3acfbd6a990c btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
> > 	1x hang
> >    135  6b2cb7cb959a btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
> >    136  c990ada2a0bb btrfs: use btrfs_read_node_slot in walk_down_tree
> > 	1x hang
> >    137  6b3426be27de btrfs: use btrfs_read_node_slot in replace_path
> > 	1x hang, 1x pass
> >    138  c975253682e0 btrfs: use btrfs_read_node_slot in do_relocation
> > 	1x hang
> >    139  8ef385bbf099 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
> > 	1x hang, 1x pass
> >    140  206983b72a36 btrfs: use btrfs_read_node_slot in btrfs_realloc_node
> > 	1x pass
> >    141  bfb484d922a3 btrfs: cleanup extent buffer readahead
> > 	1x pass
> >    142  416e3445ef80 btrfs: remove lockdep classes for the fs tree
> >    143  3e48d8d2540d btrfs: discard: reschedule work after sysfs param update
> >    144  df903e5d294f btrfs: don't miss async discards after scheduled work override
> >    145  6e88f116bd4c btrfs: discard: store async discard delay as ns not as jiffies
> > 	2x hang
> >    146  e50404a8a699 btrfs: discard: speed up async discard up to iops_limit
> > 
> >    [snip]
> > 
> >    155  0d01e247a06b btrfs: assert page mapping lock in attach_extent_buffer_page
> > 	1x hang, 1x pass
> >    156  bbb86a371791 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
> > 	1x hang
> >    157  e747853cae3a btrfs: load free space cache asynchronously
> > 	1x pass
> >    158  4d7240f0abda btrfs: load the free space cache inode extents from commit root
> > 	1x hang
> >    159  cd79909bc7cd btrfs: load free space cache into a temporary ctl
> > 	2x pass
> >    160  66b53bae46c8 btrfs: cleanup btrfs_discard_update_discardable usage
> > 	2x hang, 1x pass
> >    161  2ca08c56e813 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
> > 	2x pass
> >    162  27d56e62e474 btrfs: update last_byte_to_unpin in switch_commit_roots
> > 	2x pass
> >    163  9076dbd5ee83 btrfs: do not shorten unpin len for caching block groups
> >    164  dc5161648693 btrfs: reorder extent buffer members for better packing
> > 	2x pass
> >    165  b9729ce014f6 btrfs: locking: rip out path->leave_spinning
> >    166  ac5887c8e013 btrfs: locking: remove all the blocking helpers
> >    167  2ae0c2d80d25 btrfs: scrub: remove local copy of csum_size from context
> >    168  419b791ce760 btrfs: check integrity: remove local copy of csum_size
> > 	1x hang, 1x pass
> >    169  713cebfb9891 btrfs: remove unnecessary local variables for checksum size
> >    170  223486c27b36 btrfs: switch cached fs_info::csum_size from u16 to u32
> >    171  55fc29bed8dd btrfs: use cached value of fs_info::csum_size everywhere
> >    172  fe5ecbe818de btrfs: precalculate checksums per leaf once
> >    173  22b6331d9617 btrfs: store precalculated csum_size in fs_info
> >    174  265fdfa6ce0a btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
> >    175  098e63082b9b btrfs: replace div_u64 by shift in free_space_bitmap_size
> > 	2x pass
> >    176  ab108d992b12 btrfs: use precalculated sectorsize_bits from fs_info
> > 
> >    [snip]
> > 
> >    200  5e8b9ef30392 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
> > 	1x pass
> >    201  4e4cabece9f9 btrfs: split btrfs_direct_IO to read and write
> > 
> >    [snip]
> > 
> >    215  d70bf7484f72 btrfs: unify the ro checking for mount options
> > 	1x pass
> >    216  a6889caf6ec6 btrfs: do not start readahead for csum tree when scrubbing non-data block groups
> >    217  a57ad681f12e btrfs: assert we are holding the reada_lock when releasing a readahead zone
> >    218  aa8c1a41a1e6 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
> >    219  13f0dd8f7861 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
> >    220  949b32732eab btrfs: use iosize while reading compressed pages
> >    221  eefa45f59379 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
> >    222  fb8a7e941b1b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
> > 	1x pass
> > 
> > There is some repeatability in these results--some commits have a much
> > lower failure rate than others--but I don't see a reason why the bad
> > commits are bad or the good commits are good.  There are some commits with
> > locking and concurrency implications, but they're as likely to produce
> > good as bad results in test.  Sometimes there's a consistent change in
> > test result after a commit that only rearranges function arguments on
> > the stack.
> > 
> > Maybe what we're looking at is a subtle race that is popping up due
> > to unrelated changes in the kernel, and disappearing just as often,
> > and 5.11-rc1 in particular did something innocent that aggravates
> > it somehow, so all later kernels hit the problem more often than
> > 5.10 did.
> > 
> > 3.  Somewhere around "7f458a3873ae btrfs: fix race when defragmenting
> > leads to unnecessary IO" bees starts running about 3x slower than on
> > earlier kernels.  bees is a nightmare of nondeterministically racing
> > worker threads, so I'm not sure how important this observation is,
> > but it keeps showing up in the data.
> > 
> > 4.  I had one machine on 5.10.84 (not a test VM) with a shell process
> > that got stuck spinning 100% CPU in the kernel on sys_write.  bees was
> > also running, but its threads were all stuck waiting for the shell to
> > release the transaction.  Other crashes on 5.10.8x kernels look more
> > like the one in this thread, with a logical_ino spinning.
> > 
> >>> If it's not looping there, it may be looping higher up, but I don't see where it
> >>> would be doing that.  Lets start here and work our way up if we need to.
> >>> Thanks,
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1
  2022-01-21  0:27                       ` Zygo Blaxell
@ 2022-02-09 12:22                         ` Libor Klepáč
  2022-02-18 14:46                         ` Thorsten Leemhuis
  1 sibling, 0 replies; 22+ messages in thread
From: Libor Klepáč @ 2022-02-09 12:22 UTC (permalink / raw)
  To: linux-btrfs

Hi,
is there any progress on this?
Can we test something (in non destructive way :)

Thanks for info,
Libor


> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1
  2022-01-21  0:27                       ` Zygo Blaxell
  2022-02-09 12:22                         ` Libor Klepáč
@ 2022-02-18 14:46                         ` Thorsten Leemhuis
  2022-03-06 10:31                           ` Thorsten Leemhuis
  1 sibling, 1 reply; 22+ messages in thread
From: Thorsten Leemhuis @ 2022-02-18 14:46 UTC (permalink / raw)
  To: Zygo Blaxell, Thorsten Leemhuis
  Cc: Josef Bacik, Nikolay Borisov, linux-btrfs, regressions

Hi, this is your Linux kernel regression tracker speaking. Top-posting
for once, to make this easy accessible to everyone.

What's up here? Can anyone (Zygo? Josef?) please provide a status
update? Yes, it took quite a while till this regression got found, but
nevertheless this looks to take awfully long to get resolved for a
regression was bisected weeks ago.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.


On 21.01.22 01:27, Zygo Blaxell wrote:
> On Thu, Jan 20, 2022 at 03:04:19PM +0100, Thorsten Leemhuis wrote:
>> Hi, this is your Linux kernel regression tracker speaking.
>>
>> On 07.01.22 19:31, Zygo Blaxell wrote:
>>> On Fri, Dec 17, 2021 at 03:50:44PM -0500, Zygo Blaxell wrote:
>>> I left my VM running tests for a few weeks and got some more information.
>>> Or at least more data, I'm not feeling particularly informed by it.  :-P
>>>
>>> 1.  It's not a regression.  5.10 has the same symptoms, but about 100x
>>> less often (once a week under these test conditions, compared to once
>>> every 90 minutes or so on 5.11-rc1).
>>
>> Well, I'd still call it a regression, as it's now happening way more
>> often and thus will likely hit more users. It's thus a bit like a
>> problem that leads to higher energy consumption: things still work, but
>> worse than before -- nevertheless it's considered a regression. Anway:
>>
>> What's the status here? Are you still investigating the issue? Are any
>> developers looking out for the root cause?
> 
> I think Josef's plan (start inside the logical_ino ioctl with bpftrace
> and work upwards to see where the looping is getting stuck) is a good plan,
> but due to conflicting priorities I haven't found the time to act on it.
> 
> I can take experimental patches and throw them at my repro setup if
> anyone would like to supply some.
> 
>> Ciao, Thorsten
>>
>> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
>> on my table. I can only look briefly into most of them. Unfortunately
>> therefore I sometimes will get things wrong or miss something important.
>> I hope that's not the case here; if you think it is, don't hesitate to
>> tell me about it in a public reply, that's in everyone's interest.
>>
>> BTW, I have no personal interest in this issue, which is tracked using
>> regzbot, my Linux kernel regression tracking bot
>> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
>> this mail to get things rolling again and hence don't need to be CC on
>> all further activities wrt to this regression.
>>
>> #regzbot poke
>>
>>> 2.  Bisection doesn't work, because there are patches that are repeatably
>>> good and bad mixed together, so the bisect algorithm (even with stochastic
>>> enhancement) repeatably picks the wrong commits and converges with
>>> high confidence on nonsense.  Instead of bisecting, I picked commits
>>> semi-randomly from 5.11-rc1's patch set, and got these results:
>>>
>>>    124  3a160a933111 btrfs: drop never met disk total bytes check in verify_one_dev_extent
>>> 	1x hang, 2x slower
>>>    125  bacce86ae8a7 btrfs: drop unused argument step from btrfs_free_extra_devids
>>> 	1x pass (fast)
>>>    126  2766ff61762c btrfs: update the number of bytes used by an inode atomically
>>> 	1x hang (<20 minutes)
>>>    127  7f458a3873ae btrfs: fix race when defragmenting leads to unnecessary IO
>>> 	1x hang, runs 3x slower
>>>    128  5893dfb98f25 btrfs: refactor btrfs_drop_extents() to make it easier to extend
>>> 	2x hang (<20 minutes)
>>>    129  e114c545bb69 btrfs: set the lockdep class for extent buffers on creation
>>> 	2x pass (but runs 2x slower, both times)
>>>    130  3fbaf25817f7 btrfs: pass the owner_root and level to alloc_extent_buffer
>>> 	1x pass
>>>    131  5d81230baa90 btrfs: pass the root owner and level around for readahead
>>> 	1x pass
>>>    132  1b7ec85ef490 btrfs: pass root owner to read_tree_block
>>> 	1x pass
>>>    133  182c79fcb857 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
>>>    134  3acfbd6a990c btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
>>> 	1x hang
>>>    135  6b2cb7cb959a btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
>>>    136  c990ada2a0bb btrfs: use btrfs_read_node_slot in walk_down_tree
>>> 	1x hang
>>>    137  6b3426be27de btrfs: use btrfs_read_node_slot in replace_path
>>> 	1x hang, 1x pass
>>>    138  c975253682e0 btrfs: use btrfs_read_node_slot in do_relocation
>>> 	1x hang
>>>    139  8ef385bbf099 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
>>> 	1x hang, 1x pass
>>>    140  206983b72a36 btrfs: use btrfs_read_node_slot in btrfs_realloc_node
>>> 	1x pass
>>>    141  bfb484d922a3 btrfs: cleanup extent buffer readahead
>>> 	1x pass
>>>    142  416e3445ef80 btrfs: remove lockdep classes for the fs tree
>>>    143  3e48d8d2540d btrfs: discard: reschedule work after sysfs param update
>>>    144  df903e5d294f btrfs: don't miss async discards after scheduled work override
>>>    145  6e88f116bd4c btrfs: discard: store async discard delay as ns not as jiffies
>>> 	2x hang
>>>    146  e50404a8a699 btrfs: discard: speed up async discard up to iops_limit
>>>
>>>    [snip]
>>>
>>>    155  0d01e247a06b btrfs: assert page mapping lock in attach_extent_buffer_page
>>> 	1x hang, 1x pass
>>>    156  bbb86a371791 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
>>> 	1x hang
>>>    157  e747853cae3a btrfs: load free space cache asynchronously
>>> 	1x pass
>>>    158  4d7240f0abda btrfs: load the free space cache inode extents from commit root
>>> 	1x hang
>>>    159  cd79909bc7cd btrfs: load free space cache into a temporary ctl
>>> 	2x pass
>>>    160  66b53bae46c8 btrfs: cleanup btrfs_discard_update_discardable usage
>>> 	2x hang, 1x pass
>>>    161  2ca08c56e813 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
>>> 	2x pass
>>>    162  27d56e62e474 btrfs: update last_byte_to_unpin in switch_commit_roots
>>> 	2x pass
>>>    163  9076dbd5ee83 btrfs: do not shorten unpin len for caching block groups
>>>    164  dc5161648693 btrfs: reorder extent buffer members for better packing
>>> 	2x pass
>>>    165  b9729ce014f6 btrfs: locking: rip out path->leave_spinning
>>>    166  ac5887c8e013 btrfs: locking: remove all the blocking helpers
>>>    167  2ae0c2d80d25 btrfs: scrub: remove local copy of csum_size from context
>>>    168  419b791ce760 btrfs: check integrity: remove local copy of csum_size
>>> 	1x hang, 1x pass
>>>    169  713cebfb9891 btrfs: remove unnecessary local variables for checksum size
>>>    170  223486c27b36 btrfs: switch cached fs_info::csum_size from u16 to u32
>>>    171  55fc29bed8dd btrfs: use cached value of fs_info::csum_size everywhere
>>>    172  fe5ecbe818de btrfs: precalculate checksums per leaf once
>>>    173  22b6331d9617 btrfs: store precalculated csum_size in fs_info
>>>    174  265fdfa6ce0a btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
>>>    175  098e63082b9b btrfs: replace div_u64 by shift in free_space_bitmap_size
>>> 	2x pass
>>>    176  ab108d992b12 btrfs: use precalculated sectorsize_bits from fs_info
>>>
>>>    [snip]
>>>
>>>    200  5e8b9ef30392 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
>>> 	1x pass
>>>    201  4e4cabece9f9 btrfs: split btrfs_direct_IO to read and write
>>>
>>>    [snip]
>>>
>>>    215  d70bf7484f72 btrfs: unify the ro checking for mount options
>>> 	1x pass
>>>    216  a6889caf6ec6 btrfs: do not start readahead for csum tree when scrubbing non-data block groups
>>>    217  a57ad681f12e btrfs: assert we are holding the reada_lock when releasing a readahead zone
>>>    218  aa8c1a41a1e6 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
>>>    219  13f0dd8f7861 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
>>>    220  949b32732eab btrfs: use iosize while reading compressed pages
>>>    221  eefa45f59379 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
>>>    222  fb8a7e941b1b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
>>> 	1x pass
>>>
>>> There is some repeatability in these results--some commits have a much
>>> lower failure rate than others--but I don't see a reason why the bad
>>> commits are bad or the good commits are good.  There are some commits with
>>> locking and concurrency implications, but they're as likely to produce
>>> good as bad results in test.  Sometimes there's a consistent change in
>>> test result after a commit that only rearranges function arguments on
>>> the stack.
>>>
>>> Maybe what we're looking at is a subtle race that is popping up due
>>> to unrelated changes in the kernel, and disappearing just as often,
>>> and 5.11-rc1 in particular did something innocent that aggravates
>>> it somehow, so all later kernels hit the problem more often than
>>> 5.10 did.
>>>
>>> 3.  Somewhere around "7f458a3873ae btrfs: fix race when defragmenting
>>> leads to unnecessary IO" bees starts running about 3x slower than on
>>> earlier kernels.  bees is a nightmare of nondeterministically racing
>>> worker threads, so I'm not sure how important this observation is,
>>> but it keeps showing up in the data.
>>>
>>> 4.  I had one machine on 5.10.84 (not a test VM) with a shell process
>>> that got stuck spinning 100% CPU in the kernel on sys_write.  bees was
>>> also running, but its threads were all stuck waiting for the shell to
>>> release the transaction.  Other crashes on 5.10.8x kernels look more
>>> like the one in this thread, with a logical_ino spinning.
>>>
>>>>> If it's not looping there, it may be looping higher up, but I don't see where it
>>>>> would be doing that.  Lets start here and work our way up if we need to.
>>>>> Thanks,
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1
  2022-02-18 14:46                         ` Thorsten Leemhuis
@ 2022-03-06 10:31                           ` Thorsten Leemhuis
  2022-03-06 23:34                             ` Zygo Blaxell
  0 siblings, 1 reply; 22+ messages in thread
From: Thorsten Leemhuis @ 2022-03-06 10:31 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Nikolay Borisov, linux-btrfs, regressions, Zygo Blaxell

[Adding Chris and David to the list of recipients]

Hi, this is your Linux kernel regression tracker again.

@Btrfs-Maintainers: is anyone addressing this regression that got
bisected many weeks ago? It from here looks like things are stuck: Libor
asked for a status update 24 days ago, I did the same 15 days ago, but
neither of us got a answer. Is there some good reason for this? Or did
the discussion continue somewhere else?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

#regzbot poke

On 18.02.22 15:46, Thorsten Leemhuis wrote:
> Hi, this is your Linux kernel regression tracker speaking. Top-posting
> for once, to make this easy accessible to everyone.
> 
> What's up here? Can anyone (Zygo? Josef?) please provide a status
> update? Yes, it took quite a while till this regression got found, but
> nevertheless this looks to take awfully long to get resolved for a
> regression was bisected weeks ago.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> 
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
> 
> 
> On 21.01.22 01:27, Zygo Blaxell wrote:
>> On Thu, Jan 20, 2022 at 03:04:19PM +0100, Thorsten Leemhuis wrote:
>>> Hi, this is your Linux kernel regression tracker speaking.
>>>
>>> On 07.01.22 19:31, Zygo Blaxell wrote:
>>>> On Fri, Dec 17, 2021 at 03:50:44PM -0500, Zygo Blaxell wrote:
>>>> I left my VM running tests for a few weeks and got some more information.
>>>> Or at least more data, I'm not feeling particularly informed by it.  :-P
>>>>
>>>> 1.  It's not a regression.  5.10 has the same symptoms, but about 100x
>>>> less often (once a week under these test conditions, compared to once
>>>> every 90 minutes or so on 5.11-rc1).
>>>
>>> Well, I'd still call it a regression, as it's now happening way more
>>> often and thus will likely hit more users. It's thus a bit like a
>>> problem that leads to higher energy consumption: things still work, but
>>> worse than before -- nevertheless it's considered a regression. Anway:
>>>
>>> What's the status here? Are you still investigating the issue? Are any
>>> developers looking out for the root cause?
>>
>> I think Josef's plan (start inside the logical_ino ioctl with bpftrace
>> and work upwards to see where the looping is getting stuck) is a good plan,
>> but due to conflicting priorities I haven't found the time to act on it.
>>
>> I can take experimental patches and throw them at my repro setup if
>> anyone would like to supply some.
>>
>>> Ciao, Thorsten
>>>
>>> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
>>> on my table. I can only look briefly into most of them. Unfortunately
>>> therefore I sometimes will get things wrong or miss something important.
>>> I hope that's not the case here; if you think it is, don't hesitate to
>>> tell me about it in a public reply, that's in everyone's interest.
>>>
>>> BTW, I have no personal interest in this issue, which is tracked using
>>> regzbot, my Linux kernel regression tracking bot
>>> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
>>> this mail to get things rolling again and hence don't need to be CC on
>>> all further activities wrt to this regression.
>>>
>>> #regzbot poke
>>>
>>>> 2.  Bisection doesn't work, because there are patches that are repeatably
>>>> good and bad mixed together, so the bisect algorithm (even with stochastic
>>>> enhancement) repeatably picks the wrong commits and converges with
>>>> high confidence on nonsense.  Instead of bisecting, I picked commits
>>>> semi-randomly from 5.11-rc1's patch set, and got these results:
>>>>
>>>>    124  3a160a933111 btrfs: drop never met disk total bytes check in verify_one_dev_extent
>>>> 	1x hang, 2x slower
>>>>    125  bacce86ae8a7 btrfs: drop unused argument step from btrfs_free_extra_devids
>>>> 	1x pass (fast)
>>>>    126  2766ff61762c btrfs: update the number of bytes used by an inode atomically
>>>> 	1x hang (<20 minutes)
>>>>    127  7f458a3873ae btrfs: fix race when defragmenting leads to unnecessary IO
>>>> 	1x hang, runs 3x slower
>>>>    128  5893dfb98f25 btrfs: refactor btrfs_drop_extents() to make it easier to extend
>>>> 	2x hang (<20 minutes)
>>>>    129  e114c545bb69 btrfs: set the lockdep class for extent buffers on creation
>>>> 	2x pass (but runs 2x slower, both times)
>>>>    130  3fbaf25817f7 btrfs: pass the owner_root and level to alloc_extent_buffer
>>>> 	1x pass
>>>>    131  5d81230baa90 btrfs: pass the root owner and level around for readahead
>>>> 	1x pass
>>>>    132  1b7ec85ef490 btrfs: pass root owner to read_tree_block
>>>> 	1x pass
>>>>    133  182c79fcb857 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
>>>>    134  3acfbd6a990c btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
>>>> 	1x hang
>>>>    135  6b2cb7cb959a btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
>>>>    136  c990ada2a0bb btrfs: use btrfs_read_node_slot in walk_down_tree
>>>> 	1x hang
>>>>    137  6b3426be27de btrfs: use btrfs_read_node_slot in replace_path
>>>> 	1x hang, 1x pass
>>>>    138  c975253682e0 btrfs: use btrfs_read_node_slot in do_relocation
>>>> 	1x hang
>>>>    139  8ef385bbf099 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
>>>> 	1x hang, 1x pass
>>>>    140  206983b72a36 btrfs: use btrfs_read_node_slot in btrfs_realloc_node
>>>> 	1x pass
>>>>    141  bfb484d922a3 btrfs: cleanup extent buffer readahead
>>>> 	1x pass
>>>>    142  416e3445ef80 btrfs: remove lockdep classes for the fs tree
>>>>    143  3e48d8d2540d btrfs: discard: reschedule work after sysfs param update
>>>>    144  df903e5d294f btrfs: don't miss async discards after scheduled work override
>>>>    145  6e88f116bd4c btrfs: discard: store async discard delay as ns not as jiffies
>>>> 	2x hang
>>>>    146  e50404a8a699 btrfs: discard: speed up async discard up to iops_limit
>>>>
>>>>    [snip]
>>>>
>>>>    155  0d01e247a06b btrfs: assert page mapping lock in attach_extent_buffer_page
>>>> 	1x hang, 1x pass
>>>>    156  bbb86a371791 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
>>>> 	1x hang
>>>>    157  e747853cae3a btrfs: load free space cache asynchronously
>>>> 	1x pass
>>>>    158  4d7240f0abda btrfs: load the free space cache inode extents from commit root
>>>> 	1x hang
>>>>    159  cd79909bc7cd btrfs: load free space cache into a temporary ctl
>>>> 	2x pass
>>>>    160  66b53bae46c8 btrfs: cleanup btrfs_discard_update_discardable usage
>>>> 	2x hang, 1x pass
>>>>    161  2ca08c56e813 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
>>>> 	2x pass
>>>>    162  27d56e62e474 btrfs: update last_byte_to_unpin in switch_commit_roots
>>>> 	2x pass
>>>>    163  9076dbd5ee83 btrfs: do not shorten unpin len for caching block groups
>>>>    164  dc5161648693 btrfs: reorder extent buffer members for better packing
>>>> 	2x pass
>>>>    165  b9729ce014f6 btrfs: locking: rip out path->leave_spinning
>>>>    166  ac5887c8e013 btrfs: locking: remove all the blocking helpers
>>>>    167  2ae0c2d80d25 btrfs: scrub: remove local copy of csum_size from context
>>>>    168  419b791ce760 btrfs: check integrity: remove local copy of csum_size
>>>> 	1x hang, 1x pass
>>>>    169  713cebfb9891 btrfs: remove unnecessary local variables for checksum size
>>>>    170  223486c27b36 btrfs: switch cached fs_info::csum_size from u16 to u32
>>>>    171  55fc29bed8dd btrfs: use cached value of fs_info::csum_size everywhere
>>>>    172  fe5ecbe818de btrfs: precalculate checksums per leaf once
>>>>    173  22b6331d9617 btrfs: store precalculated csum_size in fs_info
>>>>    174  265fdfa6ce0a btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
>>>>    175  098e63082b9b btrfs: replace div_u64 by shift in free_space_bitmap_size
>>>> 	2x pass
>>>>    176  ab108d992b12 btrfs: use precalculated sectorsize_bits from fs_info
>>>>
>>>>    [snip]
>>>>
>>>>    200  5e8b9ef30392 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
>>>> 	1x pass
>>>>    201  4e4cabece9f9 btrfs: split btrfs_direct_IO to read and write
>>>>
>>>>    [snip]
>>>>
>>>>    215  d70bf7484f72 btrfs: unify the ro checking for mount options
>>>> 	1x pass
>>>>    216  a6889caf6ec6 btrfs: do not start readahead for csum tree when scrubbing non-data block groups
>>>>    217  a57ad681f12e btrfs: assert we are holding the reada_lock when releasing a readahead zone
>>>>    218  aa8c1a41a1e6 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
>>>>    219  13f0dd8f7861 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
>>>>    220  949b32732eab btrfs: use iosize while reading compressed pages
>>>>    221  eefa45f59379 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
>>>>    222  fb8a7e941b1b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
>>>> 	1x pass
>>>>
>>>> There is some repeatability in these results--some commits have a much
>>>> lower failure rate than others--but I don't see a reason why the bad
>>>> commits are bad or the good commits are good.  There are some commits with
>>>> locking and concurrency implications, but they're as likely to produce
>>>> good as bad results in test.  Sometimes there's a consistent change in
>>>> test result after a commit that only rearranges function arguments on
>>>> the stack.
>>>>
>>>> Maybe what we're looking at is a subtle race that is popping up due
>>>> to unrelated changes in the kernel, and disappearing just as often,
>>>> and 5.11-rc1 in particular did something innocent that aggravates
>>>> it somehow, so all later kernels hit the problem more often than
>>>> 5.10 did.
>>>>
>>>> 3.  Somewhere around "7f458a3873ae btrfs: fix race when defragmenting
>>>> leads to unnecessary IO" bees starts running about 3x slower than on
>>>> earlier kernels.  bees is a nightmare of nondeterministically racing
>>>> worker threads, so I'm not sure how important this observation is,
>>>> but it keeps showing up in the data.
>>>>
>>>> 4.  I had one machine on 5.10.84 (not a test VM) with a shell process
>>>> that got stuck spinning 100% CPU in the kernel on sys_write.  bees was
>>>> also running, but its threads were all stuck waiting for the shell to
>>>> release the transaction.  Other crashes on 5.10.8x kernels look more
>>>> like the one in this thread, with a logical_ino spinning.
>>>>
>>>>>> If it's not looping there, it may be looping higher up, but I don't see where it
>>>>>> would be doing that.  Lets start here and work our way up if we need to.
>>>>>> Thanks,
>>>
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1
  2022-03-06 10:31                           ` Thorsten Leemhuis
@ 2022-03-06 23:34                             ` Zygo Blaxell
  2022-03-07  6:17                               ` Thorsten Leemhuis
  0 siblings, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2022-03-06 23:34 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Chris Mason, Josef Bacik, David Sterba, Nikolay Borisov,
	linux-btrfs, regressions

On Sun, Mar 06, 2022 at 11:31:05AM +0100, Thorsten Leemhuis wrote:
> [Adding Chris and David to the list of recipients]
> 
> Hi, this is your Linux kernel regression tracker again.
> 
> @Btrfs-Maintainers: is anyone addressing this regression that got
> bisected many weeks ago? It from here looks like things are stuck: Libor
> asked for a status update 24 days ago, I did the same 15 days ago, but
> neither of us got a answer. Is there some good reason for this? Or did
> the discussion continue somewhere else?

TL;DR it's not a regression and the bisection is invalid.  I apologize
for using the word "regression" too early.  This is a pre-existing bug
that doesn't require special attention.

After multiple bisection test runs and multiple brute-force linear tests
of every btrfs commit between 5.10 and 5.11, and similar event reports
observed in production fleet, I now believe the original bisection
results are an artifact of the git bisection algorithm and a bug with
unusual behavior.

Every commit in the range eventually locks up with the same symptoms.
Commits seem to randomly fall into the "minutes to reproduce" or "weeks
to reproduce" buckets, i.e. there's no single point in the commit history
where everything before the point is good and everything after the point
is bad.  The fact that bad commits are interleaved with good ones points
to an issue elsewhere, far away from the commit that bisect happened to
land on.

This bug is now at the top of my "reasons why btrfs crashes" list,
but that is because 1) every worse bug on that list has now been fixed
(including some other much more urgent regressions from the past few
kernel cycles) and 2) I'm pushing hard on specific parts of btrfs that
only a few people are using at the moment.  Most of the bugs I've hit
in btrfs turned out to be more than 5 years old at the time, and this
"new" bug could easily fall into that category.

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> 
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
> 
> #regzbot poke
> 
> On 18.02.22 15:46, Thorsten Leemhuis wrote:
> > Hi, this is your Linux kernel regression tracker speaking. Top-posting
> > for once, to make this easy accessible to everyone.
> > 
> > What's up here? Can anyone (Zygo? Josef?) please provide a status
> > update? Yes, it took quite a while till this regression got found, but
> > nevertheless this looks to take awfully long to get resolved for a
> > regression was bisected weeks ago.
> > 
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > 
> > P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> > reports on my table. I can only look briefly into most of them and lack
> > knowledge about most of the areas they concern. I thus unfortunately
> > will sometimes get things wrong or miss something important. I hope
> > that's not the case here; if you think it is, don't hesitate to tell me
> > in a public reply, it's in everyone's interest to set the public record
> > straight.
> > 
> > 
> > On 21.01.22 01:27, Zygo Blaxell wrote:
> >> On Thu, Jan 20, 2022 at 03:04:19PM +0100, Thorsten Leemhuis wrote:
> >>> Hi, this is your Linux kernel regression tracker speaking.
> >>>
> >>> On 07.01.22 19:31, Zygo Blaxell wrote:
> >>>> On Fri, Dec 17, 2021 at 03:50:44PM -0500, Zygo Blaxell wrote:
> >>>> I left my VM running tests for a few weeks and got some more information.
> >>>> Or at least more data, I'm not feeling particularly informed by it.  :-P
> >>>>
> >>>> 1.  It's not a regression.  5.10 has the same symptoms, but about 100x
> >>>> less often (once a week under these test conditions, compared to once
> >>>> every 90 minutes or so on 5.11-rc1).
> >>>
> >>> Well, I'd still call it a regression, as it's now happening way more
> >>> often and thus will likely hit more users. It's thus a bit like a
> >>> problem that leads to higher energy consumption: things still work, but
> >>> worse than before -- nevertheless it's considered a regression. Anway:
> >>>
> >>> What's the status here? Are you still investigating the issue? Are any
> >>> developers looking out for the root cause?
> >>
> >> I think Josef's plan (start inside the logical_ino ioctl with bpftrace
> >> and work upwards to see where the looping is getting stuck) is a good plan,
> >> but due to conflicting priorities I haven't found the time to act on it.
> >>
> >> I can take experimental patches and throw them at my repro setup if
> >> anyone would like to supply some.
> >>
> >>> Ciao, Thorsten
> >>>
> >>> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> >>> on my table. I can only look briefly into most of them. Unfortunately
> >>> therefore I sometimes will get things wrong or miss something important.
> >>> I hope that's not the case here; if you think it is, don't hesitate to
> >>> tell me about it in a public reply, that's in everyone's interest.
> >>>
> >>> BTW, I have no personal interest in this issue, which is tracked using
> >>> regzbot, my Linux kernel regression tracking bot
> >>> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
> >>> this mail to get things rolling again and hence don't need to be CC on
> >>> all further activities wrt to this regression.
> >>>
> >>> #regzbot poke
> >>>
> >>>> 2.  Bisection doesn't work, because there are patches that are repeatably
> >>>> good and bad mixed together, so the bisect algorithm (even with stochastic
> >>>> enhancement) repeatably picks the wrong commits and converges with
> >>>> high confidence on nonsense.  Instead of bisecting, I picked commits
> >>>> semi-randomly from 5.11-rc1's patch set, and got these results:
> >>>>
> >>>>    124  3a160a933111 btrfs: drop never met disk total bytes check in verify_one_dev_extent
> >>>> 	1x hang, 2x slower
> >>>>    125  bacce86ae8a7 btrfs: drop unused argument step from btrfs_free_extra_devids
> >>>> 	1x pass (fast)
> >>>>    126  2766ff61762c btrfs: update the number of bytes used by an inode atomically
> >>>> 	1x hang (<20 minutes)
> >>>>    127  7f458a3873ae btrfs: fix race when defragmenting leads to unnecessary IO
> >>>> 	1x hang, runs 3x slower
> >>>>    128  5893dfb98f25 btrfs: refactor btrfs_drop_extents() to make it easier to extend
> >>>> 	2x hang (<20 minutes)
> >>>>    129  e114c545bb69 btrfs: set the lockdep class for extent buffers on creation
> >>>> 	2x pass (but runs 2x slower, both times)
> >>>>    130  3fbaf25817f7 btrfs: pass the owner_root and level to alloc_extent_buffer
> >>>> 	1x pass
> >>>>    131  5d81230baa90 btrfs: pass the root owner and level around for readahead
> >>>> 	1x pass
> >>>>    132  1b7ec85ef490 btrfs: pass root owner to read_tree_block
> >>>> 	1x pass
> >>>>    133  182c79fcb857 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
> >>>>    134  3acfbd6a990c btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
> >>>> 	1x hang
> >>>>    135  6b2cb7cb959a btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
> >>>>    136  c990ada2a0bb btrfs: use btrfs_read_node_slot in walk_down_tree
> >>>> 	1x hang
> >>>>    137  6b3426be27de btrfs: use btrfs_read_node_slot in replace_path
> >>>> 	1x hang, 1x pass
> >>>>    138  c975253682e0 btrfs: use btrfs_read_node_slot in do_relocation
> >>>> 	1x hang
> >>>>    139  8ef385bbf099 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
> >>>> 	1x hang, 1x pass
> >>>>    140  206983b72a36 btrfs: use btrfs_read_node_slot in btrfs_realloc_node
> >>>> 	1x pass
> >>>>    141  bfb484d922a3 btrfs: cleanup extent buffer readahead
> >>>> 	1x pass
> >>>>    142  416e3445ef80 btrfs: remove lockdep classes for the fs tree
> >>>>    143  3e48d8d2540d btrfs: discard: reschedule work after sysfs param update
> >>>>    144  df903e5d294f btrfs: don't miss async discards after scheduled work override
> >>>>    145  6e88f116bd4c btrfs: discard: store async discard delay as ns not as jiffies
> >>>> 	2x hang
> >>>>    146  e50404a8a699 btrfs: discard: speed up async discard up to iops_limit
> >>>>
> >>>>    [snip]
> >>>>
> >>>>    155  0d01e247a06b btrfs: assert page mapping lock in attach_extent_buffer_page
> >>>> 	1x hang, 1x pass
> >>>>    156  bbb86a371791 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
> >>>> 	1x hang
> >>>>    157  e747853cae3a btrfs: load free space cache asynchronously
> >>>> 	1x pass
> >>>>    158  4d7240f0abda btrfs: load the free space cache inode extents from commit root
> >>>> 	1x hang
> >>>>    159  cd79909bc7cd btrfs: load free space cache into a temporary ctl
> >>>> 	2x pass
> >>>>    160  66b53bae46c8 btrfs: cleanup btrfs_discard_update_discardable usage
> >>>> 	2x hang, 1x pass
> >>>>    161  2ca08c56e813 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
> >>>> 	2x pass
> >>>>    162  27d56e62e474 btrfs: update last_byte_to_unpin in switch_commit_roots
> >>>> 	2x pass
> >>>>    163  9076dbd5ee83 btrfs: do not shorten unpin len for caching block groups
> >>>>    164  dc5161648693 btrfs: reorder extent buffer members for better packing
> >>>> 	2x pass
> >>>>    165  b9729ce014f6 btrfs: locking: rip out path->leave_spinning
> >>>>    166  ac5887c8e013 btrfs: locking: remove all the blocking helpers
> >>>>    167  2ae0c2d80d25 btrfs: scrub: remove local copy of csum_size from context
> >>>>    168  419b791ce760 btrfs: check integrity: remove local copy of csum_size
> >>>> 	1x hang, 1x pass
> >>>>    169  713cebfb9891 btrfs: remove unnecessary local variables for checksum size
> >>>>    170  223486c27b36 btrfs: switch cached fs_info::csum_size from u16 to u32
> >>>>    171  55fc29bed8dd btrfs: use cached value of fs_info::csum_size everywhere
> >>>>    172  fe5ecbe818de btrfs: precalculate checksums per leaf once
> >>>>    173  22b6331d9617 btrfs: store precalculated csum_size in fs_info
> >>>>    174  265fdfa6ce0a btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
> >>>>    175  098e63082b9b btrfs: replace div_u64 by shift in free_space_bitmap_size
> >>>> 	2x pass
> >>>>    176  ab108d992b12 btrfs: use precalculated sectorsize_bits from fs_info
> >>>>
> >>>>    [snip]
> >>>>
> >>>>    200  5e8b9ef30392 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
> >>>> 	1x pass
> >>>>    201  4e4cabece9f9 btrfs: split btrfs_direct_IO to read and write
> >>>>
> >>>>    [snip]
> >>>>
> >>>>    215  d70bf7484f72 btrfs: unify the ro checking for mount options
> >>>> 	1x pass
> >>>>    216  a6889caf6ec6 btrfs: do not start readahead for csum tree when scrubbing non-data block groups
> >>>>    217  a57ad681f12e btrfs: assert we are holding the reada_lock when releasing a readahead zone
> >>>>    218  aa8c1a41a1e6 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
> >>>>    219  13f0dd8f7861 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
> >>>>    220  949b32732eab btrfs: use iosize while reading compressed pages
> >>>>    221  eefa45f59379 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
> >>>>    222  fb8a7e941b1b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
> >>>> 	1x pass
> >>>>
> >>>> There is some repeatability in these results--some commits have a much
> >>>> lower failure rate than others--but I don't see a reason why the bad
> >>>> commits are bad or the good commits are good.  There are some commits with
> >>>> locking and concurrency implications, but they're as likely to produce
> >>>> good as bad results in test.  Sometimes there's a consistent change in
> >>>> test result after a commit that only rearranges function arguments on
> >>>> the stack.
> >>>>
> >>>> Maybe what we're looking at is a subtle race that is popping up due
> >>>> to unrelated changes in the kernel, and disappearing just as often,
> >>>> and 5.11-rc1 in particular did something innocent that aggravates
> >>>> it somehow, so all later kernels hit the problem more often than
> >>>> 5.10 did.
> >>>>
> >>>> 3.  Somewhere around "7f458a3873ae btrfs: fix race when defragmenting
> >>>> leads to unnecessary IO" bees starts running about 3x slower than on
> >>>> earlier kernels.  bees is a nightmare of nondeterministically racing
> >>>> worker threads, so I'm not sure how important this observation is,
> >>>> but it keeps showing up in the data.
> >>>>
> >>>> 4.  I had one machine on 5.10.84 (not a test VM) with a shell process
> >>>> that got stuck spinning 100% CPU in the kernel on sys_write.  bees was
> >>>> also running, but its threads were all stuck waiting for the shell to
> >>>> release the transaction.  Other crashes on 5.10.8x kernels look more
> >>>> like the one in this thread, with a logical_ino spinning.
> >>>>
> >>>>>> If it's not looping there, it may be looping higher up, but I don't see where it
> >>>>>> would be doing that.  Lets start here and work our way up if we need to.
> >>>>>> Thanks,
> >>>
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1
  2022-03-06 23:34                             ` Zygo Blaxell
@ 2022-03-07  6:17                               ` Thorsten Leemhuis
  0 siblings, 0 replies; 22+ messages in thread
From: Thorsten Leemhuis @ 2022-03-07  6:17 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Chris Mason, Josef Bacik, David Sterba, Nikolay Borisov,
	linux-btrfs, regressions

On 07.03.22 00:34, Zygo Blaxell wrote:
> On Sun, Mar 06, 2022 at 11:31:05AM +0100, Thorsten Leemhuis wrote:
>> [Adding Chris and David to the list of recipients]
>>
>> Hi, this is your Linux kernel regression tracker again.
>>
>> @Btrfs-Maintainers: is anyone addressing this regression that got
>> bisected many weeks ago? It from here looks like things are stuck: Libor
>> asked for a status update 24 days ago, I did the same 15 days ago, but
>> neither of us got a answer. Is there some good reason for this? Or did
>> the discussion continue somewhere else?
> 
> TL;DR it's not a regression and the bisection is invalid.

#regzbot invalid: it's not a regression and the bisection is invalid
according to the reporter

>  I apologize for using the word "regression" too early.

Np, that just didn't became clear to me from the discussion, but I can
only skim those, as I get many reports on my table.

thx for clarifying and good luck in hunting this down!

Ciao, Thorsten

>  This is a pre-existing bug
> that doesn't require special attention.
>
> After multiple bisection test runs and multiple brute-force linear tests
> of every btrfs commit between 5.10 and 5.11, and similar event reports
> observed in production fleet, I now believe the original bisection
> results are an artifact of the git bisection algorithm and a bug with
> unusual behavior.
> 
> Every commit in the range eventually locks up with the same symptoms.
> Commits seem to randomly fall into the "minutes to reproduce" or "weeks
> to reproduce" buckets, i.e. there's no single point in the commit history
> where everything before the point is good and everything after the point
> is bad.  The fact that bad commits are interleaved with good ones points
> to an issue elsewhere, far away from the commit that bisect happened to
> land on.
> 
> This bug is now at the top of my "reasons why btrfs crashes" list,
> but that is because 1) every worse bug on that list has now been fixed
> (including some other much more urgent regressions from the past few
> kernel cycles) and 2) I'm pushing hard on specific parts of btrfs that
> only a few people are using at the moment.  Most of the bugs I've hit
> in btrfs turned out to be more than 5 years old at the time, and this
> "new" bug could easily fall into that category.
> 
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>
>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>> reports on my table. I can only look briefly into most of them and lack
>> knowledge about most of the areas they concern. I thus unfortunately
>> will sometimes get things wrong or miss something important. I hope
>> that's not the case here; if you think it is, don't hesitate to tell me
>> in a public reply, it's in everyone's interest to set the public record
>> straight.
>>
>> #regzbot poke
>>
>> On 18.02.22 15:46, Thorsten Leemhuis wrote:
>>> Hi, this is your Linux kernel regression tracker speaking. Top-posting
>>> for once, to make this easy accessible to everyone.
>>>
>>> What's up here? Can anyone (Zygo? Josef?) please provide a status
>>> update? Yes, it took quite a while till this regression got found, but
>>> nevertheless this looks to take awfully long to get resolved for a
>>> regression was bisected weeks ago.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>
>>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>>> reports on my table. I can only look briefly into most of them and lack
>>> knowledge about most of the areas they concern. I thus unfortunately
>>> will sometimes get things wrong or miss something important. I hope
>>> that's not the case here; if you think it is, don't hesitate to tell me
>>> in a public reply, it's in everyone's interest to set the public record
>>> straight.
>>>
>>>
>>> On 21.01.22 01:27, Zygo Blaxell wrote:
>>>> On Thu, Jan 20, 2022 at 03:04:19PM +0100, Thorsten Leemhuis wrote:
>>>>> Hi, this is your Linux kernel regression tracker speaking.
>>>>>
>>>>> On 07.01.22 19:31, Zygo Blaxell wrote:
>>>>>> On Fri, Dec 17, 2021 at 03:50:44PM -0500, Zygo Blaxell wrote:
>>>>>> I left my VM running tests for a few weeks and got some more information.
>>>>>> Or at least more data, I'm not feeling particularly informed by it.  :-P
>>>>>>
>>>>>> 1.  It's not a regression.  5.10 has the same symptoms, but about 100x
>>>>>> less often (once a week under these test conditions, compared to once
>>>>>> every 90 minutes or so on 5.11-rc1).
>>>>>
>>>>> Well, I'd still call it a regression, as it's now happening way more
>>>>> often and thus will likely hit more users. It's thus a bit like a
>>>>> problem that leads to higher energy consumption: things still work, but
>>>>> worse than before -- nevertheless it's considered a regression. Anway:
>>>>>
>>>>> What's the status here? Are you still investigating the issue? Are any
>>>>> developers looking out for the root cause?
>>>>
>>>> I think Josef's plan (start inside the logical_ino ioctl with bpftrace
>>>> and work upwards to see where the looping is getting stuck) is a good plan,
>>>> but due to conflicting priorities I haven't found the time to act on it.
>>>>
>>>> I can take experimental patches and throw them at my repro setup if
>>>> anyone would like to supply some.
>>>>
>>>>> Ciao, Thorsten
>>>>>
>>>>> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
>>>>> on my table. I can only look briefly into most of them. Unfortunately
>>>>> therefore I sometimes will get things wrong or miss something important.
>>>>> I hope that's not the case here; if you think it is, don't hesitate to
>>>>> tell me about it in a public reply, that's in everyone's interest.
>>>>>
>>>>> BTW, I have no personal interest in this issue, which is tracked using
>>>>> regzbot, my Linux kernel regression tracking bot
>>>>> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
>>>>> this mail to get things rolling again and hence don't need to be CC on
>>>>> all further activities wrt to this regression.
>>>>>
>>>>> #regzbot poke
>>>>>
>>>>>> 2.  Bisection doesn't work, because there are patches that are repeatably
>>>>>> good and bad mixed together, so the bisect algorithm (even with stochastic
>>>>>> enhancement) repeatably picks the wrong commits and converges with
>>>>>> high confidence on nonsense.  Instead of bisecting, I picked commits
>>>>>> semi-randomly from 5.11-rc1's patch set, and got these results:
>>>>>>
>>>>>>    124  3a160a933111 btrfs: drop never met disk total bytes check in verify_one_dev_extent
>>>>>> 	1x hang, 2x slower
>>>>>>    125  bacce86ae8a7 btrfs: drop unused argument step from btrfs_free_extra_devids
>>>>>> 	1x pass (fast)
>>>>>>    126  2766ff61762c btrfs: update the number of bytes used by an inode atomically
>>>>>> 	1x hang (<20 minutes)
>>>>>>    127  7f458a3873ae btrfs: fix race when defragmenting leads to unnecessary IO
>>>>>> 	1x hang, runs 3x slower
>>>>>>    128  5893dfb98f25 btrfs: refactor btrfs_drop_extents() to make it easier to extend
>>>>>> 	2x hang (<20 minutes)
>>>>>>    129  e114c545bb69 btrfs: set the lockdep class for extent buffers on creation
>>>>>> 	2x pass (but runs 2x slower, both times)
>>>>>>    130  3fbaf25817f7 btrfs: pass the owner_root and level to alloc_extent_buffer
>>>>>> 	1x pass
>>>>>>    131  5d81230baa90 btrfs: pass the root owner and level around for readahead
>>>>>> 	1x pass
>>>>>>    132  1b7ec85ef490 btrfs: pass root owner to read_tree_block
>>>>>> 	1x pass
>>>>>>    133  182c79fcb857 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
>>>>>>    134  3acfbd6a990c btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
>>>>>> 	1x hang
>>>>>>    135  6b2cb7cb959a btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
>>>>>>    136  c990ada2a0bb btrfs: use btrfs_read_node_slot in walk_down_tree
>>>>>> 	1x hang
>>>>>>    137  6b3426be27de btrfs: use btrfs_read_node_slot in replace_path
>>>>>> 	1x hang, 1x pass
>>>>>>    138  c975253682e0 btrfs: use btrfs_read_node_slot in do_relocation
>>>>>> 	1x hang
>>>>>>    139  8ef385bbf099 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
>>>>>> 	1x hang, 1x pass
>>>>>>    140  206983b72a36 btrfs: use btrfs_read_node_slot in btrfs_realloc_node
>>>>>> 	1x pass
>>>>>>    141  bfb484d922a3 btrfs: cleanup extent buffer readahead
>>>>>> 	1x pass
>>>>>>    142  416e3445ef80 btrfs: remove lockdep classes for the fs tree
>>>>>>    143  3e48d8d2540d btrfs: discard: reschedule work after sysfs param update
>>>>>>    144  df903e5d294f btrfs: don't miss async discards after scheduled work override
>>>>>>    145  6e88f116bd4c btrfs: discard: store async discard delay as ns not as jiffies
>>>>>> 	2x hang
>>>>>>    146  e50404a8a699 btrfs: discard: speed up async discard up to iops_limit
>>>>>>
>>>>>>    [snip]
>>>>>>
>>>>>>    155  0d01e247a06b btrfs: assert page mapping lock in attach_extent_buffer_page
>>>>>> 	1x hang, 1x pass
>>>>>>    156  bbb86a371791 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
>>>>>> 	1x hang
>>>>>>    157  e747853cae3a btrfs: load free space cache asynchronously
>>>>>> 	1x pass
>>>>>>    158  4d7240f0abda btrfs: load the free space cache inode extents from commit root
>>>>>> 	1x hang
>>>>>>    159  cd79909bc7cd btrfs: load free space cache into a temporary ctl
>>>>>> 	2x pass
>>>>>>    160  66b53bae46c8 btrfs: cleanup btrfs_discard_update_discardable usage
>>>>>> 	2x hang, 1x pass
>>>>>>    161  2ca08c56e813 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
>>>>>> 	2x pass
>>>>>>    162  27d56e62e474 btrfs: update last_byte_to_unpin in switch_commit_roots
>>>>>> 	2x pass
>>>>>>    163  9076dbd5ee83 btrfs: do not shorten unpin len for caching block groups
>>>>>>    164  dc5161648693 btrfs: reorder extent buffer members for better packing
>>>>>> 	2x pass
>>>>>>    165  b9729ce014f6 btrfs: locking: rip out path->leave_spinning
>>>>>>    166  ac5887c8e013 btrfs: locking: remove all the blocking helpers
>>>>>>    167  2ae0c2d80d25 btrfs: scrub: remove local copy of csum_size from context
>>>>>>    168  419b791ce760 btrfs: check integrity: remove local copy of csum_size
>>>>>> 	1x hang, 1x pass
>>>>>>    169  713cebfb9891 btrfs: remove unnecessary local variables for checksum size
>>>>>>    170  223486c27b36 btrfs: switch cached fs_info::csum_size from u16 to u32
>>>>>>    171  55fc29bed8dd btrfs: use cached value of fs_info::csum_size everywhere
>>>>>>    172  fe5ecbe818de btrfs: precalculate checksums per leaf once
>>>>>>    173  22b6331d9617 btrfs: store precalculated csum_size in fs_info
>>>>>>    174  265fdfa6ce0a btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
>>>>>>    175  098e63082b9b btrfs: replace div_u64 by shift in free_space_bitmap_size
>>>>>> 	2x pass
>>>>>>    176  ab108d992b12 btrfs: use precalculated sectorsize_bits from fs_info
>>>>>>
>>>>>>    [snip]
>>>>>>
>>>>>>    200  5e8b9ef30392 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
>>>>>> 	1x pass
>>>>>>    201  4e4cabece9f9 btrfs: split btrfs_direct_IO to read and write
>>>>>>
>>>>>>    [snip]
>>>>>>
>>>>>>    215  d70bf7484f72 btrfs: unify the ro checking for mount options
>>>>>> 	1x pass
>>>>>>    216  a6889caf6ec6 btrfs: do not start readahead for csum tree when scrubbing non-data block groups
>>>>>>    217  a57ad681f12e btrfs: assert we are holding the reada_lock when releasing a readahead zone
>>>>>>    218  aa8c1a41a1e6 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
>>>>>>    219  13f0dd8f7861 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
>>>>>>    220  949b32732eab btrfs: use iosize while reading compressed pages
>>>>>>    221  eefa45f59379 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
>>>>>>    222  fb8a7e941b1b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
>>>>>> 	1x pass
>>>>>>
>>>>>> There is some repeatability in these results--some commits have a much
>>>>>> lower failure rate than others--but I don't see a reason why the bad
>>>>>> commits are bad or the good commits are good.  There are some commits with
>>>>>> locking and concurrency implications, but they're as likely to produce
>>>>>> good as bad results in test.  Sometimes there's a consistent change in
>>>>>> test result after a commit that only rearranges function arguments on
>>>>>> the stack.
>>>>>>
>>>>>> Maybe what we're looking at is a subtle race that is popping up due
>>>>>> to unrelated changes in the kernel, and disappearing just as often,
>>>>>> and 5.11-rc1 in particular did something innocent that aggravates
>>>>>> it somehow, so all later kernels hit the problem more often than
>>>>>> 5.10 did.
>>>>>>
>>>>>> 3.  Somewhere around "7f458a3873ae btrfs: fix race when defragmenting
>>>>>> leads to unnecessary IO" bees starts running about 3x slower than on
>>>>>> earlier kernels.  bees is a nightmare of nondeterministically racing
>>>>>> worker threads, so I'm not sure how important this observation is,
>>>>>> but it keeps showing up in the data.
>>>>>>
>>>>>> 4.  I had one machine on 5.10.84 (not a test VM) with a shell process
>>>>>> that got stuck spinning 100% CPU in the kernel on sys_write.  bees was
>>>>>> also running, but its threads were all stuck waiting for the shell to
>>>>>> release the transaction.  Other crashes on 5.10.8x kernels look more
>>>>>> like the one in this thread, with a logical_ino spinning.
>>>>>>
>>>>>>>> If it's not looping there, it may be looping higher up, but I don't see where it
>>>>>>>> would be doing that.  Lets start here and work our way up if we need to.
>>>>>>>> Thanks,
>>>>>
>>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2021-12-10 18:34 bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
  2021-12-12 10:03 ` Thorsten Leemhuis
  2021-12-13 13:28 ` Nikolay Borisov
@ 2022-06-13  8:38 ` Libor Klepáč
  2022-06-21  5:08   ` Zygo Blaxell
  2 siblings, 1 reply; 22+ messages in thread
From: Libor Klepáč @ 2022-06-13  8:38 UTC (permalink / raw)
  To: linux-btrfs

Hi,
any news on this?

We have been running beesd in single thread as you suggested in
https://github.com/Zygo/bees/issues/215#issuecomment-1104614791

without kernel lock for almost two month on kernel 5.11 (don't know
exact version, it's in ubuntu 20.04).

We have recently switched to kernel 5.15 and started to use it at
another customer too

I don't know exact kernel version, it's Ubuntu 20.04 kernel:
uname says
Linux nakivo 5.15.0-33-generic #34~20.04.1-Ubuntu SMP Thu May 19
15:51:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Package version: 5.15.0-33.34~20.04.1

Only problem is, that bees cannot keep up with data that are beiing
added to system via nakivo backup app.
htop says, that beesd runs around 8-20MB/s.
System can do IO around 400-500MB/s , it's dedicated Dell R540 with
hardware raid on Nearline SAS drives.

Is kernel 5.15 safer for multi thread mode in beesd?

Thanks,
Libor



On Pá, 2021-12-10 at 13:34 -0500, Zygo Blaxell wrote:
> I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
> some bees users have reported it as well.  I bisected to this commit:
> 
>         3078d85c9a10 vfs: verify source area in
> vfs_dedupe_file_range_one()
> 
> These kernels work for at least 18 hours:
> 
>         5.10.83 (months)
>         5.11.22 with 3078d85c9a10 reverted (36 hours)
>         btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted
> 
> These kernels lock up in 3 hours or less:
> 
>         5.11.22
>         5.12.19
>         5.14.21
>         5.15.6
>         btrfs for-next 279373dee83e
> 
> All of the failing kernels include this commit, none of the non-
> failing
> kernels include the commit.
> 
> Kernel logs from the lockup:
> 
>         [19647.696042][ T3721] sysrq: Show Blocked State
>         [19647.697024][ T3721] task:btrfs-transacti state:D stack:   
> 0 pid: 6161 ppid:     2 flags:0x00004000
>         [19647.698203][ T3721] Call Trace:
>         [19647.698608][ T3721]  __schedule+0x388/0xaf0
>         [19647.699125][ T3721]  schedule+0x68/0xe0
>         [19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0
>         [19647.700276][ T3721]  ? start_transaction+0xd5/0x6f0
>         [19647.700897][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
>         [19647.701507][ T3721]  transaction_kthread+0x138/0x1b0
>         [19647.702154][ T3721]  kthread+0x151/0x170
>         [19647.702651][ T3721]  ?
> btrfs_cleanup_transaction.isra.0+0x620/0x620
>         [19647.703404][ T3721]  ?
> kthread_create_worker_on_cpu+0x70/0x70
>         [19647.704119][ T3721]  ret_from_fork+0x22/0x30
>         [19647.704679][ T3721] task:crawl_257_291   state:D stack:   
> 0 pid: 6494 ppid:  6435 flags:0x00000000
>         [19647.705797][ T3721] Call Trace:
>         [19647.706188][ T3721]  __schedule+0x388/0xaf0
>         [19647.706723][ T3721]  ?
> rwsem_down_write_slowpath+0x35f/0x770
>         [19647.707414][ T3721]  schedule+0x68/0xe0
>         [19647.707905][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
>         [19647.708597][ T3721]  down_write_nested+0xc1/0x130
>         [19647.709167][ T3721]  lock_two_nondirectories+0x59/0x70
>         [19647.709831][ T3721]  btrfs_remap_file_range+0x54/0x3c0
>         [19647.710505][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
>         [19647.711197][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
>         [19647.711902][ T3721]  do_vfs_ioctl+0x551/0x720
>         [19647.712434][ T3721]  ? __fget_files+0x109/0x1d0
>         [19647.713010][ T3721]  __x64_sys_ioctl+0x6f/0xc0
>         [19647.713544][ T3721]  do_syscall_64+0x38/0x90
>         [19647.714098][ T3721] 
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>         [19647.714818][ T3721] RIP: 0033:0x7f0ab2429cc7
>         [19647.715327][ T3721] RSP: 002b:00007f0ab03241d8 EFLAGS:
> 00000246 ORIG_RAX: 0000000000000010
>         [19647.716326][ T3721] RAX: ffffffffffffffda RBX:
> 00007f0ab0324420 RCX: 00007f0ab2429cc7
>         [19647.717264][ T3721] RDX: 00007f0a981afe50 RSI:
> 00000000c0189436 RDI: 0000000000000012
>         [19647.718203][ T3721] RBP: 00007f0a981afe50 R08:
> 00007f0a983759e0 R09: 0000000000000000
>         [19647.719151][ T3721] R10: 00007ffc0f9f2080 R11:
> 0000000000000246 R12: 00007f0a981afe50
>         [19647.720087][ T3721] R13: 00007f0ab0324428 R14:
> 0000000000000000 R15: 00007f0ab0324448
>         [19647.721041][ T3721] task:crawl_256_295   state:D stack:   
> 0 pid: 6496 ppid:  6435 flags:0x00000000
>         [19647.722126][ T3721] Call Trace:
>         [19647.722517][ T3721]  __schedule+0x388/0xaf0
>         [19647.723060][ T3721]  schedule+0x68/0xe0
>         [19647.723524][ T3721]  wait_current_trans+0xed/0x150
>         [19647.724117][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
>         [19647.724697][ T3721]  start_transaction+0x37e/0x6f0
>         [19647.725273][ T3721]  ?
> btrfs_inode_flags_to_xflags+0x50/0x50
>         [19647.725969][ T3721]  btrfs_attach_transaction+0x1d/0x20
>         [19647.726625][ T3721]  iterate_extent_inodes+0x7b/0x270
>         [19647.727236][ T3721]  iterate_inodes_from_logical+0x9f/0xe0
>         [19647.727912][ T3721]  ?
> btrfs_inode_flags_to_xflags+0x50/0x50
>         [19647.728599][ T3721] 
> btrfs_ioctl_logical_to_ino+0x183/0x210
>         [19647.729263][ T3721]  btrfs_ioctl+0xa83/0x2fe0
>         [19647.729818][ T3721]  ? kvm_sched_clock_read+0x18/0x30
>         [19647.730430][ T3721]  ? sched_clock+0x9/0x10
>         [19647.730976][ T3721]  ? __fget_files+0xe6/0x1d0
>         [19647.731521][ T3721]  ? __fget_files+0x109/0x1d0
>         [19647.732096][ T3721]  __x64_sys_ioctl+0x91/0xc0
>         [19647.732647][ T3721]  ? __x64_sys_ioctl+0x91/0xc0
>         [19647.733209][ T3721]  do_syscall_64+0x38/0x90
>         [19647.733750][ T3721] 
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>         [19647.734438][ T3721] RIP: 0033:0x7f0ab2429cc7
>         [19647.734992][ T3721] RSP: 002b:00007f0aaf322378 EFLAGS:
> 00000246 ORIG_RAX: 0000000000000010
>         [19647.735994][ T3721] RAX: ffffffffffffffda RBX:
> 0000558110d25dd0 RCX: 00007f0ab2429cc7
>         [19647.736949][ T3721] RDX: 00007f0aaf322608 RSI:
> 00000000c038943b RDI: 0000000000000003
>         [19647.737895][ T3721] RBP: 00007f0aaf322550 R08:
> 0000000000000000 R09: 00007f0aaf3227e0
>         [19647.738862][ T3721] R10: 000019f231c642a4 R11:
> 0000000000000246 R12: 00007f0aaf322600
>         [19647.739799][ T3721] R13: 0000000000000003 R14:
> 00007f0aaf322608 R15: 00007f0aaf3225e0
>         [19647.740748][ T3721] task:crawl_257_292   state:D stack:   
> 0 pid: 6502 ppid:  6435 flags:0x00000000
>         [19647.741832][ T3721] Call Trace:
>         [19647.742216][ T3721]  __schedule+0x388/0xaf0
>         [19647.742761][ T3721]  schedule+0x68/0xe0
>         [19647.743225][ T3721]  wait_current_trans+0xed/0x150
>         [19647.743825][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
>         [19647.744399][ T3721]  start_transaction+0x587/0x6f0
>         [19647.745003][ T3721]  btrfs_start_transaction+0x1e/0x20
>         [19647.745638][ T3721] 
> btrfs_replace_file_extents+0x135/0x8d0
>         [19647.746305][ T3721]  ? release_extent_buffer+0xae/0xf0
>         [19647.746973][ T3721]  btrfs_clone+0x828/0x8c0
>         [19647.747513][ T3721]  btrfs_extent_same_range+0x75/0xa0
>         [19647.748152][ T3721]  btrfs_remap_file_range+0x354/0x3c0
>         [19647.748802][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
>         [19647.749460][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
>         [19647.750098][ T3721]  do_vfs_ioctl+0x551/0x720
>         [19647.750666][ T3721]  ? __fget_files+0x109/0x1d0
>         [19647.751216][ T3721]  __x64_sys_ioctl+0x6f/0xc0
>         [19647.751777][ T3721]  do_syscall_64+0x38/0x90
>         [19647.752290][ T3721] 
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>         [19647.752992][ T3721] RIP: 0033:0x7f0ab2429cc7
>         [19647.753511][ T3721] RSP: 002b:00007f0aac31c1d8 EFLAGS:
> 00000246 ORIG_RAX: 0000000000000010
>         [19647.754500][ T3721] RAX: ffffffffffffffda RBX:
> 00007f0aac31c420 RCX: 00007f0ab2429cc7
>         [19647.755460][ T3721] RDX: 00007f0a787fc690 RSI:
> 00000000c0189436 RDI: 0000000000000012
>         [19647.756406][ T3721] RBP: 00007f0a787fc690 R08:
> 00007f0a78f4fd40 R09: 0000000000000000
>         [19647.757340][ T3721] R10: 00007ffc0f9f2080 R11:
> 0000000000000246 R12: 00007f0a787fc690
>         [19647.758291][ T3721] R13: 00007f0aac31c428 R14:
> 0000000000000000 R15: 00007f0aac31c448
>         [19647.759261][ T3721] task:crawl_257_293   state:D stack:   
> 0 pid: 6503 ppid:  6435 flags:0x00000000
>         [19647.760362][ T3721] Call Trace:
>         [19647.760761][ T3721]  __schedule+0x388/0xaf0
>         [19647.761261][ T3721]  ?
> rwsem_down_write_slowpath+0x35f/0x770
>         [19647.761958][ T3721]  schedule+0x68/0xe0
>         [19647.762425][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
>         [19647.763126][ T3721]  down_write+0xbd/0x120
>         [19647.763643][ T3721]  btrfs_remap_file_range+0x2eb/0x3c0
>         [19647.764278][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
>         [19647.764954][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
>         [19647.765589][ T3721]  do_vfs_ioctl+0x551/0x720
>         [19647.766115][ T3721]  ? __fget_files+0x109/0x1d0
>         [19647.766700][ T3721]  __x64_sys_ioctl+0x6f/0xc0
>         [19647.767238][ T3721]  do_syscall_64+0x38/0x90
>         [19647.767772][ T3721] 
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>         [19647.768458][ T3721] RIP: 0033:0x7f0ab2429cc7
>         [19647.768990][ T3721] RSP: 002b:00007f0aabb1b1d8 EFLAGS:
> 00000246 ORIG_RAX: 0000000000000010
>         [19647.769986][ T3721] RAX: ffffffffffffffda RBX:
> 00007f0aabb1b420 RCX: 00007f0ab2429cc7
>         [19647.770951][ T3721] RDX: 00007f0a6c1f5ac0 RSI:
> 00000000c0189436 RDI: 0000000000000012
>         [19647.771894][ T3721] RBP: 00007f0a6c1f5ac0 R08:
> 00007f0a6d00da10 R09: 0000000000000000
>         [19647.772839][ T3721] R10: 00007ffc0f9f2080 R11:
> 0000000000000246 R12: 00007f0a6c1f5ac0
>         [19647.773789][ T3721] R13: 00007f0aabb1b428 R14:
> 0000000000000000 R15: 00007f0aabb1b448
>         [19647.774784][ T3721] task:crawl_256_289   state:D stack:   
> 0 pid: 6504 ppid:  6435 flags:0x00000000
>         [19647.775882][ T3721] Call Trace:
>         [19647.776271][ T3721]  __schedule+0x388/0xaf0
>         [19647.776807][ T3721]  ?
> rwsem_down_write_slowpath+0x35f/0x770
>         [19647.777499][ T3721]  schedule+0x68/0xe0
>         [19647.777994][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
>         [19647.778712][ T3721]  down_write_nested+0xc1/0x130
>         [19647.779285][ T3721]  lock_two_nondirectories+0x59/0x70
>         [19647.779931][ T3721]  btrfs_remap_file_range+0x54/0x3c0
>         [19647.780564][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
>         [19647.781366][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
>         [19647.782009][ T3721]  do_vfs_ioctl+0x551/0x720
>         [19647.782554][ T3721]  ? __fget_files+0x109/0x1d0
>         [19647.783230][ T3721]  __x64_sys_ioctl+0x6f/0xc0
>         [19647.783804][ T3721]  do_syscall_64+0x38/0x90
>         [19647.784319][ T3721] 
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>         [19647.785031][ T3721] RIP: 0033:0x7f0ab2429cc7
>         [19647.785548][ T3721] RSP: 002b:00007f0aab31a1d8 EFLAGS:
> 00000246 ORIG_RAX: 0000000000000010
>         [19647.786660][ T3721] RAX: ffffffffffffffda RBX:
> 00007f0aab31a420 RCX: 00007f0ab2429cc7
>         [19647.787609][ T3721] RDX: 00007f0a70008f00 RSI:
> 00000000c0189436 RDI: 0000000000000012
>         [19647.788551][ T3721] RBP: 00007f0a70008f00 R08:
> 00007f0a708e66e0 R09: 0000000000000000
>         [19647.789533][ T3721] R10: 00007ffc0f9f2080 R11:
> 0000000000000246 R12: 00007f0a70008f00
>         [19647.790482][ T3721] R13: 00007f0aab31a428 R14:
> 0000000000000000 R15: 00007f0aab31a448
>         [32609.668575][ T3721] sysrq: Show Locks Held
>         [32609.673950][ T3721] 
>         [32609.673950][ T3721] Showing all locks held in the system:
>         [32609.675276][ T3721] 1 lock held by in:imklog/3603:
>         [32609.675885][ T3721] 1 lock held by dmesg/3720:
>         [32609.676432][ T3721]  #0: ffff8a1406ac80e0 (&user-
> >lock){+.+.}-{3:3}, at: devkmsg_read+0x4d/0x320
>         [32609.678403][ T3721] 3 locks held by bash/3721:
>         [32609.678972][ T3721]  #0: ffff8a142a589498
> (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x70/0xf0
>         [32609.680364][ T3721]  #1: ffffffff98f199a0
> (rcu_read_lock){....}-{1:2}, at: __handle_sysrq+0x5/0xa0
>         [32609.682046][ T3721]  #2: ffffffff98f199a0
> (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x23/0x187
>         [32609.683847][ T3721] 1 lock held by btrfs-transacti/6161:
>         [32609.684498][ T3721]  #0: ffff8a14e0178850 (&fs_info-
> >transaction_kthread_mutex){+.+.}-{3:3}, at:
> transaction_kthread+0x5a/0x1b0
>         [32609.686275][ T3721] 3 locks held by crawl_257_265/6491:
>         [32609.686937][ T3721] 3 locks held by crawl_257_291/6494:
>         [32609.687578][ T3721]  #0: ffff8a14bd092498
> (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
>         [32609.689126][ T3721]  #1: ffff8a1410d7c848 (&sb->s_type-
> >i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
>         [32609.690694][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type-
> >i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
>         [32609.692091][ T3721] 4 locks held by crawl_257_292/6502:
>         [32609.692763][ T3721]  #0: ffff8a14bd092498
> (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
>         [32609.694014][ T3721]  #1: ffff8a131637a908 (&sb->s_type-
> >i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
>         [32609.695377][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type-
> >i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
>         [32609.696764][ T3721]  #3: ffff8a14bd0926b8
> (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
>         [32609.697986][ T3721] 2 locks held by crawl_257_293/6503:
>         [32609.698629][ T3721]  #0: ffff8a14bd092498
> (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
>         [32609.699882][ T3721]  #1: ffff8a14161a39c8 (&sb->s_type-
> >i_mutex_key#17){+.+.}-{3:3}, at: btrfs_remap_file_range+0x2eb/0x3c0
>         [32609.701674][ T3721] 3 locks held by crawl_256_289/6504:
>         [32609.702443][ T3721]  #0: ffff8a14bd092498
> (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
>         [32609.703927][ T3721]  #1: ffff8a140f2c4748 (&sb->s_type-
> >i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
>         [32609.705444][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type-
> >i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
>         [32609.706899][ T3721] 
>         [32609.707177][ T3721]
> =============================================
> 
> There is also a severe performance regression (50% slower) in btrfs
> dedupe
> starting in 5.11, but reverting this commit doesn't make the
> performance
> regression go away.  I'll follow up on that separately.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one()
  2022-06-13  8:38 ` Libor Klepáč
@ 2022-06-21  5:08   ` Zygo Blaxell
  0 siblings, 0 replies; 22+ messages in thread
From: Zygo Blaxell @ 2022-06-21  5:08 UTC (permalink / raw)
  To: Libor Klepáč; +Cc: linux-btrfs

On Mon, Jun 13, 2022 at 08:38:52AM +0000, Libor Klepáč wrote:
> Hi,
> any news on this?

No news yet.  It is near the top of my todo list, but the top of
my todo list is a highly contended place.

I did run a KCSAN audit on 5.18 and while it did find some btrfs problems,
it didn't find anything relevant to this bug.  5.4 seems not to have
the bug, but 5.4 comes with a ~30% performance drop compared to 5.10
(and I didn't test it long enough to say that 5.4 really has no bug).

> We have been running beesd in single thread as you suggested in
> https://github.com/Zygo/bees/issues/215#issuecomment-1104614791
> 
> without kernel lock for almost two month on kernel 5.11 (don't know
> exact version, it's in ubuntu 20.04).
> 
> We have recently switched to kernel 5.15 and started to use it at
> another customer too
> 
> I don't know exact kernel version, it's Ubuntu 20.04 kernel:
> uname says
> Linux nakivo 5.15.0-33-generic #34~20.04.1-Ubuntu SMP Thu May 19
> 15:51:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> 
> Package version: 5.15.0-33.34~20.04.1
> 
> Only problem is, that bees cannot keep up with data that are beiing
> added to system via nakivo backup app.
> htop says, that beesd runs around 8-20MB/s.
> System can do IO around 400-500MB/s , it's dedicated Dell R540 with
> hardware raid on Nearline SAS drives.
> 
> Is kernel 5.15 safer for multi thread mode in beesd?

Unfortunately no.  The range of kernels with the bug runs from at least
5.6 through 5.18, and probably kernels outside of that range as well.

> Thanks,
> Libor
> 
> 
> 
> On Pá, 2021-12-10 at 13:34 -0500, Zygo Blaxell wrote:
> > I've been getting deadlocks in dedupe on btrfs since kernel 5.11, and
> > some bees users have reported it as well.  I bisected to this commit:
> > 
> >         3078d85c9a10 vfs: verify source area in
> > vfs_dedupe_file_range_one()
> > 
> > These kernels work for at least 18 hours:
> > 
> >         5.10.83 (months)
> >         5.11.22 with 3078d85c9a10 reverted (36 hours)
> >         btrfs misc-next 66dc4de326b0 with 3078d85c9a10 reverted
> > 
> > These kernels lock up in 3 hours or less:
> > 
> >         5.11.22
> >         5.12.19
> >         5.14.21
> >         5.15.6
> >         btrfs for-next 279373dee83e
> > 
> > All of the failing kernels include this commit, none of the non-
> > failing
> > kernels include the commit.
> > 
> > Kernel logs from the lockup:
> > 
> >         [19647.696042][ T3721] sysrq: Show Blocked State
> >         [19647.697024][ T3721] task:btrfs-transacti state:D stack:   
> > 0 pid: 6161 ppid:     2 flags:0x00004000
> >         [19647.698203][ T3721] Call Trace:
> >         [19647.698608][ T3721]  __schedule+0x388/0xaf0
> >         [19647.699125][ T3721]  schedule+0x68/0xe0
> >         [19647.699615][ T3721]  btrfs_commit_transaction+0x97c/0xbf0
> >         [19647.700276][ T3721]  ? start_transaction+0xd5/0x6f0
> >         [19647.700897][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> >         [19647.701507][ T3721]  transaction_kthread+0x138/0x1b0
> >         [19647.702154][ T3721]  kthread+0x151/0x170
> >         [19647.702651][ T3721]  ?
> > btrfs_cleanup_transaction.isra.0+0x620/0x620
> >         [19647.703404][ T3721]  ?
> > kthread_create_worker_on_cpu+0x70/0x70
> >         [19647.704119][ T3721]  ret_from_fork+0x22/0x30
> >         [19647.704679][ T3721] task:crawl_257_291   state:D stack:   
> > 0 pid: 6494 ppid:  6435 flags:0x00000000
> >         [19647.705797][ T3721] Call Trace:
> >         [19647.706188][ T3721]  __schedule+0x388/0xaf0
> >         [19647.706723][ T3721]  ?
> > rwsem_down_write_slowpath+0x35f/0x770
> >         [19647.707414][ T3721]  schedule+0x68/0xe0
> >         [19647.707905][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> >         [19647.708597][ T3721]  down_write_nested+0xc1/0x130
> >         [19647.709167][ T3721]  lock_two_nondirectories+0x59/0x70
> >         [19647.709831][ T3721]  btrfs_remap_file_range+0x54/0x3c0
> >         [19647.710505][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> >         [19647.711197][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> >         [19647.711902][ T3721]  do_vfs_ioctl+0x551/0x720
> >         [19647.712434][ T3721]  ? __fget_files+0x109/0x1d0
> >         [19647.713010][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> >         [19647.713544][ T3721]  do_syscall_64+0x38/0x90
> >         [19647.714098][ T3721] 
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >         [19647.714818][ T3721] RIP: 0033:0x7f0ab2429cc7
> >         [19647.715327][ T3721] RSP: 002b:00007f0ab03241d8 EFLAGS:
> > 00000246 ORIG_RAX: 0000000000000010
> >         [19647.716326][ T3721] RAX: ffffffffffffffda RBX:
> > 00007f0ab0324420 RCX: 00007f0ab2429cc7
> >         [19647.717264][ T3721] RDX: 00007f0a981afe50 RSI:
> > 00000000c0189436 RDI: 0000000000000012
> >         [19647.718203][ T3721] RBP: 00007f0a981afe50 R08:
> > 00007f0a983759e0 R09: 0000000000000000
> >         [19647.719151][ T3721] R10: 00007ffc0f9f2080 R11:
> > 0000000000000246 R12: 00007f0a981afe50
> >         [19647.720087][ T3721] R13: 00007f0ab0324428 R14:
> > 0000000000000000 R15: 00007f0ab0324448
> >         [19647.721041][ T3721] task:crawl_256_295   state:D stack:   
> > 0 pid: 6496 ppid:  6435 flags:0x00000000
> >         [19647.722126][ T3721] Call Trace:
> >         [19647.722517][ T3721]  __schedule+0x388/0xaf0
> >         [19647.723060][ T3721]  schedule+0x68/0xe0
> >         [19647.723524][ T3721]  wait_current_trans+0xed/0x150
> >         [19647.724117][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> >         [19647.724697][ T3721]  start_transaction+0x37e/0x6f0
> >         [19647.725273][ T3721]  ?
> > btrfs_inode_flags_to_xflags+0x50/0x50
> >         [19647.725969][ T3721]  btrfs_attach_transaction+0x1d/0x20
> >         [19647.726625][ T3721]  iterate_extent_inodes+0x7b/0x270
> >         [19647.727236][ T3721]  iterate_inodes_from_logical+0x9f/0xe0
> >         [19647.727912][ T3721]  ?
> > btrfs_inode_flags_to_xflags+0x50/0x50
> >         [19647.728599][ T3721] 
> > btrfs_ioctl_logical_to_ino+0x183/0x210
> >         [19647.729263][ T3721]  btrfs_ioctl+0xa83/0x2fe0
> >         [19647.729818][ T3721]  ? kvm_sched_clock_read+0x18/0x30
> >         [19647.730430][ T3721]  ? sched_clock+0x9/0x10
> >         [19647.730976][ T3721]  ? __fget_files+0xe6/0x1d0
> >         [19647.731521][ T3721]  ? __fget_files+0x109/0x1d0
> >         [19647.732096][ T3721]  __x64_sys_ioctl+0x91/0xc0
> >         [19647.732647][ T3721]  ? __x64_sys_ioctl+0x91/0xc0
> >         [19647.733209][ T3721]  do_syscall_64+0x38/0x90
> >         [19647.733750][ T3721] 
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >         [19647.734438][ T3721] RIP: 0033:0x7f0ab2429cc7
> >         [19647.734992][ T3721] RSP: 002b:00007f0aaf322378 EFLAGS:
> > 00000246 ORIG_RAX: 0000000000000010
> >         [19647.735994][ T3721] RAX: ffffffffffffffda RBX:
> > 0000558110d25dd0 RCX: 00007f0ab2429cc7
> >         [19647.736949][ T3721] RDX: 00007f0aaf322608 RSI:
> > 00000000c038943b RDI: 0000000000000003
> >         [19647.737895][ T3721] RBP: 00007f0aaf322550 R08:
> > 0000000000000000 R09: 00007f0aaf3227e0
> >         [19647.738862][ T3721] R10: 000019f231c642a4 R11:
> > 0000000000000246 R12: 00007f0aaf322600
> >         [19647.739799][ T3721] R13: 0000000000000003 R14:
> > 00007f0aaf322608 R15: 00007f0aaf3225e0
> >         [19647.740748][ T3721] task:crawl_257_292   state:D stack:   
> > 0 pid: 6502 ppid:  6435 flags:0x00000000
> >         [19647.741832][ T3721] Call Trace:
> >         [19647.742216][ T3721]  __schedule+0x388/0xaf0
> >         [19647.742761][ T3721]  schedule+0x68/0xe0
> >         [19647.743225][ T3721]  wait_current_trans+0xed/0x150
> >         [19647.743825][ T3721]  ? do_wait_intr_irq+0xd0/0xd0
> >         [19647.744399][ T3721]  start_transaction+0x587/0x6f0
> >         [19647.745003][ T3721]  btrfs_start_transaction+0x1e/0x20
> >         [19647.745638][ T3721] 
> > btrfs_replace_file_extents+0x135/0x8d0
> >         [19647.746305][ T3721]  ? release_extent_buffer+0xae/0xf0
> >         [19647.746973][ T3721]  btrfs_clone+0x828/0x8c0
> >         [19647.747513][ T3721]  btrfs_extent_same_range+0x75/0xa0
> >         [19647.748152][ T3721]  btrfs_remap_file_range+0x354/0x3c0
> >         [19647.748802][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> >         [19647.749460][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> >         [19647.750098][ T3721]  do_vfs_ioctl+0x551/0x720
> >         [19647.750666][ T3721]  ? __fget_files+0x109/0x1d0
> >         [19647.751216][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> >         [19647.751777][ T3721]  do_syscall_64+0x38/0x90
> >         [19647.752290][ T3721] 
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >         [19647.752992][ T3721] RIP: 0033:0x7f0ab2429cc7
> >         [19647.753511][ T3721] RSP: 002b:00007f0aac31c1d8 EFLAGS:
> > 00000246 ORIG_RAX: 0000000000000010
> >         [19647.754500][ T3721] RAX: ffffffffffffffda RBX:
> > 00007f0aac31c420 RCX: 00007f0ab2429cc7
> >         [19647.755460][ T3721] RDX: 00007f0a787fc690 RSI:
> > 00000000c0189436 RDI: 0000000000000012
> >         [19647.756406][ T3721] RBP: 00007f0a787fc690 R08:
> > 00007f0a78f4fd40 R09: 0000000000000000
> >         [19647.757340][ T3721] R10: 00007ffc0f9f2080 R11:
> > 0000000000000246 R12: 00007f0a787fc690
> >         [19647.758291][ T3721] R13: 00007f0aac31c428 R14:
> > 0000000000000000 R15: 00007f0aac31c448
> >         [19647.759261][ T3721] task:crawl_257_293   state:D stack:   
> > 0 pid: 6503 ppid:  6435 flags:0x00000000
> >         [19647.760362][ T3721] Call Trace:
> >         [19647.760761][ T3721]  __schedule+0x388/0xaf0
> >         [19647.761261][ T3721]  ?
> > rwsem_down_write_slowpath+0x35f/0x770
> >         [19647.761958][ T3721]  schedule+0x68/0xe0
> >         [19647.762425][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> >         [19647.763126][ T3721]  down_write+0xbd/0x120
> >         [19647.763643][ T3721]  btrfs_remap_file_range+0x2eb/0x3c0
> >         [19647.764278][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> >         [19647.764954][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> >         [19647.765589][ T3721]  do_vfs_ioctl+0x551/0x720
> >         [19647.766115][ T3721]  ? __fget_files+0x109/0x1d0
> >         [19647.766700][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> >         [19647.767238][ T3721]  do_syscall_64+0x38/0x90
> >         [19647.767772][ T3721] 
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >         [19647.768458][ T3721] RIP: 0033:0x7f0ab2429cc7
> >         [19647.768990][ T3721] RSP: 002b:00007f0aabb1b1d8 EFLAGS:
> > 00000246 ORIG_RAX: 0000000000000010
> >         [19647.769986][ T3721] RAX: ffffffffffffffda RBX:
> > 00007f0aabb1b420 RCX: 00007f0ab2429cc7
> >         [19647.770951][ T3721] RDX: 00007f0a6c1f5ac0 RSI:
> > 00000000c0189436 RDI: 0000000000000012
> >         [19647.771894][ T3721] RBP: 00007f0a6c1f5ac0 R08:
> > 00007f0a6d00da10 R09: 0000000000000000
> >         [19647.772839][ T3721] R10: 00007ffc0f9f2080 R11:
> > 0000000000000246 R12: 00007f0a6c1f5ac0
> >         [19647.773789][ T3721] R13: 00007f0aabb1b428 R14:
> > 0000000000000000 R15: 00007f0aabb1b448
> >         [19647.774784][ T3721] task:crawl_256_289   state:D stack:   
> > 0 pid: 6504 ppid:  6435 flags:0x00000000
> >         [19647.775882][ T3721] Call Trace:
> >         [19647.776271][ T3721]  __schedule+0x388/0xaf0
> >         [19647.776807][ T3721]  ?
> > rwsem_down_write_slowpath+0x35f/0x770
> >         [19647.777499][ T3721]  schedule+0x68/0xe0
> >         [19647.777994][ T3721]  rwsem_down_write_slowpath+0x39f/0x770
> >         [19647.778712][ T3721]  down_write_nested+0xc1/0x130
> >         [19647.779285][ T3721]  lock_two_nondirectories+0x59/0x70
> >         [19647.779931][ T3721]  btrfs_remap_file_range+0x54/0x3c0
> >         [19647.780564][ T3721]  vfs_dedupe_file_range_one+0x117/0x180
> >         [19647.781366][ T3721]  vfs_dedupe_file_range+0x159/0x1e0
> >         [19647.782009][ T3721]  do_vfs_ioctl+0x551/0x720
> >         [19647.782554][ T3721]  ? __fget_files+0x109/0x1d0
> >         [19647.783230][ T3721]  __x64_sys_ioctl+0x6f/0xc0
> >         [19647.783804][ T3721]  do_syscall_64+0x38/0x90
> >         [19647.784319][ T3721] 
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >         [19647.785031][ T3721] RIP: 0033:0x7f0ab2429cc7
> >         [19647.785548][ T3721] RSP: 002b:00007f0aab31a1d8 EFLAGS:
> > 00000246 ORIG_RAX: 0000000000000010
> >         [19647.786660][ T3721] RAX: ffffffffffffffda RBX:
> > 00007f0aab31a420 RCX: 00007f0ab2429cc7
> >         [19647.787609][ T3721] RDX: 00007f0a70008f00 RSI:
> > 00000000c0189436 RDI: 0000000000000012
> >         [19647.788551][ T3721] RBP: 00007f0a70008f00 R08:
> > 00007f0a708e66e0 R09: 0000000000000000
> >         [19647.789533][ T3721] R10: 00007ffc0f9f2080 R11:
> > 0000000000000246 R12: 00007f0a70008f00
> >         [19647.790482][ T3721] R13: 00007f0aab31a428 R14:
> > 0000000000000000 R15: 00007f0aab31a448
> >         [32609.668575][ T3721] sysrq: Show Locks Held
> >         [32609.673950][ T3721] 
> >         [32609.673950][ T3721] Showing all locks held in the system:
> >         [32609.675276][ T3721] 1 lock held by in:imklog/3603:
> >         [32609.675885][ T3721] 1 lock held by dmesg/3720:
> >         [32609.676432][ T3721]  #0: ffff8a1406ac80e0 (&user-
> > >lock){+.+.}-{3:3}, at: devkmsg_read+0x4d/0x320
> >         [32609.678403][ T3721] 3 locks held by bash/3721:
> >         [32609.678972][ T3721]  #0: ffff8a142a589498
> > (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x70/0xf0
> >         [32609.680364][ T3721]  #1: ffffffff98f199a0
> > (rcu_read_lock){....}-{1:2}, at: __handle_sysrq+0x5/0xa0
> >         [32609.682046][ T3721]  #2: ffffffff98f199a0
> > (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x23/0x187
> >         [32609.683847][ T3721] 1 lock held by btrfs-transacti/6161:
> >         [32609.684498][ T3721]  #0: ffff8a14e0178850 (&fs_info-
> > >transaction_kthread_mutex){+.+.}-{3:3}, at:
> > transaction_kthread+0x5a/0x1b0
> >         [32609.686275][ T3721] 3 locks held by crawl_257_265/6491:
> >         [32609.686937][ T3721] 3 locks held by crawl_257_291/6494:
> >         [32609.687578][ T3721]  #0: ffff8a14bd092498
> > (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> >         [32609.689126][ T3721]  #1: ffff8a1410d7c848 (&sb->s_type-
> > >i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> >         [32609.690694][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type-
> > >i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> >         [32609.692091][ T3721] 4 locks held by crawl_257_292/6502:
> >         [32609.692763][ T3721]  #0: ffff8a14bd092498
> > (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> >         [32609.694014][ T3721]  #1: ffff8a131637a908 (&sb->s_type-
> > >i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> >         [32609.695377][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type-
> > >i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> >         [32609.696764][ T3721]  #3: ffff8a14bd0926b8
> > (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
> >         [32609.697986][ T3721] 2 locks held by crawl_257_293/6503:
> >         [32609.698629][ T3721]  #0: ffff8a14bd092498
> > (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> >         [32609.699882][ T3721]  #1: ffff8a14161a39c8 (&sb->s_type-
> > >i_mutex_key#17){+.+.}-{3:3}, at: btrfs_remap_file_range+0x2eb/0x3c0
> >         [32609.701674][ T3721] 3 locks held by crawl_256_289/6504:
> >         [32609.702443][ T3721]  #0: ffff8a14bd092498
> > (sb_writers#12){.+.+}-{0:0}, at: vfs_dedupe_file_range_one+0x3b/0x180
> >         [32609.703927][ T3721]  #1: ffff8a140f2c4748 (&sb->s_type-
> > >i_mutex_key#17){+.+.}-{3:3}, at: lock_two_nondirectories+0x6b/0x70
> >         [32609.705444][ T3721]  #2: ffff8a14161a39c8 (&sb->s_type-
> > >i_mutex_key#17/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x59/0x70
> >         [32609.706899][ T3721] 
> >         [32609.707177][ T3721]
> > =============================================
> > 
> > There is also a severe performance regression (50% slower) in btrfs
> > dedupe
> > starting in 5.11, but reverting this commit doesn't make the
> > performance
> > regression go away.  I'll follow up on that separately.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-06-21  5:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-10 18:34 bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
2021-12-12 10:03 ` Thorsten Leemhuis
2021-12-13 13:28 ` Nikolay Borisov
2021-12-13 23:12   ` Zygo Blaxell
2021-12-14 11:11     ` Nikolay Borisov
2021-12-14 19:50       ` Zygo Blaxell
2021-12-14 22:25         ` Nikolay Borisov
2021-12-16  5:33           ` Zygo Blaxell
2021-12-16 21:29             ` Nikolay Borisov
2021-12-16 22:07               ` Josef Bacik
2021-12-17 20:50                 ` Zygo Blaxell
2022-01-07 18:31                   ` bisected: btrfs dedupe regression in v5.11-rc1 Zygo Blaxell
2022-01-20 14:04                     ` Thorsten Leemhuis
2022-01-21  0:27                       ` Zygo Blaxell
2022-02-09 12:22                         ` Libor Klepáč
2022-02-18 14:46                         ` Thorsten Leemhuis
2022-03-06 10:31                           ` Thorsten Leemhuis
2022-03-06 23:34                             ` Zygo Blaxell
2022-03-07  6:17                               ` Thorsten Leemhuis
2021-12-17  5:38               ` bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
2022-06-13  8:38 ` Libor Klepáč
2022-06-21  5:08   ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).