* 'btrfs replace' hangs at end of replacing a device (v5.10.82)
@ 2021-11-29 21:46 Zygo Blaxell
2021-11-30 10:08 ` Nikolay Borisov
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Zygo Blaxell @ 2021-11-29 21:46 UTC (permalink / raw)
To: linux-btrfs
Not a new bug, but it's still there. btrfs replace ends in a transaction
deadlock.
'btrfs replace status' reports the replace completed and exits:
Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
Magic-SysRq-W:
sysrq: Show Blocked State
task:btrfs-transacti state:D stack: 0 pid:29509 ppid: 2 flags:0x00004000
Call Trace:
__schedule+0x35a/0xaa0
schedule+0x68/0xe0
schedule_preempt_disabled+0x15/0x20
__mutex_lock+0x1ac/0x7e0
? lock_acquire+0x190/0x2d0
? btrfs_run_dev_stats+0x46/0x450
? rcu_read_lock_sched_held+0x16/0x80
mutex_lock_nested+0x1b/0x20
btrfs_run_dev_stats+0x46/0x450
? _raw_spin_unlock+0x23/0x30
? release_extent_buffer+0xa7/0xe0
commit_cowonly_roots+0xa2/0x2a0
? btrfs_qgroup_account_extents+0x2d3/0x320
btrfs_commit_transaction+0x51f/0xc60
transaction_kthread+0x15a/0x180
kthread+0x151/0x170
? btrfs_cleanup_transaction.isra.0+0x630/0x630
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x22/0x30
task:nfsd state:D stack: 0 pid:31445 ppid: 2 flags:0x00004000
Call Trace:
__schedule+0x35a/0xaa0
schedule+0x68/0xe0
btrfs_bio_counter_inc_blocked+0xe3/0x120
? add_wait_queue_exclusive+0x80/0x80
btrfs_map_bio+0x4d/0x3f0
? rcu_read_lock_sched_held+0x16/0x80
? kmem_cache_alloc+0x2e8/0x360
btrfs_submit_metadata_bio+0xe9/0x100
submit_one_bio+0x67/0x80
read_extent_buffer_pages+0x277/0x380
btree_read_extent_buffer_pages+0xa1/0x120
read_tree_block+0x3b/0x70
read_block_for_search.isra.0+0x1a2/0x350
? rcu_read_lock_sched_held+0x16/0x80
btrfs_search_slot+0x20f/0x910
btrfs_lookup_dir_item+0x78/0xc0
btrfs_lookup_dentry+0xca/0x540
btrfs_lookup+0x13/0x40
__lookup_slow+0x10d/0x1e0
? rcu_read_lock_sched_held+0x16/0x80
lookup_one_len+0x77/0x90
nfsd_lookup_dentry+0xe0/0x440 [nfsd]
nfsd_lookup+0x89/0x150 [nfsd]
nfsd4_lookup+0x1a/0x20 [nfsd]
nfsd4_proc_compound+0x58b/0x8a0 [nfsd]
nfsd_dispatch+0xe6/0x1a0 [nfsd]
svc_process+0x55e/0x990 [sunrpc]
? nfsd_svc+0x6a0/0x6a0 [nfsd]
nfsd+0x173/0x2a0 [nfsd]
kthread+0x151/0x170
? nfsd_destroy+0x190/0x190 [nfsd]
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x22/0x30
task:btrfs state:D stack: 0 pid:14692 ppid: 14687 flags:0x00004000
Call Trace:
__schedule+0x35a/0xaa0
schedule+0x68/0xe0
btrfs_rm_dev_replace_blocked+0x8a/0xc0
? add_wait_queue_exclusive+0x80/0x80
btrfs_dev_replace_finishing+0x59a/0x790
btrfs_dev_replace_by_ioctl+0x59d/0x6f0
? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
btrfs_ioctl+0x27b2/0x2fe0
? _raw_spin_unlock_irq+0x28/0x40
? _raw_spin_unlock_irq+0x28/0x40
? trace_hardirqs_on+0x54/0xf0
? _raw_spin_unlock_irq+0x28/0x40
? do_sigaction+0xfd/0x250
? __might_fault+0x79/0x80
__x64_sys_ioctl+0x91/0xc0
? __x64_sys_ioctl+0x91/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f1d8a0f4cc7
RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
task:mkdir state:D stack: 0 pid: 2349 ppid: 2346 flags:0x00000000
Call Trace:
__schedule+0x35a/0xaa0
schedule+0x68/0xe0
wait_current_trans+0xed/0x150
? add_wait_queue_exclusive+0x80/0x80
start_transaction+0x551/0x700
btrfs_start_transaction+0x1e/0x20
btrfs_mkdir+0x5f/0x210
vfs_mkdir+0x150/0x200
do_mkdirat+0x118/0x140
__x64_sys_mkdir+0x1b/0x20
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f608a3c6b07
RSP: 002b:00007fffbbd2bab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 0000563d79f1dc30 RCX: 00007f608a3c6b07
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007fffbbd2db5e
RBP: 00007fffbbd2db39 R08: 0000000000000000 R09: 0000563d79f1dd60
R10: fffffffffffff284 R11: 0000000000000246 R12: 00000000000001ff
R13: 00007fffbbd2bc30 R14: 00007fffbbd2db5e R15: 0000000000000000
After a reboot (still in degraded mode), btrfs finishes the replace in
a little under 5 seconds:
[ 508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
[ 513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2021-11-29 21:46 'btrfs replace' hangs at end of replacing a device (v5.10.82) Zygo Blaxell
@ 2021-11-30 10:08 ` Nikolay Borisov
2021-11-30 12:36 ` Nikolay Borisov
2021-11-30 13:55 ` Nikolay Borisov
2 siblings, 0 replies; 11+ messages in thread
From: Nikolay Borisov @ 2021-11-30 10:08 UTC (permalink / raw)
To: Zygo Blaxell, linux-btrfs
On 29.11.21 г. 23:46, Zygo Blaxell wrote:
> Not a new bug, but it's still there. btrfs replace ends in a transaction
> deadlock.
>
> 'btrfs replace status' reports the replace completed and exits:
>
> Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
>
> Magic-SysRq-W:
>
> sysrq: Show Blocked State
> task:btrfs-transacti state:D stack: 0 pid:29509 ppid: 2 flags:0x00004000
> Call Trace:
> __schedule+0x35a/0xaa0
> schedule+0x68/0xe0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock+0x1ac/0x7e0
> ? lock_acquire+0x190/0x2d0
> ? btrfs_run_dev_stats+0x46/0x450
> ? rcu_read_lock_sched_held+0x16/0x80
> mutex_lock_nested+0x1b/0x20
Transaction commit blocked on device_list_mutex which is held by
rm_dev_replace_blocked.
> btrfs_run_dev_stats+0x46/0x450
> ? _raw_spin_unlock+0x23/0x30
> ? release_extent_buffer+0xa7/0xe0
> commit_cowonly_roots+0xa2/0x2a0
> ? btrfs_qgroup_account_extents+0x2d3/0x320
> btrfs_commit_transaction+0x51f/0xc60
> transaction_kthread+0x15a/0x180
> kthread+0x151/0x170
> ? btrfs_cleanup_transaction.isra.0+0x630/0x630
> ? kthread_create_worker_on_cpu+0x70/0x70
> ret_from_fork+0x22/0x30
> task:nfsd state:D stack: 0 pid:31445 ppid: 2 flags:0x00004000
> Call Trace:
> __schedule+0x35a/0xaa0
> schedule+0x68/0xe0
> btrfs_bio_counter_inc_blocked+0xe3/0x120
This task wants to read something and increments bio_counter, then sees
that FS_STATE_DEV_REPLACING and goes to sleep, so it's not really
contributing to the deadlock and is irrelevant.
> ? add_wait_queue_exclusive+0x80/0x80
> btrfs_map_bio+0x4d/0x3f0
> ? rcu_read_lock_sched_held+0x16/0x80
> ? kmem_cache_alloc+0x2e8/0x360
> btrfs_submit_metadata_bio+0xe9/0x100
> submit_one_bio+0x67/0x80
> read_extent_buffer_pages+0x277/0x380
<snip>
> task:btrfs state:D stack: 0 pid:14692 ppid: 14687 flags:0x00004000
> Call Trace:
> __schedule+0x35a/0xaa0
> schedule+0x68/0xe0
> btrfs_rm_dev_replace_blocked+0x8a/0xc0
which is held here, yet we are blocked because we have pending bios and
bio_counter is > 0. I'd expect some task which has already issued its
bio to be blocked on transaction commit which would explain why dev
replace is waiting on a bio, which in turn is waiting on a transaction
commit which in turn is waiting on device replace so it becomes a
circular locking. However I don't see a task which is blocked on the
transaction commit and has submitted a bio and not having adjusted back
the bio count (as in the case of nfsd).
> ? add_wait_queue_exclusive+0x80/0x80
> btrfs_dev_replace_finishing+0x59a/0x790
> btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> ? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> btrfs_ioctl+0x27b2/0x2fe0
> ? _raw_spin_unlock_irq+0x28/0x40
> ? _raw_spin_unlock_irq+0x28/0x40
> ? trace_hardirqs_on+0x54/0xf0
> ? _raw_spin_unlock_irq+0x28/0x40
> ? do_sigaction+0xfd/0x250
> ? __might_fault+0x79/0x80
> __x64_sys_ioctl+0x91/0xc0
> ? __x64_sys_ioctl+0x91/0xc0
> do_syscall_64+0x38/0x90
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f1d8a0f4cc7
> RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
> RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
> R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
> task:mkdir state:D stack: 0 pid: 2349 ppid: 2346 flags:0x00000000
This just waits on the pending transaction so is irrelevant.
>
> After a reboot (still in degraded mode), btrfs finishes the replace in
> a little under 5 seconds:
>
> [ 508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
> [ 513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2021-11-29 21:46 'btrfs replace' hangs at end of replacing a device (v5.10.82) Zygo Blaxell
2021-11-30 10:08 ` Nikolay Borisov
@ 2021-11-30 12:36 ` Nikolay Borisov
2021-11-30 15:23 ` Zygo Blaxell
2021-11-30 13:55 ` Nikolay Borisov
2 siblings, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2021-11-30 12:36 UTC (permalink / raw)
To: Zygo Blaxell, linux-btrfs
On 29.11.21 г. 23:46, Zygo Blaxell wrote:
> Not a new bug, but it's still there. btrfs replace ends in a transaction
> deadlock.
>
Please provide the output of
grep PREEMPT /path/to/running/kernel/kconfig
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2021-11-29 21:46 'btrfs replace' hangs at end of replacing a device (v5.10.82) Zygo Blaxell
2021-11-30 10:08 ` Nikolay Borisov
2021-11-30 12:36 ` Nikolay Borisov
@ 2021-11-30 13:55 ` Nikolay Borisov
2021-11-30 21:18 ` Zygo Blaxell
2022-01-01 19:48 ` Nikolay Borisov
2 siblings, 2 replies; 11+ messages in thread
From: Nikolay Borisov @ 2021-11-30 13:55 UTC (permalink / raw)
To: Zygo Blaxell, linux-btrfs
On 29.11.21 г. 23:46, Zygo Blaxell wrote:
> Not a new bug, but it's still there. btrfs replace ends in a transaction
> deadlock.
>
> 'btrfs replace status' reports the replace completed and exits:
>
> Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
>
> Magic-SysRq-W:
>
> sysrq: Show Blocked State
> task:btrfs-transacti state:D stack: 0 pid:29509 ppid: 2 flags:0x00004000
> Call Trace:
> __schedule+0x35a/0xaa0
> schedule+0x68/0xe0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock+0x1ac/0x7e0
> ? lock_acquire+0x190/0x2d0
> ? btrfs_run_dev_stats+0x46/0x450
> ? rcu_read_lock_sched_held+0x16/0x80
> mutex_lock_nested+0x1b/0x20
> btrfs_run_dev_stats+0x46/0x450
> ? _raw_spin_unlock+0x23/0x30
> ? release_extent_buffer+0xa7/0xe0
> commit_cowonly_roots+0xa2/0x2a0
> ? btrfs_qgroup_account_extents+0x2d3/0x320
> btrfs_commit_transaction+0x51f/0xc60
> transaction_kthread+0x15a/0x180
> kthread+0x151/0x170
> ? btrfs_cleanup_transaction.isra.0+0x630/0x630
> ? kthread_create_worker_on_cpu+0x70/0x70
> ret_from_fork+0x22/0x30
> task:nfsd state:D stack: 0 pid:31445 ppid: 2 flags:0x00004000
> Call Trace:
> __schedule+0x35a/0xaa0
> schedule+0x68/0xe0
> btrfs_bio_counter_inc_blocked+0xe3/0x120
> ? add_wait_queue_exclusive+0x80/0x80
> btrfs_map_bio+0x4d/0x3f0
> ? rcu_read_lock_sched_held+0x16/0x80
> ? kmem_cache_alloc+0x2e8/0x360
> btrfs_submit_metadata_bio+0xe9/0x100
> submit_one_bio+0x67/0x80
> read_extent_buffer_pages+0x277/0x380
> btree_read_extent_buffer_pages+0xa1/0x120
> read_tree_block+0x3b/0x70
> read_block_for_search.isra.0+0x1a2/0x350
> ? rcu_read_lock_sched_held+0x16/0x80
> btrfs_search_slot+0x20f/0x910
> btrfs_lookup_dir_item+0x78/0xc0
> btrfs_lookup_dentry+0xca/0x540
> btrfs_lookup+0x13/0x40
> __lookup_slow+0x10d/0x1e0
> ? rcu_read_lock_sched_held+0x16/0x80
> lookup_one_len+0x77/0x90
> nfsd_lookup_dentry+0xe0/0x440 [nfsd]
> nfsd_lookup+0x89/0x150 [nfsd]
> nfsd4_lookup+0x1a/0x20 [nfsd]
> nfsd4_proc_compound+0x58b/0x8a0 [nfsd]
> nfsd_dispatch+0xe6/0x1a0 [nfsd]
> svc_process+0x55e/0x990 [sunrpc]
> ? nfsd_svc+0x6a0/0x6a0 [nfsd]
> nfsd+0x173/0x2a0 [nfsd]
> kthread+0x151/0x170
> ? nfsd_destroy+0x190/0x190 [nfsd]
> ? kthread_create_worker_on_cpu+0x70/0x70
> ret_from_fork+0x22/0x30
> task:btrfs state:D stack: 0 pid:14692 ppid: 14687 flags:0x00004000
> Call Trace:
> __schedule+0x35a/0xaa0
> schedule+0x68/0xe0
> btrfs_rm_dev_replace_blocked+0x8a/0xc0
> ? add_wait_queue_exclusive+0x80/0x80
> btrfs_dev_replace_finishing+0x59a/0x790
> btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> ? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> btrfs_ioctl+0x27b2/0x2fe0
> ? _raw_spin_unlock_irq+0x28/0x40
> ? _raw_spin_unlock_irq+0x28/0x40
> ? trace_hardirqs_on+0x54/0xf0
> ? _raw_spin_unlock_irq+0x28/0x40
> ? do_sigaction+0xfd/0x250
> ? __might_fault+0x79/0x80
> __x64_sys_ioctl+0x91/0xc0
> ? __x64_sys_ioctl+0x91/0xc0
> do_syscall_64+0x38/0x90
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f1d8a0f4cc7
> RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
> RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
> R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
> task:mkdir state:D stack: 0 pid: 2349 ppid: 2346 flags:0x00000000
> Call Trace:
> __schedule+0x35a/0xaa0
> schedule+0x68/0xe0
> wait_current_trans+0xed/0x150
> ? add_wait_queue_exclusive+0x80/0x80
> start_transaction+0x551/0x700
> btrfs_start_transaction+0x1e/0x20
> btrfs_mkdir+0x5f/0x210
> vfs_mkdir+0x150/0x200
> do_mkdirat+0x118/0x140
> __x64_sys_mkdir+0x1b/0x20
> do_syscall_64+0x38/0x90
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f608a3c6b07
> RSP: 002b:00007fffbbd2bab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
> RAX: ffffffffffffffda RBX: 0000563d79f1dc30 RCX: 00007f608a3c6b07
> RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007fffbbd2db5e
> RBP: 00007fffbbd2db39 R08: 0000000000000000 R09: 0000563d79f1dd60
> R10: fffffffffffff284 R11: 0000000000000246 R12: 00000000000001ff
> R13: 00007fffbbd2bc30 R14: 00007fffbbd2db5e R15: 0000000000000000
>
> After a reboot (still in degraded mode), btrfs finishes the replace in
> a little under 5 seconds:
>
> [ 508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
> [ 513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
>
I have a working hypothesis what might be going wrong, however without a
crash dump to investigate I can't really confirm it. Basically I think
btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
to running bios count since it's using cond_wake_up_nomb. If I'm right
then the following should fix it:
@@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
btrfs_fs_info *fs_info)
void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
{
percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
- cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
+ /* paired with the wait_event barrier in replace_blocked */
+ cond_wake_up(&fs_info->dev_replace.replace_wait);
}
Can you apply it and see if it can reproduce, I don't know what's the
incident rate of this bug so you have to decide at what point it should
be fixed. In any case this patch can't have any negative functional
impact, it just makes the ordering slightly stronger to ensure the write
happens before possibly waking up someone on the queue.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2021-11-30 12:36 ` Nikolay Borisov
@ 2021-11-30 15:23 ` Zygo Blaxell
0 siblings, 0 replies; 11+ messages in thread
From: Zygo Blaxell @ 2021-11-30 15:23 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: linux-btrfs
On Tue, Nov 30, 2021 at 02:36:38PM +0200, Nikolay Borisov wrote:
>
>
> On 29.11.21 г. 23:46, Zygo Blaxell wrote:
> > Not a new bug, but it's still there. btrfs replace ends in a transaction
> > deadlock.
> >
>
>
> Please provide the output of
>
> grep PREEMPT /path/to/running/kernel/kconfig
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
CONFIG_PREEMPTIRQ_DELAY_TEST=m
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2021-11-30 13:55 ` Nikolay Borisov
@ 2021-11-30 21:18 ` Zygo Blaxell
2022-01-01 19:48 ` Nikolay Borisov
1 sibling, 0 replies; 11+ messages in thread
From: Zygo Blaxell @ 2021-11-30 21:18 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: linux-btrfs
On Tue, Nov 30, 2021 at 03:55:12PM +0200, Nikolay Borisov wrote:
>
>
> On 29.11.21 г. 23:46, Zygo Blaxell wrote:
> > Not a new bug, but it's still there. btrfs replace ends in a transaction
> > deadlock.
> >
> > 'btrfs replace status' reports the replace completed and exits:
> >
> > Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
> >
> > Magic-SysRq-W:
> >
> > sysrq: Show Blocked State
> > task:btrfs-transacti state:D stack: 0 pid:29509 ppid: 2 flags:0x00004000
> > Call Trace:
> > __schedule+0x35a/0xaa0
> > schedule+0x68/0xe0
> > schedule_preempt_disabled+0x15/0x20
> > __mutex_lock+0x1ac/0x7e0
> > ? lock_acquire+0x190/0x2d0
> > ? btrfs_run_dev_stats+0x46/0x450
> > ? rcu_read_lock_sched_held+0x16/0x80
> > mutex_lock_nested+0x1b/0x20
> > btrfs_run_dev_stats+0x46/0x450
> > ? _raw_spin_unlock+0x23/0x30
> > ? release_extent_buffer+0xa7/0xe0
> > commit_cowonly_roots+0xa2/0x2a0
> > ? btrfs_qgroup_account_extents+0x2d3/0x320
> > btrfs_commit_transaction+0x51f/0xc60
> > transaction_kthread+0x15a/0x180
> > kthread+0x151/0x170
> > ? btrfs_cleanup_transaction.isra.0+0x630/0x630
> > ? kthread_create_worker_on_cpu+0x70/0x70
> > ret_from_fork+0x22/0x30
> > task:nfsd state:D stack: 0 pid:31445 ppid: 2 flags:0x00004000
> > Call Trace:
> > __schedule+0x35a/0xaa0
> > schedule+0x68/0xe0
> > btrfs_bio_counter_inc_blocked+0xe3/0x120
> > ? add_wait_queue_exclusive+0x80/0x80
> > btrfs_map_bio+0x4d/0x3f0
> > ? rcu_read_lock_sched_held+0x16/0x80
> > ? kmem_cache_alloc+0x2e8/0x360
> > btrfs_submit_metadata_bio+0xe9/0x100
> > submit_one_bio+0x67/0x80
> > read_extent_buffer_pages+0x277/0x380
> > btree_read_extent_buffer_pages+0xa1/0x120
> > read_tree_block+0x3b/0x70
> > read_block_for_search.isra.0+0x1a2/0x350
> > ? rcu_read_lock_sched_held+0x16/0x80
> > btrfs_search_slot+0x20f/0x910
> > btrfs_lookup_dir_item+0x78/0xc0
> > btrfs_lookup_dentry+0xca/0x540
> > btrfs_lookup+0x13/0x40
> > __lookup_slow+0x10d/0x1e0
> > ? rcu_read_lock_sched_held+0x16/0x80
> > lookup_one_len+0x77/0x90
> > nfsd_lookup_dentry+0xe0/0x440 [nfsd]
> > nfsd_lookup+0x89/0x150 [nfsd]
> > nfsd4_lookup+0x1a/0x20 [nfsd]
> > nfsd4_proc_compound+0x58b/0x8a0 [nfsd]
> > nfsd_dispatch+0xe6/0x1a0 [nfsd]
> > svc_process+0x55e/0x990 [sunrpc]
> > ? nfsd_svc+0x6a0/0x6a0 [nfsd]
> > nfsd+0x173/0x2a0 [nfsd]
> > kthread+0x151/0x170
> > ? nfsd_destroy+0x190/0x190 [nfsd]
> > ? kthread_create_worker_on_cpu+0x70/0x70
> > ret_from_fork+0x22/0x30
> > task:btrfs state:D stack: 0 pid:14692 ppid: 14687 flags:0x00004000
> > Call Trace:
> > __schedule+0x35a/0xaa0
> > schedule+0x68/0xe0
> > btrfs_rm_dev_replace_blocked+0x8a/0xc0
> > ? add_wait_queue_exclusive+0x80/0x80
> > btrfs_dev_replace_finishing+0x59a/0x790
> > btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> > ? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> > btrfs_ioctl+0x27b2/0x2fe0
> > ? _raw_spin_unlock_irq+0x28/0x40
> > ? _raw_spin_unlock_irq+0x28/0x40
> > ? trace_hardirqs_on+0x54/0xf0
> > ? _raw_spin_unlock_irq+0x28/0x40
> > ? do_sigaction+0xfd/0x250
> > ? __might_fault+0x79/0x80
> > __x64_sys_ioctl+0x91/0xc0
> > ? __x64_sys_ioctl+0x91/0xc0
> > do_syscall_64+0x38/0x90
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7f1d8a0f4cc7
> > RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> > RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
> > RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
> > R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
> > task:mkdir state:D stack: 0 pid: 2349 ppid: 2346 flags:0x00000000
> > Call Trace:
> > __schedule+0x35a/0xaa0
> > schedule+0x68/0xe0
> > wait_current_trans+0xed/0x150
> > ? add_wait_queue_exclusive+0x80/0x80
> > start_transaction+0x551/0x700
> > btrfs_start_transaction+0x1e/0x20
> > btrfs_mkdir+0x5f/0x210
> > vfs_mkdir+0x150/0x200
> > do_mkdirat+0x118/0x140
> > __x64_sys_mkdir+0x1b/0x20
> > do_syscall_64+0x38/0x90
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7f608a3c6b07
> > RSP: 002b:00007fffbbd2bab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
> > RAX: ffffffffffffffda RBX: 0000563d79f1dc30 RCX: 00007f608a3c6b07
> > RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007fffbbd2db5e
> > RBP: 00007fffbbd2db39 R08: 0000000000000000 R09: 0000563d79f1dd60
> > R10: fffffffffffff284 R11: 0000000000000246 R12: 00000000000001ff
> > R13: 00007fffbbd2bc30 R14: 00007fffbbd2db5e R15: 0000000000000000
> >
> > After a reboot (still in degraded mode), btrfs finishes the replace in
> > a little under 5 seconds:
> >
> > [ 508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
> > [ 513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
> >
>
>
> I have a working hypothesis what might be going wrong, however without a
> crash dump to investigate I can't really confirm it. Basically I think
> btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
> to running bios count since it's using cond_wake_up_nomb.
That sounds right. It does behave like a race condition, only locking
up some of the time. There's usually some read IO load on the system
during the replace (other than the replace itself).
There used to be two lockup cases: this one at the end of replace,
and some lockups that happened earlier in the replace operation.
The earlier ones haven't happened on a 5.10 kernel so far. I haven't
rigorously tested 5.5..5.9 on baremetal hardware, so I can't point to
a fix commit for the earlier lockups if there is one.
For reference the previous report is
https://lore.kernel.org/linux-btrfs/20200627024256.GT10769@hungrycats.org/
but it doesn't contain any specific information not already listed here.
> If I'm right
> then the following should fix it:
>
> @@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
> btrfs_fs_info *fs_info)
> void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
> {
> percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
> - cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
> + /* paired with the wait_event barrier in replace_blocked */
> + cond_wake_up(&fs_info->dev_replace.replace_wait);
> }
>
>
>
> Can you apply it and see if it can reproduce
I'll apply it and set a test VM with a replica workload to do a loop of
device replaces.
> I don't know what's the incident rate of this bug
Roughly once per disk failure since 2020. Some replaces don't hit this
bug, others hit it two or more times.
> so you have to decide at what point it should
> be fixed. In any case this patch can't have any negative functional
> impact, it just makes the ordering slightly stronger to ensure the write
> happens before possibly waking up someone on the queue.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2021-11-30 13:55 ` Nikolay Borisov
2021-11-30 21:18 ` Zygo Blaxell
@ 2022-01-01 19:48 ` Nikolay Borisov
2022-01-01 21:04 ` Zygo Blaxell
1 sibling, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2022-01-01 19:48 UTC (permalink / raw)
To: Zygo Blaxell, linux-btrfs
On 30.11.21 г. 15:55, Nikolay Borisov wrote:
>
>
> On 29.11.21 г. 23:46, Zygo Blaxell wrote:
>> Not a new bug, but it's still there. btrfs replace ends in a transaction
>> deadlock.
>>
>> 'btrfs replace status' reports the replace completed and exits:
>>
>> Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
>>
>> Magic-SysRq-W:
>>
>> sysrq: Show Blocked State
>> task:btrfs-transacti state:D stack: 0 pid:29509 ppid: 2 flags:0x00004000
>> Call Trace:
>> __schedule+0x35a/0xaa0
>> schedule+0x68/0xe0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock+0x1ac/0x7e0
>> ? lock_acquire+0x190/0x2d0
>> ? btrfs_run_dev_stats+0x46/0x450
>> ? rcu_read_lock_sched_held+0x16/0x80
>> mutex_lock_nested+0x1b/0x20
>> btrfs_run_dev_stats+0x46/0x450
>> ? _raw_spin_unlock+0x23/0x30
>> ? release_extent_buffer+0xa7/0xe0
>> commit_cowonly_roots+0xa2/0x2a0
>> ? btrfs_qgroup_account_extents+0x2d3/0x320
>> btrfs_commit_transaction+0x51f/0xc60
>> transaction_kthread+0x15a/0x180
>> kthread+0x151/0x170
>> ? btrfs_cleanup_transaction.isra.0+0x630/0x630
>> ? kthread_create_worker_on_cpu+0x70/0x70
>> ret_from_fork+0x22/0x30
>> task:nfsd state:D stack: 0 pid:31445 ppid: 2 flags:0x00004000
>> Call Trace:
>> __schedule+0x35a/0xaa0
>> schedule+0x68/0xe0
>> btrfs_bio_counter_inc_blocked+0xe3/0x120
>> ? add_wait_queue_exclusive+0x80/0x80
>> btrfs_map_bio+0x4d/0x3f0
>> ? rcu_read_lock_sched_held+0x16/0x80
>> ? kmem_cache_alloc+0x2e8/0x360
>> btrfs_submit_metadata_bio+0xe9/0x100
>> submit_one_bio+0x67/0x80
>> read_extent_buffer_pages+0x277/0x380
>> btree_read_extent_buffer_pages+0xa1/0x120
>> read_tree_block+0x3b/0x70
>> read_block_for_search.isra.0+0x1a2/0x350
>> ? rcu_read_lock_sched_held+0x16/0x80
>> btrfs_search_slot+0x20f/0x910
>> btrfs_lookup_dir_item+0x78/0xc0
>> btrfs_lookup_dentry+0xca/0x540
>> btrfs_lookup+0x13/0x40
>> __lookup_slow+0x10d/0x1e0
>> ? rcu_read_lock_sched_held+0x16/0x80
>> lookup_one_len+0x77/0x90
>> nfsd_lookup_dentry+0xe0/0x440 [nfsd]
>> nfsd_lookup+0x89/0x150 [nfsd]
>> nfsd4_lookup+0x1a/0x20 [nfsd]
>> nfsd4_proc_compound+0x58b/0x8a0 [nfsd]
>> nfsd_dispatch+0xe6/0x1a0 [nfsd]
>> svc_process+0x55e/0x990 [sunrpc]
>> ? nfsd_svc+0x6a0/0x6a0 [nfsd]
>> nfsd+0x173/0x2a0 [nfsd]
>> kthread+0x151/0x170
>> ? nfsd_destroy+0x190/0x190 [nfsd]
>> ? kthread_create_worker_on_cpu+0x70/0x70
>> ret_from_fork+0x22/0x30
>> task:btrfs state:D stack: 0 pid:14692 ppid: 14687 flags:0x00004000
>> Call Trace:
>> __schedule+0x35a/0xaa0
>> schedule+0x68/0xe0
>> btrfs_rm_dev_replace_blocked+0x8a/0xc0
>> ? add_wait_queue_exclusive+0x80/0x80
>> btrfs_dev_replace_finishing+0x59a/0x790
>> btrfs_dev_replace_by_ioctl+0x59d/0x6f0
>> ? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
>> btrfs_ioctl+0x27b2/0x2fe0
>> ? _raw_spin_unlock_irq+0x28/0x40
>> ? _raw_spin_unlock_irq+0x28/0x40
>> ? trace_hardirqs_on+0x54/0xf0
>> ? _raw_spin_unlock_irq+0x28/0x40
>> ? do_sigaction+0xfd/0x250
>> ? __might_fault+0x79/0x80
>> __x64_sys_ioctl+0x91/0xc0
>> ? __x64_sys_ioctl+0x91/0xc0
>> do_syscall_64+0x38/0x90
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> RIP: 0033:0x7f1d8a0f4cc7
>> RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
>> RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
>> R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
>> task:mkdir state:D stack: 0 pid: 2349 ppid: 2346 flags:0x00000000
>> Call Trace:
>> __schedule+0x35a/0xaa0
>> schedule+0x68/0xe0
>> wait_current_trans+0xed/0x150
>> ? add_wait_queue_exclusive+0x80/0x80
>> start_transaction+0x551/0x700
>> btrfs_start_transaction+0x1e/0x20
>> btrfs_mkdir+0x5f/0x210
>> vfs_mkdir+0x150/0x200
>> do_mkdirat+0x118/0x140
>> __x64_sys_mkdir+0x1b/0x20
>> do_syscall_64+0x38/0x90
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> RIP: 0033:0x7f608a3c6b07
>> RSP: 002b:00007fffbbd2bab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
>> RAX: ffffffffffffffda RBX: 0000563d79f1dc30 RCX: 00007f608a3c6b07
>> RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007fffbbd2db5e
>> RBP: 00007fffbbd2db39 R08: 0000000000000000 R09: 0000563d79f1dd60
>> R10: fffffffffffff284 R11: 0000000000000246 R12: 00000000000001ff
>> R13: 00007fffbbd2bc30 R14: 00007fffbbd2db5e R15: 0000000000000000
>>
>> After a reboot (still in degraded mode), btrfs finishes the replace in
>> a little under 5 seconds:
>>
>> [ 508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
>> [ 513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
>>
>
>
> I have a working hypothesis what might be going wrong, however without a
> crash dump to investigate I can't really confirm it. Basically I think
> btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
> to running bios count since it's using cond_wake_up_nomb. If I'm right
> then the following should fix it:
>
> @@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
> btrfs_fs_info *fs_info)
> void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
> {
> percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
> - cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
> + /* paired with the wait_event barrier in replace_blocked */
> + cond_wake_up(&fs_info->dev_replace.replace_wait);
> }
Ping, any feedback on this patch?
>
>
>
> Can you apply it and see if it can reproduce, I don't know what's the
> incident rate of this bug so you have to decide at what point it should
> be fixed. In any case this patch can't have any negative functional
> impact, it just makes the ordering slightly stronger to ensure the write
> happens before possibly waking up someone on the queue.
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2022-01-01 19:48 ` Nikolay Borisov
@ 2022-01-01 21:04 ` Zygo Blaxell
2022-01-01 21:07 ` Nikolay Borisov
0 siblings, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2022-01-01 21:04 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: linux-btrfs
On Sat, Jan 01, 2022 at 09:48:28PM +0200, Nikolay Borisov wrote:
>
>
> On 30.11.21 г. 15:55, Nikolay Borisov wrote:
> >
> >
> > On 29.11.21 г. 23:46, Zygo Blaxell wrote:
> >> Not a new bug, but it's still there. btrfs replace ends in a transaction
> >> deadlock.
> >>
> >> 'btrfs replace status' reports the replace completed and exits:
> >>
> >> Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
> >>
> >> Magic-SysRq-W:
> >>
> >> sysrq: Show Blocked State
> >> task:btrfs-transacti state:D stack: 0 pid:29509 ppid: 2 flags:0x00004000
> >> Call Trace:
> >> __schedule+0x35a/0xaa0
> >> schedule+0x68/0xe0
> >> schedule_preempt_disabled+0x15/0x20
> >> __mutex_lock+0x1ac/0x7e0
> >> ? lock_acquire+0x190/0x2d0
> >> ? btrfs_run_dev_stats+0x46/0x450
> >> ? rcu_read_lock_sched_held+0x16/0x80
> >> mutex_lock_nested+0x1b/0x20
> >> btrfs_run_dev_stats+0x46/0x450
> >> ? _raw_spin_unlock+0x23/0x30
> >> ? release_extent_buffer+0xa7/0xe0
> >> commit_cowonly_roots+0xa2/0x2a0
> >> ? btrfs_qgroup_account_extents+0x2d3/0x320
> >> btrfs_commit_transaction+0x51f/0xc60
> >> transaction_kthread+0x15a/0x180
> >> kthread+0x151/0x170
> >> ? btrfs_cleanup_transaction.isra.0+0x630/0x630
> >> ? kthread_create_worker_on_cpu+0x70/0x70
> >> ret_from_fork+0x22/0x30
> >> task:nfsd state:D stack: 0 pid:31445 ppid: 2 flags:0x00004000
> >> Call Trace:
> >> __schedule+0x35a/0xaa0
> >> schedule+0x68/0xe0
> >> btrfs_bio_counter_inc_blocked+0xe3/0x120
> >> ? add_wait_queue_exclusive+0x80/0x80
> >> btrfs_map_bio+0x4d/0x3f0
> >> ? rcu_read_lock_sched_held+0x16/0x80
> >> ? kmem_cache_alloc+0x2e8/0x360
> >> btrfs_submit_metadata_bio+0xe9/0x100
> >> submit_one_bio+0x67/0x80
> >> read_extent_buffer_pages+0x277/0x380
> >> btree_read_extent_buffer_pages+0xa1/0x120
> >> read_tree_block+0x3b/0x70
> >> read_block_for_search.isra.0+0x1a2/0x350
> >> ? rcu_read_lock_sched_held+0x16/0x80
> >> btrfs_search_slot+0x20f/0x910
> >> btrfs_lookup_dir_item+0x78/0xc0
> >> btrfs_lookup_dentry+0xca/0x540
> >> btrfs_lookup+0x13/0x40
> >> __lookup_slow+0x10d/0x1e0
> >> ? rcu_read_lock_sched_held+0x16/0x80
> >> lookup_one_len+0x77/0x90
> >> nfsd_lookup_dentry+0xe0/0x440 [nfsd]
> >> nfsd_lookup+0x89/0x150 [nfsd]
> >> nfsd4_lookup+0x1a/0x20 [nfsd]
> >> nfsd4_proc_compound+0x58b/0x8a0 [nfsd]
> >> nfsd_dispatch+0xe6/0x1a0 [nfsd]
> >> svc_process+0x55e/0x990 [sunrpc]
> >> ? nfsd_svc+0x6a0/0x6a0 [nfsd]
> >> nfsd+0x173/0x2a0 [nfsd]
> >> kthread+0x151/0x170
> >> ? nfsd_destroy+0x190/0x190 [nfsd]
> >> ? kthread_create_worker_on_cpu+0x70/0x70
> >> ret_from_fork+0x22/0x30
> >> task:btrfs state:D stack: 0 pid:14692 ppid: 14687 flags:0x00004000
> >> Call Trace:
> >> __schedule+0x35a/0xaa0
> >> schedule+0x68/0xe0
> >> btrfs_rm_dev_replace_blocked+0x8a/0xc0
> >> ? add_wait_queue_exclusive+0x80/0x80
> >> btrfs_dev_replace_finishing+0x59a/0x790
> >> btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> >> ? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
> >> btrfs_ioctl+0x27b2/0x2fe0
> >> ? _raw_spin_unlock_irq+0x28/0x40
> >> ? _raw_spin_unlock_irq+0x28/0x40
> >> ? trace_hardirqs_on+0x54/0xf0
> >> ? _raw_spin_unlock_irq+0x28/0x40
> >> ? do_sigaction+0xfd/0x250
> >> ? __might_fault+0x79/0x80
> >> __x64_sys_ioctl+0x91/0xc0
> >> ? __x64_sys_ioctl+0x91/0xc0
> >> do_syscall_64+0x38/0x90
> >> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> RIP: 0033:0x7f1d8a0f4cc7
> >> RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> >> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
> >> RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
> >> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> >> R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
> >> R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
> >> task:mkdir state:D stack: 0 pid: 2349 ppid: 2346 flags:0x00000000
> >> Call Trace:
> >> __schedule+0x35a/0xaa0
> >> schedule+0x68/0xe0
> >> wait_current_trans+0xed/0x150
> >> ? add_wait_queue_exclusive+0x80/0x80
> >> start_transaction+0x551/0x700
> >> btrfs_start_transaction+0x1e/0x20
> >> btrfs_mkdir+0x5f/0x210
> >> vfs_mkdir+0x150/0x200
> >> do_mkdirat+0x118/0x140
> >> __x64_sys_mkdir+0x1b/0x20
> >> do_syscall_64+0x38/0x90
> >> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> RIP: 0033:0x7f608a3c6b07
> >> RSP: 002b:00007fffbbd2bab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
> >> RAX: ffffffffffffffda RBX: 0000563d79f1dc30 RCX: 00007f608a3c6b07
> >> RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007fffbbd2db5e
> >> RBP: 00007fffbbd2db39 R08: 0000000000000000 R09: 0000563d79f1dd60
> >> R10: fffffffffffff284 R11: 0000000000000246 R12: 00000000000001ff
> >> R13: 00007fffbbd2bc30 R14: 00007fffbbd2db5e R15: 0000000000000000
> >>
> >> After a reboot (still in degraded mode), btrfs finishes the replace in
> >> a little under 5 seconds:
> >>
> >> [ 508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
> >> [ 513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
> >>
> >
> >
> > I have a working hypothesis what might be going wrong, however without a
> > crash dump to investigate I can't really confirm it. Basically I think
> > btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
> > to running bios count since it's using cond_wake_up_nomb. If I'm right
> > then the following should fix it:
> >
> > @@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
> > btrfs_fs_info *fs_info)
> > void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
> > {
> > percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
> > - cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
> > + /* paired with the wait_event barrier in replace_blocked */
> > + cond_wake_up(&fs_info->dev_replace.replace_wait);
> > }
>
> Ping, any feedback on this patch?
I've had a VM running 37 replaces completed without hanging. In the
2 failing cases, I hit the KASAN bug[1] and the dedupe/logical_ino/bees
lockup bug[2].
[1] https://lore.kernel.org/linux-btrfs/Ycqu1Wr8p3aJNcaf@hungrycats.org/
[2] https://lore.kernel.org/linux-btrfs/Ybz4JI+Kl2J7Py3z@hungrycats.org/
> > Can you apply it and see if it can reproduce, I don't know what's the
> > incident rate of this bug so you have to decide at what point it should
> > be fixed. In any case this patch can't have any negative functional
> > impact, it just makes the ordering slightly stronger to ensure the write
> > happens before possibly waking up someone on the queue.
> >
> >
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2022-01-01 21:04 ` Zygo Blaxell
@ 2022-01-01 21:07 ` Nikolay Borisov
2022-01-01 23:12 ` Zygo Blaxell
0 siblings, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2022-01-01 21:07 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: linux-btrfs
On 1.01.22 г. 23:04, Zygo Blaxell wrote:
> On Sat, Jan 01, 2022 at 09:48:28PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 30.11.21 г. 15:55, Nikolay Borisov wrote:
>>>
>>>
>>> On 29.11.21 г. 23:46, Zygo Blaxell wrote:
>>>> Not a new bug, but it's still there. btrfs replace ends in a transaction
>>>> deadlock.
>>>>
>>>> 'btrfs replace status' reports the replace completed and exits:
>>>>
>>>> Started on 27.Nov 02:05:07, finished on 29.Nov 14:11:20, 0 write errs, 0 uncorr. read errs
>>>>
>>>> Magic-SysRq-W:
>>>>
>>>> sysrq: Show Blocked State
>>>> task:btrfs-transacti state:D stack: 0 pid:29509 ppid: 2 flags:0x00004000
>>>> Call Trace:
>>>> __schedule+0x35a/0xaa0
>>>> schedule+0x68/0xe0
>>>> schedule_preempt_disabled+0x15/0x20
>>>> __mutex_lock+0x1ac/0x7e0
>>>> ? lock_acquire+0x190/0x2d0
>>>> ? btrfs_run_dev_stats+0x46/0x450
>>>> ? rcu_read_lock_sched_held+0x16/0x80
>>>> mutex_lock_nested+0x1b/0x20
>>>> btrfs_run_dev_stats+0x46/0x450
>>>> ? _raw_spin_unlock+0x23/0x30
>>>> ? release_extent_buffer+0xa7/0xe0
>>>> commit_cowonly_roots+0xa2/0x2a0
>>>> ? btrfs_qgroup_account_extents+0x2d3/0x320
>>>> btrfs_commit_transaction+0x51f/0xc60
>>>> transaction_kthread+0x15a/0x180
>>>> kthread+0x151/0x170
>>>> ? btrfs_cleanup_transaction.isra.0+0x630/0x630
>>>> ? kthread_create_worker_on_cpu+0x70/0x70
>>>> ret_from_fork+0x22/0x30
>>>> task:nfsd state:D stack: 0 pid:31445 ppid: 2 flags:0x00004000
>>>> Call Trace:
>>>> __schedule+0x35a/0xaa0
>>>> schedule+0x68/0xe0
>>>> btrfs_bio_counter_inc_blocked+0xe3/0x120
>>>> ? add_wait_queue_exclusive+0x80/0x80
>>>> btrfs_map_bio+0x4d/0x3f0
>>>> ? rcu_read_lock_sched_held+0x16/0x80
>>>> ? kmem_cache_alloc+0x2e8/0x360
>>>> btrfs_submit_metadata_bio+0xe9/0x100
>>>> submit_one_bio+0x67/0x80
>>>> read_extent_buffer_pages+0x277/0x380
>>>> btree_read_extent_buffer_pages+0xa1/0x120
>>>> read_tree_block+0x3b/0x70
>>>> read_block_for_search.isra.0+0x1a2/0x350
>>>> ? rcu_read_lock_sched_held+0x16/0x80
>>>> btrfs_search_slot+0x20f/0x910
>>>> btrfs_lookup_dir_item+0x78/0xc0
>>>> btrfs_lookup_dentry+0xca/0x540
>>>> btrfs_lookup+0x13/0x40
>>>> __lookup_slow+0x10d/0x1e0
>>>> ? rcu_read_lock_sched_held+0x16/0x80
>>>> lookup_one_len+0x77/0x90
>>>> nfsd_lookup_dentry+0xe0/0x440 [nfsd]
>>>> nfsd_lookup+0x89/0x150 [nfsd]
>>>> nfsd4_lookup+0x1a/0x20 [nfsd]
>>>> nfsd4_proc_compound+0x58b/0x8a0 [nfsd]
>>>> nfsd_dispatch+0xe6/0x1a0 [nfsd]
>>>> svc_process+0x55e/0x990 [sunrpc]
>>>> ? nfsd_svc+0x6a0/0x6a0 [nfsd]
>>>> nfsd+0x173/0x2a0 [nfsd]
>>>> kthread+0x151/0x170
>>>> ? nfsd_destroy+0x190/0x190 [nfsd]
>>>> ? kthread_create_worker_on_cpu+0x70/0x70
>>>> ret_from_fork+0x22/0x30
>>>> task:btrfs state:D stack: 0 pid:14692 ppid: 14687 flags:0x00004000
>>>> Call Trace:
>>>> __schedule+0x35a/0xaa0
>>>> schedule+0x68/0xe0
>>>> btrfs_rm_dev_replace_blocked+0x8a/0xc0
>>>> ? add_wait_queue_exclusive+0x80/0x80
>>>> btrfs_dev_replace_finishing+0x59a/0x790
>>>> btrfs_dev_replace_by_ioctl+0x59d/0x6f0
>>>> ? btrfs_dev_replace_by_ioctl+0x59d/0x6f0
>>>> btrfs_ioctl+0x27b2/0x2fe0
>>>> ? _raw_spin_unlock_irq+0x28/0x40
>>>> ? _raw_spin_unlock_irq+0x28/0x40
>>>> ? trace_hardirqs_on+0x54/0xf0
>>>> ? _raw_spin_unlock_irq+0x28/0x40
>>>> ? do_sigaction+0xfd/0x250
>>>> ? __might_fault+0x79/0x80
>>>> __x64_sys_ioctl+0x91/0xc0
>>>> ? __x64_sys_ioctl+0x91/0xc0
>>>> do_syscall_64+0x38/0x90
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> RIP: 0033:0x7f1d8a0f4cc7
>>>> RSP: 002b:00007ffc6cffe588 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>>>> RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1d8a0f4cc7
>>>> RDX: 00007ffc6cfff400 RSI: 00000000ca289435 RDI: 0000000000000003
>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>>>> R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
>>>> R13: 00005583a17fb2e0 R14: 00007ffc6d001b7a R15: 0000000000000001
>>>> task:mkdir state:D stack: 0 pid: 2349 ppid: 2346 flags:0x00000000
>>>> Call Trace:
>>>> __schedule+0x35a/0xaa0
>>>> schedule+0x68/0xe0
>>>> wait_current_trans+0xed/0x150
>>>> ? add_wait_queue_exclusive+0x80/0x80
>>>> start_transaction+0x551/0x700
>>>> btrfs_start_transaction+0x1e/0x20
>>>> btrfs_mkdir+0x5f/0x210
>>>> vfs_mkdir+0x150/0x200
>>>> do_mkdirat+0x118/0x140
>>>> __x64_sys_mkdir+0x1b/0x20
>>>> do_syscall_64+0x38/0x90
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> RIP: 0033:0x7f608a3c6b07
>>>> RSP: 002b:00007fffbbd2bab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
>>>> RAX: ffffffffffffffda RBX: 0000563d79f1dc30 RCX: 00007f608a3c6b07
>>>> RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007fffbbd2db5e
>>>> RBP: 00007fffbbd2db39 R08: 0000000000000000 R09: 0000563d79f1dd60
>>>> R10: fffffffffffff284 R11: 0000000000000246 R12: 00000000000001ff
>>>> R13: 00007fffbbd2bc30 R14: 00007fffbbd2db5e R15: 0000000000000000
>>>>
>>>> After a reboot (still in degraded mode), btrfs finishes the replace in
>>>> a little under 5 seconds:
>>>>
>>>> [ 508.664454] BTRFS info (device dm-34): continuing dev_replace from <missing disk> (devid 5) to target /dev/mapper/md17 @100%
>>>> [ 513.285473] BTRFS info (device dm-34): dev_replace from <missing disk> (devid 5) to /dev/mapper/md17 finished
>>>>
>>>
>>>
>>> I have a working hypothesis what might be going wrong, however without a
>>> crash dump to investigate I can't really confirm it. Basically I think
>>> btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
>>> to running bios count since it's using cond_wake_up_nomb. If I'm right
>>> then the following should fix it:
>>>
>>> @@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
>>> btrfs_fs_info *fs_info)
>>> void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
>>> {
>>> percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
>>> - cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
>>> + /* paired with the wait_event barrier in replace_blocked */
>>> + cond_wake_up(&fs_info->dev_replace.replace_wait);
>>> }
>>
>> Ping, any feedback on this patch?
>
> I've had a VM running 37 replaces completed without hanging. In the
> 2 failing cases, I hit the KASAN bug[1] and the dedupe/logical_ino/bees
> lockup bug[2].
How does that compare vs without the patch? The KASAN thing looks like
raid56-related so I'd discount it. The logical_ino lockup also isn't
directly related to this patch. So without the patch you should have had
some incident rate greater than 0 of the replace lock up ?
>
> [1] https://lore.kernel.org/linux-btrfs/Ycqu1Wr8p3aJNcaf@hungrycats.org/
> [2] https://lore.kernel.org/linux-btrfs/Ybz4JI+Kl2J7Py3z@hungrycats.org/
>
>>> Can you apply it and see if it can reproduce, I don't know what's the
>>> incident rate of this bug so you have to decide at what point it should
>>> be fixed. In any case this patch can't have any negative functional
>>> impact, it just makes the ordering slightly stronger to ensure the write
>>> happens before possibly waking up someone on the queue.
>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2022-01-01 21:07 ` Nikolay Borisov
@ 2022-01-01 23:12 ` Zygo Blaxell
2022-01-26 15:02 ` Zygo Blaxell
0 siblings, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2022-01-01 23:12 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: linux-btrfs
On Sat, Jan 01, 2022 at 11:07:24PM +0200, Nikolay Borisov wrote:
> >>> I have a working hypothesis what might be going wrong, however without a
> >>> crash dump to investigate I can't really confirm it. Basically I think
> >>> btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
> >>> to running bios count since it's using cond_wake_up_nomb. If I'm right
> >>> then the following should fix it:
> >>>
> >>> @@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
> >>> btrfs_fs_info *fs_info)
> >>> void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
> >>> {
> >>> percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
> >>> - cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
> >>> + /* paired with the wait_event barrier in replace_blocked */
> >>> + cond_wake_up(&fs_info->dev_replace.replace_wait);
> >>> }
> >>
> >> Ping, any feedback on this patch?
> >
> > I've had a VM running 37 replaces completed without hanging. In the
> > 2 failing cases, I hit the KASAN bug[1] and the dedupe/logical_ino/bees
> > lockup bug[2].
>
> How does that compare vs without the patch? The KASAN thing looks like
> raid56-related so I'd discount it. The logical_ino lockup also isn't
> directly related to this patch.
I added the device replace loop to my existing regression test workload
(with a kernel that has KASAN enabled) and left it since the beginning
of December. So it's testing for all the currently known and active
bugs at once (the ghost dirents bug, the logical_ino bug, and the dev
replace hang, as well as general regressions affecting our workload)
and counting how often each kind of failure event occurs.
The KASAN thing would probably disrupt bio counters during a replace?
It wouldn't explain replace hangs on RAID1, but maybe there are multiple
separate bugs here. KASAN isn't enabled on our baremetal hosts where
the majority of device replace operations are running, so any previous
UAF bugs would have been invisible. We are replacing drives in RAID1
and RAID5 profiles.
Note these test results are for 5.10. On kernels starting with 5.11,
the logical_ino bug hits every few hours so replace never gets a chance
to finish running. All the hangs do make it easier to test for ghost
dirents though. We use every part of the buffalo.
> So without the patch you should have had
> some incident rate greater than 0 of the replace lock up ?
Under the same conditions without the patch, there would be between 10
and 30 replace hangs by now.
> > [1] https://lore.kernel.org/linux-btrfs/Ycqu1Wr8p3aJNcaf@hungrycats.org/
> > [2] https://lore.kernel.org/linux-btrfs/Ybz4JI+Kl2J7Py3z@hungrycats.org/
> >
> >>> Can you apply it and see if it can reproduce, I don't know what's the
> >>> incident rate of this bug so you have to decide at what point it should
> >>> be fixed. In any case this patch can't have any negative functional
> >>> impact, it just makes the ordering slightly stronger to ensure the write
> >>> happens before possibly waking up someone on the queue.
> >>>
> >>>
> >>
> >
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 'btrfs replace' hangs at end of replacing a device (v5.10.82)
2022-01-01 23:12 ` Zygo Blaxell
@ 2022-01-26 15:02 ` Zygo Blaxell
0 siblings, 0 replies; 11+ messages in thread
From: Zygo Blaxell @ 2022-01-26 15:02 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: linux-btrfs
On Sat, Jan 01, 2022 at 06:12:55PM -0500, Zygo Blaxell wrote:
> On Sat, Jan 01, 2022 at 11:07:24PM +0200, Nikolay Borisov wrote:
> > >>> I have a working hypothesis what might be going wrong, however without a
> > >>> crash dump to investigate I can't really confirm it. Basically I think
> > >>> btrfs_rm_dev_replace_blocked is not seeing the decrement aka the store
> > >>> to running bios count since it's using cond_wake_up_nomb. If I'm right
> > >>> then the following should fix it:
> > >>>
> > >>> @@ -1122,7 +1123,8 @@ void btrfs_bio_counter_inc_noblocked(struct
> > >>> btrfs_fs_info *fs_info)
> > >>> void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
> > >>> {
> > >>> percpu_counter_sub(&fs_info->dev_replace.bio_counter, amount);
> > >>> - cond_wake_up_nomb(&fs_info->dev_replace.replace_wait);
> > >>> + /* paired with the wait_event barrier in replace_blocked */
> > >>> + cond_wake_up(&fs_info->dev_replace.replace_wait);
> > >>> }
> > >>
> > >> Ping, any feedback on this patch?
> > >
> > > I've had a VM running 37 replaces completed without hanging. In the
> > > 2 failing cases, I hit the KASAN bug[1] and the dedupe/logical_ino/bees
> > > lockup bug[2].
> >
> > How does that compare vs without the patch? The KASAN thing looks like
> > raid56-related so I'd discount it. The logical_ino lockup also isn't
> > directly related to this patch.
>
> I added the device replace loop to my existing regression test workload
> (with a kernel that has KASAN enabled) and left it since the beginning
> of December.
I changed the setup so that the filesystem started in degraded with one
device missing. With raid5 data and raid1 metadata and some active
writers on the filesystem, the replace crashes with many different
problems, all some manifestation of a use-after-free bug. There are
lockdep complaints about a thread taking the same lock twice, assorted
list and tree corruptions, NULL derefs in the block IO code, and KASAN
UAF complaints like the one I posted at [1]. With a mean uptime of 7
minutes, the replace gets maybe 1% done between crashes. It runs longer
between crashes when replace is the only thing touching the filesystem.
Previously my test was running replace with the replaced drive still
online, and apparently that case almost never has problems. That's going
to be a different code path. Kind of obvious, I should have noticed
that difference much earlier.
I'm not 100% sure if this is the same bug I found earlier, but it is
an easily reproducible bug, so that's something. Hell, I wouldn't
be surprised if it also caused all the degraded raid5 read corruption
problems too.
> So it's testing for all the currently known and active
> bugs at once (the ghost dirents bug, the logical_ino bug, and the dev
> replace hang, as well as general regressions affecting our workload)
> and counting how often each kind of failure event occurs.
>
> The KASAN thing would probably disrupt bio counters during a replace?
> It wouldn't explain replace hangs on RAID1, but maybe there are multiple
> separate bugs here. KASAN isn't enabled on our baremetal hosts where
> the majority of device replace operations are running, so any previous
> UAF bugs would have been invisible. We are replacing drives in RAID1
> and RAID5 profiles.
>
> Note these test results are for 5.10. On kernels starting with 5.11,
> the logical_ino bug hits every few hours so replace never gets a chance
> to finish running. All the hangs do make it easier to test for ghost
> dirents though. We use every part of the buffalo.
>
> > So without the patch you should have had
> > some incident rate greater than 0 of the replace lock up ?
>
> Under the same conditions without the patch, there would be between 10
> and 30 replace hangs by now.
>
> > > [1] https://lore.kernel.org/linux-btrfs/Ycqu1Wr8p3aJNcaf@hungrycats.org/
> > > [2] https://lore.kernel.org/linux-btrfs/Ybz4JI+Kl2J7Py3z@hungrycats.org/
> > >
> > >>> Can you apply it and see if it can reproduce, I don't know what's the
> > >>> incident rate of this bug so you have to decide at what point it should
> > >>> be fixed. In any case this patch can't have any negative functional
> > >>> impact, it just makes the ordering slightly stronger to ensure the write
> > >>> happens before possibly waking up someone on the queue.
> > >>>
> > >>>
> > >>
> > >
> >
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-01-26 15:02 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-29 21:46 'btrfs replace' hangs at end of replacing a device (v5.10.82) Zygo Blaxell
2021-11-30 10:08 ` Nikolay Borisov
2021-11-30 12:36 ` Nikolay Borisov
2021-11-30 15:23 ` Zygo Blaxell
2021-11-30 13:55 ` Nikolay Borisov
2021-11-30 21:18 ` Zygo Blaxell
2022-01-01 19:48 ` Nikolay Borisov
2022-01-01 21:04 ` Zygo Blaxell
2022-01-01 21:07 ` Nikolay Borisov
2022-01-01 23:12 ` Zygo Blaxell
2022-01-26 15:02 ` Zygo Blaxell
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.