linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v3] sched/core: Always flush pending blk_plug
       [not found] ` <20220709161956.GA20046@windriver.com>
@ 2022-07-10  1:01   ` Hillf Danton
  2022-07-12 18:29     ` Paul Gortmaker
  0 siblings, 1 reply; 3+ messages in thread
From: Hillf Danton @ 2022-07-10  1:01 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: John Keeping, linux-kernel, linux-mm, Peter Zijlstra,
	Thomas Gleixner, Mel Gorman, Theodore Ts'o

On Sat, 9 Jul 2022 12:19:56 -0400 Paul Gortmaker wrote:
> On 08/07/2022 (Fri 17:27) John Keeping wrote:
> > With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
> > normal priority tasks (SCHED_OTHER, nice level zero):
> > 
> > 	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
> > 	      Not tainted 5.15.49-rt46 #1
> > 	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > 	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
> > 	Workqueue: writeback wb_workfn (flush-7:0)
> 
> This is great.  I was on-and-off looking at an ext4 deadlock triggered
> by LTP's "ftest01" - and was just potting my notes together this morning
> to clarify for myself what I thought was going on before I forgot.
> 
> Which is good, because I then could recognize the very similar patterns
> in what you saw to what I'd observed/deduced.  Otherise I'd probably not
> have had your backtraces grab my attention at all.
> 
> Long story short, I applied this to v5.19-rc3-rt5 and it resolves the
> LTP/ext4 deadlock issue.  So...
> 
> Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> 
> As I'd already put together my ext4 debug info, I'll include it below
> in case it helps future people searching for LTP and ftest01.
> 
> Thanks,
> Paul.
> 
> 	-----------------
> 
> 
> There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP
> "ftest01" 
> 
> Reproducer details are at the bottom.  What seems to happen is as follows:
> 
> The kworker doing writeback wb_workfn (pid 213 below) stalls on the 
> one and only down_write in ext4_map_blocks() and never moves again.
> 
>       /*
>        * New blocks allocate and/or writing to unwritten extent
>        * will possibly result in updating i_data, so we take
>        * the write lock of i_data_sem, and call get_block()
>        * with create == 1 flag.
>        */
>       down_write(&EXT4_I(inode)->i_data_sem);
> 
> So who is blocking the kworker?  There are two of the ftest01 processes
> (pid 818 and pid 821) doing ext4 truncate which look like this:
> 
>   ext4_truncate+0x46a/0x660
>    --> ext4_ext_truncate+0xd6/0xf0
>        --> ext4_ext_remove_space+0x59f/0x1830
>            --> jbd2_journal_get_write_access+0x51/0x80
>                --> do_get_write_access+0x2ad/0x550
>                    --> complete_all+0x60/0x60
>                        --> bit_wait_io+0x11/0x60
> 
> They too are stalled in D state waiting there forever - presumably
> waiting on the kwowrker to make the progress they are waiting for.
> 
> Here is where I think lockdep shows us the problem:
> 
>   5 locks held by ftest01/818:
>    [...]
>    #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
> 
>   5 locks held by ftest01/821:
>    [...]
>    #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
> 
> Yes - that is the one the kworker was trying the down_write() on.
> 
> Also note that the bit_wait is a big telltale sign here - there is only
> one in the jbd2's do_get_write_access() function:
> 
>       /*
>        * There is one case we have to be very careful about.  If the
>        * committing transaction is currently writing this buffer out to disk
>        * and has NOT made a copy-out, then we cannot modify the buffer
>        * contents at all right now.  The essence of copy-out is that it is
>        * the extra copy, not the primary copy, which gets journaled.  If the
>        * primary copy is already going to disk then we cannot do copy-out
>        * here.
>        */
>       if (buffer_shadow(bh)) {
>               JBUFFER_TRACE(jh, "on shadow: sleep");
>               spin_unlock(&jh->b_state_lock);
>               wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
>               goto repeat;
>       }
> 
> Presumaby this is a corner case that preempt-rt makes easier to trigger?
> 
> The full trace with interspersed lockdep info follows.
> 
> Various detailed reproducer information follows that.
> 
> 	-----------------------
> 
> sysrq-w backtracce information interspersed with sysrq-d lockdep info
> on 5.19-rc2-rt3 kernel:
> 
>  task:kworker/u129:34 state:D stack:    0 pid:  213 ppid:     2 flags:0x00004000
>  Workqueue: writeback wb_workfn (flush-7:0)
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40
>   schedule+0x48/0xc0
>   __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40
>   rwbase_write_lock+0x98/0x6e0
>   down_write+0x46/0x50
>   ? ext4_map_blocks+0x156/0x5c0
>   ext4_map_blocks+0x156/0x5c0
>   ext4_writepages+0xa1a/0x1290
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? lockdep_hardirqs_on+0xcd/0x150
>   do_writepages+0xd2/0x1a0
>   __writeback_single_inode+0x64/0x850
>   writeback_sb_inodes+0x22d/0x5b0
>   __writeback_inodes_wb+0x67/0xe0
>   wb_writeback+0x269/0x5f0
>   ? get_nr_inodes+0x49/0x70
>   wb_workfn+0x43d/0x780
>   ? __this_cpu_preempt_check+0x13/0x20
>   process_one_work+0x286/0x710
>   worker_thread+0x3c/0x3d0
>   ? process_one_work+0x710/0x710
>   kthread+0x13b/0x150
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x1f/0x30
>   </TASK>
> 
>   6 locks held by kworker/u129:34/213:
>    #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710
>    #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710
>    #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50
>    #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0
>    #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
>    #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0
> 
> 
>  task:jbd2/loop0-8    state:D stack:    0 pid:  679 ppid:     2 flags:0x00004000
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   schedule+0x48/0xc0
>   jbd2_journal_wait_updates+0x7c/0xf0
>   ? schedstat_stop+0x10/0x10
>   jbd2_journal_commit_transaction+0x284/0x20e0
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? find_held_lock+0x35/0xa0
>   ? _raw_spin_unlock_irqrestore+0x86/0x8f
>   ? _raw_spin_unlock_irqrestore+0x86/0x8f
>   ? try_to_del_timer_sync+0x53/0x80
>   ? trace_hardirqs_on+0x3b/0x120
>   ? try_to_del_timer_sync+0x53/0x80
>   kjournald2+0xd5/0x2b0
>   ? schedstat_stop+0x10/0x10
>   ? commit_timeout+0x20/0x20
>   kthread+0x13b/0x150
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x1f/0x30
>   </TASK>
> 
> 
>  task:ftest01         state:D stack:    0 pid:  817 ppid:   816 flags:0x00004002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? __x64_sys_tee+0xd0/0xd0
>   ? __x64_sys_tee+0xd0/0xd0
>   schedule+0x48/0xc0
>   wb_wait_for_completion+0x62/0x90
>   ? schedstat_stop+0x10/0x10
>   sync_inodes_sb+0xdd/0x460
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? __x64_sys_tee+0xd0/0xd0
>   sync_inodes_one_sb+0x15/0x20
>   iterate_supers+0x94/0x100
>   ksys_sync+0x42/0xa0
>   __do_sys_sync+0xe/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae2cdb
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
>  RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
>  RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
>  RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0
>  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003
>   </TASK>
> 
>   1 lock held by ftest01/817:
>    #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
> 
> 
>  task:ftest01         state:D stack:    0 pid:  818 ppid:   816 flags:0x00004002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? bit_wait+0x60/0x60
>   schedule+0x48/0xc0
>   io_schedule+0x16/0x40
>   bit_wait_io+0x11/0x60
>   __wait_on_bit_lock+0x5e/0xb0
>   out_of_line_wait_on_bit_lock+0x93/0xb0
>   ? complete_all+0x60/0x60
>   __lock_buffer+0x3f/0x50

pid: 818 was locking buffer

>   do_get_write_access+0x2ad/0x550
>   jbd2_journal_get_write_access+0x51/0x80
>   __ext4_journal_get_write_access+0xf9/0x1a0
>   ext4_ext_get_access.isra.32+0x34/0x50
>   ext4_ext_remove_space+0x59f/0x1830
>   ? ext4_ext_truncate+0xa2/0xf0
>   ? ext4_ext_truncate+0x5e/0xf0
>   ext4_ext_truncate+0xd6/0xf0
>   ext4_truncate+0x46a/0x660
>   ext4_setattr+0x6db/0xa40
>   notify_change+0x3e5/0x540
>   do_truncate+0x7b/0xc0
>   ? do_truncate+0x7b/0xc0
>   do_sys_ftruncate+0x164/0x2f0
>   __x64_sys_ftruncate+0x1b/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae3f7b
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
>  RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b
>  RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003
>  RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0
>  R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003
>   </TASK>
> 
>   5 locks held by ftest01/818:
>    #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
>    #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
>    #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
>    #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
>    #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
> 
> 
>  task:ftest01         state:D stack:    0 pid:  819 ppid:   816 flags:0x00004002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? __x64_sys_tee+0xd0/0xd0
>   ? __x64_sys_tee+0xd0/0xd0
>   schedule+0x48/0xc0
>   wb_wait_for_completion+0x62/0x90
>   ? schedstat_stop+0x10/0x10
>   sync_inodes_sb+0xdd/0x460
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? __x64_sys_tee+0xd0/0xd0
>   sync_inodes_one_sb+0x15/0x20
>   iterate_supers+0x94/0x100
>   ksys_sync+0x42/0xa0
>   __do_sys_sync+0xe/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae2cdb
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
>  RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
>  RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
>  RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000
>  R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003
>   </TASK>
> 
>   1 lock held by ftest01/819:
>    #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
> 
> 
>  task:ftest01         state:D stack:    0 pid:  820 ppid:   816 flags:0x00004002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? __x64_sys_tee+0xd0/0xd0
>   ? __x64_sys_tee+0xd0/0xd0
>   schedule+0x48/0xc0
>   wb_wait_for_completion+0x62/0x90
>   ? schedstat_stop+0x10/0x10
>   sync_inodes_sb+0xdd/0x460
>   ? __this_cpu_preempt_check+0x13/0x20
>   ? __x64_sys_tee+0xd0/0xd0
>   sync_inodes_one_sb+0x15/0x20
>   iterate_supers+0x94/0x100
>   ksys_sync+0x42/0xa0
>   __do_sys_sync+0xe/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae2cdb
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
>  RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
>  RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
>  RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000
>  R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003
>   </TASK>
> 
>   1 lock held by ftest01/820:
>    #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
> 
> 
>  task:ftest01         state:D stack:    0 pid:  821 ppid:   816 flags:0x00000002
>  Call Trace:
>   <TASK>
>   __schedule+0x329/0xae0
>   ? bit_wait+0x60/0x60
>   schedule+0x48/0xc0
>   io_schedule+0x16/0x40
>   bit_wait_io+0x11/0x60
>   __wait_on_bit_lock+0x5e/0xb0
>   out_of_line_wait_on_bit_lock+0x93/0xb0
>   ? complete_all+0x60/0x60
>   __lock_buffer+0x3f/0x50

So was pid: 821.

>   do_get_write_access+0x2ad/0x550
>   jbd2_journal_get_write_access+0x51/0x80
>   __ext4_journal_get_write_access+0xf9/0x1a0
>   ext4_ext_get_access.isra.32+0x34/0x50
>   ext4_ext_remove_space+0x59f/0x1830
>   ? ext4_ext_truncate+0xa2/0xf0
>   ? ext4_ext_truncate+0x5e/0xf0
>   ext4_ext_truncate+0xd6/0xf0
>   ext4_truncate+0x46a/0x660
>   ext4_setattr+0x6db/0xa40
>   notify_change+0x3e5/0x540
>   do_truncate+0x7b/0xc0
>   ? do_truncate+0x7b/0xc0
>   do_sys_ftruncate+0x164/0x2f0
>   __x64_sys_ftruncate+0x1b/0x20
>   do_syscall_64+0x3f/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f3f9dae3f7b
>  RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
>  RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b
>  RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003
>  RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0
>  R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
>  R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003
>   </TASK>
> 
>   5 locks held by ftest01/821:
>    #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
>    #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
>    #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
>    #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
>    #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
> 

Looks like a ABBA deadlock that should be fed to lockdep, given what was
blocking flusher (pid 213).

> 		--------------------------------------
> 
> Reproducer:
> 
> Get latest preempt-rt kernel:
> 	git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
> 	checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch)
> 	make defconfig
> 	manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT
> 	make oldconfig and take defaults for other options hidden behind EXPERT
> 	can optionally enable LOCKDEP options so sysrq-d works.
> 
> Get LTP ; build + install
> 	https://github.com/linux-test-project/ltp.git
> 
> Boot and then run reproducer:
>    dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512
>    mkfs.ext4 -F /root/tmp_mount_file
>    mount /root/tmp_mount_file /mnt/
>    /opt/ltp/runltp -f fs -s ftest01 -d /mnt/
>    umount /mnt
> 
>    loop device isn't strictly necessary but contains fs fallout when
>    inevitable reboot is required to get rid of D state processes.
> 
>    Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total.
>    Test hangs forever on v5.15-rt and latest v5.19-rc-rt.  Test threads
>    in D state - any other i/o directed at /mnt also hangs in D state.
> 

Make lock_buffer have mutex_lock() semantics in attempt to catch the
deadlock above.

Only for thoughts now.

Hillf

--- mainline/fs/buffer.c
+++ b/fs/buffer.c
@@ -66,12 +66,17 @@ EXPORT_SYMBOL(touch_buffer);
 
 void __lock_buffer(struct buffer_head *bh)
 {
+	/* lock_buffer has mutex_lock() semantics here */
+	mutex_acquire(&bh->b_dep_map, 0, 0, _RET_IP_);
+
 	wait_on_bit_lock_io(&bh->b_state, BH_Lock, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_buffer);
 
 void unlock_buffer(struct buffer_head *bh)
 {
+	mutex_release(&bh->b_dep_map, _RET_IP_);
+
 	clear_bit_unlock(BH_Lock, &bh->b_state);
 	smp_mb__after_atomic();
 	wake_up_bit(&bh->b_state, BH_Lock);
@@ -3289,12 +3294,22 @@ static void recalc_bh_state(void)
 	buffer_heads_over_limit = (tot > max_buffer_heads);
 }
 
+static void buffer_head_init_lockdep(struct buffer_head *buf)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key bh_key;
+
+	lockdep_init_map_wait(&buf->b_dep_map, "buffer_head_lock", &bh_key, 0, LD_WAIT_SLEEP);
+#endif
+}
+
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
 	struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags);
 	if (ret) {
 		INIT_LIST_HEAD(&ret->b_assoc_buffers);
 		spin_lock_init(&ret->b_uptodate_lock);
+		buffer_head_init_lockdep(ret);
 		preempt_disable();
 		__this_cpu_inc(bh_accounting.nr);
 		recalc_bh_state();
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -76,6 +76,9 @@ struct buffer_head {
 	spinlock_t b_uptodate_lock;	/* Used by the first bh in a page, to
 					 * serialise IO completion of other
 					 * buffers in the page */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	b_dep_map;
+#endif
 };
 
 /*


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] sched/core: Always flush pending blk_plug
  2022-07-10  1:01   ` [PATCH v3] sched/core: Always flush pending blk_plug Hillf Danton
@ 2022-07-12 18:29     ` Paul Gortmaker
  2022-07-12 22:38       ` Hillf Danton
  0 siblings, 1 reply; 3+ messages in thread
From: Paul Gortmaker @ 2022-07-12 18:29 UTC (permalink / raw)
  To: Hillf Danton
  Cc: John Keeping, linux-kernel, linux-mm, Peter Zijlstra,
	Thomas Gleixner, Mel Gorman, Theodore Ts'o

[Re: [PATCH v3] sched/core: Always flush pending blk_plug] On 10/07/2022 (Sun 09:01) Hillf Danton wrote:

> Looks like a ABBA deadlock that should be fed to lockdep, given what was
> blocking flusher (pid 213).

[,,,]

> Make lock_buffer have mutex_lock() semantics in attempt to catch the
> deadlock above.
> 
> Only for thoughts now.

Thanks for the test patch - I'm keeping it as a reference for future
use.

So, preempt-rt issues an early lockdep splat at partition probe bootup,
both with and without the original v3 patch from this thread.

Of course then I figured I'd try the debug patch on vanilla v5.19-rc3
and we get pretty much the same lockdep complaint.


   sd 1:0:0:0: Attached scsi generic sg1 type 0
   sd 1:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
   scsi 2:0:0:0: CD-ROM            HL-DT-ST DVD-ROM DH30N    A101 PQ: 0 ANSI: 5

   =====================================
   WARNING: bad unlock balance detected!
   5.19.0-rc3-dirty #2 Not tainted
   -------------------------------------
   swapper/2/0 is trying to release lock (buffer_head_lock) at:
   [<ffffffffbc494a6b>] end_buffer_async_read+0x5b/0x180
   but there are no more locks to release!
   
   other info that might help us debug this:
   1 lock held by swapper/2/0:
    #0: ffff8bee27744080 (&ret->b_uptodate_lock){....}-{2:2}, at: end_buffer_async_read+0x47/0x180
   
   stack backtrace:
   CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.19.0-rc3-dirty #2
   Hardware name: Dell Inc. Precision WorkStation T5500  /0D883F, BIOS A16 05/28/2013
   Call Trace:
    <IRQ>
    dump_stack_lvl+0x40/0x5c
    lock_release+0x245/0x3f0
    unlock_buffer+0x15/0x30
    end_buffer_async_read+0x5b/0x180
    end_bio_bh_io_sync+0x1e/0x40
    blk_update_request+0x9a/0x470
    scsi_end_request+0x27/0x190
    scsi_io_completion+0x3c/0x580
    blk_complete_reqs+0x39/0x50
    __do_softirq+0x11d/0x344
    irq_exit_rcu+0xa9/0xc0
    common_interrupt+0xa5/0xc0
    </IRQ>
    <TASK>
    asm_common_interrupt+0x27/0x40
   RIP: 0010:cpuidle_enter_state+0x12d/0x3f0
   Code: 49 43 0f 84 b7 02 00 00 31 ff e8 fe 1c 74 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 8e 02 00 00 31 ff e8 97 95 7a ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49
   RSP: 0000:ffffa2890013fe90 EFLAGS: 00000206
   RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002
   RDX: 0000000080000002 RSI: ffffffffbd78d911 RDI: 00000000ffffffff
   RBP: ffff8bee2739a400 R08: 0000000000000000 R09: 00000000001faf40
   R10: ffffa2890013fdc8 R11: 0000000000000000 R12: 00000000a7b149d6
   R13: ffffffffbdbee3a0 R14: 0000000000000003 R15: 0000000000000001
    cpuidle_enter+0x24/0x40
    do_idle+0x1e3/0x230
    cpu_startup_entry+0x14/0x20
    start_secondary+0xe8/0xf0
    secondary_startup_64_no_verify+0xe0/0xeb
    </TASK>
    sda: sda1 sda2 sda3
   sd 0:0:0:0: [sda] Attached SCSI disk
    sdb: sdb1 sdb2
   sd 1:0:0:0: [sdb] Attached SCSI disk

Not quite sure what to make of that.

Paul.
--


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] sched/core: Always flush pending blk_plug
  2022-07-12 18:29     ` Paul Gortmaker
@ 2022-07-12 22:38       ` Hillf Danton
  0 siblings, 0 replies; 3+ messages in thread
From: Hillf Danton @ 2022-07-12 22:38 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: John Keeping, linux-kernel, linux-mm, Peter Zijlstra,
	Thomas Gleixner, Mel Gorman, Theodore Ts'o

On Tue, 12 Jul 2022 14:29:14 -0400 Paul Gortmaker wrote:
> On 10/07/2022 (Sun 09:01) Hillf Danton wrote:
> 
> > Looks like a ABBA deadlock that should be fed to lockdep, given what was
> > blocking flusher (pid 213).
> 
> [,,,]
> 
> > Make lock_buffer have mutex_lock() semantics in attempt to catch the
> > deadlock above.
> > 
> > Only for thoughts now.
> 
> Thanks for the test patch - I'm keeping it as a reference for future
> use.
> 
> So, preempt-rt issues an early lockdep splat at partition probe bootup,
> both with and without the original v3 patch from this thread.
> 
> Of course then I figured I'd try the debug patch on vanilla v5.19-rc3
> and we get pretty much the same lockdep complaint.
> 
> 
>    sd 1:0:0:0: Attached scsi generic sg1 type 0
>    sd 1:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
>    scsi 2:0:0:0: CD-ROM            HL-DT-ST DVD-ROM DH30N    A101 PQ: 0 ANSI: 5
> 
>    =====================================
>    WARNING: bad unlock balance detected!
>    5.19.0-rc3-dirty #2 Not tainted
>    -------------------------------------
>    swapper/2/0 is trying to release lock (buffer_head_lock) at:
>    [<ffffffffbc494a6b>] end_buffer_async_read+0x5b/0x180
>    but there are no more locks to release!

The lock releaser different from the lock acquirer was caught.

>    
>    other info that might help us debug this:
>    1 lock held by swapper/2/0:
>     #0: ffff8bee27744080 (&ret->b_uptodate_lock){....}-{2:2}, at: end_buffer_async_read+0x47/0x180
>    
>    stack backtrace:
>    CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.19.0-rc3-dirty #2
>    Hardware name: Dell Inc. Precision WorkStation T5500  /0D883F, BIOS A16 05/28/2013
>    Call Trace:
>     <IRQ>
>     dump_stack_lvl+0x40/0x5c
>     lock_release+0x245/0x3f0
>     unlock_buffer+0x15/0x30
>     end_buffer_async_read+0x5b/0x180
>     end_bio_bh_io_sync+0x1e/0x40
>     blk_update_request+0x9a/0x470
>     scsi_end_request+0x27/0x190
>     scsi_io_completion+0x3c/0x580
>     blk_complete_reqs+0x39/0x50
>     __do_softirq+0x11d/0x344
>     irq_exit_rcu+0xa9/0xc0
>     common_interrupt+0xa5/0xc0
>     </IRQ>
>     <TASK>
>     asm_common_interrupt+0x27/0x40
>    RIP: 0010:cpuidle_enter_state+0x12d/0x3f0
>    Code: 49 43 0f 84 b7 02 00 00 31 ff e8 fe 1c 74 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 8e 02 00 00 31 ff e8 97 95 7a ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49
>    RSP: 0000:ffffa2890013fe90 EFLAGS: 00000206
>    RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002
>    RDX: 0000000080000002 RSI: ffffffffbd78d911 RDI: 00000000ffffffff
>    RBP: ffff8bee2739a400 R08: 0000000000000000 R09: 00000000001faf40
>    R10: ffffa2890013fdc8 R11: 0000000000000000 R12: 00000000a7b149d6
>    R13: ffffffffbdbee3a0 R14: 0000000000000003 R15: 0000000000000001
>     cpuidle_enter+0x24/0x40
>     do_idle+0x1e3/0x230
>     cpu_startup_entry+0x14/0x20
>     start_secondary+0xe8/0xf0
>     secondary_startup_64_no_verify+0xe0/0xeb
>     </TASK>
>     sda: sda1 sda2 sda3
>    sd 0:0:0:0: [sda] Attached SCSI disk
>     sdb: sdb1 sdb2
>    sd 1:0:0:0: [sdb] Attached SCSI disk
> 
> Not quite sure what to make of that.

You provide the answer to the question why it has been quite a long while
that lock_buffer has no annotation to help lockdep. And perhaps lock_page
as well despite the relevant long running deadlock does not sit deep under
the water.

Thanks
Hillf


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-07-12 22:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220708162702.1758865-1-john@metanate.com>
     [not found] ` <20220709161956.GA20046@windriver.com>
2022-07-10  1:01   ` [PATCH v3] sched/core: Always flush pending blk_plug Hillf Danton
2022-07-12 18:29     ` Paul Gortmaker
2022-07-12 22:38       ` Hillf Danton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).