* [PATCH v3] sched/core: Always flush pending blk_plug
@ 2022-07-08 16:27 John Keeping
2022-07-09 16:19 ` Paul Gortmaker
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: John Keeping @ 2022-07-08 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-rt-users, John Keeping, Sebastian Andrzej Siewior,
Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):
INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
Workqueue: writeback wb_workfn (flush-7:0)
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
+(rt_mutex_slowlock.constprop.0+0xac/0x174)
[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc10e3fb0 to 0xc10e3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
INFO: task tar:2083 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc
Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.
The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.
It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag. But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.
Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").
As described in [1]:
The schedule process must distinguish between blocking on a regular
sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
and rwlock):
- rwsem and mutex must flush block requests (blk_schedule_flush_plug())
even if blocked on a lock. This can not deadlock because this also
happens for non-RT.
There should be a warning if the scheduling point is within a RCU read
section.
- spinlock and rwlock must not flush block requests. This will deadlock
if the callback attempts to acquire a lock which is already acquired.
Similarly to being preempted, there should be no warning if the
scheduling point is within a RCU read section.
and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: John Keeping <john@metanate.com>
---
v3:
- Add SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT) as suggested by
Peter
v2:
- Add Steven's R-b and update the commit message with his suggested
quote from [1]
include/linux/sched/rt.h | 8 --------
kernel/sched/core.c | 8 ++++++--
2 files changed, 6 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index e5af028c08b49..994c25640e156 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
}
extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
extern void rt_mutex_adjust_pi(struct task_struct *p);
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return tsk->pi_blocked_on != NULL;
-}
#else
static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
{
return NULL;
}
# define rt_mutex_adjust_pi(p) do { } while (0)
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return false;
-}
#endif
extern void normalize_rt_tasks(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d4660a1915b3..71d6385ece83f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6578,8 +6578,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
io_wq_worker_sleeping(tsk);
}
- if (tsk_is_pi_blocked(tsk))
- return;
+ /*
+ * spinlock and rwlock must not flush block requests. This will
+ * deadlock if the callback attempts to acquire a lock which is
+ * already acquired.
+ */
+ SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
/*
* If we are going to sleep and we have plugged IO queued,
--
2.37.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v3] sched/core: Always flush pending blk_plug
2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
@ 2022-07-09 16:19 ` Paul Gortmaker
[not found] ` <20220710010136.2510-1-hdanton@sina.com>
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Paul Gortmaker @ 2022-07-09 16:19 UTC (permalink / raw)
To: John Keeping
Cc: linux-kernel, linux-rt-users, Sebastian Andrzej Siewior,
Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider
[[PATCH v3] sched/core: Always flush pending blk_plug] On 08/07/2022 (Fri 17:27) John Keeping wrote:
> With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
> normal priority tasks (SCHED_OTHER, nice level zero):
>
> INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
> Not tainted 5.15.49-rt46 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
> Workqueue: writeback wb_workfn (flush-7:0)
This is great. I was on-and-off looking at an ext4 deadlock triggered
by LTP's "ftest01" - and was just potting my notes together this morning
to clarify for myself what I thought was going on before I forgot.
Which is good, because I then could recognize the very similar patterns
in what you saw to what I'd observed/deduced. Otherise I'd probably not
have had your backtraces grab my attention at all.
Long story short, I applied this to v5.19-rc3-rt5 and it resolves the
LTP/ext4 deadlock issue. So...
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
As I'd already put together my ext4 debug info, I'll include it below
in case it helps future people searching for LTP and ftest01.
Thanks,
Paul.
-----------------
There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP
"ftest01"
Reproducer details are at the bottom. What seems to happen is as follows:
The kworker doing writeback wb_workfn (pid 213 below) stalls on the
one and only down_write in ext4_map_blocks() and never moves again.
/*
* New blocks allocate and/or writing to unwritten extent
* will possibly result in updating i_data, so we take
* the write lock of i_data_sem, and call get_block()
* with create == 1 flag.
*/
down_write(&EXT4_I(inode)->i_data_sem);
So who is blocking the kworker? There are two of the ftest01 processes
(pid 818 and pid 821) doing ext4 truncate which look like this:
ext4_truncate+0x46a/0x660
--> ext4_ext_truncate+0xd6/0xf0
--> ext4_ext_remove_space+0x59f/0x1830
--> jbd2_journal_get_write_access+0x51/0x80
--> do_get_write_access+0x2ad/0x550
--> complete_all+0x60/0x60
--> bit_wait_io+0x11/0x60
They too are stalled in D state waiting there forever - presumably
waiting on the kwowrker to make the progress they are waiting for.
Here is where I think lockdep shows us the problem:
5 locks held by ftest01/818:
[...]
#4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
5 locks held by ftest01/821:
[...]
#4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
Yes - that is the one the kworker was trying the down_write() on.
Also note that the bit_wait is a big telltale sign here - there is only
one in the jbd2's do_get_write_access() function:
/*
* There is one case we have to be very careful about. If the
* committing transaction is currently writing this buffer out to disk
* and has NOT made a copy-out, then we cannot modify the buffer
* contents at all right now. The essence of copy-out is that it is
* the extra copy, not the primary copy, which gets journaled. If the
* primary copy is already going to disk then we cannot do copy-out
* here.
*/
if (buffer_shadow(bh)) {
JBUFFER_TRACE(jh, "on shadow: sleep");
spin_unlock(&jh->b_state_lock);
wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
goto repeat;
}
Presumaby this is a corner case that preempt-rt makes easier to trigger?
The full trace with interspersed lockdep info follows.
Various detailed reproducer information follows that.
-----------------------
sysrq-w backtracce information interspersed with sysrq-d lockdep info
on 5.19-rc2-rt3 kernel:
task:kworker/u129:34 state:D stack: 0 pid: 213 ppid: 2 flags:0x00004000
Workqueue: writeback wb_workfn (flush-7:0)
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __this_cpu_preempt_check+0x13/0x20
? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40
schedule+0x48/0xc0
__rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40
rwbase_write_lock+0x98/0x6e0
down_write+0x46/0x50
? ext4_map_blocks+0x156/0x5c0
ext4_map_blocks+0x156/0x5c0
ext4_writepages+0xa1a/0x1290
? __this_cpu_preempt_check+0x13/0x20
? lockdep_hardirqs_on+0xcd/0x150
do_writepages+0xd2/0x1a0
__writeback_single_inode+0x64/0x850
writeback_sb_inodes+0x22d/0x5b0
__writeback_inodes_wb+0x67/0xe0
wb_writeback+0x269/0x5f0
? get_nr_inodes+0x49/0x70
wb_workfn+0x43d/0x780
? __this_cpu_preempt_check+0x13/0x20
process_one_work+0x286/0x710
worker_thread+0x3c/0x3d0
? process_one_work+0x710/0x710
kthread+0x13b/0x150
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
6 locks held by kworker/u129:34/213:
#0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710
#1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710
#2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50
#3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0
#4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0
task:jbd2/loop0-8 state:D stack: 0 pid: 679 ppid: 2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x329/0xae0
schedule+0x48/0xc0
jbd2_journal_wait_updates+0x7c/0xf0
? schedstat_stop+0x10/0x10
jbd2_journal_commit_transaction+0x284/0x20e0
? __this_cpu_preempt_check+0x13/0x20
? find_held_lock+0x35/0xa0
? _raw_spin_unlock_irqrestore+0x86/0x8f
? _raw_spin_unlock_irqrestore+0x86/0x8f
? try_to_del_timer_sync+0x53/0x80
? trace_hardirqs_on+0x3b/0x120
? try_to_del_timer_sync+0x53/0x80
kjournald2+0xd5/0x2b0
? schedstat_stop+0x10/0x10
? commit_timeout+0x20/0x20
kthread+0x13b/0x150
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
task:ftest01 state:D stack: 0 pid: 817 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003
</TASK>
1 lock held by ftest01/817:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
task:ftest01 state:D stack: 0 pid: 818 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? bit_wait+0x60/0x60
schedule+0x48/0xc0
io_schedule+0x16/0x40
bit_wait_io+0x11/0x60
__wait_on_bit_lock+0x5e/0xb0
out_of_line_wait_on_bit_lock+0x93/0xb0
? complete_all+0x60/0x60
__lock_buffer+0x3f/0x50
do_get_write_access+0x2ad/0x550
jbd2_journal_get_write_access+0x51/0x80
__ext4_journal_get_write_access+0xf9/0x1a0
ext4_ext_get_access.isra.32+0x34/0x50
ext4_ext_remove_space+0x59f/0x1830
? ext4_ext_truncate+0xa2/0xf0
? ext4_ext_truncate+0x5e/0xf0
ext4_ext_truncate+0xd6/0xf0
ext4_truncate+0x46a/0x660
ext4_setattr+0x6db/0xa40
notify_change+0x3e5/0x540
do_truncate+0x7b/0xc0
? do_truncate+0x7b/0xc0
do_sys_ftruncate+0x164/0x2f0
__x64_sys_ftruncate+0x1b/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae3f7b
RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b
RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003
RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0
R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003
</TASK>
5 locks held by ftest01/818:
#0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
#1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
#2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
#3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
task:ftest01 state:D stack: 0 pid: 819 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000
R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003
</TASK>
1 lock held by ftest01/819:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
task:ftest01 state:D stack: 0 pid: 820 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000
R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003
</TASK>
1 lock held by ftest01/820:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
task:ftest01 state:D stack: 0 pid: 821 ppid: 816 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? bit_wait+0x60/0x60
schedule+0x48/0xc0
io_schedule+0x16/0x40
bit_wait_io+0x11/0x60
__wait_on_bit_lock+0x5e/0xb0
out_of_line_wait_on_bit_lock+0x93/0xb0
? complete_all+0x60/0x60
__lock_buffer+0x3f/0x50
do_get_write_access+0x2ad/0x550
jbd2_journal_get_write_access+0x51/0x80
__ext4_journal_get_write_access+0xf9/0x1a0
ext4_ext_get_access.isra.32+0x34/0x50
ext4_ext_remove_space+0x59f/0x1830
? ext4_ext_truncate+0xa2/0xf0
? ext4_ext_truncate+0x5e/0xf0
ext4_ext_truncate+0xd6/0xf0
ext4_truncate+0x46a/0x660
ext4_setattr+0x6db/0xa40
notify_change+0x3e5/0x540
do_truncate+0x7b/0xc0
? do_truncate+0x7b/0xc0
do_sys_ftruncate+0x164/0x2f0
__x64_sys_ftruncate+0x1b/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae3f7b
RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b
RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003
RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0
R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003
</TASK>
5 locks held by ftest01/821:
#0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
#1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
#2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
#3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
--------------------------------------
Reproducer:
Get latest preempt-rt kernel:
git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch)
make defconfig
manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT
make oldconfig and take defaults for other options hidden behind EXPERT
can optionally enable LOCKDEP options so sysrq-d works.
Get LTP ; build + install
https://github.com/linux-test-project/ltp.git
Boot and then run reproducer:
dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512
mkfs.ext4 -F /root/tmp_mount_file
mount /root/tmp_mount_file /mnt/
/opt/ltp/runltp -f fs -s ftest01 -d /mnt/
umount /mnt
loop device isn't strictly necessary but contains fs fallout when
inevitable reboot is required to get rid of D state processes.
Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total.
Test hangs forever on v5.15-rt and latest v5.19-rc-rt. Test threads
in D state - any other i/o directed at /mnt also hangs in D state.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v3] sched/core: Always flush pending blk_plug
[not found] ` <20220710010136.2510-1-hdanton@sina.com>
@ 2022-07-12 18:29 ` Paul Gortmaker
0 siblings, 0 replies; 5+ messages in thread
From: Paul Gortmaker @ 2022-07-12 18:29 UTC (permalink / raw)
To: Hillf Danton
Cc: John Keeping, linux-kernel, linux-mm, Peter Zijlstra,
Thomas Gleixner, Mel Gorman, Theodore Ts'o
[Re: [PATCH v3] sched/core: Always flush pending blk_plug] On 10/07/2022 (Sun 09:01) Hillf Danton wrote:
> Looks like a ABBA deadlock that should be fed to lockdep, given what was
> blocking flusher (pid 213).
[,,,]
> Make lock_buffer have mutex_lock() semantics in attempt to catch the
> deadlock above.
>
> Only for thoughts now.
Thanks for the test patch - I'm keeping it as a reference for future
use.
So, preempt-rt issues an early lockdep splat at partition probe bootup,
both with and without the original v3 patch from this thread.
Of course then I figured I'd try the debug patch on vanilla v5.19-rc3
and we get pretty much the same lockdep complaint.
sd 1:0:0:0: Attached scsi generic sg1 type 0
sd 1:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
scsi 2:0:0:0: CD-ROM HL-DT-ST DVD-ROM DH30N A101 PQ: 0 ANSI: 5
=====================================
WARNING: bad unlock balance detected!
5.19.0-rc3-dirty #2 Not tainted
-------------------------------------
swapper/2/0 is trying to release lock (buffer_head_lock) at:
[<ffffffffbc494a6b>] end_buffer_async_read+0x5b/0x180
but there are no more locks to release!
other info that might help us debug this:
1 lock held by swapper/2/0:
#0: ffff8bee27744080 (&ret->b_uptodate_lock){....}-{2:2}, at: end_buffer_async_read+0x47/0x180
stack backtrace:
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.19.0-rc3-dirty #2
Hardware name: Dell Inc. Precision WorkStation T5500 /0D883F, BIOS A16 05/28/2013
Call Trace:
<IRQ>
dump_stack_lvl+0x40/0x5c
lock_release+0x245/0x3f0
unlock_buffer+0x15/0x30
end_buffer_async_read+0x5b/0x180
end_bio_bh_io_sync+0x1e/0x40
blk_update_request+0x9a/0x470
scsi_end_request+0x27/0x190
scsi_io_completion+0x3c/0x580
blk_complete_reqs+0x39/0x50
__do_softirq+0x11d/0x344
irq_exit_rcu+0xa9/0xc0
common_interrupt+0xa5/0xc0
</IRQ>
<TASK>
asm_common_interrupt+0x27/0x40
RIP: 0010:cpuidle_enter_state+0x12d/0x3f0
Code: 49 43 0f 84 b7 02 00 00 31 ff e8 fe 1c 74 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 8e 02 00 00 31 ff e8 97 95 7a ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49
RSP: 0000:ffffa2890013fe90 EFLAGS: 00000206
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002
RDX: 0000000080000002 RSI: ffffffffbd78d911 RDI: 00000000ffffffff
RBP: ffff8bee2739a400 R08: 0000000000000000 R09: 00000000001faf40
R10: ffffa2890013fdc8 R11: 0000000000000000 R12: 00000000a7b149d6
R13: ffffffffbdbee3a0 R14: 0000000000000003 R15: 0000000000000001
cpuidle_enter+0x24/0x40
do_idle+0x1e3/0x230
cpu_startup_entry+0x14/0x20
start_secondary+0xe8/0xf0
secondary_startup_64_no_verify+0xe0/0xeb
</TASK>
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
sdb: sdb1 sdb2
sd 1:0:0:0: [sdb] Attached SCSI disk
Not quite sure what to make of that.
Paul.
--
^ permalink raw reply [flat|nested] 5+ messages in thread
* [tip: sched/core] sched/core: Always flush pending blk_plug
2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
2022-07-09 16:19 ` Paul Gortmaker
[not found] ` <20220710010136.2510-1-hdanton@sina.com>
@ 2022-07-14 11:35 ` tip-bot2 for John Keeping
2022-07-25 8:56 ` [PATCH v3] " Sebastian Andrzej Siewior
3 siblings, 0 replies; 5+ messages in thread
From: tip-bot2 for John Keeping @ 2022-07-14 11:35 UTC (permalink / raw)
To: linux-tip-commits
Cc: John Keeping, Peter Zijlstra (Intel), Steven Rostedt (Google),
x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 401e4963bf45c800e3e9ea0d3a0289d738005fd4
Gitweb: https://git.kernel.org/tip/401e4963bf45c800e3e9ea0d3a0289d738005fd4
Author: John Keeping <john@metanate.com>
AuthorDate: Fri, 08 Jul 2022 17:27:02 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 13 Jul 2022 11:29:17 +02:00
sched/core: Always flush pending blk_plug
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):
INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
Workqueue: writeback wb_workfn (flush-7:0)
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
+(rt_mutex_slowlock.constprop.0+0xac/0x174)
[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc10e3fb0 to 0xc10e3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
INFO: task tar:2083 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc
Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.
The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.
It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag. But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.
Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").
As described in [1]:
The schedule process must distinguish between blocking on a regular
sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
and rwlock):
- rwsem and mutex must flush block requests (blk_schedule_flush_plug())
even if blocked on a lock. This can not deadlock because this also
happens for non-RT.
There should be a warning if the scheduling point is within a RCU read
section.
- spinlock and rwlock must not flush block requests. This will deadlock
if the callback attempts to acquire a lock which is already acquired.
Similarly to being preempted, there should be no warning if the
scheduling point is within a RCU read section.
and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches
Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
---
include/linux/sched/rt.h | 8 --------
kernel/sched/core.c | 8 ++++++--
2 files changed, 6 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index e5af028..994c256 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
}
extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
extern void rt_mutex_adjust_pi(struct task_struct *p);
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return tsk->pi_blocked_on != NULL;
-}
#else
static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
{
return NULL;
}
# define rt_mutex_adjust_pi(p) do { } while (0)
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return false;
-}
#endif
extern void normalize_rt_tasks(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c703d17..a463dbc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6470,8 +6470,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
io_wq_worker_sleeping(tsk);
}
- if (tsk_is_pi_blocked(tsk))
- return;
+ /*
+ * spinlock and rwlock must not flush block requests. This will
+ * deadlock if the callback attempts to acquire a lock which is
+ * already acquired.
+ */
+ SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
/*
* If we are going to sleep and we have plugged IO queued,
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v3] sched/core: Always flush pending blk_plug
2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
` (2 preceding siblings ...)
2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping
@ 2022-07-25 8:56 ` Sebastian Andrzej Siewior
3 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-07-25 8:56 UTC (permalink / raw)
To: John Keeping
Cc: linux-kernel, linux-rt-users, Peter Zijlstra, Thomas Gleixner,
Steven Rostedt, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, Valentin Schneider
On 2022-07-08 17:27:02 [+0100], John Keeping wrote:
> Looking at the history of the -rt patchset, in fact this change was
> present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
> of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
> ("sched/core: Rework the __schedule() preempt argument").
Thanks for spotting that. The plan was to get rid of this with the
introduction of schedule_rtlock() but it somehow remained resilient.
Sebastian
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-07-25 8:56 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
2022-07-09 16:19 ` Paul Gortmaker
[not found] ` <20220710010136.2510-1-hdanton@sina.com>
2022-07-12 18:29 ` Paul Gortmaker
2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping
2022-07-25 8:56 ` [PATCH v3] " Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).