* [PATCH v2] sched/core: Always flush pending blk_plug
@ 2022-07-08 9:32 John Keeping
2022-07-08 13:53 ` Peter Zijlstra
0 siblings, 1 reply; 2+ messages in thread
From: John Keeping @ 2022-07-08 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: linux-rt-users, John Keeping, Sebastian Andrzej Siewior,
Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):
INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
Workqueue: writeback wb_workfn (flush-7:0)
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
+(rt_mutex_slowlock.constprop.0+0xac/0x174)
[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc10e3fb0 to 0xc10e3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
INFO: task tar:2083 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc
Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.
The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.
It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag. But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.
Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").
As described in [1]:
The schedule process must distinguish between blocking on a regular
sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
and rwlock):
- rwsem and mutex must flush block requests (blk_schedule_flush_plug())
even if blocked on a lock. This can not deadlock because this also
happens for non-RT.
There should be a warning if the scheduling point is within a RCU read
section.
- spinlock and rwlock must not flush block requests. This will deadlock
if the callback attempts to acquire a lock which is already acquired.
Similarly to being preempted, there should be no warning if the
scheduling point is within a RCU read section.
and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: John Keeping <john@metanate.com>
---
v2:
- Add Steven's R-b and update the commit message with his suggested
quote from [1]
include/linux/sched/rt.h | 8 --------
kernel/sched/core.c | 3 ---
2 files changed, 11 deletions(-)
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index e5af028c08b49..994c25640e156 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
}
extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
extern void rt_mutex_adjust_pi(struct task_struct *p);
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return tsk->pi_blocked_on != NULL;
-}
#else
static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
{
return NULL;
}
# define rt_mutex_adjust_pi(p) do { } while (0)
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return false;
-}
#endif
extern void normalize_rt_tasks(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d4660a1915b3..e4974fe003b5b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6578,9 +6578,6 @@ static inline void sched_submit_work(struct task_struct *tsk)
io_wq_worker_sleeping(tsk);
}
- if (tsk_is_pi_blocked(tsk))
- return;
-
/*
* If we are going to sleep and we have plugged IO queued,
* make sure to submit it to avoid deadlocks.
--
2.37.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH v2] sched/core: Always flush pending blk_plug
2022-07-08 9:32 [PATCH v2] sched/core: Always flush pending blk_plug John Keeping
@ 2022-07-08 13:53 ` Peter Zijlstra
0 siblings, 0 replies; 2+ messages in thread
From: Peter Zijlstra @ 2022-07-08 13:53 UTC (permalink / raw)
To: John Keeping
Cc: linux-kernel, linux-rt-users, Sebastian Andrzej Siewior,
Thomas Gleixner, Steven Rostedt, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, Valentin Schneider
On Fri, Jul 08, 2022 at 10:32:12AM +0100, John Keeping wrote:
> It seems that the intent here is to skip blk_flush_plug() in the case
> where a non-preemptible lock (such as a spinlock) has been converted to
> a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
> schedule flag. But sched_submit_work() is only called from schedule()
> which is never called in this scenario, so the check can simply be
> deleted.
> include/linux/sched/rt.h | 8 --------
> kernel/sched/core.c | 3 ---
> 2 files changed, 11 deletions(-)
>
> diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> index e5af028c08b49..994c25640e156 100644
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
> }
> extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
> extern void rt_mutex_adjust_pi(struct task_struct *p);
> -static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
> -{
> - return tsk->pi_blocked_on != NULL;
> -}
> #else
> static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
> {
> return NULL;
> }
> # define rt_mutex_adjust_pi(p) do { } while (0)
> -static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
> -{
> - return false;
> -}
> #endif
>
> extern void normalize_rt_tasks(void);
Excellent, glad to see the back of that.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1d4660a1915b3..e4974fe003b5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6578,9 +6578,6 @@ static inline void sched_submit_work(struct task_struct *tsk)
> io_wq_worker_sleeping(tsk);
> }
>
> - if (tsk_is_pi_blocked(tsk))
> - return;
> -
Would it make sense to replace this with:
SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
Along with a comment along the lines of:
> - spinlock and rwlock must not flush block requests. This will deadlock
> if the callback attempts to acquire a lock which is already acquired.
> Similarly to being preempted, there should be no warning if the
> scheduling point is within a RCU read section.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2022-07-08 13:54 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08 9:32 [PATCH v2] sched/core: Always flush pending blk_plug John Keeping
2022-07-08 13:53 ` Peter Zijlstra
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.