linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] sched/core: Always flush pending blk_plug
@ 2022-07-08 16:27 John Keeping
  2022-07-09 16:19 ` Paul Gortmaker
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: John Keeping @ 2022-07-08 16:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-rt-users, John Keeping, Sebastian Andrzej Siewior,
	Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):

	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
	Workqueue: writeback wb_workfn (flush-7:0)
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
	[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
	+(rt_mutex_slowlock.constprop.0+0xac/0x174)
	[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
	[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
	[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
	[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
	[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
	[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
	[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
	[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
	[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
	[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
	Exception stack(0xc10e3fb0 to 0xc10e3ff8)
	3fa0:                                     00000000 00000000 00000000 00000000
	3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

	INFO: task tar:2083 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:tar             state:D stack:    0 pid: 2083 ppid:  2082 flags:0x00000000
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
	[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
	[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
	[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
	[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
	[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
	[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
	[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
	[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
	[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
	[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
	Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
	bfa0:                   01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
	bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
	bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.

The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.

It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag.  But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.

Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").

As described in [1]:

   The schedule process must distinguish between blocking on a regular
   sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
   and rwlock):
   - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
     even if blocked on a lock. This can not deadlock because this also
     happens for non-RT.
     There should be a warning if the scheduling point is within a RCU read
     section.

   - spinlock and rwlock must not flush block requests. This will deadlock
     if the callback attempts to acquire a lock which is already acquired.
     Similarly to being preempted, there should be no warning if the
     scheduling point is within a RCU read section.

and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: John Keeping <john@metanate.com>
---
v3:
- Add SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT) as suggested by
  Peter
v2:
- Add Steven's R-b and update the commit message with his suggested
  quote from [1]

 include/linux/sched/rt.h | 8 --------
 kernel/sched/core.c      | 8 ++++++--
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index e5af028c08b49..994c25640e156 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
 }
 extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
-	return tsk->pi_blocked_on != NULL;
-}
 #else
 static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
 	return NULL;
 }
 # define rt_mutex_adjust_pi(p)		do { } while (0)
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
-	return false;
-}
 #endif
 
 extern void normalize_rt_tasks(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d4660a1915b3..71d6385ece83f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6578,8 +6578,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
 			io_wq_worker_sleeping(tsk);
 	}
 
-	if (tsk_is_pi_blocked(tsk))
-		return;
+	/*
+	 * spinlock and rwlock must not flush block requests.  This will
+	 * deadlock if the callback attempts to acquire a lock which is
+	 * already acquired.
+	 */
+	SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
 
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] sched/core: Always flush pending blk_plug
  2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
@ 2022-07-09 16:19 ` Paul Gortmaker
       [not found] ` <20220710010136.2510-1-hdanton@sina.com>
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Paul Gortmaker @ 2022-07-09 16:19 UTC (permalink / raw)
  To: John Keeping
  Cc: linux-kernel, linux-rt-users, Sebastian Andrzej Siewior,
	Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider

[[PATCH v3] sched/core: Always flush pending blk_plug] On 08/07/2022 (Fri 17:27) John Keeping wrote:

> With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
> normal priority tasks (SCHED_OTHER, nice level zero):
> 
> 	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
> 	      Not tainted 5.15.49-rt46 #1
> 	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
> 	Workqueue: writeback wb_workfn (flush-7:0)

This is great.  I was on-and-off looking at an ext4 deadlock triggered
by LTP's "ftest01" - and was just potting my notes together this morning
to clarify for myself what I thought was going on before I forgot.

Which is good, because I then could recognize the very similar patterns
in what you saw to what I'd observed/deduced.  Otherise I'd probably not
have had your backtraces grab my attention at all.

Long story short, I applied this to v5.19-rc3-rt5 and it resolves the
LTP/ext4 deadlock issue.  So...

Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>

As I'd already put together my ext4 debug info, I'll include it below
in case it helps future people searching for LTP and ftest01.

Thanks,
Paul.

	-----------------


There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP
"ftest01" 

Reproducer details are at the bottom.  What seems to happen is as follows:

The kworker doing writeback wb_workfn (pid 213 below) stalls on the 
one and only down_write in ext4_map_blocks() and never moves again.

      /*
       * New blocks allocate and/or writing to unwritten extent
       * will possibly result in updating i_data, so we take
       * the write lock of i_data_sem, and call get_block()
       * with create == 1 flag.
       */
      down_write(&EXT4_I(inode)->i_data_sem);

So who is blocking the kworker?  There are two of the ftest01 processes
(pid 818 and pid 821) doing ext4 truncate which look like this:

  ext4_truncate+0x46a/0x660
   --> ext4_ext_truncate+0xd6/0xf0
       --> ext4_ext_remove_space+0x59f/0x1830
           --> jbd2_journal_get_write_access+0x51/0x80
               --> do_get_write_access+0x2ad/0x550
                   --> complete_all+0x60/0x60
                       --> bit_wait_io+0x11/0x60

They too are stalled in D state waiting there forever - presumably
waiting on the kwowrker to make the progress they are waiting for.

Here is where I think lockdep shows us the problem:

  5 locks held by ftest01/818:
   [...]
   #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

  5 locks held by ftest01/821:
   [...]
   #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

Yes - that is the one the kworker was trying the down_write() on.

Also note that the bit_wait is a big telltale sign here - there is only
one in the jbd2's do_get_write_access() function:

      /*
       * There is one case we have to be very careful about.  If the
       * committing transaction is currently writing this buffer out to disk
       * and has NOT made a copy-out, then we cannot modify the buffer
       * contents at all right now.  The essence of copy-out is that it is
       * the extra copy, not the primary copy, which gets journaled.  If the
       * primary copy is already going to disk then we cannot do copy-out
       * here.
       */
      if (buffer_shadow(bh)) {
              JBUFFER_TRACE(jh, "on shadow: sleep");
              spin_unlock(&jh->b_state_lock);
              wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
              goto repeat;
      }

Presumaby this is a corner case that preempt-rt makes easier to trigger?

The full trace with interspersed lockdep info follows.

Various detailed reproducer information follows that.

	-----------------------

sysrq-w backtracce information interspersed with sysrq-d lockdep info
on 5.19-rc2-rt3 kernel:

 task:kworker/u129:34 state:D stack:    0 pid:  213 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-7:0)
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __this_cpu_preempt_check+0x13/0x20
  ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40
  schedule+0x48/0xc0
  __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40
  rwbase_write_lock+0x98/0x6e0
  down_write+0x46/0x50
  ? ext4_map_blocks+0x156/0x5c0
  ext4_map_blocks+0x156/0x5c0
  ext4_writepages+0xa1a/0x1290
  ? __this_cpu_preempt_check+0x13/0x20
  ? lockdep_hardirqs_on+0xcd/0x150
  do_writepages+0xd2/0x1a0
  __writeback_single_inode+0x64/0x850
  writeback_sb_inodes+0x22d/0x5b0
  __writeback_inodes_wb+0x67/0xe0
  wb_writeback+0x269/0x5f0
  ? get_nr_inodes+0x49/0x70
  wb_workfn+0x43d/0x780
  ? __this_cpu_preempt_check+0x13/0x20
  process_one_work+0x286/0x710
  worker_thread+0x3c/0x3d0
  ? process_one_work+0x710/0x710
  kthread+0x13b/0x150
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

  6 locks held by kworker/u129:34/213:
   #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710
   #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710
   #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50
   #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0
   #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0


 task:jbd2/loop0-8    state:D stack:    0 pid:  679 ppid:     2 flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  schedule+0x48/0xc0
  jbd2_journal_wait_updates+0x7c/0xf0
  ? schedstat_stop+0x10/0x10
  jbd2_journal_commit_transaction+0x284/0x20e0
  ? __this_cpu_preempt_check+0x13/0x20
  ? find_held_lock+0x35/0xa0
  ? _raw_spin_unlock_irqrestore+0x86/0x8f
  ? _raw_spin_unlock_irqrestore+0x86/0x8f
  ? try_to_del_timer_sync+0x53/0x80
  ? trace_hardirqs_on+0x3b/0x120
  ? try_to_del_timer_sync+0x53/0x80
  kjournald2+0xd5/0x2b0
  ? schedstat_stop+0x10/0x10
  ? commit_timeout+0x20/0x20
  kthread+0x13b/0x150
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>


 task:ftest01         state:D stack:    0 pid:  817 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/817:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  818 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? bit_wait+0x60/0x60
  schedule+0x48/0xc0
  io_schedule+0x16/0x40
  bit_wait_io+0x11/0x60
  __wait_on_bit_lock+0x5e/0xb0
  out_of_line_wait_on_bit_lock+0x93/0xb0
  ? complete_all+0x60/0x60
  __lock_buffer+0x3f/0x50
  do_get_write_access+0x2ad/0x550
  jbd2_journal_get_write_access+0x51/0x80
  __ext4_journal_get_write_access+0xf9/0x1a0
  ext4_ext_get_access.isra.32+0x34/0x50
  ext4_ext_remove_space+0x59f/0x1830
  ? ext4_ext_truncate+0xa2/0xf0
  ? ext4_ext_truncate+0x5e/0xf0
  ext4_ext_truncate+0xd6/0xf0
  ext4_truncate+0x46a/0x660
  ext4_setattr+0x6db/0xa40
  notify_change+0x3e5/0x540
  do_truncate+0x7b/0xc0
  ? do_truncate+0x7b/0xc0
  do_sys_ftruncate+0x164/0x2f0
  __x64_sys_ftruncate+0x1b/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae3f7b
 RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
 RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b
 RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003
 RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0
 R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003
  </TASK>

  5 locks held by ftest01/818:
   #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
   #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
   #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
   #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660


 task:ftest01         state:D stack:    0 pid:  819 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000
 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/819:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  820 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000
 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/820:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  821 ppid:   816 flags:0x00000002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? bit_wait+0x60/0x60
  schedule+0x48/0xc0
  io_schedule+0x16/0x40
  bit_wait_io+0x11/0x60
  __wait_on_bit_lock+0x5e/0xb0
  out_of_line_wait_on_bit_lock+0x93/0xb0
  ? complete_all+0x60/0x60
  __lock_buffer+0x3f/0x50
  do_get_write_access+0x2ad/0x550
  jbd2_journal_get_write_access+0x51/0x80
  __ext4_journal_get_write_access+0xf9/0x1a0
  ext4_ext_get_access.isra.32+0x34/0x50
  ext4_ext_remove_space+0x59f/0x1830
  ? ext4_ext_truncate+0xa2/0xf0
  ? ext4_ext_truncate+0x5e/0xf0
  ext4_ext_truncate+0xd6/0xf0
  ext4_truncate+0x46a/0x660
  ext4_setattr+0x6db/0xa40
  notify_change+0x3e5/0x540
  do_truncate+0x7b/0xc0
  ? do_truncate+0x7b/0xc0
  do_sys_ftruncate+0x164/0x2f0
  __x64_sys_ftruncate+0x1b/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae3f7b
 RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
 RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b
 RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003
 RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0
 R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003
  </TASK>

  5 locks held by ftest01/821:
   #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
   #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
   #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
   #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

		--------------------------------------

Reproducer:

Get latest preempt-rt kernel:
	git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
	checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch)
	make defconfig
	manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT
	make oldconfig and take defaults for other options hidden behind EXPERT
	can optionally enable LOCKDEP options so sysrq-d works.

Get LTP ; build + install
	https://github.com/linux-test-project/ltp.git

Boot and then run reproducer:
   dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512
   mkfs.ext4 -F /root/tmp_mount_file
   mount /root/tmp_mount_file /mnt/
   /opt/ltp/runltp -f fs -s ftest01 -d /mnt/
   umount /mnt

   loop device isn't strictly necessary but contains fs fallout when
   inevitable reboot is required to get rid of D state processes.

   Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total.
   Test hangs forever on v5.15-rt and latest v5.19-rc-rt.  Test threads
   in D state - any other i/o directed at /mnt also hangs in D state.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] sched/core: Always flush pending blk_plug
       [not found] ` <20220710010136.2510-1-hdanton@sina.com>
@ 2022-07-12 18:29   ` Paul Gortmaker
  0 siblings, 0 replies; 5+ messages in thread
From: Paul Gortmaker @ 2022-07-12 18:29 UTC (permalink / raw)
  To: Hillf Danton
  Cc: John Keeping, linux-kernel, linux-mm, Peter Zijlstra,
	Thomas Gleixner, Mel Gorman, Theodore Ts'o

[Re: [PATCH v3] sched/core: Always flush pending blk_plug] On 10/07/2022 (Sun 09:01) Hillf Danton wrote:

> Looks like a ABBA deadlock that should be fed to lockdep, given what was
> blocking flusher (pid 213).

[,,,]

> Make lock_buffer have mutex_lock() semantics in attempt to catch the
> deadlock above.
> 
> Only for thoughts now.

Thanks for the test patch - I'm keeping it as a reference for future
use.

So, preempt-rt issues an early lockdep splat at partition probe bootup,
both with and without the original v3 patch from this thread.

Of course then I figured I'd try the debug patch on vanilla v5.19-rc3
and we get pretty much the same lockdep complaint.


   sd 1:0:0:0: Attached scsi generic sg1 type 0
   sd 1:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
   scsi 2:0:0:0: CD-ROM            HL-DT-ST DVD-ROM DH30N    A101 PQ: 0 ANSI: 5

   =====================================
   WARNING: bad unlock balance detected!
   5.19.0-rc3-dirty #2 Not tainted
   -------------------------------------
   swapper/2/0 is trying to release lock (buffer_head_lock) at:
   [<ffffffffbc494a6b>] end_buffer_async_read+0x5b/0x180
   but there are no more locks to release!
   
   other info that might help us debug this:
   1 lock held by swapper/2/0:
    #0: ffff8bee27744080 (&ret->b_uptodate_lock){....}-{2:2}, at: end_buffer_async_read+0x47/0x180
   
   stack backtrace:
   CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.19.0-rc3-dirty #2
   Hardware name: Dell Inc. Precision WorkStation T5500  /0D883F, BIOS A16 05/28/2013
   Call Trace:
    <IRQ>
    dump_stack_lvl+0x40/0x5c
    lock_release+0x245/0x3f0
    unlock_buffer+0x15/0x30
    end_buffer_async_read+0x5b/0x180
    end_bio_bh_io_sync+0x1e/0x40
    blk_update_request+0x9a/0x470
    scsi_end_request+0x27/0x190
    scsi_io_completion+0x3c/0x580
    blk_complete_reqs+0x39/0x50
    __do_softirq+0x11d/0x344
    irq_exit_rcu+0xa9/0xc0
    common_interrupt+0xa5/0xc0
    </IRQ>
    <TASK>
    asm_common_interrupt+0x27/0x40
   RIP: 0010:cpuidle_enter_state+0x12d/0x3f0
   Code: 49 43 0f 84 b7 02 00 00 31 ff e8 fe 1c 74 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 8e 02 00 00 31 ff e8 97 95 7a ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49
   RSP: 0000:ffffa2890013fe90 EFLAGS: 00000206
   RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002
   RDX: 0000000080000002 RSI: ffffffffbd78d911 RDI: 00000000ffffffff
   RBP: ffff8bee2739a400 R08: 0000000000000000 R09: 00000000001faf40
   R10: ffffa2890013fdc8 R11: 0000000000000000 R12: 00000000a7b149d6
   R13: ffffffffbdbee3a0 R14: 0000000000000003 R15: 0000000000000001
    cpuidle_enter+0x24/0x40
    do_idle+0x1e3/0x230
    cpu_startup_entry+0x14/0x20
    start_secondary+0xe8/0xf0
    secondary_startup_64_no_verify+0xe0/0xeb
    </TASK>
    sda: sda1 sda2 sda3
   sd 0:0:0:0: [sda] Attached SCSI disk
    sdb: sdb1 sdb2
   sd 1:0:0:0: [sdb] Attached SCSI disk

Not quite sure what to make of that.

Paul.
--

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip: sched/core] sched/core: Always flush pending blk_plug
  2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
  2022-07-09 16:19 ` Paul Gortmaker
       [not found] ` <20220710010136.2510-1-hdanton@sina.com>
@ 2022-07-14 11:35 ` tip-bot2 for John Keeping
  2022-07-25  8:56 ` [PATCH v3] " Sebastian Andrzej Siewior
  3 siblings, 0 replies; 5+ messages in thread
From: tip-bot2 for John Keeping @ 2022-07-14 11:35 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: John Keeping, Peter Zijlstra (Intel), Steven Rostedt (Google),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     401e4963bf45c800e3e9ea0d3a0289d738005fd4
Gitweb:        https://git.kernel.org/tip/401e4963bf45c800e3e9ea0d3a0289d738005fd4
Author:        John Keeping <john@metanate.com>
AuthorDate:    Fri, 08 Jul 2022 17:27:02 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 13 Jul 2022 11:29:17 +02:00

sched/core: Always flush pending blk_plug

With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):

	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
	Workqueue: writeback wb_workfn (flush-7:0)
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
	[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
	+(rt_mutex_slowlock.constprop.0+0xac/0x174)
	[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
	[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
	[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
	[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
	[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
	[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
	[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
	[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
	[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
	[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
	Exception stack(0xc10e3fb0 to 0xc10e3ff8)
	3fa0:                                     00000000 00000000 00000000 00000000
	3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

	INFO: task tar:2083 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:tar             state:D stack:    0 pid: 2083 ppid:  2082 flags:0x00000000
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
	[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
	[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
	[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
	[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
	[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
	[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
	[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
	[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
	[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
	[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
	Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
	bfa0:                   01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
	bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
	bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.

The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.

It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag.  But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.

Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").

As described in [1]:

   The schedule process must distinguish between blocking on a regular
   sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
   and rwlock):
   - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
     even if blocked on a lock. This can not deadlock because this also
     happens for non-RT.
     There should be a warning if the scheduling point is within a RCU read
     section.

   - spinlock and rwlock must not flush block requests. This will deadlock
     if the callback attempts to acquire a lock which is already acquired.
     Similarly to being preempted, there should be no warning if the
     scheduling point is within a RCU read section.

and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
---
 include/linux/sched/rt.h | 8 --------
 kernel/sched/core.c      | 8 ++++++--
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index e5af028..994c256 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
 }
 extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
-	return tsk->pi_blocked_on != NULL;
-}
 #else
 static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
 	return NULL;
 }
 # define rt_mutex_adjust_pi(p)		do { } while (0)
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
-	return false;
-}
 #endif
 
 extern void normalize_rt_tasks(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c703d17..a463dbc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6470,8 +6470,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
 			io_wq_worker_sleeping(tsk);
 	}
 
-	if (tsk_is_pi_blocked(tsk))
-		return;
+	/*
+	 * spinlock and rwlock must not flush block requests.  This will
+	 * deadlock if the callback attempts to acquire a lock which is
+	 * already acquired.
+	 */
+	SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
 
 	/*
 	 * If we are going to sleep and we have plugged IO queued,

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] sched/core: Always flush pending blk_plug
  2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
                   ` (2 preceding siblings ...)
  2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping
@ 2022-07-25  8:56 ` Sebastian Andrzej Siewior
  3 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-07-25  8:56 UTC (permalink / raw)
  To: John Keeping
  Cc: linux-kernel, linux-rt-users, Peter Zijlstra, Thomas Gleixner,
	Steven Rostedt, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider

On 2022-07-08 17:27:02 [+0100], John Keeping wrote:
> Looking at the history of the -rt patchset, in fact this change was
> present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
> of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
> ("sched/core: Rework the __schedule() preempt argument").

Thanks for spotting that. The plan was to get rid of this with the
introduction of schedule_rtlock() but it somehow remained resilient.

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-07-25  8:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
2022-07-09 16:19 ` Paul Gortmaker
     [not found] ` <20220710010136.2510-1-hdanton@sina.com>
2022-07-12 18:29   ` Paul Gortmaker
2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping
2022-07-25  8:56 ` [PATCH v3] " Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).