[PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters
@ 2022-06-21 19:36 Waiman Long
  2022-06-21 19:38 ` Waiman Long
  2022-06-23 13:32 ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 6+ messages in thread
From: Waiman Long @ 2022-06-21 19:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Thomas Gleixner, Sebastian Andrzej Siewior,
	Juri Lelli, Mike Stowell, Waiman Long

Commit 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock
stealing") allows unlimited number of lock stealing's for non-RT
tasks. That can lead to lock starvation of non-RT top waiter tasks if
there is a constant incoming stream of non-RT lockers. This can cause
task lockup in PREEMPT_RT kernel. For example,

[ 1249.921363] INFO: task systemd:2178 blocked for more than 622 seconds.
[ 1872.984225] INFO: task kworker/6:4:63401 blocked for more than 622 seconds.

Avoiding this problem and ensuring forward progress by limiting the
number of times that a lock can be stolen from each waiter. This patch
sets a threshold of 10. That number is arbitrary and can be changed
if needed.

With that change, the task lockups previously observed when running
stressful workloads on PREEMPT_RT kernel disappeared.

Fixes: 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock stealing")
Reported-by: Mike Stowell <mstowell@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rtmutex.c        | 9 ++++++---
 kernel/locking/rtmutex_common.h | 8 ++++++++
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 7779ee8abc2a..bdddb3dc36c2 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -359,10 +359,13 @@ static inline bool rt_mutex_steal(struct rt_mutex_waiter *waiter,
 	if (rt_prio(waiter->prio) || dl_prio(waiter->prio))
 		return false;
 
-	return rt_mutex_waiter_equal(waiter, top_waiter);
-#else
-	return false;
+	if (rt_mutex_waiter_equal(waiter, top_waiter) &&
+	   (top_waiter->nr_steals < RT_MUTEX_LOCK_STEAL_MAX)) {
+		top_waiter->nr_steals++;
+		return true;
+	}
 #endif
+	return false;
 }
 
 #define __node_2_waiter(node) \
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index c47e8361bfb5..5858efe5cb0e 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -26,6 +26,7 @@
  * @task:		task reference to the blocked task
  * @lock:		Pointer to the rt_mutex on which the waiter blocks
  * @wake_state:		Wakeup state to use (TASK_NORMAL or TASK_RTLOCK_WAIT)
+ * @nr_steals:		Number of times the lock is stolen
  * @prio:		Priority of the waiter
  * @deadline:		Deadline of the waiter if applicable
  * @ww_ctx:		WW context pointer
@@ -36,11 +37,17 @@ struct rt_mutex_waiter {
 	struct task_struct	*task;
 	struct rt_mutex_base	*lock;
 	unsigned int		wake_state;
+	unsigned int		nr_steals;
 	int			prio;
 	u64			deadline;
 	struct ww_acquire_ctx	*ww_ctx;
 };
 
+/*
+ * The maximum number of times where lock can be stolen per waiter.
+ */
+#define	RT_MUTEX_LOCK_STEAL_MAX	10
+
 /**
  * rt_wake_q_head - Wrapper around regular wake_q_head to support
  *		    "sleeping" spinlocks on RT
@@ -194,6 +201,7 @@ static inline void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 	RB_CLEAR_NODE(&waiter->tree_entry);
 	waiter->wake_state = TASK_NORMAL;
 	waiter->task = NULL;
+	waiter->nr_steals = 0;
 }
 
 static inline void rt_mutex_init_rtlock_waiter(struct rt_mutex_waiter *waiter)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters
  2022-06-21 19:36 [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters Waiman Long
@ 2022-06-21 19:38 ` Waiman Long
  2022-06-23 13:32 ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 6+ messages in thread
From: Waiman Long @ 2022-06-21 19:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
  Cc: linux-kernel, Thomas Gleixner, Sebastian Andrzej Siewior,
	Juri Lelli, Mike Stowell

On 6/21/22 15:36, Waiman Long wrote:
> Commit 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock
> stealing") allows unlimited number of lock stealing's for non-RT
> tasks. That can lead to lock starvation of non-RT top waiter tasks if
> there is a constant incoming stream of non-RT lockers. This can cause
> task lockup in PREEMPT_RT kernel. For example,
>
> [ 1249.921363] INFO: task systemd:2178 blocked for more than 622 seconds.
> [ 1872.984225] INFO: task kworker/6:4:63401 blocked for more than 622 seconds.
>
> Avoiding this problem and ensuring forward progress by limiting the
> number of times that a lock can be stolen from each waiter. This patch
> sets a threshold of 10. That number is arbitrary and can be changed
> if needed.
>
> With that change, the task lockups previously observed when running
> stressful workloads on PREEMPT_RT kernel disappeared.
>
> Fixes: 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock stealing")
> Reported-by: Mike Stowell <mstowell@redhat.com>
> Signed-off-by: Waiman Long <longman@redhat.com>

There is no code change in v2. I have only updated the patch description.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters
  2022-06-21 19:36 [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters Waiman Long
  2022-06-21 19:38 ` Waiman Long
@ 2022-06-23 13:32 ` Sebastian Andrzej Siewior
  2022-06-23 14:41   ` Waiman Long
  1 sibling, 1 reply; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-23 13:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	linux-kernel, Thomas Gleixner, Juri Lelli, Mike Stowell

On 2022-06-21 15:36:41 [-0400], Waiman Long wrote:
> Commit 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock
> stealing") allows unlimited number of lock stealing's for non-RT
> tasks. That can lead to lock starvation of non-RT top waiter tasks if
> there is a constant incoming stream of non-RT lockers. This can cause
> task lockup in PREEMPT_RT kernel. For example,
> 
> [ 1249.921363] INFO: task systemd:2178 blocked for more than 622 seconds.
> [ 1872.984225] INFO: task kworker/6:4:63401 blocked for more than 622 seconds.
> 
> Avoiding this problem and ensuring forward progress by limiting the
> number of times that a lock can be stolen from each waiter. This patch
> sets a threshold of 10. That number is arbitrary and can be changed
> if needed.
> 
> With that change, the task lockups previously observed when running
> stressful workloads on PREEMPT_RT kernel disappeared.

Do you have more insight on how this was tested/ created? Based on that,
systemd and a random kworker waited on a lock for more than 10 minutes.

I added a trace-printk each time a non-RT waiter got the lock stolen,
kicked a kernel build and a package upgrade and took a look at the stats
an hour later:
- sh got its lock stolen 3416 times. I didn't lock the pid so I can't
  look back and check how long it waited since the first time.
- the median average of stolen locks is 173.

> Fixes: 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock stealing")
> Reported-by: Mike Stowell <mstowell@redhat.com>
> Signed-off-by: Waiman Long <longman@redhat.com>

Sebastian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters
  2022-06-23 13:32 ` Sebastian Andrzej Siewior
@ 2022-06-23 14:41   ` Waiman Long
  2022-06-24  6:42     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2022-06-23 14:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	linux-kernel, Thomas Gleixner, Juri Lelli, Mike Stowell


On 6/23/22 09:32, Sebastian Andrzej Siewior wrote:
> On 2022-06-21 15:36:41 [-0400], Waiman Long wrote:
>> Commit 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock
>> stealing") allows unlimited number of lock stealing's for non-RT
>> tasks. That can lead to lock starvation of non-RT top waiter tasks if
>> there is a constant incoming stream of non-RT lockers. This can cause
>> task lockup in PREEMPT_RT kernel. For example,
>>
>> [ 1249.921363] INFO: task systemd:2178 blocked for more than 622 seconds.
>> [ 1872.984225] INFO: task kworker/6:4:63401 blocked for more than 622 seconds.
>>
>> Avoiding this problem and ensuring forward progress by limiting the
>> number of times that a lock can be stolen from each waiter. This patch
>> sets a threshold of 10. That number is arbitrary and can be changed
>> if needed.
>>
>> With that change, the task lockups previously observed when running
>> stressful workloads on PREEMPT_RT kernel disappeared.
> Do you have more insight on how this was tested/ created? Based on that,
> systemd and a random kworker waited on a lock for more than 10 minutes.

The hang happens when our QE team run thier kernel tier 1 test which, I 
think, lasts several hours. The hang happens in some runs but not all of 
them. So it is kind of opportunistic. Mike should be able to provide a 
better idea about frequency and so on.

>
> I added a trace-printk each time a non-RT waiter got the lock stolen,
> kicked a kernel build and a package upgrade and took a look at the stats
> an hour later:
> - sh got its lock stolen 3416 times. I didn't lock the pid so I can't
>    look back and check how long it waited since the first time.
> - the median average of stolen locks is 173.
Maybe we should also more lock stealing per waiter than the 10 that I 
used in the patch. I am open to suggestion to what is a good value to use.
>
>> Fixes: 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock stealing")
>> Reported-by: Mike Stowell <mstowell@redhat.com>
>> Signed-off-by: Waiman Long <longman@redhat.com>

Thanks for your time looking at the patch.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters
  2022-06-23 14:41   ` Waiman Long
@ 2022-06-24  6:42     ` Sebastian Andrzej Siewior
  2022-06-25  2:23       ` Waiman Long
  0 siblings, 1 reply; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-06-24  6:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	linux-kernel, Thomas Gleixner, Juri Lelli, Mike Stowell

On 2022-06-23 10:41:17 [-0400], Waiman Long wrote:
> 
> On 6/23/22 09:32, Sebastian Andrzej Siewior wrote:
> > Do you have more insight on how this was tested/ created? Based on that,
> > systemd and a random kworker waited on a lock for more than 10 minutes.
> 
> The hang happens when our QE team run thier kernel tier 1 test which, I
> think, lasts several hours. The hang happens in some runs but not all of
> them. So it is kind of opportunistic. Mike should be able to provide a
> better idea about frequency and so on.

So we talk here about 64+ CPU or more than that?

> > I added a trace-printk each time a non-RT waiter got the lock stolen,
> > kicked a kernel build and a package upgrade and took a look at the stats
> > an hour later:
> > - sh got its lock stolen 3416 times. I didn't lock the pid so I can't
> >    look back and check how long it waited since the first time.
> > - the median average of stolen locks is 173.
> Maybe we should also more lock stealing per waiter than the 10 that I used
> in the patch. I am open to suggestion to what is a good value to use.

I have no idea either. I just looked at a run to see what the number
actually are. I have no numbers in terms of performance. So what most
likely happens is that on an unlock operation the waiter gets a wake-up
but before he gets a chance to acquire the lock, it is already taken and
he goes back to sleep again. While this looks painful it might be better
performance wise because the other task was able to acquire the lock
without waiting. But then it is not fair and this happens.
One thing that I'm curious about is, what lock is it (one or two global
hot spots or many). And how to benchmark this…

> > > Fixes: 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock stealing")
> > > Reported-by: Mike Stowell <mstowell@redhat.com>
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> 
> Thanks for your time looking at the patch.

no problem.

> Cheers,
> Longman

Sebastian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters
  2022-06-24  6:42     ` Sebastian Andrzej Siewior
@ 2022-06-25  2:23       ` Waiman Long
  0 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2022-06-25  2:23 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	linux-kernel, Thomas Gleixner, Juri Lelli, Mike Stowell

On 6/24/22 02:42, Sebastian Andrzej Siewior wrote:
> On 2022-06-23 10:41:17 [-0400], Waiman Long wrote:
>> On 6/23/22 09:32, Sebastian Andrzej Siewior wrote:
>>> Do you have more insight on how this was tested/ created? Based on that,
>>> systemd and a random kworker waited on a lock for more than 10 minutes.
>> The hang happens when our QE team run thier kernel tier 1 test which, I
>> think, lasts several hours. The hang happens in some runs but not all of
>> them. So it is kind of opportunistic. Mike should be able to provide a
>> better idea about frequency and so on.
> So we talk here about 64+ CPU or more than that?

The bug report that I saw showed 2 hangs in two different one-socket 
4-core 8-thread systems. So they are not 64+ cpus. Of course, the more 
cpus that we have, the more chance we will get into trouble.


>
>>> I added a trace-printk each time a non-RT waiter got the lock stolen,
>>> kicked a kernel build and a package upgrade and took a look at the stats
>>> an hour later:
>>> - sh got its lock stolen 3416 times. I didn't lock the pid so I can't
>>>     look back and check how long it waited since the first time.
>>> - the median average of stolen locks is 173.
>> Maybe we should also more lock stealing per waiter than the 10 that I used
>> in the patch. I am open to suggestion to what is a good value to use.
> I have no idea either. I just looked at a run to see what the number
> actually are. I have no numbers in terms of performance. So what most
> likely happens is that on an unlock operation the waiter gets a wake-up
> but before he gets a chance to acquire the lock, it is already taken and
> he goes back to sleep again. While this looks painful it might be better
> performance wise because the other task was able to acquire the lock
> without waiting. But then it is not fair and this happens.
> One thing that I'm curious about is, what lock is it (one or two global
> hot spots or many). And how to benchmark this…

The hang can happen in different places. Like

[ 1873.205398] INFO: task pool-restraintd:122206 blocked for more than 
622 seconds.
[ 1873.205400]       Tainted: G        W  OE  X --------- --- 
5.14.0-104.rt21.104.el9.x86_64 #1
[ 1873.205401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1873.205402] task:pool-restraintd state:D stack:    0 pid:122206 
ppid:     1 flags:0x00000000
[ 1873.205405] Call Trace:
[ 1873.205407]  __schedule+0x21c/0x5b0
[ 1873.205410]  ? migrate_enable+0xcd/0x140
[ 1873.205413]  schedule+0x46/0xc0
[ 1873.205415]  rt_mutex_slowlock_block.constprop.0+0x30/0x150
[ 1873.205418]  __rt_mutex_slowlock.constprop.0+0x6b/0xe0
[ 1873.205421]  __rwbase_read_lock+0x9c/0xe0
[ 1873.205425]  do_madvise+0x109/0x9c0
[ 1873.205429]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1873.205432]  ? do_syscall_64+0x69/0x80
[ 1873.205434]  __x64_sys_madvise+0x28/0x30
[ 1873.205436]  do_syscall_64+0x5c/0x80
[ 1873.205438]  ? fpregs_restore_userregs+0x53/0x100
[ 1873.205442]  ? exit_to_user_mode_prepare+0xda/0xe0
[ 1873.205445]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1873.205447]  ? do_syscall_64+0x69/0x80
[ 1873.205449]  ? do_syscall_64+0x69/0x80
[ 1873.205452]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1873.205455] RIP: 0033:0x7f35f5a93f5b
[ 1873.205457] RSP: 002b:00007f35f4a48e18 EFLAGS: 00000206 ORIG_RAX: 
000000000000001c
[ 1873.205460] RAX: ffffffffffffffda RBX: 00007f35f4a49640 RCX: 
00007f35f5a93f5b
[ 1873.205461] RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 
00007f35f4249000
[ 1873.205463] RBP: 00007f35f4249000 R08: 00007f35ec0026c0 R09: 
0000000000000005
[ 1873.205464] R10: 0000000000000008 R11: 0000000000000206 R12: 
00007f35f4a49640
[ 1873.205466] R13: 0000000000000000 R14: 00007f35f5af4560 R15: 
0000000000000000

or

[ 1872.559562] INFO: task auditd:926 blocked for more than 622 seconds.
[ 1872.559563]       Tainted: G        W        --------- --- 
5.14.0-103.rt21.103.el9.x86_64 #1
[ 1872.559563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1872.559564] task:auditd          state:D stack:    0 pid:  926 
ppid:     1 flags:0x00000000
[ 1872.559565] Call Trace:
[ 1872.559566]  __schedule+0x21c/0x5b0
[ 1872.559568]  schedule+0x46/0xc0
[ 1872.559569]  rt_mutex_slowlock_block.constprop.0+0x30/0x150
[ 1872.559570]  __rt_mutex_slowlock.constprop.0+0x6b/0xe0
[ 1872.559572]  __rwbase_read_lock+0x9c/0xe0
[ 1872.559574]  xlog_cil_commit+0x30/0x190 [xfs]
[ 1872.559631]  __xfs_trans_commit+0xb0/0x320 [xfs]
[ 1872.559686]  xfs_vn_update_time+0x12c/0x240 [xfs]
[ 1872.559741]  file_update_time+0xea/0x140
[ 1872.559744]  ? rwbase_write_lock+0xde/0x1b0
[ 1872.559746]  xfs_file_write_checks+0x247/0x310 [xfs]
[ 1872.559799]  xfs_file_buffered_write+0x76/0x2c0 [xfs]
[ 1872.559852]  ? update_load_avg+0x80/0x7d0
[ 1872.559855]  new_sync_write+0x11f/0x1b0
[ 1872.559858]  vfs_write+0x25a/0x2f0
[ 1872.559860]  ksys_write+0x65/0xe0
[ 1872.559862]  do_syscall_64+0x5c/0x80
[ 1872.559864]  ? fpregs_restore_userregs+0x12/0x100
[ 1872.559866]  ? exit_to_user_mode_prepare+0xda/0xe0
[ 1872.559868]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1872.559870] RIP: 0033:0x7f4bf7545d9f
[ 1872.559872] RSP: 002b:00007ffce95c2920 EFLAGS: 00000293 ORIG_RAX: 
0000000000000001
[ 1872.559874] RAX: ffffffffffffffda RBX: 0000000000000111 RCX: 
00007f4bf7545d9f
[ 1872.559874] RDX: 0000000000000111 RSI: 0000559015de18f0 RDI: 
0000000000000005
[ 1872.559875] RBP: 0000559015de18f0 R08: 0000000000000000 R09: 
00007f4bf75b74e0
[ 1872.559876] R10: 00007f4bf75b73e0 R11: 0000000000000293 R12: 
0000000000000111
[ 1872.559877] R13: 0000559015dd9310 R14: 0000000000000111 R15: 
00007f4bf75fb9e0

Since it is causing 25-50% of our test jobs to fail, we really want to 
get it fixed ASAP.

Regards,
Longman


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-06-25  2:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-21 19:36 [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters Waiman Long
2022-06-21 19:38 ` Waiman Long
2022-06-23 13:32 ` Sebastian Andrzej Siewior
2022-06-23 14:41   ` Waiman Long
2022-06-24  6:42     ` Sebastian Andrzej Siewior
2022-06-25  2:23       ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).