linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/2] locking/rtmutex: Cure two subtle bugs
@ 2021-08-25 10:33 Thomas Gleixner
  2021-08-25 10:33 ` [patch 1/2] locking/rtmutex: Dont dereference waiter lockless Thomas Gleixner
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Thomas Gleixner @ 2021-08-25 10:33 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Waiman Long,
	Sebastian Andrzej Siewior, Davidlohr Bueso

The recent updates to rtmutex introduced two subtle bugs:

  1) The spinwait mechanism added a UAF which triggers a BUG_ON()

  2) The ww_mutex addition leaves a waiter queued in the error exit path
     resulting in rb tree corruption

The fixes are straight forward, but the rtmutex based ww_mutex
implementation still has some more rough egdes which need to be ironed out.

Thanks,

	tglx
---
 rtmutex.c        |   12 +++++++++---
 rtmutex_common.h |   13 +++++++++++++
 2 files changed, 22 insertions(+), 3 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 1/2] locking/rtmutex: Dont dereference waiter lockless
  2021-08-25 10:33 [patch 0/2] locking/rtmutex: Cure two subtle bugs Thomas Gleixner
@ 2021-08-25 10:33 ` Thomas Gleixner
  2021-08-25 14:17   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
  2021-08-25 10:33 ` [patch 2/2] locking/rtmutex: Dequeue waiter on ww_mutex deadlock Thomas Gleixner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2021-08-25 10:33 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Waiman Long,
	Sebastian Andrzej Siewior, Davidlohr Bueso

The new rt_mutex_spin_on_onwer() loop checks whether the spinning waiter is
still the top waiter on the lock by utilizing rt_mutex_top_waiter(), which
is broken because that function contains a sanity check which dereferences
the top waiter pointer to check whether the waiter belongs to the
lock. That's wrong in the lockless spinwait case:

 CPU 0							CPU 1
 rt_mutex_lock(lock)					rt_mutex_lock(lock);
   queue(waiter0)
   waiter0 == rt_mutex_top_waiter(lock)
   rt_mutex_spin_on_onwer(lock, waiter0) {		queue(waiter1)
   					 		waiter1 == rt_mutex_top_waiter(lock)
   							...
     top_waiter = rt_mutex_top_waiter(lock)
       leftmost = rb_first_cached(&lock->waiters);
							-> signal
							dequeue(waiter1)
							destroy(waiter1)
       w = rb_entry(leftmost, ....)
       BUG_ON(w->lock != lock)	 <- UAF

The BUG_ON() is correct for the case where the caller holds lock->wait_lock
which guarantees that the leftmost waiter entry cannot vanish. For the
lockless spinwait case it's broken.

Create a new helper function which avoids the pointer dereference and just
compares the leftmost entry pointer with current's waiter pointer to
validate that currrent is still elegible for spinning.

Fixes: 992caf7f1724 ("locking/rtmutex: Add adaptive spinwait mechanism")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/locking/rtmutex.c        |    5 +++--
 kernel/locking/rtmutex_common.h |   13 +++++++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1329,8 +1329,9 @@ static bool rtmutex_spin_on_owner(struct
 		 *    for CONFIG_PREEMPT_RCU=y)
 		 *  - the VCPU on which owner runs is preempted
 		 */
-		if (!owner->on_cpu || waiter != rt_mutex_top_waiter(lock) ||
-		    need_resched() || vcpu_is_preempted(task_cpu(owner))) {
+		if (!owner->on_cpu || need_resched() ||
+		    rt_mutex_waiter_is_top_waiter(lock, waiter) ||
+		    vcpu_is_preempted(task_cpu(owner))) {
 			res = false;
 			break;
 		}
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -95,6 +95,19 @@ static inline int rt_mutex_has_waiters(s
 	return !RB_EMPTY_ROOT(&lock->waiters.rb_root);
 }
 
+/*
+ * Lockless speculative check whether @waiter is still the top waiter on
+ * @lock. This is solely comparing pointers and not derefencing the
+ * leftmost entry which might be about to vanish.
+ */
+static inline bool rt_mutex_waiter_is_top_waiter(struct rt_mutex_base *lock,
+						 struct rt_mutex_waiter *waiter)
+{
+	struct rb_node *leftmost = rb_first_cached(&lock->waiters);
+
+	return rb_entry(leftmost, struct rt_mutex_waiter, tree_entry) == waiter;
+}
+
 static inline struct rt_mutex_waiter *rt_mutex_top_waiter(struct rt_mutex_base *lock)
 {
 	struct rb_node *leftmost = rb_first_cached(&lock->waiters);


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 2/2] locking/rtmutex: Dequeue waiter on ww_mutex deadlock
  2021-08-25 10:33 [patch 0/2] locking/rtmutex: Cure two subtle bugs Thomas Gleixner
  2021-08-25 10:33 ` [patch 1/2] locking/rtmutex: Dont dereference waiter lockless Thomas Gleixner
@ 2021-08-25 10:33 ` Thomas Gleixner
  2021-08-25 14:17   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
  2021-08-25 11:40 ` [patch 0/2] locking/rtmutex: Cure two subtle bugs Peter Zijlstra
  2021-08-26 13:26 ` Peter Zijlstra
  3 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2021-08-25 10:33 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Waiman Long,
	Sebastian Andrzej Siewior, Davidlohr Bueso

The rt_mutex based ww_mutex variant queues the new waiter first in the
lock's rbtree before evaluating the ww_mutex specific conditions which
might decide that the waiter should back out. This check and conditional
exit happens before the waiter is enqueued into the PI chain.

The failure handling at the call site assumes that the waiter, if it is the
top most waiter on the lock, is queued in the PI chain and then proceeds to
adjust the unmodified PI chain, which results in RB tree corruption.

Dequeue the waiter from the lock waiter list in the ww_mutex error exit
path to prevent this.

Fixes: add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/locking/rtmutex.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1082,8 +1082,13 @@ static int __sched task_blocks_on_rt_mut
 		/* Check whether the waiter should back out immediately */
 		rtm = container_of(lock, struct rt_mutex, rtmutex);
 		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
-		if (res)
+		if (res) {
+			raw_spin_lock(&task->pi_lock);
+			rt_mutex_dequeue(lock, waiter);
+			task->pi_blocked_on = NULL;
+			raw_spin_unlock(&task->pi_lock);
 			return res;
+		}
 	}
 
 	if (!owner)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 0/2] locking/rtmutex: Cure two subtle bugs
  2021-08-25 10:33 [patch 0/2] locking/rtmutex: Cure two subtle bugs Thomas Gleixner
  2021-08-25 10:33 ` [patch 1/2] locking/rtmutex: Dont dereference waiter lockless Thomas Gleixner
  2021-08-25 10:33 ` [patch 2/2] locking/rtmutex: Dequeue waiter on ww_mutex deadlock Thomas Gleixner
@ 2021-08-25 11:40 ` Peter Zijlstra
  2021-08-26 13:26 ` Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2021-08-25 11:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Steven Rostedt, Waiman Long,
	Sebastian Andrzej Siewior, Davidlohr Bueso

On Wed, Aug 25, 2021 at 12:33:11PM +0200, Thomas Gleixner wrote:
> The recent updates to rtmutex introduced two subtle bugs:
> 
>   1) The spinwait mechanism added a UAF which triggers a BUG_ON()
> 
>   2) The ww_mutex addition leaves a waiter queued in the error exit path
>      resulting in rb tree corruption
> 
> The fixes are straight forward, but the rtmutex based ww_mutex
> implementation still has some more rough egdes which need to be ironed out.

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip: locking/core] locking/rtmutex: Dequeue waiter on ww_mutex deadlock
  2021-08-25 10:33 ` [patch 2/2] locking/rtmutex: Dequeue waiter on ww_mutex deadlock Thomas Gleixner
@ 2021-08-25 14:17   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-25 14:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Siewior, Thomas Gleixner, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     37e8abff2bebbf9947d6b784f5c75ed48a717089
Gitweb:        https://git.kernel.org/tip/37e8abff2bebbf9947d6b784f5c75ed48a717089
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 25 Aug 2021 12:33:14 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 25 Aug 2021 15:42:33 +02:00

locking/rtmutex: Dequeue waiter on ww_mutex deadlock

The rt_mutex based ww_mutex variant queues the new waiter first in the
lock's rbtree before evaluating the ww_mutex specific conditions which
might decide that the waiter should back out. This check and conditional
exit happens before the waiter is enqueued into the PI chain.

The failure handling at the call site assumes that the waiter, if it is the
top most waiter on the lock, is queued in the PI chain and then proceeds to
adjust the unmodified PI chain, which results in RB tree corruption.

Dequeue the waiter from the lock waiter list in the ww_mutex error exit
path to prevent this.

Fixes: add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210825102454.042280541@linutronix.de
---
 kernel/locking/rtmutex.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b3c0961..c8fe74e 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1082,8 +1082,13 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 		/* Check whether the waiter should back out immediately */
 		rtm = container_of(lock, struct rt_mutex, rtmutex);
 		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
-		if (res)
+		if (res) {
+			raw_spin_lock(&task->pi_lock);
+			rt_mutex_dequeue(lock, waiter);
+			task->pi_blocked_on = NULL;
+			raw_spin_unlock(&task->pi_lock);
 			return res;
+		}
 	}
 
 	if (!owner)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: locking/core] locking/rtmutex: Dont dereference waiter lockless
  2021-08-25 10:33 ` [patch 1/2] locking/rtmutex: Dont dereference waiter lockless Thomas Gleixner
@ 2021-08-25 14:17   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-25 14:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Siewior, Thomas Gleixner, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     c3123c431447da99db160264506de9897c003513
Gitweb:        https://git.kernel.org/tip/c3123c431447da99db160264506de9897c003513
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 25 Aug 2021 12:33:12 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 25 Aug 2021 15:42:32 +02:00

locking/rtmutex: Dont dereference waiter lockless

The new rt_mutex_spin_on_onwer() loop checks whether the spinning waiter is
still the top waiter on the lock by utilizing rt_mutex_top_waiter(), which
is broken because that function contains a sanity check which dereferences
the top waiter pointer to check whether the waiter belongs to the
lock. That's wrong in the lockless spinwait case:

 CPU 0							CPU 1
 rt_mutex_lock(lock)					rt_mutex_lock(lock);
   queue(waiter0)
   waiter0 == rt_mutex_top_waiter(lock)
   rt_mutex_spin_on_onwer(lock, waiter0) {		queue(waiter1)
   					 		waiter1 == rt_mutex_top_waiter(lock)
   							...
     top_waiter = rt_mutex_top_waiter(lock)
       leftmost = rb_first_cached(&lock->waiters);
							-> signal
							dequeue(waiter1)
							destroy(waiter1)
       w = rb_entry(leftmost, ....)
       BUG_ON(w->lock != lock)	 <- UAF

The BUG_ON() is correct for the case where the caller holds lock->wait_lock
which guarantees that the leftmost waiter entry cannot vanish. For the
lockless spinwait case it's broken.

Create a new helper function which avoids the pointer dereference and just
compares the leftmost entry pointer with current's waiter pointer to
validate that currrent is still elegible for spinning.

Fixes: 992caf7f1724 ("locking/rtmutex: Add adaptive spinwait mechanism")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210825102453.981720644@linutronix.de
---
 kernel/locking/rtmutex.c        |  5 +++--
 kernel/locking/rtmutex_common.h | 13 +++++++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 8aaa352..b3c0961 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1329,8 +1329,9 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
 		 *    for CONFIG_PREEMPT_RCU=y)
 		 *  - the VCPU on which owner runs is preempted
 		 */
-		if (!owner->on_cpu || waiter != rt_mutex_top_waiter(lock) ||
-		    need_resched() || vcpu_is_preempted(task_cpu(owner))) {
+		if (!owner->on_cpu || need_resched() ||
+		    rt_mutex_waiter_is_top_waiter(lock, waiter) ||
+		    vcpu_is_preempted(task_cpu(owner))) {
 			res = false;
 			break;
 		}
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 61256de..c47e836 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -95,6 +95,19 @@ static inline int rt_mutex_has_waiters(struct rt_mutex_base *lock)
 	return !RB_EMPTY_ROOT(&lock->waiters.rb_root);
 }
 
+/*
+ * Lockless speculative check whether @waiter is still the top waiter on
+ * @lock. This is solely comparing pointers and not derefencing the
+ * leftmost entry which might be about to vanish.
+ */
+static inline bool rt_mutex_waiter_is_top_waiter(struct rt_mutex_base *lock,
+						 struct rt_mutex_waiter *waiter)
+{
+	struct rb_node *leftmost = rb_first_cached(&lock->waiters);
+
+	return rb_entry(leftmost, struct rt_mutex_waiter, tree_entry) == waiter;
+}
+
 static inline struct rt_mutex_waiter *rt_mutex_top_waiter(struct rt_mutex_base *lock)
 {
 	struct rb_node *leftmost = rb_first_cached(&lock->waiters);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [patch 0/2] locking/rtmutex: Cure two subtle bugs
  2021-08-25 10:33 [patch 0/2] locking/rtmutex: Cure two subtle bugs Thomas Gleixner
                   ` (2 preceding siblings ...)
  2021-08-25 11:40 ` [patch 0/2] locking/rtmutex: Cure two subtle bugs Peter Zijlstra
@ 2021-08-26 13:26 ` Peter Zijlstra
  2021-08-27  7:56   ` Sebastian Andrzej Siewior
                     ` (2 more replies)
  3 siblings, 3 replies; 10+ messages in thread
From: Peter Zijlstra @ 2021-08-26 13:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Steven Rostedt, Waiman Long,
	Sebastian Andrzej Siewior, Davidlohr Bueso

On Wed, Aug 25, 2021 at 12:33:11PM +0200, Thomas Gleixner wrote:
> The fixes are straight forward, but the rtmutex based ww_mutex
> implementation still has some more rough egdes which need to be ironed out.

Something like the below (which should be two patches).

There's two issues at hand:

 - 'spurious' -EDEADLK reports due to prior loops
 - actual deadlock detection that will/should be resolved by ww_mutex
   instead.

Both stem from the fact that ww_mutex can legitimately create cycles in
the lock graph. The first issue is because a blocker (P3 blow) can get
trapped in a cycle it didn't cause and hit the iteration depth.

The latter can be observed when we suppose P2 to be the highest priority
and thus should win progress rights over P1, but P2 would also be 'late'
and be the one to close the cycle. In that case PI-walk would report
-EDEADLK to P2, even though w/w would pick and wake P1 to receive
-EDEADLK.

Now; afaict these conditions below capture the above, but then fail to
adequately handle more complex locking chains where, for example, two
w/w classes are inverted, which is a geniune deadlock, which the below
will suppress.

Luckily w/w stuff is in-kernel only and lockdep should capture them; any
user chains should be unaffected.

famous last words etc..

---
 kernel/locking/rtmutex.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c8fe74ef8db9..b7145d724b5f 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -656,6 +656,31 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 	if (next_lock != waiter->lock)
 		goto out_unlock_pi;
 
+	/*
+	 * There could be 'spurious' loops in the lock graph due to ww_mutex,
+	 * consider:
+	 *
+	 *   P1: A, ww_A, ww_B
+	 *   P2: ww_B, ww_A
+	 *   P3: A
+	 *
+	 * P3 should not return -EDEADLK because it gets trapped in the cycle
+	 * created by P1 and P2 (which will resolve -- and runs into
+	 * max_lock_depth above). Therefore disable detect_deadlock such that
+	 * the below termination condition can trigger once all relevant tasks
+	 * are boosted.
+	 *
+	 * Even when we start with ww_mutex we can disable deadlock detection,
+	 * since we would supress a ww_mutex induced deadlock at [6] anyway.
+	 * Supressing it here however is not sufficient since we might still
+	 * hit [6] due to adjustment driven iteration.
+	 *
+	 * NOTE: if someone were to create a deadlock between 2 ww_classes we'd
+	 * utterly fail to report it; lockdep should.
+	 */
+	if (waiter->ww_ctx && detect_deadlock)
+		detect_deadlock = false;
+
 	/*
 	 * Drop out, when the task has no waiters. Note,
 	 * top_waiter can be NULL, when we are in the deboosting
@@ -717,8 +742,21 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * walk, we detected a deadlock.
 	 */
 	if (lock == orig_lock || rt_mutex_owner(lock) == top_task) {
-		raw_spin_unlock(&lock->wait_lock);
 		ret = -EDEADLK;
+
+		/*
+		 * When the deadlock is due to ww_mutex; also see above. Don't
+		 * report the deadlock and instead let the ww_mutex wound/die
+		 * logic pick which of the contending threads gets -EDEADLK.
+		 *
+		 * NOTE: assumes the cycle only contains a single ww_class; any
+		 * other configuration and we fail to report; also, see
+		 * lockdep.
+		 */
+		if (orig_waiter->ww_ctx)
+			ret = 0;
+
+		raw_spin_unlock(&lock->wait_lock);
 		goto out_unlock_pi;
 	}
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [patch 0/2] locking/rtmutex: Cure two subtle bugs
  2021-08-26 13:26 ` Peter Zijlstra
@ 2021-08-27  7:56   ` Sebastian Andrzej Siewior
  2021-08-27 12:31   ` [tip: locking/core] locking/rtmutex: Return success on deadlock for ww_mutex waiters tip-bot2 for Peter Zijlstra
  2021-08-27 12:31   ` [tip: locking/core] locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-08-27  7:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Steven Rostedt, Waiman Long,
	Davidlohr Bueso

On 2021-08-26 15:26:36 [+0200], Peter Zijlstra wrote:
> On Wed, Aug 25, 2021 at 12:33:11PM +0200, Thomas Gleixner wrote:
> > The fixes are straight forward, but the rtmutex based ww_mutex
> > implementation still has some more rough egdes which need to be ironed out.
> 
> Something like the below (which should be two patches).

tglx suggested to put an ifdef RT around the two checks since it is only
needed by ww-mutex on RT. The patch below is what I committed to my RT
tree.

From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 26 Aug 2021 16:39:51 +0200
Subject: [PATCH] locking/rtmutex: Let ww-mutex handle deadlock detection.

There's two issues at hand:

 - 'spurious' -EDEADLK reports due to prior loops
 - actual deadlock detection that will/should be resolved by ww_mutex
   instead.

Both stem from the fact that ww_mutex can legitimately create cycles in
the lock graph. The first issue is because a blocker (P3 blow) can get
trapped in a cycle it didn't cause and hit the iteration depth.

The latter can be observed when we suppose P2 to be the highest priority
and thus should win progress rights over P1, but P2 would also be 'late'
and be the one to close the cycle. In that case PI-walk would report
-EDEADLK to P2, even though w/w would pick and wake P1 to receive
-EDEADLK.

Now; afaict these conditions below capture the above, but then fail to
adequately handle more complex locking chains where, for example, two
w/w classes are inverted, which is a geniune deadlock, which the below
will suppress.

Luckily w/w stuff is in-kernel only and lockdep should capture them; any
user chains should be unaffected.

famous last words etc..

Link: https://lkml.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/locking/rtmutex.c | 43 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 8b14f9a958415..884d4dd9d81e2 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -656,6 +656,33 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 	if (next_lock != waiter->lock)
 		goto out_unlock_pi;
 
+#ifdef CONFIG_PREEMPT_RT
+	/*
+	 * There could be 'spurious' loops in the lock graph due to ww_mutex,
+	 * consider:
+	 *
+	 *   P1: A, ww_A, ww_B
+	 *   P2: ww_B, ww_A
+	 *   P3: A
+	 *
+	 * P3 should not return -EDEADLK because it gets trapped in the cycle
+	 * created by P1 and P2 (which will resolve -- and runs into
+	 * max_lock_depth above). Therefore disable detect_deadlock such that
+	 * the below termination condition can trigger once all relevant tasks
+	 * are boosted.
+	 *
+	 * Even when we start with ww_mutex we can disable deadlock detection,
+	 * since we would supress a ww_mutex induced deadlock at [6] anyway.
+	 * Supressing it here however is not sufficient since we might still
+	 * hit [6] due to adjustment driven iteration.
+	 *
+	 * NOTE: if someone were to create a deadlock between 2 ww_classes we'd
+	 * utterly fail to report it; lockdep should.
+	 */
+	if (waiter->ww_ctx && detect_deadlock)
+		detect_deadlock = false;
+#endif
+
 	/*
 	 * Drop out, when the task has no waiters. Note,
 	 * top_waiter can be NULL, when we are in the deboosting
@@ -717,8 +744,22 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * walk, we detected a deadlock.
 	 */
 	if (lock == orig_lock || rt_mutex_owner(lock) == top_task) {
-		raw_spin_unlock(&lock->wait_lock);
 		ret = -EDEADLK;
+
+#ifdef CONFIG_PREEMPT_RT
+		/*
+		 * When the deadlock is due to ww_mutex; also see above. Don't
+		 * report the deadlock and instead let the ww_mutex wound/die
+		 * logic pick which of the contending threads gets -EDEADLK.
+		 *
+		 * NOTE: assumes the cycle only contains a single ww_class; any
+		 * other configuration and we fail to report; also, see
+		 * lockdep.
+		 */
+		if (orig_waiter->ww_ctx)
+			ret = 0;
+#endif
+		raw_spin_unlock(&lock->wait_lock);
 		goto out_unlock_pi;
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: locking/core] locking/rtmutex: Return success on deadlock for ww_mutex waiters
  2021-08-26 13:26 ` Peter Zijlstra
  2021-08-27  7:56   ` Sebastian Andrzej Siewior
@ 2021-08-27 12:31   ` tip-bot2 for Peter Zijlstra
  2021-08-27 12:31   ` [tip: locking/core] locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-08-27 12:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Siewior, Peter Zijlstra (Intel),
	Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     a055fcc132d4c25b96d1115aea514258810dc6fc
Gitweb:        https://git.kernel.org/tip/a055fcc132d4c25b96d1115aea514258810dc6fc
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 26 Aug 2021 10:48:18 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 27 Aug 2021 14:28:49 +02:00

locking/rtmutex: Return success on deadlock for ww_mutex waiters

ww_mutexes can legitimately cause a deadlock situation in the lock graph
which is resolved afterwards by the wait/wound mechanics. The rtmutex chain
walk can detect such a deadlock and returns EDEADLK which in turn skips the
wait/wound mechanism and returns EDEADLK to the caller. That's wrong
because both lock chains might get EDEADLK or the wrong waiter would back
out.

Detect that situation and return 'success' in case that the waiter which
initiated the chain walk is a ww_mutex with context. This allows the
wait/wound mechanics to resolve the situation according to the rules.

[ tglx: Split it apart and added changelog ]

Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Fixes: add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
---
 kernel/locking/rtmutex.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 3c1ba7b..8eabdc7 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -742,8 +742,21 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * walk, we detected a deadlock.
 	 */
 	if (lock == orig_lock || rt_mutex_owner(lock) == top_task) {
-		raw_spin_unlock(&lock->wait_lock);
 		ret = -EDEADLK;
+
+		/*
+		 * When the deadlock is due to ww_mutex; also see above. Don't
+		 * report the deadlock and instead let the ww_mutex wound/die
+		 * logic pick which of the contending threads gets -EDEADLK.
+		 *
+		 * NOTE: assumes the cycle only contains a single ww_class; any
+		 * other configuration and we fail to report; also, see
+		 * lockdep.
+		 */
+		if (IS_ENABLED(CONFIG_PREEMPT_RT) && orig_waiter->ww_ctx)
+			ret = 0;
+
+		raw_spin_unlock(&lock->wait_lock);
 		goto out_unlock_pi;
 	}
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: locking/core] locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes
  2021-08-26 13:26 ` Peter Zijlstra
  2021-08-27  7:56   ` Sebastian Andrzej Siewior
  2021-08-27 12:31   ` [tip: locking/core] locking/rtmutex: Return success on deadlock for ww_mutex waiters tip-bot2 for Peter Zijlstra
@ 2021-08-27 12:31   ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-08-27 12:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Siewior, Peter Zijlstra (Intel),
	Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     6467822b8cc96e5feda98c7bf5c6329c6a896c91
Gitweb:        https://git.kernel.org/tip/6467822b8cc96e5feda98c7bf5c6329c6a896c91
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 26 Aug 2021 09:36:53 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 27 Aug 2021 14:28:49 +02:00

locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes

rtmutex based ww_mutexes can legitimately create a cycle in the lock graph
which can be observed by a blocker which didn't cause the problem:

   P1: A, ww_A, ww_B
   P2: ww_B, ww_A
   P3: A

P3 might therefore be trapped in the ww_mutex induced cycle and run into
the lock depth limitation of rt_mutex_adjust_prio_chain() which returns
-EDEADLK to the caller.

Disable the deadlock detection walk when the chain walk observes a
ww_mutex to prevent this looping.

[ tglx: Split it apart and added changelog ]

Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Fixes: add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
---
 kernel/locking/rtmutex.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c8fe74e..3c1ba7b 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -657,6 +657,31 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 		goto out_unlock_pi;
 
 	/*
+	 * There could be 'spurious' loops in the lock graph due to ww_mutex,
+	 * consider:
+	 *
+	 *   P1: A, ww_A, ww_B
+	 *   P2: ww_B, ww_A
+	 *   P3: A
+	 *
+	 * P3 should not return -EDEADLK because it gets trapped in the cycle
+	 * created by P1 and P2 (which will resolve -- and runs into
+	 * max_lock_depth above). Therefore disable detect_deadlock such that
+	 * the below termination condition can trigger once all relevant tasks
+	 * are boosted.
+	 *
+	 * Even when we start with ww_mutex we can disable deadlock detection,
+	 * since we would supress a ww_mutex induced deadlock at [6] anyway.
+	 * Supressing it here however is not sufficient since we might still
+	 * hit [6] due to adjustment driven iteration.
+	 *
+	 * NOTE: if someone were to create a deadlock between 2 ww_classes we'd
+	 * utterly fail to report it; lockdep should.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && waiter->ww_ctx && detect_deadlock)
+		detect_deadlock = false;
+
+	/*
 	 * Drop out, when the task has no waiters. Note,
 	 * top_waiter can be NULL, when we are in the deboosting
 	 * mode!

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-08-27 12:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-25 10:33 [patch 0/2] locking/rtmutex: Cure two subtle bugs Thomas Gleixner
2021-08-25 10:33 ` [patch 1/2] locking/rtmutex: Dont dereference waiter lockless Thomas Gleixner
2021-08-25 14:17   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
2021-08-25 10:33 ` [patch 2/2] locking/rtmutex: Dequeue waiter on ww_mutex deadlock Thomas Gleixner
2021-08-25 14:17   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
2021-08-25 11:40 ` [patch 0/2] locking/rtmutex: Cure two subtle bugs Peter Zijlstra
2021-08-26 13:26 ` Peter Zijlstra
2021-08-27  7:56   ` Sebastian Andrzej Siewior
2021-08-27 12:31   ` [tip: locking/core] locking/rtmutex: Return success on deadlock for ww_mutex waiters tip-bot2 for Peter Zijlstra
2021-08-27 12:31   ` [tip: locking/core] locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes tip-bot2 for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).