[PATCH v2] futex: Avoid reusing outdated pi_state.

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Slaby <jirislaby@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, boqun.feng@gmail.com,
	bristot@redhat.com, bsegall@google.com, dietmar.eggemann@arm.com,
	jstultz@google.com, juri.lelli@redhat.com, longman@redhat.com,
	mgorman@suse.de, mingo@redhat.com, rostedt@goodmis.org,
	swood@redhat.com, vincent.guittot@linaro.org,
	vschneid@redhat.com, will@kernel.org
Subject: [PATCH v2] futex: Avoid reusing outdated pi_state.
Date: Thu, 18 Jan 2024 12:54:51 +0100	[thread overview]
Message-ID: <20240118115451.0TkD_ZhB@linutronix.de> (raw)
In-Reply-To: <87wms7g62b.ffs@tglx>

Jiri Slaby reported a futex state inconsistency resulting in -EINVAL
during a lock operation for a PI futex. A requirement is that the lock
process is interrupted by a timeout or signal:

T1 Owns the futex in userland.

T2 Tries to acquire the futex in kernel (futex_lock_pi()). Allocates a
   pi_state and attaches itself to it.

T2 Times out and removes its rt_waiter from the rt_mutex.

T1 Unlocks the futex (futex_unlock_pi()). Finds a futex_q but no
   rt_waiter. Unlocks the futex (do_uncontended) and makes it available
   to userland.

T3 Acquires the futex in userland.

T4 Tries to acquire the futex in kernel (futex_lock_pi()). Finds the
   existing futex_q and tries to attach itself to the existing pi_state.
   This (attach_to_pi_state()) fails with -EINVAL because uval contains
   the pid of T3 but pi_state points to T1.

We must not unlock the futex and make it available for userland as long
as there is still an existing state in kernel. It can be used by further
futex_lock_pi() caller and observe inconsistent state.

The lock can not be handed over to T1 because it already gave up and
started to clean up.
All futex_q entries point to the same pi_state and the pi_mutex has no
waiters. A waiter can not be enqueued because hb->lock +
pi_mutex.wait_lock is acquired (by the unlock operation) and the same
ordering is used by futex_lock_pi() during locking.

Remove all futex_q entries from the hb list which point to the futex if
no waiter has been observed in the futex_unlock_pi() path. This closes
the race window by removing all pointer to the previous in-kernel state.

The leaving futex_lock_pi() caller can clean up the pi-state once it
acquires hb->lock. During the process futex_unqueue_pi() must remove
the futex_q item from the hb list only if not yet happened.

Fixes: fbeb558b0dd0d ("futex/pi: Fix recursive rt_mutex waiter state")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
---
v1…v2:
  - Removed the argument to futex_unqueue_pi(). __futex_unqueue() is now
    always invoked conditionally.
  - Rewrote the patch description and comments in the patch.
  - Corrected the Closes: link and added Tested-by.

 kernel/futex/core.c |  7 ++++++-
 kernel/futex/pi.c   | 11 ++++++++---
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index dad981a865b84..fd56a84b163a1 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -628,10 +628,15 @@ int futex_unqueue(struct futex_q *q)
 /*
  * PI futexes can not be requeued and must remove themselves from the
  * hash bucket. The hash bucket lock (i.e. lock_ptr) is held.
+ * If the lock was not acquired (due to timeout or signal) then the rt_waiter
+ * is removed before futex_q is. If this is observed by unlocker then it will
+ * remove futex_q from the hash bucket list. Thefore the removal here is
+ * optional.
  */
 void futex_unqueue_pi(struct futex_q *q)
 {
-	__futex_unqueue(q);
+	if (!plist_node_empty(&q->list))
+		__futex_unqueue(q);
 
 	BUG_ON(!q->pi_state);
 	put_pi_state(q->pi_state);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 90e5197f4e569..50008b7b38e7e 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1135,6 +1135,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 
 	hb = futex_hash(&key);
 	spin_lock(&hb->lock);
+retry_hb:
 
 	/*
 	 * Check waiters first. We do not trust user space values at
@@ -1177,12 +1178,17 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		/*
 		 * Futex vs rt_mutex waiter state -- if there are no rt_mutex
 		 * waiters even though futex thinks there are, then the waiter
-		 * is leaving and the uncontended path is safe to take.
+		 * is leaving. The entry needs to be removed from the list so a
+		 * new futex_lock_pi() is not using this outdated PI-state while
+		 * the futex is available in userland again.
+		 * There can be more than one task on its way out so it needs
+		 * to retry.
 		 */
 		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
 		if (!rt_waiter) {
+			__futex_unqueue(top_waiter);
 			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-			goto do_uncontended;
+			goto retry_hb;
 		}
 
 		get_pi_state(pi_state);
@@ -1217,7 +1223,6 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		return ret;
 	}
 
-do_uncontended:
 	/*
 	 * We have no kernel internal state, i.e. no waiters in the
 	 * kernel. Waiters which are about to queue themselves are stuck
-- 
2.43.0