All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiri Slaby <jirislaby@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	linux-kernel@vger.kernel.org, boqun.feng@gmail.com,
	bristot@redhat.com, bsegall@google.com, dietmar.eggemann@arm.com,
	jstultz@google.com, juri.lelli@redhat.com, longman@redhat.com,
	mgorman@suse.de, mingo@redhat.com, rostedt@goodmis.org,
	swood@redhat.com, vincent.guittot@linaro.org,
	vschneid@redhat.com, will@kernel.org
Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
Date: Mon, 15 Jan 2024 12:40:08 +0100	[thread overview]
Message-ID: <4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org> (raw)
In-Reply-To: <20230915151943.GD6743@noisy.programming.kicks-ass.net>

On 15. 09. 23, 17:19, Peter Zijlstra wrote:
> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
> 
>> I spent quite some time to convince myself that this is correct. I was
>> not able to poke a hole into it. So that really should be safe to
>> do. Famous last words ...
> 
> IKR :-/
> 
> Something like so then...
> 
> ---
> Subject: futex/pi: Fix recursive rt_mutex waiter state

So this breaks some random test in APR:

 From 
https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:G/apr/standard/x86_64:
testprocmutex       :  Line 122: child did not terminate with success

The child in fact terminates on 
https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
                 while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
                     if (!APR_STATUS_IS_TIMEUP(rv))
                         exit(1); <----- here

The test creates 6 children and does some 
pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel 
while sleeping 1 us inside the lock. The timeout is 1 us above. And the 
test expects all them to fail (to time out). But the time out does not 
always happen in 6.7 (it's racy, so the failure is semi-random: like 1 
of 1000 attempts is bad).

If I revert this patch (commit fbeb558b0dd0d), the test works.

I know, the test could be broken too, but I have no idea, really. The 
testsuite is sort of hairy and I could not come up with a simple repro.

Note APR sets up PTHREAD_PROCESS_SHARED, _ROBUST, and _PRIO_INHERIT 
attrs for the mutex.

Anyway:
downstream report: https://bugzilla.suse.com/show_bug.cgi?id=1218801
APR report: https://bz.apache.org/bugzilla/show_bug.cgi?id=68481

Any idea if this patch should cause the above (or even is a desired 
behavior)?

Thanks.

> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, 12 Sep 2023 13:17:11 +0200
> 
> Some new assertions pointed out that the existing code has nested rt_mutex wait
> state in the futex code.
> 
> Specifically, the futex_lock_pi() cancel case uses spin_lock() while there
> still is a rt_waiter enqueued for this task, resulting in a state where there
> are two waiters for the same task (and task_struct::pi_blocked_on gets
> scrambled).
> 
> The reason to take hb->lock at this point is to avoid the wake_futex_pi()
> EAGAIN case.
> 
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes
> inconsistent. The current rules are such that this inconsistency will not be
> observed.
> 
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.
> 
> *However* the case at hand is where a waiter is leaving, in this case the race
> means a waiter that is going away is not observed -- which is harmless,
> provided this race is explicitly handled.
> 
> This is a somewhat dangerous proposition because the converse race is not
> observing a new waiter, which must absolutely not happen. But since the race is
> valid this cannot be asserted.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   pi.c      |   76 +++++++++++++++++++++++++++++++++++++++-----------------------
>   requeue.c |    6 +++-
>   2 files changed, 52 insertions(+), 30 deletions(-)
> 
> Index: linux-2.6/kernel/futex/pi.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex/pi.c
> +++ linux-2.6/kernel/futex/pi.c
> @@ -610,29 +610,16 @@ int futex_lock_pi_atomic(u32 __user *uad
>   /*
>    * Caller must hold a reference on @pi_state.
>    */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> +			 struct futex_pi_state *pi_state,
> +			 struct rt_mutex_waiter *top_waiter)
>   {
> -	struct rt_mutex_waiter *top_waiter;
>   	struct task_struct *new_owner;
>   	bool postunlock = false;
>   	DEFINE_RT_WAKE_Q(wqh);
>   	u32 curval, newval;
>   	int ret = 0;
>   
> -	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> -	if (WARN_ON_ONCE(!top_waiter)) {
> -		/*
> -		 * As per the comment in futex_unlock_pi() this should not happen.
> -		 *
> -		 * When this happens, give up our locks and try again, giving
> -		 * the futex_lock_pi() instance time to complete, either by
> -		 * waiting on the rtmutex or removing itself from the futex
> -		 * queue.
> -		 */
> -		ret = -EAGAIN;
> -		goto out_unlock;
> -	}
> -
>   	new_owner = top_waiter->task;
>   
>   	/*
> @@ -1039,19 +1026,33 @@ retry_private:
>   	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>   
>   cleanup:
> -	spin_lock(q.lock_ptr);
>   	/*
>   	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
> -	 * first acquire the hb->lock before removing the lock from the
> -	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> -	 * lists consistent.
> +	 * must unwind the above, however we canont lock hb->lock because
> +	 * rt_mutex already has a waiter enqueued and hb->lock can itself try
> +	 * and enqueue an rt_waiter through rtlock.
> +	 *
> +	 * Doing the cleanup without holding hb->lock can cause inconsistent
> +	 * state between hb and pi_state, but only in the direction of not
> +	 * seeing a waiter that is leaving.
> +	 *
> +	 * See futex_unlock_pi(), it deals with this inconsistency.
> +	 *
> +	 * There be dragons here, since we must deal with the inconsistency on
> +	 * the way out (here), it is impossible to detect/warn about the race
> +	 * the other way around (missing an incoming waiter).
>   	 *
> -	 * In particular; it is important that futex_unlock_pi() can not
> -	 * observe this inconsistency.
> +	 * What could possibly go wrong...
>   	 */
>   	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
>   		ret = 0;
>   
> +	/*
> +	 * Now that the rt_waiter has been dequeued, it is safe to use
> +	 * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
> +	 * the
> +	 */
> +	spin_lock(q.lock_ptr);
>   no_block:
>   	/*
>   	 * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1132,6 +1133,7 @@ retry:
>   	top_waiter = futex_top_waiter(hb, &key);
>   	if (top_waiter) {
>   		struct futex_pi_state *pi_state = top_waiter->pi_state;
> +		struct rt_mutex_waiter *rt_waiter;
>   
>   		ret = -EINVAL;
>   		if (!pi_state)
> @@ -1144,22 +1146,39 @@ retry:
>   		if (pi_state->owner != current)
>   			goto out_unlock;
>   
> -		get_pi_state(pi_state);
>   		/*
>   		 * By taking wait_lock while still holding hb->lock, we ensure
> -		 * there is no point where we hold neither; and therefore
> -		 * wake_futex_p() must observe a state consistent with what we
> -		 * observed.
> +		 * there is no point where we hold neither; and thereby
> +		 * wake_futex_pi() must observe any new waiters.
> +		 *
> +		 * Since the cleanup: case in futex_lock_pi() removes the
> +		 * rt_waiter without holding hb->lock, it is possible for
> +		 * wake_futex_pi() to not find a waiter while the above does,
> +		 * in this case the waiter is on the way out and it can be
> +		 * ignored.
>   		 *
>   		 * In particular; this forces __rt_mutex_start_proxy() to
>   		 * complete such that we're guaranteed to observe the
> -		 * rt_waiter. Also see the WARN in wake_futex_pi().
> +		 * rt_waiter.
>   		 */
>   		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> +		/*
> +		 * Futex vs rt_mutex waiter state -- if there are no rt_mutex
> +		 * waiters even though futex thinks there are, then the waiter
> +		 * is leaving and the uncontended path is safe to take.
> +		 */
> +		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> +		if (!rt_waiter) {
> +			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
> +			goto do_uncontended;
> +		}
> +
> +		get_pi_state(pi_state);
>   		spin_unlock(&hb->lock);
>   
>   		/* drops pi_state->pi_mutex.wait_lock */
> -		ret = wake_futex_pi(uaddr, uval, pi_state);
> +		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>   
>   		put_pi_state(pi_state);
>   
> @@ -1187,6 +1206,7 @@ retry:
>   		return ret;
>   	}
>   
> +do_uncontended:
>   	/*
>   	 * We have no kernel internal state, i.e. no waiters in the
>   	 * kernel. Waiters which are about to queue themselves are stuck
> Index: linux-2.6/kernel/futex/requeue.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex/requeue.c
> +++ linux-2.6/kernel/futex/requeue.c
> @@ -850,11 +850,13 @@ int futex_wait_requeue_pi(u32 __user *ua
>   		pi_mutex = &q.pi_state->pi_mutex;
>   		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
>   
> -		/* Current is not longer pi_blocked_on */
> -		spin_lock(q.lock_ptr);
> +		/*
> +		 * See futex_unlock_pi()'s cleanup: comment.
> +		 */
>   		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
>   			ret = 0;
>   
> +		spin_lock(q.lock_ptr);
>   		debug_rt_mutex_free_waiter(&rt_waiter);
>   		/*
>   		 * Fixup the pi_state owner and possibly acquire the lock if we
> 

-- 
js
suse labs


  parent reply	other threads:[~2024-01-15 11:40 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
2023-09-08 16:22 ` [PATCH v3 1/7] sched: Constrain locks in sched_submit_work() Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Peter Zijlstra
2023-09-08 16:22 ` [PATCH v3 2/7] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Sebastian Andrzej Siewior
2023-09-08 16:22 ` [PATCH v3 3/7] sched: Extract __schedule_loop() Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
2023-09-08 16:22 ` [PATCH v3 4/7] sched: Provide rt_mutex specific scheduler helpers Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Peter Zijlstra
2023-09-08 16:22 ` [PATCH v3 5/7] locking/rtmutex: Use " Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Sebastian Andrzej Siewior
2023-09-08 16:22 ` [PATCH v3 6/7] locking/rtmutex: Add a lockdep assert to catch potential nested blocking Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
2023-09-08 16:22 ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Sebastian Andrzej Siewior
2023-09-09 18:29   ` Peter Zijlstra
2023-09-11 14:11   ` Peter Zijlstra
2023-09-12 10:57     ` Peter Zijlstra
2023-09-12 11:26       ` Sebastian Andrzej Siewior
2023-09-12 11:17     ` Sebastian Andrzej Siewior
2023-09-12 11:24       ` Peter Zijlstra
2023-09-12 11:25       ` Sebastian Andrzej Siewior
2023-09-12 11:25       ` Peter Zijlstra
2023-09-12 11:28         ` Sebastian Andrzej Siewior
2023-09-15 12:58     ` Thomas Gleixner
2023-09-15 13:15       ` Sebastian Andrzej Siewior
2023-09-15 15:19       ` Peter Zijlstra
2023-09-15 18:59         ` Thomas Gleixner
2023-09-19 11:03         ` Sebastian Andrzej Siewior
2023-09-20  7:36         ` [tip: locking/core] futex/pi: Fix recursive rt_mutex waiter state tip-bot2 for Peter Zijlstra
2024-01-15 11:40         ` Jiri Slaby [this message]
2024-01-15 11:52           ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Jiri Slaby
2024-01-15 12:16             ` Sebastian Andrzej Siewior
2024-01-15 12:54             ` Jiri Slaby
2024-01-15 15:32               ` Yann Ylavic
2024-01-15 18:33               ` Sebastian Andrzej Siewior
2024-01-16  7:07                 ` Jiri Slaby

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org \
    --to=jirislaby@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=swood@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.