linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Will Deacon <will@kernel.org>, Jan Kara <jack@suse.cz>,
	Waiman Long <longman@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>
Subject: Re: Crash with PREEMPT_RT on aarch64 machine
Date: Wed, 30 Nov 2022 20:22:04 +0000	[thread overview]
Message-ID: <20221130202204.usku3rl6wowiugju@suse.de> (raw)
In-Reply-To: <Y4Tapja2qq8HiHBZ@linutronix.de>

On Mon, Nov 28, 2022 at 04:58:30PM +0100, Sebastian Andrzej Siewior wrote:
> How about this?
> 
> - The fast path is easy???
> 
> - The slow path first sets the WAITER bits (mark_rt_mutex_waiters()) so
>   I made that one _acquire so that it is visible by the unlocker forcing
>   everyone into slow path.
> 
> - If the lock is acquired, then the owner is written via
>   rt_mutex_set_owner(). This happens under wait_lock so it is
>   serialized and so a WRITE_ONCE() is used to write the final owner. I
>   replaced it with a cmpxchg_acquire() to have the owner there.
>   Not sure if I shouldn't make this as you put it:
> |   e.g. by making use of dependency ordering where it already exists.
>   The other (locking) CPU needs to see the owner not only the WAITER
>   bit. I'm not sure if this could be replaced with smp_store_release().
> 
> - After the whole operation completes, fixup_rt_mutex_waiters() cleans
>   the WAITER bit and I kept the _acquire semantic here. Now looking at
>   it again, I don't think that needs to be done since that shouldn't be
>   set here.
> 
> - There is rtmutex_spin_on_owner() which (as the name implies) spins on
>   the owner as long as it active. It obtains it via READ_ONCE() and I'm
>   not sure if any memory barrier is needed. Worst case is that it will
>   spin while owner isn't set if it observers a stale value.
> 
> - The unlock path first clears the waiter bit if there are no waiters
>   recorded (via simple assignments under the wait_lock (every locker
>   will fail with the cmpxchg_acquire() and go for the wait_lock)) and
>   then finally drop it via rt_mutex_cmpxchg_release(,, NULL).
>   Should there be a wait, it will just store the WAITER bit with
>   smp_store_release() (setting the owner is NULL but the WAITER bit
>   forces everyone into the slow path).
> 
> - Added rt_mutex_set_owner_pi() which does simple assignment. This is
>   used from the futex code and here everything happens under a lock.
> 
> - I added a smp_load_acquire() to rt_mutex_base_is_locked() since I
>   *think* want to observe a real waiter and not something stale.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

include/linux/rtmutex.h needs to also include asm/barrier.h to resolve
some build problems. Once that was resolved, 10 iterations of the dbench
work completed successfully and without the patch, 1 iteration could not
complete.

Review is trickier as I'm not spent any reasonable amount of time on locking
primitives. I'd have to defer to Peter in that regard but I skimmed it at
least before wrapping up for the evening.

> ---
>  include/linux/rtmutex.h      |  2 +-
>  kernel/locking/rtmutex.c     | 26 ++++++++++++++++++--------
>  kernel/locking/rtmutex_api.c |  4 ++--
>  3 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
> index 7d049883a08ac..4447e01f723d4 100644
> --- a/include/linux/rtmutex.h
> +++ b/include/linux/rtmutex.h
> @@ -41,7 +41,7 @@ struct rt_mutex_base {
>   */
>  static inline bool rt_mutex_base_is_locked(struct rt_mutex_base *lock)
>  {
> -	return READ_ONCE(lock->owner) != NULL;
> +	return smp_load_acquire(&lock->owner) != NULL;
>  }
>  

I don't believe this is necessary. It's only needed when checking if a
lock is acquired or not and it's inherently race-prone. It's harmless if
a stale value is observed and it does not pair with a release. Mostly it's
useful for debugging checks.

>  extern void rt_mutex_base_init(struct rt_mutex_base *rtb);
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 7779ee8abc2a0..e3cc673e0c988 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -97,7 +97,7 @@ rt_mutex_set_owner(struct rt_mutex_base *lock, struct task_struct *owner)
>  	if (rt_mutex_has_waiters(lock))
>  		val |= RT_MUTEX_HAS_WAITERS;
>  
> -	WRITE_ONCE(lock->owner, (struct task_struct *)val);
> +	WARN_ON_ONCE(cmpxchg_acquire(&lock->owner, RT_MUTEX_HAS_WAITERS, val) != RT_MUTEX_HAS_WAITERS);
>  }
>  
>  static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
> @@ -106,6 +106,17 @@ static __always_inline void clear_rt_mutex_waiters(struct rt_mutex_base *lock)
>  			((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
>  }
>  
> +static __always_inline void
> +rt_mutex_set_owner_pi(struct rt_mutex_base *lock, struct task_struct *owner)
> +{

What does pi mean in this context? I think the naming here might
misleading. rt_mutex_set_owner_pi is used when initialising and when
clearing the owner. rt_mutex_set_owner is set when acquiring the lock.

Consider renaming rt_mutex_set_owner_pi to rt_mutex_clear_owner. The init
could still use rt_mutex_set_owner as an extra barrier is not a big deal
during init if the straight assignment was unpopular.  The init could also
do a plain assignment because it cannot have any waiters yet.

What is less obvious is if rt_mutex_clear_owner should have explicit release
semantics to pair with rt_mutex_set_owner. It looks like it might not
matter because at least some paths end up having release semantics anyway
due to a spinlock but I didn't check all cases and it's potentially fragile.

> +	unsigned long val = (unsigned long)owner;
> +
> +	if (rt_mutex_has_waiters(lock))
> +		val |= RT_MUTEX_HAS_WAITERS;
> +
> +	WRITE_ONCE(lock->owner, val);
> +}
> +
>  static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
>  {
>  	unsigned long owner, *p = (unsigned long *) &lock->owner;
> @@ -173,7 +184,7 @@ static __always_inline void fixup_rt_mutex_waiters(struct rt_mutex_base *lock)
>  	 */
>  	owner = READ_ONCE(*p);
>  	if (owner & RT_MUTEX_HAS_WAITERS)
> -		WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS);
> +		cmpxchg_acquire(p, owner, owner & ~RT_MUTEX_HAS_WAITERS);
>  }
>  
>  /*
> @@ -196,17 +207,16 @@ static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
>  }
>  
>  /*
> - * Callers must hold the ->wait_lock -- which is the whole purpose as we force
> - * all future threads that attempt to [Rmw] the lock to the slowpath. As such
> - * relaxed semantics suffice.
> + * Callers hold the ->wait_lock. This needs to be visible to the lockowner,
> + * forcing him into the slow path for the unlock operation.
>   */
>  static __always_inline void mark_rt_mutex_waiters(struct rt_mutex_base *lock)
>  {
>  	unsigned long owner, *p = (unsigned long *) &lock->owner;
>  
>  	do {
> -		owner = *p;
> -	} while (cmpxchg_relaxed(p, owner,
> +		owner = READ_ONCE(*p);
> +	} while (cmpxchg_acquire(p, owner,
>  				 owner | RT_MUTEX_HAS_WAITERS) != owner);
>  }
>  

Not 100% sure although I see it's to cover an exit path from
try_to_take_rt_mutex. I'm undecided if rt_mutex_set_owner having acquire
semantics and rt_mutex_clear_owner having clear semantics would be
sufficient. try_to_take_rt_mutex can still return with release semantics
but only in the case where it fails to acquire the lock.

> @@ -1218,7 +1228,7 @@ static void __sched mark_wakeup_next_waiter(struct rt_wake_q_head *wqh,
>  	 * slow path making sure no task of lower priority than
>  	 * the top waiter can steal this lock.
>  	 */
> -	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
> +	smp_store_release(&lock->owner, (void *) RT_MUTEX_HAS_WAITERS);
>  
>  	/*
>  	 * We deboosted before waking the top waiter task such that we don't

This is within a locked section and would definitely see a barrier if
rt_mutex_wake_q_add_task calls wake_q_add but it's less clear if the
optimisation in rt_mutex_wake_q_add_task could race so I'm undecided if
it's necessary or not.

> diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
> index 900220941caac..9acc176f93d38 100644
> --- a/kernel/locking/rtmutex_api.c
> +++ b/kernel/locking/rtmutex_api.c
> @@ -249,7 +249,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
>  	 * recursion. Give the futex/rtmutex wait_lock a separate key.
>  	 */
>  	lockdep_set_class(&lock->wait_lock, &pi_futex_key);
> -	rt_mutex_set_owner(lock, proxy_owner);
> +	rt_mutex_set_owner_pi(lock, proxy_owner);
>  }
>  
>  /**
> @@ -267,7 +267,7 @@ void __sched rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
>  void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
>  {
>  	debug_rt_mutex_proxy_unlock(lock);
> -	rt_mutex_set_owner(lock, NULL);
> +	rt_mutex_set_owner_pi(lock, NULL);
>  }
>  
>  /**
> -- 
> 2.38.1
> 

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2022-11-30 20:23 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-03 11:54 Crash with PREEMPT_RT on aarch64 machine Jan Kara
2022-11-04 16:30 ` Sebastian Andrzej Siewior
2022-11-07 13:56   ` Jan Kara
2022-11-07 15:10     ` Sebastian Andrzej Siewior
2022-11-07 16:30       ` Jan Kara
2022-11-07 17:12         ` Sebastian Andrzej Siewior
2022-11-07 16:49       ` Waiman Long
2022-11-08 10:53         ` Mark Rutland
2022-11-08 17:45           ` Jan Kara
2022-11-09  9:55             ` Mark Rutland
2022-11-09 10:11               ` Pierre Gondois
2022-11-09 10:54                 ` Jan Kara
2022-11-09 11:01               ` Jan Kara
2022-11-09 13:52                 ` Pierre Gondois
2022-11-09 14:21                   ` Pierre Gondois
2022-11-09 12:57         ` Will Deacon
2022-11-09 15:40           ` Jan Kara
2022-11-11 14:27             ` Jan Kara
2022-11-14 12:41               ` Will Deacon
2022-11-28 15:58                 ` Sebastian Andrzej Siewior
2022-11-28 20:30                   ` kernel test robot
2022-11-28 21:11                   ` kernel test robot
2022-11-29  5:16                   ` kernel test robot
2022-11-29  5:26                   ` kernel test robot
2022-11-29  6:48                   ` kernel test robot
2022-11-29  7:39                   ` kernel test robot
2022-11-30 17:20                   ` Pierre Gondois
2022-12-01 12:37                     ` Jan Kara
2022-11-30 20:22                   ` Mel Gorman [this message]
2022-12-01 17:09                     ` Mel Gorman
     [not found] ` <20221104080637.626-1-hdanton@sina.com>
2022-11-07 12:41   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221130202204.usku3rl6wowiugju@suse.de \
    --to=mgorman@suse.de \
    --cc=bigeasy@linutronix.de \
    --cc=catalin.marinas@arm.com \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).