Re: Question on smp_mb__before_spinlock

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Will Deacon <will.deacon@arm.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	linux-kernel@vger.kernel.org, Nicholas Piggin <npiggin@gmail.com>,
	Ingo Molnar <mingo@kernel.org>,
	Alan Stern <stern@rowland.harvard.edu>
Subject: Re: Question on smp_mb__before_spinlock
Date: Mon, 5 Sep 2016 03:37:14 -0700	[thread overview]
Message-ID: <20160905103714.GZ3663@linux.vnet.ibm.com> (raw)
In-Reply-To: <20160905093753.GN10138@twins.programming.kicks-ass.net>

On Mon, Sep 05, 2016 at 11:37:53AM +0200, Peter Zijlstra wrote:
> Hi all,
> 
> So recently I've had two separate issues that touched upon
> smp_mb__before_spinlock().
> 
> 
> Since its inception, our understanding of ACQUIRE, esp. as applied to
> spinlocks, has changed somewhat. Also, I wonder if, with a simple
> change, we cannot make it provide more.
> 
> The problem with the comment is that the STORE done by spin_lock isn't
> itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> it and cross with any prior STORE, rendering the default WMB
> insufficient (pointed out by Alan).
> 
> Now, this is only really a problem on PowerPC and ARM64, the former of
> which already defined smp_mb__before_spinlock() as a smp_mb(), the
> latter does not, Will?
> 
> The second issue I wondered about is spinlock transitivity. All except
> powerpc have RCsc locks, and since Power already does a full mb, would
> it not make sense to put it _after_ the spin_lock(), which would provide
> the same guarantee, but also upgrades the section to RCsc.
> 
> That would make all schedule() calls fully transitive against one
> another.
> 
> 
> That is, would something like the below make sense?

Looks to me like you have reinvented smp_mb__after_unlock_lock()...

							Thanx, Paul

> (does not deal with mm_types.h and overlayfs using
> smp_mb__before_spnlock).
> 
> ---
>  arch/arm64/include/asm/barrier.h   |  2 ++
>  arch/powerpc/include/asm/barrier.h |  2 +-
>  include/linux/spinlock.h           | 41 +++++++++++++++++++++++++++++---------
>  kernel/sched/core.c                |  5 +++--
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 4eea7f618dce..d5cc8b58f942 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -104,6 +104,8 @@ do {									\
>  	VAL;								\
>  })
> 
> +#define smp_mb__after_spinlock()	smp_mb()
> +
>  #include <asm-generic/barrier.h>
> 
>  #endif	/* __ASSEMBLY__ */
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index c0deafc212b8..23d64d7196b7 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -74,7 +74,7 @@ do {									\
>  	___p1;								\
>  })
> 
> -#define smp_mb__before_spinlock()   smp_mb()
> +#define smp_mb__after_spinlock()   smp_mb()
> 
>  #include <asm-generic/barrier.h>
> 
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 47dd0cebd204..284616dad607 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -118,16 +118,39 @@ do {								\
>  #endif
> 
>  /*
> - * Despite its name it doesn't necessarily has to be a full barrier.
> - * It should only guarantee that a STORE before the critical section
> - * can not be reordered with LOADs and STOREs inside this section.
> - * spin_lock() is the one-way barrier, this LOAD can not escape out
> - * of the region. So the default implementation simply ensures that
> - * a STORE can not move into the critical section, smp_wmb() should
> - * serialize it with another STORE done by spin_lock().
> + * This barrier must provide two things:
> + *
> + *   - it must guarantee a STORE before the spin_lock() is ordered against a
> + *     LOAD after it, see the comments at its two usage sites.
> + *
> + *   - it must ensure the critical section is RCsc.
> + *
> + * The latter is important for cases where we observe values written by other
> + * CPUs in spin-loops, without barriers, while being subject to scheduling.
> + *
> + * CPU0			CPU1			CPU2
> + * 
> + * 			for (;;) {
> + * 			  if (READ_ONCE(X))
> + * 			  	break;
> + * 			}
> + * X=1
> + * 			<sched-out>
> + * 						<sched-in>
> + * 						r = X;
> + *
> + * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> + * we get migrated and CPU2 sees X==0.
> + *
> + * Since most load-store architectures implement ACQUIRE with an smp_mb() after
> + * the LL/SC loop, they need no further barriers. Similarly all our TSO
> + * architectures imlpy an smp_mb() for each atomic instruction and equally don't
> + * need more.
> + *
> + * Architectures that can implement ACQUIRE better need to take care.
>   */
> -#ifndef smp_mb__before_spinlock
> -#define smp_mb__before_spinlock()	smp_wmb()
> +#ifndef smp_mb__after_spinlock
> +#define smp_mb__after_spinlock()	do { } while (0)
>  #endif
> 
>  /**
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 556cb07ab1cf..b151a33d393b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2006,8 +2006,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 * reordered with p->state check below. This pairs with mb() in
>  	 * set_current_state() the waiting thread does.
>  	 */
> -	smp_mb__before_spinlock();
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> +	smp_mb__after_spinlock();
>  	if (!(p->state & state))
>  		goto out;
> 
> @@ -3332,8 +3332,9 @@ static void __sched notrace __schedule(bool preempt)
>  	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
>  	 * done by the caller to avoid the race with signal_wake_up().
>  	 */
> -	smp_mb__before_spinlock();
>  	raw_spin_lock(&rq->lock);
> +	smp_mb__after_spinlock();
> +
>  	cookie = lockdep_pin_lock(&rq->lock);
> 
>  	rq->clock_skip_update <<= 1; /* promote REQ to ACT */
>