linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
@ 2015-07-13 12:15 Will Deacon
  2015-07-13 13:09 ` Peter Hurley
                   ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Will Deacon @ 2015-07-13 12:15 UTC (permalink / raw)
  To: linux-arch
  Cc: linux-kernel, Will Deacon, Benjamin Herrenschmidt, Paul McKenney,
	Peter Zijlstra

smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
into a full memory barrier.

However:

  - This ordering guarantee is already provided without the barrier on
    all architectures apart from PowerPC

  - The barrier only applies to UNLOCK + LOCK, not general
    RELEASE + ACQUIRE operations

  - Locks are generally assumed to offer SC ordering semantics, so
    having this additional barrier is error-prone and complicates the
    callers of LOCK/UNLOCK primitives

  - The barrier is not well used outside of RCU and, because it was
    retrofitted into the kernel, it's not clear whether other areas of
    the kernel are incorrectly relying on UNLOCK + LOCK implying a full
    barrier

This patch removes the barrier and instead requires architectures to
provide full barrier semantics for an UNLOCK + LOCK sequence.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---

This didn't go anywhere last time I posted it, but here it is again.
I'd really appreciate some feedback from the PowerPC guys, especially as
to whether this change requires them to add an additional barrier in
arch_spin_unlock and what the cost of that would be.

 Documentation/memory-barriers.txt   | 77 ++-----------------------------------
 arch/powerpc/include/asm/spinlock.h |  2 -
 include/linux/spinlock.h            | 10 -----
 kernel/locking/mcs_spinlock.h       |  9 -----
 kernel/rcu/tree.c                   | 21 +---------
 kernel/rcu/tree_plugin.h            | 11 ------
 6 files changed, 4 insertions(+), 126 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 13feb697271f..fff21b632893 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1848,74 +1848,9 @@ RELEASE are to the same lock variable, but only from the perspective of
 another CPU not holding that lock.  In short, a ACQUIRE followed by an
 RELEASE may -not- be assumed to be a full memory barrier.
 
-Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
-imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
-pair to produce a full barrier, the ACQUIRE can be followed by an
-smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
-if either (a) the RELEASE and the ACQUIRE are executed by the same
-CPU or task, or (b) the RELEASE and ACQUIRE act on the same variable.
-The smp_mb__after_unlock_lock() primitive is free on many architectures.
-Without smp_mb__after_unlock_lock(), the CPU's execution of the critical
-sections corresponding to the RELEASE and the ACQUIRE can cross, so that:
-
-	*A = a;
-	RELEASE M
-	ACQUIRE N
-	*B = b;
-
-could occur as:
-
-	ACQUIRE N, STORE *B, STORE *A, RELEASE M
-
-It might appear that this reordering could introduce a deadlock.
-However, this cannot happen because if such a deadlock threatened,
-the RELEASE would simply complete, thereby avoiding the deadlock.
-
-	Why does this work?
-
-	One key point is that we are only talking about the CPU doing
-	the reordering, not the compiler.  If the compiler (or, for
-	that matter, the developer) switched the operations, deadlock
-	-could- occur.
-
-	But suppose the CPU reordered the operations.  In this case,
-	the unlock precedes the lock in the assembly code.  The CPU
-	simply elected to try executing the later lock operation first.
-	If there is a deadlock, this lock operation will simply spin (or
-	try to sleep, but more on that later).	The CPU will eventually
-	execute the unlock operation (which preceded the lock operation
-	in the assembly code), which will unravel the potential deadlock,
-	allowing the lock operation to succeed.
-
-	But what if the lock is a sleeplock?  In that case, the code will
-	try to enter the scheduler, where it will eventually encounter
-	a memory barrier, which will force the earlier unlock operation
-	to complete, again unraveling the deadlock.  There might be
-	a sleep-unlock race, but the locking primitive needs to resolve
-	such races properly in any case.
-
-With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
-For example, with the following code, the store to *A will always be
-seen by other CPUs before the store to *B:
-
-	*A = a;
-	RELEASE M
-	ACQUIRE N
-	smp_mb__after_unlock_lock();
-	*B = b;
-
-The operations will always occur in one of the following orders:
-
-	STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
-	STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
-	ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
-
-If the RELEASE and ACQUIRE were instead both operating on the same lock
-variable, only the first of these alternatives can occur.  In addition,
-the more strongly ordered systems may rule out some of the above orders.
-But in any case, as noted earlier, the smp_mb__after_unlock_lock()
-ensures that the store to *A will always be seen as happening before
-the store to *B.
+However, the reverse case of a RELEASE followed by an ACQUIRE _does_
+imply a full memory barrier when these accesses are performed as a pair
+of UNLOCK and LOCK operations, irrespective of the lock variable.
 
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
@@ -2158,7 +2093,6 @@ However, if the following occurs:
 	RELEASE M	     [1]
 	ACCESS_ONCE(*D) = d;		ACCESS_ONCE(*E) = e;
 					ACQUIRE M		     [2]
-					smp_mb__after_unlock_lock();
 					ACCESS_ONCE(*F) = f;
 					ACCESS_ONCE(*G) = g;
 					RELEASE M	     [2]
@@ -2176,11 +2110,6 @@ But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
 	*F, *G or *H preceding ACQUIRE M [2]
 	*A, *B, *C, *E, *F or *G following RELEASE M [2]
 
-Note that the smp_mb__after_unlock_lock() is critically important
-here: Without it CPU 3 might see some of the above orderings.
-Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
-to be seen in order unless CPU 3 holds lock M.
-
 
 ACQUIRES VS I/O ACCESSES
 ------------------------
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 4dbe072eecbe..523673d7583c 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -28,8 +28,6 @@
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
 
-#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
-
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
 #ifdef __BIG_ENDIAN__
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 0063b24b4f36..16c5ed5a627c 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -130,16 +130,6 @@ do {								\
 #define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
-/*
- * Place this after a lock-acquisition primitive to guarantee that
- * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
- * if the UNLOCK and LOCK are executed by the same CPU or if the
- * UNLOCK and LOCK operate on the same lock variable.
- */
-#ifndef smp_mb__after_unlock_lock
-#define smp_mb__after_unlock_lock()	do { } while (0)
-#endif
-
 /**
  * raw_spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index fd91aaa4554c..2ea2fae2b477 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -43,15 +43,6 @@ do {									\
 #endif
 
 /*
- * Note: the smp_load_acquire/smp_store_release pair is not
- * sufficient to form a full memory barrier across
- * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
- * For applications that need a full barrier across multiple cpus
- * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
- * used after mcs_lock.
- */
-
-/*
  * In order to acquire the lock, the caller should declare a local node and
  * pass a reference of the node to this function in addition to the lock.
  * If the lock has already been acquired, then this will proceed to spin
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 65137bc28b2b..6689fc0808c8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1513,10 +1513,8 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 * hold it, acquire the root rcu_node structure's lock in order to
 	 * start one (if needed).
 	 */
-	if (rnp != rnp_root) {
+	if (rnp != rnp_root)
 		raw_spin_lock(&rnp_root->lock);
-		smp_mb__after_unlock_lock();
-	}
 
 	/*
 	 * Get a new grace-period number.  If there really is no grace
@@ -1769,7 +1767,6 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
 		local_irq_restore(flags);
 		return;
 	}
-	smp_mb__after_unlock_lock();
 	needwake = __note_gp_changes(rsp, rnp, rdp);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	if (needwake)
@@ -1794,7 +1791,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
 
 	WRITE_ONCE(rsp->gp_activity, jiffies);
 	raw_spin_lock_irq(&rnp->lock);
-	smp_mb__after_unlock_lock();
 	if (!READ_ONCE(rsp->gp_flags)) {
 		/* Spurious wakeup, tell caller to go back to sleep.  */
 		raw_spin_unlock_irq(&rnp->lock);
@@ -1827,7 +1823,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	rcu_for_each_leaf_node(rsp, rnp) {
 		rcu_gp_slow(rsp, gp_preinit_delay);
 		raw_spin_lock_irq(&rnp->lock);
-		smp_mb__after_unlock_lock();
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
 		    !rnp->wait_blkd_tasks) {
 			/* Nothing to do on this leaf rcu_node structure. */
@@ -1884,7 +1879,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	rcu_for_each_node_breadth_first(rsp, rnp) {
 		rcu_gp_slow(rsp, gp_init_delay);
 		raw_spin_lock_irq(&rnp->lock);
-		smp_mb__after_unlock_lock();
 		rdp = this_cpu_ptr(rsp->rda);
 		rcu_preempt_check_blocked_tasks(rnp);
 		rnp->qsmask = rnp->qsmaskinit;
@@ -1935,7 +1929,6 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
 	/* Clear flag to prevent immediate re-entry. */
 	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
 		raw_spin_lock_irq(&rnp->lock);
-		smp_mb__after_unlock_lock();
 		WRITE_ONCE(rsp->gp_flags,
 			   READ_ONCE(rsp->gp_flags) & ~RCU_GP_FLAG_FQS);
 		raw_spin_unlock_irq(&rnp->lock);
@@ -1956,7 +1949,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 
 	WRITE_ONCE(rsp->gp_activity, jiffies);
 	raw_spin_lock_irq(&rnp->lock);
-	smp_mb__after_unlock_lock();
 	gp_duration = jiffies - rsp->gp_start;
 	if (gp_duration > rsp->gp_max)
 		rsp->gp_max = gp_duration;
@@ -1982,7 +1974,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	 */
 	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irq(&rnp->lock);
-		smp_mb__after_unlock_lock();
 		WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp));
 		WARN_ON_ONCE(rnp->qsmask);
 		WRITE_ONCE(rnp->completed, rsp->gpnum);
@@ -1998,7 +1989,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	}
 	rnp = rcu_get_root(rsp);
 	raw_spin_lock_irq(&rnp->lock);
-	smp_mb__after_unlock_lock(); /* Order GP before ->completed update. */
 	rcu_nocb_gp_set(rnp, nocb);
 
 	/* Declare grace period done. */
@@ -2246,7 +2236,6 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
 		rnp_c = rnp;
 		rnp = rnp->parent;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
-		smp_mb__after_unlock_lock();
 		oldmask = rnp_c->qsmask;
 	}
 
@@ -2294,7 +2283,6 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
 	mask = rnp->grpmask;
 	raw_spin_unlock(&rnp->lock);	/* irqs remain disabled. */
 	raw_spin_lock(&rnp_p->lock);	/* irqs already disabled. */
-	smp_mb__after_unlock_lock();
 	rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags);
 }
 
@@ -2317,7 +2305,6 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
 
 	rnp = rdp->mynode;
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();
 	if ((rdp->passed_quiesce == 0 &&
 	     rdp->rcu_qs_ctr_snap == __this_cpu_read(rcu_qs_ctr)) ||
 	    rdp->gpnum != rnp->gpnum || rnp->completed == rnp->gpnum ||
@@ -2544,7 +2531,6 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
 		if (!rnp)
 			break;
 		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
-		smp_mb__after_unlock_lock(); /* GP memory ordering. */
 		rnp->qsmaskinit &= ~mask;
 		rnp->qsmask &= ~mask;
 		if (rnp->qsmaskinit) {
@@ -2573,7 +2559,6 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();	/* Enforce GP memory-order guarantee. */
 	rnp->qsmaskinitnext &= ~mask;
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 }
@@ -2771,7 +2756,6 @@ static void force_qs_rnp(struct rcu_state *rsp,
 		cond_resched_rcu_qs();
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
-		smp_mb__after_unlock_lock();
 		if (rnp->qsmask == 0) {
 			if (rcu_state_p == &rcu_sched_state ||
 			    rsp != rcu_state_p ||
@@ -2843,7 +2827,6 @@ static void force_quiescent_state(struct rcu_state *rsp)
 
 	/* Reached the root of the rcu_node tree, acquire lock. */
 	raw_spin_lock_irqsave(&rnp_old->lock, flags);
-	smp_mb__after_unlock_lock();
 	raw_spin_unlock(&rnp_old->fqslock);
 	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
 		rsp->n_force_qs_lh++;
@@ -2967,7 +2950,6 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
 			struct rcu_node *rnp_root = rcu_get_root(rsp);
 
 			raw_spin_lock(&rnp_root->lock);
-			smp_mb__after_unlock_lock();
 			needwake = rcu_start_gp(rsp);
 			raw_spin_unlock(&rnp_root->lock);
 			if (needwake)
@@ -3810,7 +3792,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
 	rnp = rdp->mynode;
 	mask = rdp->grpmask;
 	raw_spin_lock(&rnp->lock);		/* irqs already disabled. */
-	smp_mb__after_unlock_lock();
 	rnp->qsmaskinitnext |= mask;
 	rdp->gpnum = rnp->completed; /* Make CPU later note any new GP. */
 	rdp->completed = rnp->completed;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 013485fb2b06..79793a7647cf 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -164,7 +164,6 @@ static void rcu_preempt_note_context_switch(void)
 		rdp = this_cpu_ptr(rcu_state_p->rda);
 		rnp = rdp->mynode;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
-		smp_mb__after_unlock_lock();
 		t->rcu_read_unlock_special.b.blocked = true;
 		t->rcu_blocked_node = rnp;
 
@@ -324,7 +323,6 @@ void rcu_read_unlock_special(struct task_struct *t)
 		for (;;) {
 			rnp = t->rcu_blocked_node;
 			raw_spin_lock(&rnp->lock);  /* irqs already disabled. */
-			smp_mb__after_unlock_lock();
 			if (rnp == t->rcu_blocked_node)
 				break;
 			WARN_ON_ONCE(1);
@@ -598,7 +596,6 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
 	unsigned long mask;
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();
 	for (;;) {
 		if (!sync_rcu_preempt_exp_done(rnp)) {
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
@@ -616,7 +613,6 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
 		raw_spin_unlock(&rnp->lock); /* irqs remain disabled */
 		rnp = rnp->parent;
 		raw_spin_lock(&rnp->lock); /* irqs already disabled */
-		smp_mb__after_unlock_lock();
 		rnp->expmask &= ~mask;
 	}
 }
@@ -638,7 +634,6 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
 	struct rcu_node *rnp_up;
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();
 	WARN_ON_ONCE(rnp->expmask);
 	WARN_ON_ONCE(rnp->exp_tasks);
 	if (!rcu_preempt_has_tasks(rnp)) {
@@ -655,7 +650,6 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
 		if (rnp_up->expmask & mask)
 			break;
 		raw_spin_lock(&rnp_up->lock); /* irqs already off */
-		smp_mb__after_unlock_lock();
 		rnp_up->expmask |= mask;
 		raw_spin_unlock(&rnp_up->lock); /* irqs still off */
 	}
@@ -679,7 +673,6 @@ sync_rcu_preempt_exp_init2(struct rcu_state *rsp, struct rcu_node *rnp)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();
 	if (!rnp->expmask) {
 		/* Phase 1 didn't do anything, so Phase 2 doesn't either. */
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
@@ -1007,7 +1000,6 @@ static int rcu_boost(struct rcu_node *rnp)
 		return 0;  /* Nothing left to boost. */
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();
 
 	/*
 	 * Recheck under the lock: all tasks in need of boosting
@@ -1195,7 +1187,6 @@ static int rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
 	if (IS_ERR(t))
 		return PTR_ERR(t);
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();
 	rnp->boost_kthread_task = t;
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	sp.sched_priority = kthread_prio;
@@ -1586,7 +1577,6 @@ static void rcu_prepare_for_idle(void)
 			continue;
 		rnp = rdp->mynode;
 		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
-		smp_mb__after_unlock_lock();
 		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
 		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
 		if (needwake)
@@ -2114,7 +2104,6 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 	struct rcu_node *rnp = rdp->mynode;
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();
 	needwake = rcu_start_future_gp(rnp, rdp, &c);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	if (needwake)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 12:15 [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock() Will Deacon
@ 2015-07-13 13:09 ` Peter Hurley
  2015-07-13 14:24   ` Will Deacon
  2015-07-13 13:11 ` Peter Zijlstra
  2015-07-13 22:31 ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 66+ messages in thread
From: Peter Hurley @ 2015-07-13 13:09 UTC (permalink / raw)
  To: Will Deacon, linux-arch
  Cc: linux-kernel, Benjamin Herrenschmidt, Paul McKenney, Peter Zijlstra

On 07/13/2015 08:15 AM, Will Deacon wrote:
> smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> into a full memory barrier.
> 
> However:
> 
>   - This ordering guarantee is already provided without the barrier on
>     all architectures apart from PowerPC
> 
>   - The barrier only applies to UNLOCK + LOCK, not general
>     RELEASE + ACQUIRE operations

I'm unclear what you mean here: do you mean
A) a memory barrier is not required between RELEASE M + ACQUIRE N when you
   want to maintain distinct order between those operations on all arches
   (with the possible exception of PowerPC), or,
B) no one is using smp_mb__after_unlock_lock() in that way right now.

Regards,
Peter Hurley

>   - Locks are generally assumed to offer SC ordering semantics, so
>     having this additional barrier is error-prone and complicates the
>     callers of LOCK/UNLOCK primitives
> 
>   - The barrier is not well used outside of RCU and, because it was
>     retrofitted into the kernel, it's not clear whether other areas of
>     the kernel are incorrectly relying on UNLOCK + LOCK implying a full
>     barrier
> 
> This patch removes the barrier and instead requires architectures to
> provide full barrier semantics for an UNLOCK + LOCK sequence.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
> 
> This didn't go anywhere last time I posted it, but here it is again.
> I'd really appreciate some feedback from the PowerPC guys, especially as
> to whether this change requires them to add an additional barrier in
> arch_spin_unlock and what the cost of that would be.
> 
>  Documentation/memory-barriers.txt   | 77 ++-----------------------------------
>  arch/powerpc/include/asm/spinlock.h |  2 -
>  include/linux/spinlock.h            | 10 -----
>  kernel/locking/mcs_spinlock.h       |  9 -----
>  kernel/rcu/tree.c                   | 21 +---------
>  kernel/rcu/tree_plugin.h            | 11 ------
>  6 files changed, 4 insertions(+), 126 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 13feb697271f..fff21b632893 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1848,74 +1848,9 @@ RELEASE are to the same lock variable, but only from the perspective of
>  another CPU not holding that lock.  In short, a ACQUIRE followed by an
>  RELEASE may -not- be assumed to be a full memory barrier.
>  
> -Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
> -imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
> -pair to produce a full barrier, the ACQUIRE can be followed by an
> -smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
> -if either (a) the RELEASE and the ACQUIRE are executed by the same
> -CPU or task, or (b) the RELEASE and ACQUIRE act on the same variable.
> -The smp_mb__after_unlock_lock() primitive is free on many architectures.
> -Without smp_mb__after_unlock_lock(), the CPU's execution of the critical
> -sections corresponding to the RELEASE and the ACQUIRE can cross, so that:
> -
> -	*A = a;
> -	RELEASE M
> -	ACQUIRE N
> -	*B = b;
> -
> -could occur as:
> -
> -	ACQUIRE N, STORE *B, STORE *A, RELEASE M
> -
> -It might appear that this reordering could introduce a deadlock.
> -However, this cannot happen because if such a deadlock threatened,
> -the RELEASE would simply complete, thereby avoiding the deadlock.
> -
> -	Why does this work?
> -
> -	One key point is that we are only talking about the CPU doing
> -	the reordering, not the compiler.  If the compiler (or, for
> -	that matter, the developer) switched the operations, deadlock
> -	-could- occur.
> -
> -	But suppose the CPU reordered the operations.  In this case,
> -	the unlock precedes the lock in the assembly code.  The CPU
> -	simply elected to try executing the later lock operation first.
> -	If there is a deadlock, this lock operation will simply spin (or
> -	try to sleep, but more on that later).	The CPU will eventually
> -	execute the unlock operation (which preceded the lock operation
> -	in the assembly code), which will unravel the potential deadlock,
> -	allowing the lock operation to succeed.
> -
> -	But what if the lock is a sleeplock?  In that case, the code will
> -	try to enter the scheduler, where it will eventually encounter
> -	a memory barrier, which will force the earlier unlock operation
> -	to complete, again unraveling the deadlock.  There might be
> -	a sleep-unlock race, but the locking primitive needs to resolve
> -	such races properly in any case.
> -
> -With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
> -For example, with the following code, the store to *A will always be
> -seen by other CPUs before the store to *B:
> -
> -	*A = a;
> -	RELEASE M
> -	ACQUIRE N
> -	smp_mb__after_unlock_lock();
> -	*B = b;
> -
> -The operations will always occur in one of the following orders:
> -
> -	STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
> -	STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> -	ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> -
> -If the RELEASE and ACQUIRE were instead both operating on the same lock
> -variable, only the first of these alternatives can occur.  In addition,
> -the more strongly ordered systems may rule out some of the above orders.
> -But in any case, as noted earlier, the smp_mb__after_unlock_lock()
> -ensures that the store to *A will always be seen as happening before
> -the store to *B.
> +However, the reverse case of a RELEASE followed by an ACQUIRE _does_
> +imply a full memory barrier when these accesses are performed as a pair
> +of UNLOCK and LOCK operations, irrespective of the lock variable.
>  
>  Locks and semaphores may not provide any guarantee of ordering on UP compiled
>  systems, and so cannot be counted on in such a situation to actually achieve
> @@ -2158,7 +2093,6 @@ However, if the following occurs:
>  	RELEASE M	     [1]
>  	ACCESS_ONCE(*D) = d;		ACCESS_ONCE(*E) = e;
>  					ACQUIRE M		     [2]
> -					smp_mb__after_unlock_lock();
>  					ACCESS_ONCE(*F) = f;
>  					ACCESS_ONCE(*G) = g;
>  					RELEASE M	     [2]
> @@ -2176,11 +2110,6 @@ But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
>  	*F, *G or *H preceding ACQUIRE M [2]
>  	*A, *B, *C, *E, *F or *G following RELEASE M [2]
>  
> -Note that the smp_mb__after_unlock_lock() is critically important
> -here: Without it CPU 3 might see some of the above orderings.
> -Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
> -to be seen in order unless CPU 3 holds lock M.
> -
>  
>  ACQUIRES VS I/O ACCESSES
>  ------------------------
> diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
> index 4dbe072eecbe..523673d7583c 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -28,8 +28,6 @@
>  #include <asm/synch.h>
>  #include <asm/ppc-opcode.h>
>  
> -#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
> -
>  #ifdef CONFIG_PPC64
>  /* use 0x800000yy when locked, where yy == CPU number */
>  #ifdef __BIG_ENDIAN__
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 0063b24b4f36..16c5ed5a627c 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -130,16 +130,6 @@ do {								\
>  #define smp_mb__before_spinlock()	smp_wmb()
>  #endif
>  
> -/*
> - * Place this after a lock-acquisition primitive to guarantee that
> - * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
> - * if the UNLOCK and LOCK are executed by the same CPU or if the
> - * UNLOCK and LOCK operate on the same lock variable.
> - */
> -#ifndef smp_mb__after_unlock_lock
> -#define smp_mb__after_unlock_lock()	do { } while (0)
> -#endif
> -
>  /**
>   * raw_spin_unlock_wait - wait until the spinlock gets unlocked
>   * @lock: the spinlock in question.
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index fd91aaa4554c..2ea2fae2b477 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -43,15 +43,6 @@ do {									\
>  #endif
>  
>  /*
> - * Note: the smp_load_acquire/smp_store_release pair is not
> - * sufficient to form a full memory barrier across
> - * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
> - * For applications that need a full barrier across multiple cpus
> - * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
> - * used after mcs_lock.
> - */
> -
> -/*
>   * In order to acquire the lock, the caller should declare a local node and
>   * pass a reference of the node to this function in addition to the lock.
>   * If the lock has already been acquired, then this will proceed to spin
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 65137bc28b2b..6689fc0808c8 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1513,10 +1513,8 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
>  	 * hold it, acquire the root rcu_node structure's lock in order to
>  	 * start one (if needed).
>  	 */
> -	if (rnp != rnp_root) {
> +	if (rnp != rnp_root)
>  		raw_spin_lock(&rnp_root->lock);
> -		smp_mb__after_unlock_lock();
> -	}
>  
>  	/*
>  	 * Get a new grace-period number.  If there really is no grace
> @@ -1769,7 +1767,6 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
>  		local_irq_restore(flags);
>  		return;
>  	}
> -	smp_mb__after_unlock_lock();
>  	needwake = __note_gp_changes(rsp, rnp, rdp);
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	if (needwake)
> @@ -1794,7 +1791,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  
>  	WRITE_ONCE(rsp->gp_activity, jiffies);
>  	raw_spin_lock_irq(&rnp->lock);
> -	smp_mb__after_unlock_lock();
>  	if (!READ_ONCE(rsp->gp_flags)) {
>  		/* Spurious wakeup, tell caller to go back to sleep.  */
>  		raw_spin_unlock_irq(&rnp->lock);
> @@ -1827,7 +1823,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  	rcu_for_each_leaf_node(rsp, rnp) {
>  		rcu_gp_slow(rsp, gp_preinit_delay);
>  		raw_spin_lock_irq(&rnp->lock);
> -		smp_mb__after_unlock_lock();
>  		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
>  		    !rnp->wait_blkd_tasks) {
>  			/* Nothing to do on this leaf rcu_node structure. */
> @@ -1884,7 +1879,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  	rcu_for_each_node_breadth_first(rsp, rnp) {
>  		rcu_gp_slow(rsp, gp_init_delay);
>  		raw_spin_lock_irq(&rnp->lock);
> -		smp_mb__after_unlock_lock();
>  		rdp = this_cpu_ptr(rsp->rda);
>  		rcu_preempt_check_blocked_tasks(rnp);
>  		rnp->qsmask = rnp->qsmaskinit;
> @@ -1935,7 +1929,6 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
>  	/* Clear flag to prevent immediate re-entry. */
>  	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
>  		raw_spin_lock_irq(&rnp->lock);
> -		smp_mb__after_unlock_lock();
>  		WRITE_ONCE(rsp->gp_flags,
>  			   READ_ONCE(rsp->gp_flags) & ~RCU_GP_FLAG_FQS);
>  		raw_spin_unlock_irq(&rnp->lock);
> @@ -1956,7 +1949,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  
>  	WRITE_ONCE(rsp->gp_activity, jiffies);
>  	raw_spin_lock_irq(&rnp->lock);
> -	smp_mb__after_unlock_lock();
>  	gp_duration = jiffies - rsp->gp_start;
>  	if (gp_duration > rsp->gp_max)
>  		rsp->gp_max = gp_duration;
> @@ -1982,7 +1974,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  	 */
>  	rcu_for_each_node_breadth_first(rsp, rnp) {
>  		raw_spin_lock_irq(&rnp->lock);
> -		smp_mb__after_unlock_lock();
>  		WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp));
>  		WARN_ON_ONCE(rnp->qsmask);
>  		WRITE_ONCE(rnp->completed, rsp->gpnum);
> @@ -1998,7 +1989,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  	}
>  	rnp = rcu_get_root(rsp);
>  	raw_spin_lock_irq(&rnp->lock);
> -	smp_mb__after_unlock_lock(); /* Order GP before ->completed update. */
>  	rcu_nocb_gp_set(rnp, nocb);
>  
>  	/* Declare grace period done. */
> @@ -2246,7 +2236,6 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
>  		rnp_c = rnp;
>  		rnp = rnp->parent;
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		smp_mb__after_unlock_lock();
>  		oldmask = rnp_c->qsmask;
>  	}
>  
> @@ -2294,7 +2283,6 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
>  	mask = rnp->grpmask;
>  	raw_spin_unlock(&rnp->lock);	/* irqs remain disabled. */
>  	raw_spin_lock(&rnp_p->lock);	/* irqs already disabled. */
> -	smp_mb__after_unlock_lock();
>  	rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags);
>  }
>  
> @@ -2317,7 +2305,6 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
>  
>  	rnp = rdp->mynode;
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	if ((rdp->passed_quiesce == 0 &&
>  	     rdp->rcu_qs_ctr_snap == __this_cpu_read(rcu_qs_ctr)) ||
>  	    rdp->gpnum != rnp->gpnum || rnp->completed == rnp->gpnum ||
> @@ -2544,7 +2531,6 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
>  		if (!rnp)
>  			break;
>  		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> -		smp_mb__after_unlock_lock(); /* GP memory ordering. */
>  		rnp->qsmaskinit &= ~mask;
>  		rnp->qsmask &= ~mask;
>  		if (rnp->qsmaskinit) {
> @@ -2573,7 +2559,6 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
>  	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
>  	mask = rdp->grpmask;
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();	/* Enforce GP memory-order guarantee. */
>  	rnp->qsmaskinitnext &= ~mask;
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  }
> @@ -2771,7 +2756,6 @@ static void force_qs_rnp(struct rcu_state *rsp,
>  		cond_resched_rcu_qs();
>  		mask = 0;
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		smp_mb__after_unlock_lock();
>  		if (rnp->qsmask == 0) {
>  			if (rcu_state_p == &rcu_sched_state ||
>  			    rsp != rcu_state_p ||
> @@ -2843,7 +2827,6 @@ static void force_quiescent_state(struct rcu_state *rsp)
>  
>  	/* Reached the root of the rcu_node tree, acquire lock. */
>  	raw_spin_lock_irqsave(&rnp_old->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	raw_spin_unlock(&rnp_old->fqslock);
>  	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
>  		rsp->n_force_qs_lh++;
> @@ -2967,7 +2950,6 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
>  			struct rcu_node *rnp_root = rcu_get_root(rsp);
>  
>  			raw_spin_lock(&rnp_root->lock);
> -			smp_mb__after_unlock_lock();
>  			needwake = rcu_start_gp(rsp);
>  			raw_spin_unlock(&rnp_root->lock);
>  			if (needwake)
> @@ -3810,7 +3792,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
>  	rnp = rdp->mynode;
>  	mask = rdp->grpmask;
>  	raw_spin_lock(&rnp->lock);		/* irqs already disabled. */
> -	smp_mb__after_unlock_lock();
>  	rnp->qsmaskinitnext |= mask;
>  	rdp->gpnum = rnp->completed; /* Make CPU later note any new GP. */
>  	rdp->completed = rnp->completed;
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 013485fb2b06..79793a7647cf 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -164,7 +164,6 @@ static void rcu_preempt_note_context_switch(void)
>  		rdp = this_cpu_ptr(rcu_state_p->rda);
>  		rnp = rdp->mynode;
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		smp_mb__after_unlock_lock();
>  		t->rcu_read_unlock_special.b.blocked = true;
>  		t->rcu_blocked_node = rnp;
>  
> @@ -324,7 +323,6 @@ void rcu_read_unlock_special(struct task_struct *t)
>  		for (;;) {
>  			rnp = t->rcu_blocked_node;
>  			raw_spin_lock(&rnp->lock);  /* irqs already disabled. */
> -			smp_mb__after_unlock_lock();
>  			if (rnp == t->rcu_blocked_node)
>  				break;
>  			WARN_ON_ONCE(1);
> @@ -598,7 +596,6 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
>  	unsigned long mask;
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	for (;;) {
>  		if (!sync_rcu_preempt_exp_done(rnp)) {
>  			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> @@ -616,7 +613,6 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
>  		raw_spin_unlock(&rnp->lock); /* irqs remain disabled */
>  		rnp = rnp->parent;
>  		raw_spin_lock(&rnp->lock); /* irqs already disabled */
> -		smp_mb__after_unlock_lock();
>  		rnp->expmask &= ~mask;
>  	}
>  }
> @@ -638,7 +634,6 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
>  	struct rcu_node *rnp_up;
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	WARN_ON_ONCE(rnp->expmask);
>  	WARN_ON_ONCE(rnp->exp_tasks);
>  	if (!rcu_preempt_has_tasks(rnp)) {
> @@ -655,7 +650,6 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
>  		if (rnp_up->expmask & mask)
>  			break;
>  		raw_spin_lock(&rnp_up->lock); /* irqs already off */
> -		smp_mb__after_unlock_lock();
>  		rnp_up->expmask |= mask;
>  		raw_spin_unlock(&rnp_up->lock); /* irqs still off */
>  	}
> @@ -679,7 +673,6 @@ sync_rcu_preempt_exp_init2(struct rcu_state *rsp, struct rcu_node *rnp)
>  	unsigned long flags;
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	if (!rnp->expmask) {
>  		/* Phase 1 didn't do anything, so Phase 2 doesn't either. */
>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> @@ -1007,7 +1000,6 @@ static int rcu_boost(struct rcu_node *rnp)
>  		return 0;  /* Nothing left to boost. */
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  
>  	/*
>  	 * Recheck under the lock: all tasks in need of boosting
> @@ -1195,7 +1187,6 @@ static int rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
>  	if (IS_ERR(t))
>  		return PTR_ERR(t);
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	rnp->boost_kthread_task = t;
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	sp.sched_priority = kthread_prio;
> @@ -1586,7 +1577,6 @@ static void rcu_prepare_for_idle(void)
>  			continue;
>  		rnp = rdp->mynode;
>  		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> -		smp_mb__after_unlock_lock();
>  		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
>  		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
>  		if (needwake)
> @@ -2114,7 +2104,6 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
>  	struct rcu_node *rnp = rdp->mynode;
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	needwake = rcu_start_future_gp(rnp, rdp, &c);
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	if (needwake)
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 12:15 [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock() Will Deacon
  2015-07-13 13:09 ` Peter Hurley
@ 2015-07-13 13:11 ` Peter Zijlstra
  2015-07-13 14:09   ` Will Deacon
  2015-07-13 22:31 ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2015-07-13 13:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, linux-kernel, Benjamin Herrenschmidt, Paul McKenney

On Mon, Jul 13, 2015 at 01:15:04PM +0100, Will Deacon wrote:
> smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> into a full memory barrier.
> 
> However:

>   - The barrier only applies to UNLOCK + LOCK, not general
>     RELEASE + ACQUIRE operations

No it does too; note that on ppc both acquire and release use lwsync and
two lwsyncs do not make a sync.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 13:11 ` Peter Zijlstra
@ 2015-07-13 14:09   ` Will Deacon
  2015-07-13 14:21     ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-13 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, linux-kernel, Benjamin Herrenschmidt, Paul McKenney

On Mon, Jul 13, 2015 at 02:11:43PM +0100, Peter Zijlstra wrote:
> On Mon, Jul 13, 2015 at 01:15:04PM +0100, Will Deacon wrote:
> > smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> > into a full memory barrier.
> > 
> > However:
> 
> >   - The barrier only applies to UNLOCK + LOCK, not general
> >     RELEASE + ACQUIRE operations
> 
> No it does too; note that on ppc both acquire and release use lwsync and
> two lwsyncs do not make a sync.

Really? IIUC, that means smp_mb__after_unlock_lock needs to be a full
barrier on all architectures implementing smp_store_release as smp_mb() +
STORE, otherwise the following isn't ordered:

  RELEASE X
  smp_mb__after_unlock_lock()
  ACQUIRE Y

On 32-bit ARM (at least), the ACQUIRE can be observed before the RELEASE.

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 14:09   ` Will Deacon
@ 2015-07-13 14:21     ` Will Deacon
  2015-07-13 15:54       ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-13 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, linux-kernel, Benjamin Herrenschmidt, Paul McKenney

On Mon, Jul 13, 2015 at 03:09:15PM +0100, Will Deacon wrote:
> On Mon, Jul 13, 2015 at 02:11:43PM +0100, Peter Zijlstra wrote:
> > On Mon, Jul 13, 2015 at 01:15:04PM +0100, Will Deacon wrote:
> > > smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> > > into a full memory barrier.
> > > 
> > > However:
> > 
> > >   - The barrier only applies to UNLOCK + LOCK, not general
> > >     RELEASE + ACQUIRE operations
> > 
> > No it does too; note that on ppc both acquire and release use lwsync and
> > two lwsyncs do not make a sync.
> 
> Really? IIUC, that means smp_mb__after_unlock_lock needs to be a full
> barrier on all architectures implementing smp_store_release as smp_mb() +
> STORE, otherwise the following isn't ordered:
> 
>   RELEASE X
>   smp_mb__after_unlock_lock()
>   ACQUIRE Y
> 
> On 32-bit ARM (at least), the ACQUIRE can be observed before the RELEASE.

I knew we'd had this conversation before ;)

  http://lkml.kernel.org/r/20150120093443.GA11596@twins.programming.kicks-ass.net

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 13:09 ` Peter Hurley
@ 2015-07-13 14:24   ` Will Deacon
  2015-07-13 15:56     ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-13 14:24 UTC (permalink / raw)
  To: Peter Hurley
  Cc: linux-arch, linux-kernel, Benjamin Herrenschmidt, Paul McKenney,
	Peter Zijlstra

On Mon, Jul 13, 2015 at 02:09:50PM +0100, Peter Hurley wrote:
> On 07/13/2015 08:15 AM, Will Deacon wrote:
> > smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> > into a full memory barrier.
> >
> > However:
> >
> >   - This ordering guarantee is already provided without the barrier on
> >     all architectures apart from PowerPC
> >
> >   - The barrier only applies to UNLOCK + LOCK, not general
> >     RELEASE + ACQUIRE operations
> 
> I'm unclear what you mean here: do you mean
> A) a memory barrier is not required between RELEASE M + ACQUIRE N when you
>    want to maintain distinct order between those operations on all arches
>    (with the possible exception of PowerPC), or,
> B) no one is using smp_mb__after_unlock_lock() in that way right now.

My understanding is (B), but Peter and I don't seem to agree yet!
I'll tighten up the text once we reach a conclusion.

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 14:21     ` Will Deacon
@ 2015-07-13 15:54       ` Peter Zijlstra
  2015-07-13 17:50         ` Will Deacon
                           ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Peter Zijlstra @ 2015-07-13 15:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, linux-kernel, Benjamin Herrenschmidt, Paul McKenney

On Mon, Jul 13, 2015 at 03:21:10PM +0100, Will Deacon wrote:
> On Mon, Jul 13, 2015 at 03:09:15PM +0100, Will Deacon wrote:
> > On Mon, Jul 13, 2015 at 02:11:43PM +0100, Peter Zijlstra wrote:
> > > On Mon, Jul 13, 2015 at 01:15:04PM +0100, Will Deacon wrote:
> > > > smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> > > > into a full memory barrier.
> > > > 
> > > > However:
> > > 
> > > >   - The barrier only applies to UNLOCK + LOCK, not general
> > > >     RELEASE + ACQUIRE operations
> > > 
> > > No it does too; note that on ppc both acquire and release use lwsync and
> > > two lwsyncs do not make a sync.
> > 
> > Really? IIUC, that means smp_mb__after_unlock_lock needs to be a full
> > barrier on all architectures implementing smp_store_release as smp_mb() +
> > STORE, otherwise the following isn't ordered:
> > 
> >   RELEASE X
> >   smp_mb__after_unlock_lock()
> >   ACQUIRE Y
> > 
> > On 32-bit ARM (at least), the ACQUIRE can be observed before the RELEASE.
> 
> I knew we'd had this conversation before ;)
> 
>   http://lkml.kernel.org/r/20150120093443.GA11596@twins.programming.kicks-ass.net

Ha! yes. And I had indeed forgotten about this argument.

However I think we should look at the insides of the critical sections;
for example (from Documentation/memory-barriers.txt):

"       *A = a;
        RELEASE M
        ACQUIRE N
        *B = b;

could occur as:

        ACQUIRE N, STORE *B, STORE *A, RELEASE M"

This could not in fact happen, even though we could flip M and N, A and
B will remain strongly ordered.

That said, I don't think this could even happen on PPC because we have
load_acquire and store_release, this means that:

	*A = a
	lwsync
	store_release M
	load_acquire N
	lwsync
	*B = b

And since the store to M is wrapped inside two lwsync there must be
strong store order, and because the load from N is equally wrapped in
two lwsyncs there must also be strong load order.

In fact, no store/load can cross from before the first lwsync to after
the latter and the other way around.

So in that respect it does provide full load-store ordering. What it
does not provide is order for M and N, nor does it provide transitivity,
but looking at our documentation I'm not at all sure we guarantee that
in any case.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 14:24   ` Will Deacon
@ 2015-07-13 15:56     ` Peter Zijlstra
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Zijlstra @ 2015-07-13 15:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Hurley, linux-arch, linux-kernel, Benjamin Herrenschmidt,
	Paul McKenney

On Mon, Jul 13, 2015 at 03:24:18PM +0100, Will Deacon wrote:
> On Mon, Jul 13, 2015 at 02:09:50PM +0100, Peter Hurley wrote:
> > On 07/13/2015 08:15 AM, Will Deacon wrote:
> > > smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> > > into a full memory barrier.
> > >
> > > However:
> > >
> > >   - This ordering guarantee is already provided without the barrier on
> > >     all architectures apart from PowerPC
> > >
> > >   - The barrier only applies to UNLOCK + LOCK, not general
> > >     RELEASE + ACQUIRE operations
> > 
> > I'm unclear what you mean here: do you mean
> > A) a memory barrier is not required between RELEASE M + ACQUIRE N when you
> >    want to maintain distinct order between those operations on all arches
> >    (with the possible exception of PowerPC), or,
> > B) no one is using smp_mb__after_unlock_lock() in that way right now.
> 
> My understanding is (B), but Peter and I don't seem to agree yet!
> I'll tighten up the text once we reach a conclusion.

I'm fairly sure (but I've not looked) that nobody does in fact rely on
this.

So I'm in agreement with B, and I'm quibbling on what exactly A means
;-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 15:54       ` Peter Zijlstra
@ 2015-07-13 17:50         ` Will Deacon
  2015-07-13 20:20           ` Paul E. McKenney
  2015-07-13 18:23         ` Paul E. McKenney
  2015-07-13 22:37         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-13 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, linux-kernel, Benjamin Herrenschmidt, Paul McKenney

On Mon, Jul 13, 2015 at 04:54:47PM +0100, Peter Zijlstra wrote:
> However I think we should look at the insides of the critical sections;
> for example (from Documentation/memory-barriers.txt):
> 
> "       *A = a;
>         RELEASE M
>         ACQUIRE N
>         *B = b;
> 
> could occur as:
> 
>         ACQUIRE N, STORE *B, STORE *A, RELEASE M"
> 
> This could not in fact happen, even though we could flip M and N, A and
> B will remain strongly ordered.
> 
> That said, I don't think this could even happen on PPC because we have
> load_acquire and store_release, this means that:
> 
> 	*A = a
> 	lwsync
> 	store_release M
> 	load_acquire N
> 	lwsync
> 	*B = b
> 
> And since the store to M is wrapped inside two lwsync there must be
> strong store order, and because the load from N is equally wrapped in
> two lwsyncs there must also be strong load order.
> 
> In fact, no store/load can cross from before the first lwsync to after
> the latter and the other way around.
> 
> So in that respect it does provide full load-store ordering. What it
> does not provide is order for M and N, nor does it provide transitivity,
> but looking at our documentation I'm not at all sure we guarantee that
> in any case.

So if I'm following along, smp_mb__after_unlock_lock *does* provide
transitivity when used with UNLOCK + LOCK, which is stronger than your
example here.

I don't think we want to make the same guarantee for general RELEASE +
ACQUIRE, because we'd end up forcing most architectures to implement the
expensive macro for a case that currently has no users.

In which case, it boils down to the question of how expensive it would
be to implement an SC UNLOCK operation on PowerPC and whether that justifies
the existence of a complicated barrier macro that isn't used outside of
RCU.

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 15:54       ` Peter Zijlstra
  2015-07-13 17:50         ` Will Deacon
@ 2015-07-13 18:23         ` Paul E. McKenney
  2015-07-13 19:41           ` Peter Hurley
  2015-07-13 22:37         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-13 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Mon, Jul 13, 2015 at 05:54:47PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 13, 2015 at 03:21:10PM +0100, Will Deacon wrote:
> > On Mon, Jul 13, 2015 at 03:09:15PM +0100, Will Deacon wrote:
> > > On Mon, Jul 13, 2015 at 02:11:43PM +0100, Peter Zijlstra wrote:
> > > > On Mon, Jul 13, 2015 at 01:15:04PM +0100, Will Deacon wrote:
> > > > > smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> > > > > into a full memory barrier.
> > > > > 
> > > > > However:
> > > > 
> > > > >   - The barrier only applies to UNLOCK + LOCK, not general
> > > > >     RELEASE + ACQUIRE operations
> > > > 
> > > > No it does too; note that on ppc both acquire and release use lwsync and
> > > > two lwsyncs do not make a sync.
> > > 
> > > Really? IIUC, that means smp_mb__after_unlock_lock needs to be a full
> > > barrier on all architectures implementing smp_store_release as smp_mb() +
> > > STORE, otherwise the following isn't ordered:
> > > 
> > >   RELEASE X
> > >   smp_mb__after_unlock_lock()
> > >   ACQUIRE Y
> > > 
> > > On 32-bit ARM (at least), the ACQUIRE can be observed before the RELEASE.
> > 
> > I knew we'd had this conversation before ;)
> > 
> >   http://lkml.kernel.org/r/20150120093443.GA11596@twins.programming.kicks-ass.net
> 
> Ha! yes. And I had indeed forgotten about this argument.
> 
> However I think we should look at the insides of the critical sections;
> for example (from Documentation/memory-barriers.txt):
> 
> "       *A = a;
>         RELEASE M
>         ACQUIRE N
>         *B = b;
> 
> could occur as:
> 
>         ACQUIRE N, STORE *B, STORE *A, RELEASE M"
> 
> This could not in fact happen, even though we could flip M and N, A and
> B will remain strongly ordered.
> 
> That said, I don't think this could even happen on PPC because we have
> load_acquire and store_release, this means that:
> 
> 	*A = a
> 	lwsync
> 	store_release M
> 	load_acquire N
> 	lwsync

Presumably the lwsync instructions are part of the store_release and
load_acquire?

> 	*B = b
> 
> And since the store to M is wrapped inside two lwsync there must be
> strong store order, and because the load from N is equally wrapped in
> two lwsyncs there must also be strong load order.
> 
> In fact, no store/load can cross from before the first lwsync to after
> the latter and the other way around.
> 
> So in that respect it does provide full load-store ordering. What it
> does not provide is order for M and N, nor does it provide transitivity,
> but looking at our documentation I'm not at all sure we guarantee that
> in any case.

I have no idea what the other thread is doing, so I put together the
following litmus test, guessing reverse order, inverse operations,
and full ordering:

	PPC peterz.2015.07.13a
	""
	{
	0:r1=1; 0:r2=a; 0:r3=b; 0:r4=m; 0:r5=n;
	1:r1=1; 1:r2=a; 1:r3=b; 1:r4=m; 1:r5=n;
	}
	 P0            | P1            ;
	 stw r1,0(r2)  | lwz r10,0(r3) ;
	 lwsync        | sync          ;
	 stw r1,0(r4)  | stw r1,0(r5)  ;
	 lwz r10,0(r5) | sync          ;
	 lwsync        | lwz r11,0(r4) ;
	 stw r1,0(r3)  | sync          ;
		       | lwz r12,0(r2) ;
	exists
	(0:r10=0 /\ 1:r10=1 /\ 1:r11=1 /\ 1:r12=1)

See http://lwn.net/Articles/608550/ and http://lwn.net/Articles/470681/
for information on tools that operate on these litmus tests.  (Both
the herd and ppcmem tools agree, as is usually the case.)

Of the 16 possible combinations of values loaded, the following seven
can happen:

	0:r10=0; 1:r10=0; 1:r11=0; 1:r12=0;
	0:r10=0; 1:r10=0; 1:r11=0; 1:r12=1;
	0:r10=0; 1:r10=0; 1:r11=1; 1:r12=1;
	0:r10=0; 1:r10=1; 1:r11=1; 1:r12=1;
	0:r10=1; 1:r10=0; 1:r11=0; 1:r12=0;
	0:r10=1; 1:r10=0; 1:r11=0; 1:r12=1;
	0:r10=1; 1:r10=0; 1:r11=1; 1:r12=1;

P0's store to "m" and load from "n" can clearly be misordered, as there
is nothing to order them.  And all four possible outcomes for 0:r10 and
1:r11 are seen, as expected.

Given that smp_store_release() is only guaranteed to order against prior
operations and smp_load_acquire() is only guaranteed to order against
subsequent operations, P0's load from "n" can be misordered with its
store to "a", and as expected, all four possible outcomes for 0:r10 and
1:r12 are observed.

P0's pairs of stores should all be ordered:

o	"a" and "m" -> 1:r11=1 and 1:r12=0 cannot happen, as expected.

o	"a" and "b" -> 1:r10=1 and 1:r12=0 cannot happen, as expected.

o	"m" and "b" -> 1:r10=1 and 1:r11=0 cannot happen, as expected.

So smp_load_acquire() orders against all subsequent operations, but not
necessarily against any prior ones, and smp_store_release() orders against
all prior operations but not necessarily against any subsequent onse.
But additional stray orderings are permitted, as is the case here.
Which is in fact what these operations are defined to do.

Does that answer the question, or am I missing the point?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 18:23         ` Paul E. McKenney
@ 2015-07-13 19:41           ` Peter Hurley
  2015-07-13 20:16             ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Hurley @ 2015-07-13 19:41 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra, Will Deacon
  Cc: linux-arch, linux-kernel, Benjamin Herrenschmidt

On 07/13/2015 02:23 PM, Paul E. McKenney wrote:
> On Mon, Jul 13, 2015 at 05:54:47PM +0200, Peter Zijlstra wrote:
>> On Mon, Jul 13, 2015 at 03:21:10PM +0100, Will Deacon wrote:
>>> On Mon, Jul 13, 2015 at 03:09:15PM +0100, Will Deacon wrote:
>>>> On Mon, Jul 13, 2015 at 02:11:43PM +0100, Peter Zijlstra wrote:
>>>>> On Mon, Jul 13, 2015 at 01:15:04PM +0100, Will Deacon wrote:
>>>>>> smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
>>>>>> into a full memory barrier.
>>>>>>
>>>>>> However:
>>>>>
>>>>>>   - The barrier only applies to UNLOCK + LOCK, not general
>>>>>>     RELEASE + ACQUIRE operations
>>>>>
>>>>> No it does too; note that on ppc both acquire and release use lwsync and
>>>>> two lwsyncs do not make a sync.
>>>>
>>>> Really? IIUC, that means smp_mb__after_unlock_lock needs to be a full
>>>> barrier on all architectures implementing smp_store_release as smp_mb() +
>>>> STORE, otherwise the following isn't ordered:
>>>>
>>>>   RELEASE X
>>>>   smp_mb__after_unlock_lock()
>>>>   ACQUIRE Y
>>>>
>>>> On 32-bit ARM (at least), the ACQUIRE can be observed before the RELEASE.
>>>
>>> I knew we'd had this conversation before ;)
>>>
>>>   http://lkml.kernel.org/r/20150120093443.GA11596@twins.programming.kicks-ass.net
>>
>> Ha! yes. And I had indeed forgotten about this argument.
>>
>> However I think we should look at the insides of the critical sections;
>> for example (from Documentation/memory-barriers.txt):
>>
>> "       *A = a;
>>         RELEASE M
>>         ACQUIRE N
>>         *B = b;
>>
>> could occur as:
>>
>>         ACQUIRE N, STORE *B, STORE *A, RELEASE M"
>>
>> This could not in fact happen, even though we could flip M and N, A and
>> B will remain strongly ordered.
>>
>> That said, I don't think this could even happen on PPC because we have
>> load_acquire and store_release, this means that:
>>
>> 	*A = a
>> 	lwsync
>> 	store_release M
>> 	load_acquire N
>> 	lwsync
> 
> Presumably the lwsync instructions are part of the store_release and
> load_acquire?
> 
>> 	*B = b
>>
>> And since the store to M is wrapped inside two lwsync there must be
>> strong store order, and because the load from N is equally wrapped in
>> two lwsyncs there must also be strong load order.
>>
>> In fact, no store/load can cross from before the first lwsync to after
>> the latter and the other way around.
>>
>> So in that respect it does provide full load-store ordering. What it
>> does not provide is order for M and N, nor does it provide transitivity,
>> but looking at our documentation I'm not at all sure we guarantee that
>> in any case.
> 
> I have no idea what the other thread is doing, so I put together the
> following litmus test, guessing reverse order, inverse operations,
> and full ordering:
> 
> 	PPC peterz.2015.07.13a
> 	""
> 	{
> 	0:r1=1; 0:r2=a; 0:r3=b; 0:r4=m; 0:r5=n;
> 	1:r1=1; 1:r2=a; 1:r3=b; 1:r4=m; 1:r5=n;
> 	}
> 	 P0            | P1            ;
> 	 stw r1,0(r2)  | lwz r10,0(r3) ;
> 	 lwsync        | sync          ;
> 	 stw r1,0(r4)  | stw r1,0(r5)  ;
> 	 lwz r10,0(r5) | sync          ;
> 	 lwsync        | lwz r11,0(r4) ;
> 	 stw r1,0(r3)  | sync          ;
> 		       | lwz r12,0(r2) ;
> 	exists
> 	(0:r10=0 /\ 1:r10=1 /\ 1:r11=1 /\ 1:r12=1)
> 
> See http://lwn.net/Articles/608550/ and http://lwn.net/Articles/470681/
> for information on tools that operate on these litmus tests.  (Both
> the herd and ppcmem tools agree, as is usually the case.)
> 
> Of the 16 possible combinations of values loaded, the following seven
> can happen:
> 
> 	0:r10=0; 1:r10=0; 1:r11=0; 1:r12=0;
> 	0:r10=0; 1:r10=0; 1:r11=0; 1:r12=1;
> 	0:r10=0; 1:r10=0; 1:r11=1; 1:r12=1;
> 	0:r10=0; 1:r10=1; 1:r11=1; 1:r12=1;
> 	0:r10=1; 1:r10=0; 1:r11=0; 1:r12=0;
> 	0:r10=1; 1:r10=0; 1:r11=0; 1:r12=1;
> 	0:r10=1; 1:r10=0; 1:r11=1; 1:r12=1;
> 
> P0's store to "m" and load from "n" can clearly be misordered, as there
> is nothing to order them.  And all four possible outcomes for 0:r10 and
> 1:r11 are seen, as expected.
> 
> Given that smp_store_release() is only guaranteed to order against prior
> operations and smp_load_acquire() is only guaranteed to order against
> subsequent operations, P0's load from "n" can be misordered with its
> store to "a", and as expected, all four possible outcomes for 0:r10 and
> 1:r12 are observed.
> 
> P0's pairs of stores should all be ordered:
> 
> o	"a" and "m" -> 1:r11=1 and 1:r12=0 cannot happen, as expected.
> 
> o	"a" and "b" -> 1:r10=1 and 1:r12=0 cannot happen, as expected.
> 
> o	"m" and "b" -> 1:r10=1 and 1:r11=0 cannot happen, as expected.
> 
> So smp_load_acquire() orders against all subsequent operations, but not
> necessarily against any prior ones, and smp_store_release() orders against
> all prior operations but not necessarily against any subsequent onse.
> But additional stray orderings are permitted, as is the case here.
> Which is in fact what these operations are defined to do.
> 
> Does that answer the question, or am I missing the point?

Yes, it shows that smp_mb__after_unlock_lock() has no purpose, since it
is defined only for PowerPC and your test above just showed that for
the sequence

  store a
  UNLOCK M
  LOCK N
  store b

a and b is always observed as an ordered pair {a,b}.

Additionally, the assertion in Documentation/memory_barriers.txt that
the sequence above can be reordered as

  LOCK N
  store b
  store a
  UNLOCK M

is not true on any existing arch in Linux.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 19:41           ` Peter Hurley
@ 2015-07-13 20:16             ` Paul E. McKenney
  2015-07-13 22:15               ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-13 20:16 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Will Deacon, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Mon, Jul 13, 2015 at 03:41:53PM -0400, Peter Hurley wrote:
> On 07/13/2015 02:23 PM, Paul E. McKenney wrote:
> > On Mon, Jul 13, 2015 at 05:54:47PM +0200, Peter Zijlstra wrote:
> >> On Mon, Jul 13, 2015 at 03:21:10PM +0100, Will Deacon wrote:
> >>> On Mon, Jul 13, 2015 at 03:09:15PM +0100, Will Deacon wrote:
> >>>> On Mon, Jul 13, 2015 at 02:11:43PM +0100, Peter Zijlstra wrote:
> >>>>> On Mon, Jul 13, 2015 at 01:15:04PM +0100, Will Deacon wrote:
> >>>>>> smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> >>>>>> into a full memory barrier.
> >>>>>>
> >>>>>> However:
> >>>>>
> >>>>>>   - The barrier only applies to UNLOCK + LOCK, not general
> >>>>>>     RELEASE + ACQUIRE operations
> >>>>>
> >>>>> No it does too; note that on ppc both acquire and release use lwsync and
> >>>>> two lwsyncs do not make a sync.
> >>>>
> >>>> Really? IIUC, that means smp_mb__after_unlock_lock needs to be a full
> >>>> barrier on all architectures implementing smp_store_release as smp_mb() +
> >>>> STORE, otherwise the following isn't ordered:
> >>>>
> >>>>   RELEASE X
> >>>>   smp_mb__after_unlock_lock()
> >>>>   ACQUIRE Y
> >>>>
> >>>> On 32-bit ARM (at least), the ACQUIRE can be observed before the RELEASE.
> >>>
> >>> I knew we'd had this conversation before ;)
> >>>
> >>>   http://lkml.kernel.org/r/20150120093443.GA11596@twins.programming.kicks-ass.net
> >>
> >> Ha! yes. And I had indeed forgotten about this argument.
> >>
> >> However I think we should look at the insides of the critical sections;
> >> for example (from Documentation/memory-barriers.txt):
> >>
> >> "       *A = a;
> >>         RELEASE M
> >>         ACQUIRE N
> >>         *B = b;
> >>
> >> could occur as:
> >>
> >>         ACQUIRE N, STORE *B, STORE *A, RELEASE M"
> >>
> >> This could not in fact happen, even though we could flip M and N, A and
> >> B will remain strongly ordered.
> >>
> >> That said, I don't think this could even happen on PPC because we have
> >> load_acquire and store_release, this means that:
> >>
> >> 	*A = a
> >> 	lwsync
> >> 	store_release M
> >> 	load_acquire N
> >> 	lwsync
> > 
> > Presumably the lwsync instructions are part of the store_release and
> > load_acquire?
> > 
> >> 	*B = b
> >>
> >> And since the store to M is wrapped inside two lwsync there must be
> >> strong store order, and because the load from N is equally wrapped in
> >> two lwsyncs there must also be strong load order.
> >>
> >> In fact, no store/load can cross from before the first lwsync to after
> >> the latter and the other way around.
> >>
> >> So in that respect it does provide full load-store ordering. What it
> >> does not provide is order for M and N, nor does it provide transitivity,
> >> but looking at our documentation I'm not at all sure we guarantee that
> >> in any case.
> > 
> > I have no idea what the other thread is doing, so I put together the
> > following litmus test, guessing reverse order, inverse operations,
> > and full ordering:
> > 
> > 	PPC peterz.2015.07.13a
> > 	""
> > 	{
> > 	0:r1=1; 0:r2=a; 0:r3=b; 0:r4=m; 0:r5=n;
> > 	1:r1=1; 1:r2=a; 1:r3=b; 1:r4=m; 1:r5=n;
> > 	}
> > 	 P0            | P1            ;
> > 	 stw r1,0(r2)  | lwz r10,0(r3) ;
> > 	 lwsync        | sync          ;
> > 	 stw r1,0(r4)  | stw r1,0(r5)  ;
> > 	 lwz r10,0(r5) | sync          ;
> > 	 lwsync        | lwz r11,0(r4) ;
> > 	 stw r1,0(r3)  | sync          ;
> > 		       | lwz r12,0(r2) ;
> > 	exists
> > 	(0:r10=0 /\ 1:r10=1 /\ 1:r11=1 /\ 1:r12=1)
> > 
> > See http://lwn.net/Articles/608550/ and http://lwn.net/Articles/470681/
> > for information on tools that operate on these litmus tests.  (Both
> > the herd and ppcmem tools agree, as is usually the case.)
> > 
> > Of the 16 possible combinations of values loaded, the following seven
> > can happen:
> > 
> > 	0:r10=0; 1:r10=0; 1:r11=0; 1:r12=0;
> > 	0:r10=0; 1:r10=0; 1:r11=0; 1:r12=1;
> > 	0:r10=0; 1:r10=0; 1:r11=1; 1:r12=1;
> > 	0:r10=0; 1:r10=1; 1:r11=1; 1:r12=1;
> > 	0:r10=1; 1:r10=0; 1:r11=0; 1:r12=0;
> > 	0:r10=1; 1:r10=0; 1:r11=0; 1:r12=1;
> > 	0:r10=1; 1:r10=0; 1:r11=1; 1:r12=1;
> > 
> > P0's store to "m" and load from "n" can clearly be misordered, as there
> > is nothing to order them.  And all four possible outcomes for 0:r10 and
> > 1:r11 are seen, as expected.
> > 
> > Given that smp_store_release() is only guaranteed to order against prior
> > operations and smp_load_acquire() is only guaranteed to order against
> > subsequent operations, P0's load from "n" can be misordered with its
> > store to "a", and as expected, all four possible outcomes for 0:r10 and
> > 1:r12 are observed.
> > 
> > P0's pairs of stores should all be ordered:
> > 
> > o	"a" and "m" -> 1:r11=1 and 1:r12=0 cannot happen, as expected.
> > 
> > o	"a" and "b" -> 1:r10=1 and 1:r12=0 cannot happen, as expected.
> > 
> > o	"m" and "b" -> 1:r10=1 and 1:r11=0 cannot happen, as expected.
> > 
> > So smp_load_acquire() orders against all subsequent operations, but not
> > necessarily against any prior ones, and smp_store_release() orders against
> > all prior operations but not necessarily against any subsequent onse.
> > But additional stray orderings are permitted, as is the case here.
> > Which is in fact what these operations are defined to do.
> > 
> > Does that answer the question, or am I missing the point?
> 
> Yes, it shows that smp_mb__after_unlock_lock() has no purpose, since it
> is defined only for PowerPC and your test above just showed that for
> the sequence
> 
>   store a
>   UNLOCK M
>   LOCK N
>   store b
> 
> a and b is always observed as an ordered pair {a,b}.

Not quite.

This is instead the sequence that is of concern:

	store a
	unlock M
	lock N
	load b

> Additionally, the assertion in Documentation/memory_barriers.txt that
> the sequence above can be reordered as
> 
>   LOCK N
>   store b
>   store a
>   UNLOCK M
> 
> is not true on any existing arch in Linux.

It was at one time and might be again.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 17:50         ` Will Deacon
@ 2015-07-13 20:20           ` Paul E. McKenney
  2015-07-13 22:23             ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-13 20:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Mon, Jul 13, 2015 at 06:50:29PM +0100, Will Deacon wrote:
> On Mon, Jul 13, 2015 at 04:54:47PM +0100, Peter Zijlstra wrote:
> > However I think we should look at the insides of the critical sections;
> > for example (from Documentation/memory-barriers.txt):
> > 
> > "       *A = a;
> >         RELEASE M
> >         ACQUIRE N
> >         *B = b;
> > 
> > could occur as:
> > 
> >         ACQUIRE N, STORE *B, STORE *A, RELEASE M"
> > 
> > This could not in fact happen, even though we could flip M and N, A and
> > B will remain strongly ordered.
> > 
> > That said, I don't think this could even happen on PPC because we have
> > load_acquire and store_release, this means that:
> > 
> > 	*A = a
> > 	lwsync
> > 	store_release M
> > 	load_acquire N
> > 	lwsync
> > 	*B = b
> > 
> > And since the store to M is wrapped inside two lwsync there must be
> > strong store order, and because the load from N is equally wrapped in
> > two lwsyncs there must also be strong load order.
> > 
> > In fact, no store/load can cross from before the first lwsync to after
> > the latter and the other way around.
> > 
> > So in that respect it does provide full load-store ordering. What it
> > does not provide is order for M and N, nor does it provide transitivity,
> > but looking at our documentation I'm not at all sure we guarantee that
> > in any case.
> 
> So if I'm following along, smp_mb__after_unlock_lock *does* provide
> transitivity when used with UNLOCK + LOCK, which is stronger than your
> example here.

Yes, that is indeed the intent.

> I don't think we want to make the same guarantee for general RELEASE +
> ACQUIRE, because we'd end up forcing most architectures to implement the
> expensive macro for a case that currently has no users.

Agreed, smp_mb__after_unlock_lock() makes a limited guarantee.

> In which case, it boils down to the question of how expensive it would
> be to implement an SC UNLOCK operation on PowerPC and whether that justifies
> the existence of a complicated barrier macro that isn't used outside of
> RCU.

Given that it is either smp_mb() or nothing, I am not seeing the
"complicated" part...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 20:16             ` Paul E. McKenney
@ 2015-07-13 22:15               ` Peter Zijlstra
  2015-07-13 22:43                 ` Benjamin Herrenschmidt
  2015-07-13 22:53                 ` Paul E. McKenney
  0 siblings, 2 replies; 66+ messages in thread
From: Peter Zijlstra @ 2015-07-13 22:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Hurley, Will Deacon, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Mon, Jul 13, 2015 at 01:16:42PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 13, 2015 at 03:41:53PM -0400, Peter Hurley wrote:
> > > Does that answer the question, or am I missing the point?
> > 
> > Yes, it shows that smp_mb__after_unlock_lock() has no purpose, since it
> > is defined only for PowerPC and your test above just showed that for
> > the sequence

The only purpose is to provide transitivity, but the documentation fails
to explicitly call that out.

> > 
> >   store a
> >   UNLOCK M
> >   LOCK N
> >   store b
> > 
> > a and b is always observed as an ordered pair {a,b}.
> 
> Not quite.
> 
> This is instead the sequence that is of concern:
> 
> 	store a
> 	unlock M
> 	lock N
> 	load b

So its late and that table didn't parse, but that should be ordered too.
The load of b should not be able to escape the lock N.

If only because LWSYNC is a valid RMB and any LOCK implementation must
load the lock state to observe it unlocked.

> > Additionally, the assertion in Documentation/memory_barriers.txt that
> > the sequence above can be reordered as
> > 
> >   LOCK N
> >   store b
> >   store a
> >   UNLOCK M
> > 
> > is not true on any existing arch in Linux.
> 
> It was at one time and might be again.

What would be required to make this true? I'm having a hard time seeing
how things can get reordered like that.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 20:20           ` Paul E. McKenney
@ 2015-07-13 22:23             ` Peter Zijlstra
  2015-07-13 23:04               ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2015-07-13 22:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Mon, Jul 13, 2015 at 01:20:32PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 13, 2015 at 06:50:29PM +0100, Will Deacon wrote:

> > So if I'm following along, smp_mb__after_unlock_lock *does* provide
> > transitivity when used with UNLOCK + LOCK, which is stronger than your
> > example here.
> 
> Yes, that is indeed the intent.

Maybe good to state this explicitly somewhere.

> > I don't think we want to make the same guarantee for general RELEASE +
> > ACQUIRE, because we'd end up forcing most architectures to implement the
> > expensive macro for a case that currently has no users.
> 
> Agreed, smp_mb__after_unlock_lock() makes a limited guarantee.

I'm still not seeing how the archs that implement load_acquire and
store_release with smp_mb() are a problem.

If we look at the inside of the critical section again -- similar
argument as before:

	*A = a
	smp_mb()
	store M
	load N
	smp_mb()
	*B = b

A and B are fully ordered, and in this case even transitivity is
provided.

I'm stating that the order of M and N don't matter, only the
load/stores that are inside the acquire/release are constrained.

IOW, I think smp_mb__after_unlock_lock() already works as advertised
with all our acquire/release thingies -- as is stated by the
documentation.

That said, I'm not aware of anybody but RCU actually using this, so its
not used in that capacity.

> > In which case, it boils down to the question of how expensive it would
> > be to implement an SC UNLOCK operation on PowerPC and whether that justifies
> > the existence of a complicated barrier macro that isn't used outside of
> > RCU.
> 
> Given that it is either smp_mb() or nothing, I am not seeing the
> "complicated" part...

The 'complicated' part is that we need think about it; that is we need
to realized and remember that UNLOCK+LOCK is a load-store barrier but
fails to provide transitivity.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 12:15 [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock() Will Deacon
  2015-07-13 13:09 ` Peter Hurley
  2015-07-13 13:11 ` Peter Zijlstra
@ 2015-07-13 22:31 ` Benjamin Herrenschmidt
  2015-07-14 10:16   ` Will Deacon
  2015-07-15  3:06   ` Michael Ellerman
  2 siblings, 2 replies; 66+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-13 22:31 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, linux-kernel, Paul McKenney, Peter Zijlstra,
	Michael Ellerman

On Mon, 2015-07-13 at 13:15 +0100, Will Deacon wrote:
> smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> into a full memory barrier.
> 
> However:
> 
>   - This ordering guarantee is already provided without the barrier on
>     all architectures apart from PowerPC
> 
>   - The barrier only applies to UNLOCK + LOCK, not general
>     RELEASE + ACQUIRE operations
> 
>   - Locks are generally assumed to offer SC ordering semantics, so
>     having this additional barrier is error-prone and complicates the
>     callers of LOCK/UNLOCK primitives
> 
>   - The barrier is not well used outside of RCU and, because it was
>     retrofitted into the kernel, it's not clear whether other areas of
>     the kernel are incorrectly relying on UNLOCK + LOCK implying a full
>     barrier
> 
> This patch removes the barrier and instead requires architectures to
> provide full barrier semantics for an UNLOCK + LOCK sequence.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
> 
> This didn't go anywhere last time I posted it, but here it is again.
> I'd really appreciate some feedback from the PowerPC guys, especially as
> to whether this change requires them to add an additional barrier in
> arch_spin_unlock and what the cost of that would be.

We'd have to turn the lwsync in unlock or the isync in lock into a full
barrier. As it is, we *almost* have a full barrier semantic, but not
quite, as in things can get mixed up inside spin_lock between the LL and
the SC (things leaking in past LL and things leaking "out" up before SC
and then getting mixed up in there).

Michael, at some point you were experimenting a bit with that and tried
to get some perf numbers of the impact that would have, did that
solidify ? Otherwise, I'll have a look when I'm back next week.

Ben.

>  Documentation/memory-barriers.txt   | 77 ++-----------------------------------
>  arch/powerpc/include/asm/spinlock.h |  2 -
>  include/linux/spinlock.h            | 10 -----
>  kernel/locking/mcs_spinlock.h       |  9 -----
>  kernel/rcu/tree.c                   | 21 +---------
>  kernel/rcu/tree_plugin.h            | 11 ------
>  6 files changed, 4 insertions(+), 126 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 13feb697271f..fff21b632893 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1848,74 +1848,9 @@ RELEASE are to the same lock variable, but only from the perspective of
>  another CPU not holding that lock.  In short, a ACQUIRE followed by an
>  RELEASE may -not- be assumed to be a full memory barrier.
>  
> -Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
> -imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
> -pair to produce a full barrier, the ACQUIRE can be followed by an
> -smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
> -if either (a) the RELEASE and the ACQUIRE are executed by the same
> -CPU or task, or (b) the RELEASE and ACQUIRE act on the same variable.
> -The smp_mb__after_unlock_lock() primitive is free on many architectures.
> -Without smp_mb__after_unlock_lock(), the CPU's execution of the critical
> -sections corresponding to the RELEASE and the ACQUIRE can cross, so that:
> -
> -	*A = a;
> -	RELEASE M
> -	ACQUIRE N
> -	*B = b;
> -
> -could occur as:
> -
> -	ACQUIRE N, STORE *B, STORE *A, RELEASE M
> -
> -It might appear that this reordering could introduce a deadlock.
> -However, this cannot happen because if such a deadlock threatened,
> -the RELEASE would simply complete, thereby avoiding the deadlock.
> -
> -	Why does this work?
> -
> -	One key point is that we are only talking about the CPU doing
> -	the reordering, not the compiler.  If the compiler (or, for
> -	that matter, the developer) switched the operations, deadlock
> -	-could- occur.
> -
> -	But suppose the CPU reordered the operations.  In this case,
> -	the unlock precedes the lock in the assembly code.  The CPU
> -	simply elected to try executing the later lock operation first.
> -	If there is a deadlock, this lock operation will simply spin (or
> -	try to sleep, but more on that later).	The CPU will eventually
> -	execute the unlock operation (which preceded the lock operation
> -	in the assembly code), which will unravel the potential deadlock,
> -	allowing the lock operation to succeed.
> -
> -	But what if the lock is a sleeplock?  In that case, the code will
> -	try to enter the scheduler, where it will eventually encounter
> -	a memory barrier, which will force the earlier unlock operation
> -	to complete, again unraveling the deadlock.  There might be
> -	a sleep-unlock race, but the locking primitive needs to resolve
> -	such races properly in any case.
> -
> -With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
> -For example, with the following code, the store to *A will always be
> -seen by other CPUs before the store to *B:
> -
> -	*A = a;
> -	RELEASE M
> -	ACQUIRE N
> -	smp_mb__after_unlock_lock();
> -	*B = b;
> -
> -The operations will always occur in one of the following orders:
> -
> -	STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
> -	STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> -	ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> -
> -If the RELEASE and ACQUIRE were instead both operating on the same lock
> -variable, only the first of these alternatives can occur.  In addition,
> -the more strongly ordered systems may rule out some of the above orders.
> -But in any case, as noted earlier, the smp_mb__after_unlock_lock()
> -ensures that the store to *A will always be seen as happening before
> -the store to *B.
> +However, the reverse case of a RELEASE followed by an ACQUIRE _does_
> +imply a full memory barrier when these accesses are performed as a pair
> +of UNLOCK and LOCK operations, irrespective of the lock variable.
>  
>  Locks and semaphores may not provide any guarantee of ordering on UP compiled
>  systems, and so cannot be counted on in such a situation to actually achieve
> @@ -2158,7 +2093,6 @@ However, if the following occurs:
>  	RELEASE M	     [1]
>  	ACCESS_ONCE(*D) = d;		ACCESS_ONCE(*E) = e;
>  					ACQUIRE M		     [2]
> -					smp_mb__after_unlock_lock();
>  					ACCESS_ONCE(*F) = f;
>  					ACCESS_ONCE(*G) = g;
>  					RELEASE M	     [2]
> @@ -2176,11 +2110,6 @@ But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
>  	*F, *G or *H preceding ACQUIRE M [2]
>  	*A, *B, *C, *E, *F or *G following RELEASE M [2]
>  
> -Note that the smp_mb__after_unlock_lock() is critically important
> -here: Without it CPU 3 might see some of the above orderings.
> -Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
> -to be seen in order unless CPU 3 holds lock M.
> -
>  
>  ACQUIRES VS I/O ACCESSES
>  ------------------------
> diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
> index 4dbe072eecbe..523673d7583c 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -28,8 +28,6 @@
>  #include <asm/synch.h>
>  #include <asm/ppc-opcode.h>
>  
> -#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
> -
>  #ifdef CONFIG_PPC64
>  /* use 0x800000yy when locked, where yy == CPU number */
>  #ifdef __BIG_ENDIAN__
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 0063b24b4f36..16c5ed5a627c 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -130,16 +130,6 @@ do {								\
>  #define smp_mb__before_spinlock()	smp_wmb()
>  #endif
>  
> -/*
> - * Place this after a lock-acquisition primitive to guarantee that
> - * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
> - * if the UNLOCK and LOCK are executed by the same CPU or if the
> - * UNLOCK and LOCK operate on the same lock variable.
> - */
> -#ifndef smp_mb__after_unlock_lock
> -#define smp_mb__after_unlock_lock()	do { } while (0)
> -#endif
> -
>  /**
>   * raw_spin_unlock_wait - wait until the spinlock gets unlocked
>   * @lock: the spinlock in question.
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index fd91aaa4554c..2ea2fae2b477 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -43,15 +43,6 @@ do {									\
>  #endif
>  
>  /*
> - * Note: the smp_load_acquire/smp_store_release pair is not
> - * sufficient to form a full memory barrier across
> - * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
> - * For applications that need a full barrier across multiple cpus
> - * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
> - * used after mcs_lock.
> - */
> -
> -/*
>   * In order to acquire the lock, the caller should declare a local node and
>   * pass a reference of the node to this function in addition to the lock.
>   * If the lock has already been acquired, then this will proceed to spin
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 65137bc28b2b..6689fc0808c8 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1513,10 +1513,8 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
>  	 * hold it, acquire the root rcu_node structure's lock in order to
>  	 * start one (if needed).
>  	 */
> -	if (rnp != rnp_root) {
> +	if (rnp != rnp_root)
>  		raw_spin_lock(&rnp_root->lock);
> -		smp_mb__after_unlock_lock();
> -	}
>  
>  	/*
>  	 * Get a new grace-period number.  If there really is no grace
> @@ -1769,7 +1767,6 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
>  		local_irq_restore(flags);
>  		return;
>  	}
> -	smp_mb__after_unlock_lock();
>  	needwake = __note_gp_changes(rsp, rnp, rdp);
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	if (needwake)
> @@ -1794,7 +1791,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  
>  	WRITE_ONCE(rsp->gp_activity, jiffies);
>  	raw_spin_lock_irq(&rnp->lock);
> -	smp_mb__after_unlock_lock();
>  	if (!READ_ONCE(rsp->gp_flags)) {
>  		/* Spurious wakeup, tell caller to go back to sleep.  */
>  		raw_spin_unlock_irq(&rnp->lock);
> @@ -1827,7 +1823,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  	rcu_for_each_leaf_node(rsp, rnp) {
>  		rcu_gp_slow(rsp, gp_preinit_delay);
>  		raw_spin_lock_irq(&rnp->lock);
> -		smp_mb__after_unlock_lock();
>  		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
>  		    !rnp->wait_blkd_tasks) {
>  			/* Nothing to do on this leaf rcu_node structure. */
> @@ -1884,7 +1879,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  	rcu_for_each_node_breadth_first(rsp, rnp) {
>  		rcu_gp_slow(rsp, gp_init_delay);
>  		raw_spin_lock_irq(&rnp->lock);
> -		smp_mb__after_unlock_lock();
>  		rdp = this_cpu_ptr(rsp->rda);
>  		rcu_preempt_check_blocked_tasks(rnp);
>  		rnp->qsmask = rnp->qsmaskinit;
> @@ -1935,7 +1929,6 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
>  	/* Clear flag to prevent immediate re-entry. */
>  	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
>  		raw_spin_lock_irq(&rnp->lock);
> -		smp_mb__after_unlock_lock();
>  		WRITE_ONCE(rsp->gp_flags,
>  			   READ_ONCE(rsp->gp_flags) & ~RCU_GP_FLAG_FQS);
>  		raw_spin_unlock_irq(&rnp->lock);
> @@ -1956,7 +1949,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  
>  	WRITE_ONCE(rsp->gp_activity, jiffies);
>  	raw_spin_lock_irq(&rnp->lock);
> -	smp_mb__after_unlock_lock();
>  	gp_duration = jiffies - rsp->gp_start;
>  	if (gp_duration > rsp->gp_max)
>  		rsp->gp_max = gp_duration;
> @@ -1982,7 +1974,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  	 */
>  	rcu_for_each_node_breadth_first(rsp, rnp) {
>  		raw_spin_lock_irq(&rnp->lock);
> -		smp_mb__after_unlock_lock();
>  		WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp));
>  		WARN_ON_ONCE(rnp->qsmask);
>  		WRITE_ONCE(rnp->completed, rsp->gpnum);
> @@ -1998,7 +1989,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  	}
>  	rnp = rcu_get_root(rsp);
>  	raw_spin_lock_irq(&rnp->lock);
> -	smp_mb__after_unlock_lock(); /* Order GP before ->completed update. */
>  	rcu_nocb_gp_set(rnp, nocb);
>  
>  	/* Declare grace period done. */
> @@ -2246,7 +2236,6 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
>  		rnp_c = rnp;
>  		rnp = rnp->parent;
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		smp_mb__after_unlock_lock();
>  		oldmask = rnp_c->qsmask;
>  	}
>  
> @@ -2294,7 +2283,6 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
>  	mask = rnp->grpmask;
>  	raw_spin_unlock(&rnp->lock);	/* irqs remain disabled. */
>  	raw_spin_lock(&rnp_p->lock);	/* irqs already disabled. */
> -	smp_mb__after_unlock_lock();
>  	rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags);
>  }
>  
> @@ -2317,7 +2305,6 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
>  
>  	rnp = rdp->mynode;
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	if ((rdp->passed_quiesce == 0 &&
>  	     rdp->rcu_qs_ctr_snap == __this_cpu_read(rcu_qs_ctr)) ||
>  	    rdp->gpnum != rnp->gpnum || rnp->completed == rnp->gpnum ||
> @@ -2544,7 +2531,6 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
>  		if (!rnp)
>  			break;
>  		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> -		smp_mb__after_unlock_lock(); /* GP memory ordering. */
>  		rnp->qsmaskinit &= ~mask;
>  		rnp->qsmask &= ~mask;
>  		if (rnp->qsmaskinit) {
> @@ -2573,7 +2559,6 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
>  	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
>  	mask = rdp->grpmask;
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();	/* Enforce GP memory-order guarantee. */
>  	rnp->qsmaskinitnext &= ~mask;
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  }
> @@ -2771,7 +2756,6 @@ static void force_qs_rnp(struct rcu_state *rsp,
>  		cond_resched_rcu_qs();
>  		mask = 0;
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		smp_mb__after_unlock_lock();
>  		if (rnp->qsmask == 0) {
>  			if (rcu_state_p == &rcu_sched_state ||
>  			    rsp != rcu_state_p ||
> @@ -2843,7 +2827,6 @@ static void force_quiescent_state(struct rcu_state *rsp)
>  
>  	/* Reached the root of the rcu_node tree, acquire lock. */
>  	raw_spin_lock_irqsave(&rnp_old->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	raw_spin_unlock(&rnp_old->fqslock);
>  	if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
>  		rsp->n_force_qs_lh++;
> @@ -2967,7 +2950,6 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
>  			struct rcu_node *rnp_root = rcu_get_root(rsp);
>  
>  			raw_spin_lock(&rnp_root->lock);
> -			smp_mb__after_unlock_lock();
>  			needwake = rcu_start_gp(rsp);
>  			raw_spin_unlock(&rnp_root->lock);
>  			if (needwake)
> @@ -3810,7 +3792,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
>  	rnp = rdp->mynode;
>  	mask = rdp->grpmask;
>  	raw_spin_lock(&rnp->lock);		/* irqs already disabled. */
> -	smp_mb__after_unlock_lock();
>  	rnp->qsmaskinitnext |= mask;
>  	rdp->gpnum = rnp->completed; /* Make CPU later note any new GP. */
>  	rdp->completed = rnp->completed;
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 013485fb2b06..79793a7647cf 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -164,7 +164,6 @@ static void rcu_preempt_note_context_switch(void)
>  		rdp = this_cpu_ptr(rcu_state_p->rda);
>  		rnp = rdp->mynode;
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		smp_mb__after_unlock_lock();
>  		t->rcu_read_unlock_special.b.blocked = true;
>  		t->rcu_blocked_node = rnp;
>  
> @@ -324,7 +323,6 @@ void rcu_read_unlock_special(struct task_struct *t)
>  		for (;;) {
>  			rnp = t->rcu_blocked_node;
>  			raw_spin_lock(&rnp->lock);  /* irqs already disabled. */
> -			smp_mb__after_unlock_lock();
>  			if (rnp == t->rcu_blocked_node)
>  				break;
>  			WARN_ON_ONCE(1);
> @@ -598,7 +596,6 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
>  	unsigned long mask;
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	for (;;) {
>  		if (!sync_rcu_preempt_exp_done(rnp)) {
>  			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> @@ -616,7 +613,6 @@ static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
>  		raw_spin_unlock(&rnp->lock); /* irqs remain disabled */
>  		rnp = rnp->parent;
>  		raw_spin_lock(&rnp->lock); /* irqs already disabled */
> -		smp_mb__after_unlock_lock();
>  		rnp->expmask &= ~mask;
>  	}
>  }
> @@ -638,7 +634,6 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
>  	struct rcu_node *rnp_up;
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	WARN_ON_ONCE(rnp->expmask);
>  	WARN_ON_ONCE(rnp->exp_tasks);
>  	if (!rcu_preempt_has_tasks(rnp)) {
> @@ -655,7 +650,6 @@ sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp)
>  		if (rnp_up->expmask & mask)
>  			break;
>  		raw_spin_lock(&rnp_up->lock); /* irqs already off */
> -		smp_mb__after_unlock_lock();
>  		rnp_up->expmask |= mask;
>  		raw_spin_unlock(&rnp_up->lock); /* irqs still off */
>  	}
> @@ -679,7 +673,6 @@ sync_rcu_preempt_exp_init2(struct rcu_state *rsp, struct rcu_node *rnp)
>  	unsigned long flags;
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	if (!rnp->expmask) {
>  		/* Phase 1 didn't do anything, so Phase 2 doesn't either. */
>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> @@ -1007,7 +1000,6 @@ static int rcu_boost(struct rcu_node *rnp)
>  		return 0;  /* Nothing left to boost. */
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  
>  	/*
>  	 * Recheck under the lock: all tasks in need of boosting
> @@ -1195,7 +1187,6 @@ static int rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
>  	if (IS_ERR(t))
>  		return PTR_ERR(t);
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	rnp->boost_kthread_task = t;
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	sp.sched_priority = kthread_prio;
> @@ -1586,7 +1577,6 @@ static void rcu_prepare_for_idle(void)
>  			continue;
>  		rnp = rdp->mynode;
>  		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> -		smp_mb__after_unlock_lock();
>  		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
>  		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
>  		if (needwake)
> @@ -2114,7 +2104,6 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
>  	struct rcu_node *rnp = rdp->mynode;
>  
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	smp_mb__after_unlock_lock();
>  	needwake = rcu_start_future_gp(rnp, rdp, &c);
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	if (needwake)



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 15:54       ` Peter Zijlstra
  2015-07-13 17:50         ` Will Deacon
  2015-07-13 18:23         ` Paul E. McKenney
@ 2015-07-13 22:37         ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 66+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-13 22:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Will Deacon, linux-arch, linux-kernel, Paul McKenney

On Mon, 2015-07-13 at 17:54 +0200, Peter Zijlstra wrote:

> That said, I don't think this could even happen on PPC because we have
> load_acquire and store_release, this means that:
> 
> 	*A = a
> 	lwsync
> 	store_release M
> 	load_acquire N
> 	lwsync
> 	*B = b
> 
> And since the store to M is wrapped inside two lwsync there must be
> strong store order, and because the load from N is equally wrapped in
> two lwsyncs there must also be strong load order.
> 
> In fact, no store/load can cross from before the first lwsync to after
> the latter and the other way around.
> 
> So in that respect it does provide full load-store ordering. What it
> does not provide is order for M and N, nor does it provide transitivity,
> but looking at our documentation I'm not at all sure we guarantee that
> in any case.

The problem is if you have a load after the second lwsync, that load can
go up pass the store release. This has caused issues when N or M is what
you are trying to order against. That's why we had to add a sync to
spin_is_locked or similar.

Ben.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 22:15               ` Peter Zijlstra
@ 2015-07-13 22:43                 ` Benjamin Herrenschmidt
  2015-07-14  8:34                   ` Peter Zijlstra
  2015-07-13 22:53                 ` Paul E. McKenney
  1 sibling, 1 reply; 66+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-13 22:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Peter Hurley, Will Deacon, linux-arch, linux-kernel

On Tue, 2015-07-14 at 00:15 +0200, Peter Zijlstra wrote:

> > This is instead the sequence that is of concern:
> > 
> > 	store a
> > 	unlock M
> > 	lock N
> > 	load b
> 
> So its late and that table didn't parse, but that should be ordered too.
> The load of b should not be able to escape the lock N.
> 
> If only because LWSYNC is a valid RMB and any LOCK implementation must
> load the lock state to observe it unlocked.

What happens is that the load passes the store conditional, though it
doesn't pass the load with reserve. However, both store A and unlock M
being just stores with an lwsync, can pass a load, so they can pass the
load with reserve. And thus inside the LL/SC loop, our store A has
passed our load B.

> > > Additionally, the assertion in Documentation/memory_barriers.txt that
> > > the sequence above can be reordered as
> > > 
> > >   LOCK N
> > >   store b
> > >   store a
> > >   UNLOCK M
> > > 
> > > is not true on any existing arch in Linux.
> > 
> > It was at one time and might be again.
> 
> What would be required to make this true? I'm having a hard time seeing
> how things can get reordered like that.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 22:15               ` Peter Zijlstra
  2015-07-13 22:43                 ` Benjamin Herrenschmidt
@ 2015-07-13 22:53                 ` Paul E. McKenney
  1 sibling, 0 replies; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-13 22:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Hurley, Will Deacon, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Tue, Jul 14, 2015 at 12:15:03AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 13, 2015 at 01:16:42PM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 13, 2015 at 03:41:53PM -0400, Peter Hurley wrote:
> > > > Does that answer the question, or am I missing the point?
> > > 
> > > Yes, it shows that smp_mb__after_unlock_lock() has no purpose, since it
> > > is defined only for PowerPC and your test above just showed that for
> > > the sequence
> 
> The only purpose is to provide transitivity, but the documentation fails
> to explicitly call that out.

It does say that it is a full barrier, but I added explicit mention of
transitivity.

> > > 
> > >   store a
> > >   UNLOCK M
> > >   LOCK N
> > >   store b
> > > 
> > > a and b is always observed as an ordered pair {a,b}.
> > 
> > Not quite.
> > 
> > This is instead the sequence that is of concern:
> > 
> > 	store a
> > 	unlock M
> > 	lock N
> > 	load b
> 
> So its late and that table didn't parse, but that should be ordered too.
> The load of b should not be able to escape the lock N.
> 
> If only because LWSYNC is a valid RMB and any LOCK implementation must
> load the lock state to observe it unlocked.

If you actually hold a given lock, then yes, you will observe anything
previously done while holding that same lock, even if you don't use
smp_mb__after_unlock_lock().  The smp_mb__after_unlock_lock() comes into
play when code not holding a lock needs to see the ordering.  RCU needs
this because of the strong ordering that grace periods must provide:
regardless of who started or ended the grace period, anything on any
CPU preceding a given grace period is fully ordered before anything on
any CPU following that same grace period.  It is not clear to me that
anything else would need such strong ordering.

> > > Additionally, the assertion in Documentation/memory_barriers.txt that
> > > the sequence above can be reordered as
> > > 
> > >   LOCK N
> > >   store b
> > >   store a
> > >   UNLOCK M
> > > 
> > > is not true on any existing arch in Linux.
> > 
> > It was at one time and might be again.
> 
> What would be required to make this true? I'm having a hard time seeing
> how things can get reordered like that.

You are right, I failed to merge current and past knowledge.  At one time,
Itanium was said to allow things to bleed into lock-based critical sections.
However, we now know that ld,acq and st,rel really do full ordering.

Compilers might one day do this sort of reordering, but I would guess
that Linux kernel builds would disable this sort of thing.  Something
about wanting critical sections to remain small.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 22:23             ` Peter Zijlstra
@ 2015-07-13 23:04               ` Paul E. McKenney
  2015-07-14 10:04                 ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-13 23:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Tue, Jul 14, 2015 at 12:23:46AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 13, 2015 at 01:20:32PM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 13, 2015 at 06:50:29PM +0100, Will Deacon wrote:
> 
> > > So if I'm following along, smp_mb__after_unlock_lock *does* provide
> > > transitivity when used with UNLOCK + LOCK, which is stronger than your
> > > example here.
> > 
> > Yes, that is indeed the intent.
> 
> Maybe good to state this explicitly somewhere.

Fair enough!  Please see patch below.

> > > I don't think we want to make the same guarantee for general RELEASE +
> > > ACQUIRE, because we'd end up forcing most architectures to implement the
> > > expensive macro for a case that currently has no users.
> > 
> > Agreed, smp_mb__after_unlock_lock() makes a limited guarantee.
> 
> I'm still not seeing how the archs that implement load_acquire and
> store_release with smp_mb() are a problem.

I don't know that I ever claimed such architectures to be a problem.
Color me confused.

> If we look at the inside of the critical section again -- similar
> argument as before:
> 
> 	*A = a
> 	smp_mb()
> 	store M
> 	load N
> 	smp_mb()
> 	*B = b
> 
> A and B are fully ordered, and in this case even transitivity is
> provided.
> 
> I'm stating that the order of M and N don't matter, only the
> load/stores that are inside the acquire/release are constrained.

No argument here.

> IOW, I think smp_mb__after_unlock_lock() already works as advertised
> with all our acquire/release thingies -- as is stated by the
> documentation.
> 
> That said, I'm not aware of anybody but RCU actually using this, so its
> not used in that capacity.

OK, I might actually understand what you are getting at.  And, yes, if
someone actually comes up with a need to combine smp_mb__after_unlock_lock()
with something other than locking, we should worry about it at that point.
And probably rename smp_mb__after_unlock_lock() at that point, as well.
Until then, why lock ourselves into semantics that no one needs, and
that it is quite possible that no one will ever need?

> > > In which case, it boils down to the question of how expensive it would
> > > be to implement an SC UNLOCK operation on PowerPC and whether that justifies
> > > the existence of a complicated barrier macro that isn't used outside of
> > > RCU.
> > 
> > Given that it is either smp_mb() or nothing, I am not seeing the
> > "complicated" part...
> 
> The 'complicated' part is that we need think about it; that is we need
> to realized and remember that UNLOCK+LOCK is a load-store barrier but
> fails to provide transitivity.

...  unless you are holding the lock.  So in the common case, you do
get transitivity.

							Thanx, Paul

------------------------------------------------------------------------

commit bae5cf1e2973bb1e8f852abb54f8b1948ffd82e4
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Mon Jul 13 15:55:52 2015 -0700

    doc: Call out smp_mb__after_unlock_lock() transitivity
    
    Although "full barrier" should be interpreted as providing transitivity,
    it is worth eliminating any possible confusion.  This commit therefore
    adds "(including transitivity)" to eliminate any possible confusion.
    
    Reported-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 470c07c868e4..318523872db5 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1858,11 +1858,12 @@ Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
 imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
 pair to produce a full barrier, the ACQUIRE can be followed by an
 smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
-if either (a) the RELEASE and the ACQUIRE are executed by the same
-CPU or task, or (b) the RELEASE and ACQUIRE act on the same variable.
-The smp_mb__after_unlock_lock() primitive is free on many architectures.
-Without smp_mb__after_unlock_lock(), the CPU's execution of the critical
-sections corresponding to the RELEASE and the ACQUIRE can cross, so that:
+(including transitivity) if either (a) the RELEASE and the ACQUIRE are
+executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on
+the same variable.  The smp_mb__after_unlock_lock() primitive is free
+on many architectures.  Without smp_mb__after_unlock_lock(), the CPU's
+execution of the critical sections corresponding to the RELEASE and the
+ACQUIRE can cross, so that:
 
 	*A = a;
 	RELEASE M


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 22:43                 ` Benjamin Herrenschmidt
@ 2015-07-14  8:34                   ` Peter Zijlstra
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Zijlstra @ 2015-07-14  8:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, Peter Hurley, Will Deacon, linux-arch, linux-kernel

On Tue, Jul 14, 2015 at 08:43:44AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2015-07-14 at 00:15 +0200, Peter Zijlstra wrote:
> 
> > > This is instead the sequence that is of concern:
> > > 
> > > 	store a
> > > 	unlock M
> > > 	lock N
> > > 	load b
> > 
> > So its late and that table didn't parse, but that should be ordered too.
> > The load of b should not be able to escape the lock N.
> > 
> > If only because LWSYNC is a valid RMB and any LOCK implementation must
> > load the lock state to observe it unlocked.
> 
> What happens is that the load passes the store conditional, though it
> doesn't pass the load with reserve. However, both store A and unlock M
> being just stores with an lwsync, can pass a load, so they can pass the
> load with reserve. And thus inside the LL/SC loop, our store A has
> passed our load B.

Ah cute.. Thanks, clearly I wasn't awake enough anymore :-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 23:04               ` Paul E. McKenney
@ 2015-07-14 10:04                 ` Will Deacon
  2015-07-14 12:45                   ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-14 10:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Tue, Jul 14, 2015 at 12:04:06AM +0100, Paul E. McKenney wrote:
> On Tue, Jul 14, 2015 at 12:23:46AM +0200, Peter Zijlstra wrote:
> > If we look at the inside of the critical section again -- similar
> > argument as before:
> > 
> > 	*A = a
> > 	smp_mb()
> > 	store M
> > 	load N
> > 	smp_mb()
> > 	*B = b
> > 
> > A and B are fully ordered, and in this case even transitivity is
> > provided.
> > 
> > I'm stating that the order of M and N don't matter, only the
> > load/stores that are inside the acquire/release are constrained.
> 
> No argument here.
> 
> > IOW, I think smp_mb__after_unlock_lock() already works as advertised
> > with all our acquire/release thingies -- as is stated by the
> > documentation.
> > 
> > That said, I'm not aware of anybody but RCU actually using this, so its
> > not used in that capacity.
> 
> OK, I might actually understand what you are getting at.  And, yes, if
> someone actually comes up with a need to combine smp_mb__after_unlock_lock()
> with something other than locking, we should worry about it at that point.
> And probably rename smp_mb__after_unlock_lock() at that point, as well.
> Until then, why lock ourselves into semantics that no one needs, and
> that it is quite possible that no one will ever need?

Given that RCU is currently the only user of this barrier, how would you
feel about making the barrier local to RCU and not part of the general
memory-barrier API?

My main reason for proposing its removal is because I don't want to see
it being used (incorrectly) all over the place to order the new RELEASE
and ACQUIRE operations I posted separately, at which point we have to try
fixing up all the callers or retrofitting some semantics. It doesn't help
that memory-barriers.txt lumps things like LOCK and ACQUIRE together,
whereas this barrier is currently only intended to be used in conjunction
with the former.

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 22:31 ` Benjamin Herrenschmidt
@ 2015-07-14 10:16   ` Will Deacon
  2015-07-15  3:06   ` Michael Ellerman
  1 sibling, 0 replies; 66+ messages in thread
From: Will Deacon @ 2015-07-14 10:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-arch, linux-kernel, Paul McKenney, Peter Zijlstra,
	Michael Ellerman

On Mon, Jul 13, 2015 at 11:31:29PM +0100, Benjamin Herrenschmidt wrote:
> On Mon, 2015-07-13 at 13:15 +0100, Will Deacon wrote:
> > This didn't go anywhere last time I posted it, but here it is again.
> > I'd really appreciate some feedback from the PowerPC guys, especially as
> > to whether this change requires them to add an additional barrier in
> > arch_spin_unlock and what the cost of that would be.
> 
> We'd have to turn the lwsync in unlock or the isync in lock into a full
> barrier. As it is, we *almost* have a full barrier semantic, but not
> quite, as in things can get mixed up inside spin_lock between the LL and
> the SC (things leaking in past LL and things leaking "out" up before SC
> and then getting mixed up in there).

Thanks, Ben.

> Michael, at some point you were experimenting a bit with that and tried
> to get some perf numbers of the impact that would have, did that
> solidify ? Otherwise, I'll have a look when I'm back next week.

These numbers would be really interesting...

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-14 10:04                 ` Will Deacon
@ 2015-07-14 12:45                   ` Paul E. McKenney
  2015-07-14 12:51                     ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-14 12:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> On Tue, Jul 14, 2015 at 12:04:06AM +0100, Paul E. McKenney wrote:
> > On Tue, Jul 14, 2015 at 12:23:46AM +0200, Peter Zijlstra wrote:
> > > If we look at the inside of the critical section again -- similar
> > > argument as before:
> > > 
> > > 	*A = a
> > > 	smp_mb()
> > > 	store M
> > > 	load N
> > > 	smp_mb()
> > > 	*B = b
> > > 
> > > A and B are fully ordered, and in this case even transitivity is
> > > provided.
> > > 
> > > I'm stating that the order of M and N don't matter, only the
> > > load/stores that are inside the acquire/release are constrained.
> > 
> > No argument here.
> > 
> > > IOW, I think smp_mb__after_unlock_lock() already works as advertised
> > > with all our acquire/release thingies -- as is stated by the
> > > documentation.
> > > 
> > > That said, I'm not aware of anybody but RCU actually using this, so its
> > > not used in that capacity.
> > 
> > OK, I might actually understand what you are getting at.  And, yes, if
> > someone actually comes up with a need to combine smp_mb__after_unlock_lock()
> > with something other than locking, we should worry about it at that point.
> > And probably rename smp_mb__after_unlock_lock() at that point, as well.
> > Until then, why lock ourselves into semantics that no one needs, and
> > that it is quite possible that no one will ever need?
> 
> Given that RCU is currently the only user of this barrier, how would you
> feel about making the barrier local to RCU and not part of the general
> memory-barrier API?

In theory, no objection.  Your thought is to leave the definitions where
they are, mark them as being used only by RCU, and removing mention from
memory-barriers.txt?  Or did you have something else in mind?

> My main reason for proposing its removal is because I don't want to see
> it being used (incorrectly) all over the place to order the new RELEASE
> and ACQUIRE operations I posted separately, at which point we have to try
> fixing up all the callers or retrofitting some semantics. It doesn't help
> that memory-barriers.txt lumps things like LOCK and ACQUIRE together,
> whereas this barrier is currently only intended to be used in conjunction
> with the former.

Heh!  That lumping was considered to be a feature at the time.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-14 12:45                   ` Paul E. McKenney
@ 2015-07-14 12:51                     ` Will Deacon
  2015-07-14 14:00                       ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-14 12:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

Hi Paul,

On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > Given that RCU is currently the only user of this barrier, how would you
> > feel about making the barrier local to RCU and not part of the general
> > memory-barrier API?
> 
> In theory, no objection.  Your thought is to leave the definitions where
> they are, mark them as being used only by RCU, and removing mention from
> memory-barriers.txt?  Or did you have something else in mind?

Actually, I was thinking of defining them in an RCU header file with an
#ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
comment describing the semantics, or put that in an RCU Documentation file
instead of memory-barriers.txt.

That *should* then mean we notice anybody else trying to use the barrier,
because they'd need to send patches to either add something equivalent
or move the definition out again.

> > My main reason for proposing its removal is because I don't want to see
> > it being used (incorrectly) all over the place to order the new RELEASE
> > and ACQUIRE operations I posted separately, at which point we have to try
> > fixing up all the callers or retrofitting some semantics. It doesn't help
> > that memory-barriers.txt lumps things like LOCK and ACQUIRE together,
> > whereas this barrier is currently only intended to be used in conjunction
> > with the former.
> 
> Heh!  That lumping was considered to be a feature at the time.  ;-)

Oh, I'm sure it was added with good intentions!

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-14 12:51                     ` Will Deacon
@ 2015-07-14 14:00                       ` Paul E. McKenney
  2015-07-14 14:12                         ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-14 14:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Tue, Jul 14, 2015 at 01:51:46PM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> > On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > > Given that RCU is currently the only user of this barrier, how would you
> > > feel about making the barrier local to RCU and not part of the general
> > > memory-barrier API?
> > 
> > In theory, no objection.  Your thought is to leave the definitions where
> > they are, mark them as being used only by RCU, and removing mention from
> > memory-barriers.txt?  Or did you have something else in mind?
> 
> Actually, I was thinking of defining them in an RCU header file with an
> #ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
> comment describing the semantics, or put that in an RCU Documentation file
> instead of memory-barriers.txt.
> 
> That *should* then mean we notice anybody else trying to use the barrier,
> because they'd need to send patches to either add something equivalent
> or move the definition out again.

My concern with this approach is that someone putting together a new
architecture might miss this.  That said, this approach certainly would
work for the current architectures.

> > > My main reason for proposing its removal is because I don't want to see
> > > it being used (incorrectly) all over the place to order the new RELEASE
> > > and ACQUIRE operations I posted separately, at which point we have to try
> > > fixing up all the callers or retrofitting some semantics. It doesn't help
> > > that memory-barriers.txt lumps things like LOCK and ACQUIRE together,
> > > whereas this barrier is currently only intended to be used in conjunction
> > > with the former.
> > 
> > Heh!  That lumping was considered to be a feature at the time.  ;-)
> 
> Oh, I'm sure it was added with good intentions!

And we all know which road is paved with good intentions!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-14 14:00                       ` Paul E. McKenney
@ 2015-07-14 14:12                         ` Will Deacon
  2015-07-14 19:31                           ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-14 14:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Tue, Jul 14, 2015 at 03:00:14PM +0100, Paul E. McKenney wrote:
> On Tue, Jul 14, 2015 at 01:51:46PM +0100, Will Deacon wrote:
> > On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> > > On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > > > Given that RCU is currently the only user of this barrier, how would you
> > > > feel about making the barrier local to RCU and not part of the general
> > > > memory-barrier API?
> > > 
> > > In theory, no objection.  Your thought is to leave the definitions where
> > > they are, mark them as being used only by RCU, and removing mention from
> > > memory-barriers.txt?  Or did you have something else in mind?
> > 
> > Actually, I was thinking of defining them in an RCU header file with an
> > #ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
> > comment describing the semantics, or put that in an RCU Documentation file
> > instead of memory-barriers.txt.
> > 
> > That *should* then mean we notice anybody else trying to use the barrier,
> > because they'd need to send patches to either add something equivalent
> > or move the definition out again.
> 
> My concern with this approach is that someone putting together a new
> architecture might miss this.  That said, this approach certainly would
> work for the current architectures.

I don't think they're any more likely to miss it than with the current
situation where the generic code defines the macro as a NOP unless you
explicitly override it.

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-14 14:12                         ` Will Deacon
@ 2015-07-14 19:31                           ` Paul E. McKenney
  2015-07-15  1:38                             ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-14 19:31 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Tue, Jul 14, 2015 at 03:12:16PM +0100, Will Deacon wrote:
> On Tue, Jul 14, 2015 at 03:00:14PM +0100, Paul E. McKenney wrote:
> > On Tue, Jul 14, 2015 at 01:51:46PM +0100, Will Deacon wrote:
> > > On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> > > > On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > > > > Given that RCU is currently the only user of this barrier, how would you
> > > > > feel about making the barrier local to RCU and not part of the general
> > > > > memory-barrier API?
> > > > 
> > > > In theory, no objection.  Your thought is to leave the definitions where
> > > > they are, mark them as being used only by RCU, and removing mention from
> > > > memory-barriers.txt?  Or did you have something else in mind?
> > > 
> > > Actually, I was thinking of defining them in an RCU header file with an
> > > #ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
> > > comment describing the semantics, or put that in an RCU Documentation file
> > > instead of memory-barriers.txt.
> > > 
> > > That *should* then mean we notice anybody else trying to use the barrier,
> > > because they'd need to send patches to either add something equivalent
> > > or move the definition out again.
> > 
> > My concern with this approach is that someone putting together a new
> > architecture might miss this.  That said, this approach certainly would
> > work for the current architectures.
> 
> I don't think they're any more likely to miss it than with the current
> situation where the generic code defines the macro as a NOP unless you
> explicitly override it.

Fair enough...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-14 19:31                           ` Paul E. McKenney
@ 2015-07-15  1:38                             ` Paul E. McKenney
  2015-07-15 10:51                               ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-15  1:38 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Tue, Jul 14, 2015 at 12:31:44PM -0700, Paul E. McKenney wrote:
> On Tue, Jul 14, 2015 at 03:12:16PM +0100, Will Deacon wrote:
> > On Tue, Jul 14, 2015 at 03:00:14PM +0100, Paul E. McKenney wrote:
> > > On Tue, Jul 14, 2015 at 01:51:46PM +0100, Will Deacon wrote:
> > > > On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> > > > > On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > > > > > Given that RCU is currently the only user of this barrier, how would you
> > > > > > feel about making the barrier local to RCU and not part of the general
> > > > > > memory-barrier API?
> > > > > 
> > > > > In theory, no objection.  Your thought is to leave the definitions where
> > > > > they are, mark them as being used only by RCU, and removing mention from
> > > > > memory-barriers.txt?  Or did you have something else in mind?
> > > > 
> > > > Actually, I was thinking of defining them in an RCU header file with an
> > > > #ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
> > > > comment describing the semantics, or put that in an RCU Documentation file
> > > > instead of memory-barriers.txt.
> > > > 
> > > > That *should* then mean we notice anybody else trying to use the barrier,
> > > > because they'd need to send patches to either add something equivalent
> > > > or move the definition out again.
> > > 
> > > My concern with this approach is that someone putting together a new
> > > architecture might miss this.  That said, this approach certainly would
> > > work for the current architectures.
> > 
> > I don't think they're any more likely to miss it than with the current
> > situation where the generic code defines the macro as a NOP unless you
> > explicitly override it.
> 
> Fair enough...

Like this?

							Thanx, Paul

------------------------------------------------------------------------

commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Tue Jul 14 18:35:23 2015 -0700

    rcu,locking: Privatize smp_mb__after_unlock_lock()
    
    RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
    likely the only thing that ever will use it, so this commit makes this
    macro private to RCU.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 318523872db5..eafa6a53f72c 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1854,16 +1854,10 @@ RELEASE are to the same lock variable, but only from the perspective of
 another CPU not holding that lock.  In short, a ACQUIRE followed by an
 RELEASE may -not- be assumed to be a full memory barrier.
 
-Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
-imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
-pair to produce a full barrier, the ACQUIRE can be followed by an
-smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
-(including transitivity) if either (a) the RELEASE and the ACQUIRE are
-executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on
-the same variable.  The smp_mb__after_unlock_lock() primitive is free
-on many architectures.  Without smp_mb__after_unlock_lock(), the CPU's
-execution of the critical sections corresponding to the RELEASE and the
-ACQUIRE can cross, so that:
+Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
+not imply a full memory barrier.  Therefore, the CPU's execution of the
+critical sections corresponding to the RELEASE and the ACQUIRE can cross,
+so that:
 
 	*A = a;
 	RELEASE M
@@ -1901,29 +1895,6 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
 	a sleep-unlock race, but the locking primitive needs to resolve
 	such races properly in any case.
 
-With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
-For example, with the following code, the store to *A will always be
-seen by other CPUs before the store to *B:
-
-	*A = a;
-	RELEASE M
-	ACQUIRE N
-	smp_mb__after_unlock_lock();
-	*B = b;
-
-The operations will always occur in one of the following orders:
-
-	STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
-	STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
-	ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
-
-If the RELEASE and ACQUIRE were instead both operating on the same lock
-variable, only the first of these alternatives can occur.  In addition,
-the more strongly ordered systems may rule out some of the above orders.
-But in any case, as noted earlier, the smp_mb__after_unlock_lock()
-ensures that the store to *A will always be seen as happening before
-the store to *B.
-
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined
@@ -2154,40 +2125,6 @@ But it won't see any of:
 	*E, *F or *G following RELEASE Q
 
 
-However, if the following occurs:
-
-	CPU 1				CPU 2
-	===============================	===============================
-	WRITE_ONCE(*A, a);
-	ACQUIRE M		     [1]
-	WRITE_ONCE(*B, b);
-	WRITE_ONCE(*C, c);
-	RELEASE M	     [1]
-	WRITE_ONCE(*D, d);		WRITE_ONCE(*E, e);
-					ACQUIRE M		     [2]
-					smp_mb__after_unlock_lock();
-					WRITE_ONCE(*F, f);
-					WRITE_ONCE(*G, g);
-					RELEASE M	     [2]
-					WRITE_ONCE(*H, h);
-
-CPU 3 might see:
-
-	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
-		ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
-
-But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
-
-	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
-	*A, *B or *C following RELEASE M [1]
-	*F, *G or *H preceding ACQUIRE M [2]
-	*A, *B, *C, *E, *F or *G following RELEASE M [2]
-
-Note that the smp_mb__after_unlock_lock() is critically important
-here: Without it CPU 3 might see some of the above orderings.
-Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
-to be seen in order unless CPU 3 holds lock M.
-
 
 ACQUIRES VS I/O ACCESSES
 ------------------------
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 4dbe072eecbe..523673d7583c 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -28,8 +28,6 @@
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
 
-#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
-
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
 #ifdef __BIG_ENDIAN__
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 80d974df0ea0..a9fea7395ba2 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -645,3 +645,15 @@ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 }
 #endif /* #ifdef CONFIG_RCU_TRACE */
+
+/*
+ * Place this after a lock-acquisition primitive to guarantee that
+ * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
+ * if the UNLOCK and LOCK are executed by the same CPU or if the
+ * UNLOCK and LOCK operate on the same lock variable.
+ */
+#ifdef CONFIG_PPC
+#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
+#else /* #ifdef CONFIG_PPC */
+#define smp_mb__after_unlock_lock()	do { } while (0)
+#endif /* #else #ifdef CONFIG_PPC */


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-13 22:31 ` Benjamin Herrenschmidt
  2015-07-14 10:16   ` Will Deacon
@ 2015-07-15  3:06   ` Michael Ellerman
  2015-07-15 10:44     ` Will Deacon
  2015-07-15 14:18     ` Paul E. McKenney
  1 sibling, 2 replies; 66+ messages in thread
From: Michael Ellerman @ 2015-07-15  3:06 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, linux-arch, linux-kernel, Paul McKenney,
	Peter Zijlstra, Michael Ellerman

On Tue, 2015-07-14 at 08:31 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2015-07-13 at 13:15 +0100, Will Deacon wrote:
> > smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> > into a full memory barrier.
> > 
> > However:
> > 
> >   - This ordering guarantee is already provided without the barrier on
> >     all architectures apart from PowerPC
> > 
> >   - The barrier only applies to UNLOCK + LOCK, not general
> >     RELEASE + ACQUIRE operations
> > 
> >   - Locks are generally assumed to offer SC ordering semantics, so
> >     having this additional barrier is error-prone and complicates the
> >     callers of LOCK/UNLOCK primitives
> > 
> >   - The barrier is not well used outside of RCU and, because it was
> >     retrofitted into the kernel, it's not clear whether other areas of
> >     the kernel are incorrectly relying on UNLOCK + LOCK implying a full
> >     barrier
> > 
> > This patch removes the barrier and instead requires architectures to
> > provide full barrier semantics for an UNLOCK + LOCK sequence.
> > 
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> > 
> > This didn't go anywhere last time I posted it, but here it is again.
> > I'd really appreciate some feedback from the PowerPC guys, especially as
> > to whether this change requires them to add an additional barrier in
> > arch_spin_unlock and what the cost of that would be.
> 
> We'd have to turn the lwsync in unlock or the isync in lock into a full
> barrier. As it is, we *almost* have a full barrier semantic, but not
> quite, as in things can get mixed up inside spin_lock between the LL and
> the SC (things leaking in past LL and things leaking "out" up before SC
> and then getting mixed up in there).
> 
> Michael, at some point you were experimenting a bit with that and tried
> to get some perf numbers of the impact that would have, did that
> solidify ? Otherwise, I'll have a look when I'm back next week.

I was mainly experimenting with replacing the lwsync in lock with an isync.

But I think you're talking about making it a full sync in lock.

That was about +7% on p8, +25% on p7 and +88% on p6.

We got stuck deciding whether isync was safe to use as a memory barrier,
because the wording in the arch is a bit vague.

But if we're talking about a full sync then I think there is no question that's
OK and we should just do it.

cheers



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-15  3:06   ` Michael Ellerman
@ 2015-07-15 10:44     ` Will Deacon
  2015-07-16  2:00       ` Michael Ellerman
  2015-07-15 14:18     ` Paul E. McKenney
  1 sibling, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-15 10:44 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Benjamin Herrenschmidt, linux-arch, linux-kernel, Paul McKenney,
	Peter Zijlstra, Michael Ellerman

Hi Michael,

On Wed, Jul 15, 2015 at 04:06:18AM +0100, Michael Ellerman wrote:
> On Tue, 2015-07-14 at 08:31 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2015-07-13 at 13:15 +0100, Will Deacon wrote:
> > > This didn't go anywhere last time I posted it, but here it is again.
> > > I'd really appreciate some feedback from the PowerPC guys, especially as
> > > to whether this change requires them to add an additional barrier in
> > > arch_spin_unlock and what the cost of that would be.
> > 
> > We'd have to turn the lwsync in unlock or the isync in lock into a full
> > barrier. As it is, we *almost* have a full barrier semantic, but not
> > quite, as in things can get mixed up inside spin_lock between the LL and
> > the SC (things leaking in past LL and things leaking "out" up before SC
> > and then getting mixed up in there).
> > 
> > Michael, at some point you were experimenting a bit with that and tried
> > to get some perf numbers of the impact that would have, did that
> > solidify ? Otherwise, I'll have a look when I'm back next week.
> 
> I was mainly experimenting with replacing the lwsync in lock with an isync.
> 
> But I think you're talking about making it a full sync in lock.
> 
> That was about +7% on p8, +25% on p7 and +88% on p6.

Ok, so that's overhead incurred by moving from isync -> lwsync? The numbers
look quite large...

> We got stuck deciding whether isync was safe to use as a memory barrier,
> because the wording in the arch is a bit vague.
> 
> But if we're talking about a full sync then I think there is no question that's
> OK and we should just do it.

Is this because there's a small overhead from lwsync -> sync? Just want to
make sure I follow your reasoning.

If you went the way of sync in unlock, could you remove the conditional
SYNC_IO stuff?

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-15  1:38                             ` Paul E. McKenney
@ 2015-07-15 10:51                               ` Will Deacon
  2015-07-15 13:12                                 ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-15 10:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

Hi Paul,

On Wed, Jul 15, 2015 at 02:38:20AM +0100, Paul E. McKenney wrote:
> On Tue, Jul 14, 2015 at 12:31:44PM -0700, Paul E. McKenney wrote:
> > On Tue, Jul 14, 2015 at 03:12:16PM +0100, Will Deacon wrote:
> > > On Tue, Jul 14, 2015 at 03:00:14PM +0100, Paul E. McKenney wrote:
> > > > On Tue, Jul 14, 2015 at 01:51:46PM +0100, Will Deacon wrote:
> > > > > On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> > > > > > On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > > > > > > Given that RCU is currently the only user of this barrier, how would you
> > > > > > > feel about making the barrier local to RCU and not part of the general
> > > > > > > memory-barrier API?
> > > > > > 
> > > > > > In theory, no objection.  Your thought is to leave the definitions where
> > > > > > they are, mark them as being used only by RCU, and removing mention from
> > > > > > memory-barriers.txt?  Or did you have something else in mind?
> > > > > 
> > > > > Actually, I was thinking of defining them in an RCU header file with an
> > > > > #ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
> > > > > comment describing the semantics, or put that in an RCU Documentation file
> > > > > instead of memory-barriers.txt.
> > > > > 
> > > > > That *should* then mean we notice anybody else trying to use the barrier,
> > > > > because they'd need to send patches to either add something equivalent
> > > > > or move the definition out again.
> > > > 
> > > > My concern with this approach is that someone putting together a new
> > > > architecture might miss this.  That said, this approach certainly would
> > > > work for the current architectures.
> > > 
> > > I don't think they're any more likely to miss it than with the current
> > > situation where the generic code defines the macro as a NOP unless you
> > > explicitly override it.
> > 
> > Fair enough...
> 
> Like this?

Precisely! Thanks for cooking the patch -- this lays all my worries to
rest, so:

  Acked-by: Will Deacon <will.deacon@arm.com>

We should continue the discussion with Ben and Michael about whether or
not the PowerPC locking code can be strengthened, though (making this
barrier a NOP on all currently supported archs).

Will

> commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Tue Jul 14 18:35:23 2015 -0700
> 
>     rcu,locking: Privatize smp_mb__after_unlock_lock()
>     
>     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
>     likely the only thing that ever will use it, so this commit makes this
>     macro private to RCU.
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>     Cc: Will Deacon <will.deacon@arm.com>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 318523872db5..eafa6a53f72c 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1854,16 +1854,10 @@ RELEASE are to the same lock variable, but only from the perspective of
>  another CPU not holding that lock.  In short, a ACQUIRE followed by an
>  RELEASE may -not- be assumed to be a full memory barrier.
>  
> -Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
> -imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
> -pair to produce a full barrier, the ACQUIRE can be followed by an
> -smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
> -(including transitivity) if either (a) the RELEASE and the ACQUIRE are
> -executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on
> -the same variable.  The smp_mb__after_unlock_lock() primitive is free
> -on many architectures.  Without smp_mb__after_unlock_lock(), the CPU's
> -execution of the critical sections corresponding to the RELEASE and the
> -ACQUIRE can cross, so that:
> +Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
> +not imply a full memory barrier.  Therefore, the CPU's execution of the
> +critical sections corresponding to the RELEASE and the ACQUIRE can cross,
> +so that:
>  
>  	*A = a;
>  	RELEASE M
> @@ -1901,29 +1895,6 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
>  	a sleep-unlock race, but the locking primitive needs to resolve
>  	such races properly in any case.
>  
> -With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
> -For example, with the following code, the store to *A will always be
> -seen by other CPUs before the store to *B:
> -
> -	*A = a;
> -	RELEASE M
> -	ACQUIRE N
> -	smp_mb__after_unlock_lock();
> -	*B = b;
> -
> -The operations will always occur in one of the following orders:
> -
> -	STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
> -	STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> -	ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> -
> -If the RELEASE and ACQUIRE were instead both operating on the same lock
> -variable, only the first of these alternatives can occur.  In addition,
> -the more strongly ordered systems may rule out some of the above orders.
> -But in any case, as noted earlier, the smp_mb__after_unlock_lock()
> -ensures that the store to *A will always be seen as happening before
> -the store to *B.
> -
>  Locks and semaphores may not provide any guarantee of ordering on UP compiled
>  systems, and so cannot be counted on in such a situation to actually achieve
>  anything at all - especially with respect to I/O accesses - unless combined
> @@ -2154,40 +2125,6 @@ But it won't see any of:
>  	*E, *F or *G following RELEASE Q
>  
>  
> -However, if the following occurs:
> -
> -	CPU 1				CPU 2
> -	===============================	===============================
> -	WRITE_ONCE(*A, a);
> -	ACQUIRE M		     [1]
> -	WRITE_ONCE(*B, b);
> -	WRITE_ONCE(*C, c);
> -	RELEASE M	     [1]
> -	WRITE_ONCE(*D, d);		WRITE_ONCE(*E, e);
> -					ACQUIRE M		     [2]
> -					smp_mb__after_unlock_lock();
> -					WRITE_ONCE(*F, f);
> -					WRITE_ONCE(*G, g);
> -					RELEASE M	     [2]
> -					WRITE_ONCE(*H, h);
> -
> -CPU 3 might see:
> -
> -	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
> -		ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
> -
> -But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
> -
> -	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
> -	*A, *B or *C following RELEASE M [1]
> -	*F, *G or *H preceding ACQUIRE M [2]
> -	*A, *B, *C, *E, *F or *G following RELEASE M [2]
> -
> -Note that the smp_mb__after_unlock_lock() is critically important
> -here: Without it CPU 3 might see some of the above orderings.
> -Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
> -to be seen in order unless CPU 3 holds lock M.
> -
>  
>  ACQUIRES VS I/O ACCESSES
>  ------------------------
> diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
> index 4dbe072eecbe..523673d7583c 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -28,8 +28,6 @@
>  #include <asm/synch.h>
>  #include <asm/ppc-opcode.h>
>  
> -#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
> -
>  #ifdef CONFIG_PPC64
>  /* use 0x800000yy when locked, where yy == CPU number */
>  #ifdef __BIG_ENDIAN__
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 80d974df0ea0..a9fea7395ba2 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -645,3 +645,15 @@ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
>  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
>  }
>  #endif /* #ifdef CONFIG_RCU_TRACE */
> +
> +/*
> + * Place this after a lock-acquisition primitive to guarantee that
> + * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
> + * if the UNLOCK and LOCK are executed by the same CPU or if the
> + * UNLOCK and LOCK operate on the same lock variable.
> + */
> +#ifdef CONFIG_PPC
> +#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
> +#else /* #ifdef CONFIG_PPC */
> +#define smp_mb__after_unlock_lock()	do { } while (0)
> +#endif /* #else #ifdef CONFIG_PPC */
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-15 10:51                               ` Will Deacon
@ 2015-07-15 13:12                                 ` Paul E. McKenney
  2015-07-24 11:31                                   ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-15 13:12 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Wed, Jul 15, 2015 at 11:51:35AM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Wed, Jul 15, 2015 at 02:38:20AM +0100, Paul E. McKenney wrote:
> > On Tue, Jul 14, 2015 at 12:31:44PM -0700, Paul E. McKenney wrote:
> > > On Tue, Jul 14, 2015 at 03:12:16PM +0100, Will Deacon wrote:
> > > > On Tue, Jul 14, 2015 at 03:00:14PM +0100, Paul E. McKenney wrote:
> > > > > On Tue, Jul 14, 2015 at 01:51:46PM +0100, Will Deacon wrote:
> > > > > > On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> > > > > > > On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > > > > > > > Given that RCU is currently the only user of this barrier, how would you
> > > > > > > > feel about making the barrier local to RCU and not part of the general
> > > > > > > > memory-barrier API?
> > > > > > > 
> > > > > > > In theory, no objection.  Your thought is to leave the definitions where
> > > > > > > they are, mark them as being used only by RCU, and removing mention from
> > > > > > > memory-barriers.txt?  Or did you have something else in mind?
> > > > > > 
> > > > > > Actually, I was thinking of defining them in an RCU header file with an
> > > > > > #ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
> > > > > > comment describing the semantics, or put that in an RCU Documentation file
> > > > > > instead of memory-barriers.txt.
> > > > > > 
> > > > > > That *should* then mean we notice anybody else trying to use the barrier,
> > > > > > because they'd need to send patches to either add something equivalent
> > > > > > or move the definition out again.
> > > > > 
> > > > > My concern with this approach is that someone putting together a new
> > > > > architecture might miss this.  That said, this approach certainly would
> > > > > work for the current architectures.
> > > > 
> > > > I don't think they're any more likely to miss it than with the current
> > > > situation where the generic code defines the macro as a NOP unless you
> > > > explicitly override it.
> > > 
> > > Fair enough...
> > 
> > Like this?
> 
> Precisely! Thanks for cooking the patch -- this lays all my worries to
> rest, so:
> 
>   Acked-by: Will Deacon <will.deacon@arm.com>

Thank you!

> We should continue the discussion with Ben and Michael about whether or
> not the PowerPC locking code can be strengthened, though (making this
> barrier a NOP on all currently supported archs).

Indeed -- if it becomes a NOP on all supported architectures, we might
want to consider just removing it completely.

							Thanx, Paul

> Will
> 
> > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Tue Jul 14 18:35:23 2015 -0700
> > 
> >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> >     
> >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> >     likely the only thing that ever will use it, so this commit makes this
> >     macro private to RCU.
> >     
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >     Cc: Will Deacon <will.deacon@arm.com>
> >     Cc: Peter Zijlstra <peterz@infradead.org>
> >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 318523872db5..eafa6a53f72c 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1854,16 +1854,10 @@ RELEASE are to the same lock variable, but only from the perspective of
> >  another CPU not holding that lock.  In short, a ACQUIRE followed by an
> >  RELEASE may -not- be assumed to be a full memory barrier.
> >  
> > -Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
> > -imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
> > -pair to produce a full barrier, the ACQUIRE can be followed by an
> > -smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
> > -(including transitivity) if either (a) the RELEASE and the ACQUIRE are
> > -executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on
> > -the same variable.  The smp_mb__after_unlock_lock() primitive is free
> > -on many architectures.  Without smp_mb__after_unlock_lock(), the CPU's
> > -execution of the critical sections corresponding to the RELEASE and the
> > -ACQUIRE can cross, so that:
> > +Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
> > +not imply a full memory barrier.  Therefore, the CPU's execution of the
> > +critical sections corresponding to the RELEASE and the ACQUIRE can cross,
> > +so that:
> >  
> >  	*A = a;
> >  	RELEASE M
> > @@ -1901,29 +1895,6 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
> >  	a sleep-unlock race, but the locking primitive needs to resolve
> >  	such races properly in any case.
> >  
> > -With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
> > -For example, with the following code, the store to *A will always be
> > -seen by other CPUs before the store to *B:
> > -
> > -	*A = a;
> > -	RELEASE M
> > -	ACQUIRE N
> > -	smp_mb__after_unlock_lock();
> > -	*B = b;
> > -
> > -The operations will always occur in one of the following orders:
> > -
> > -	STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
> > -	STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> > -	ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
> > -
> > -If the RELEASE and ACQUIRE were instead both operating on the same lock
> > -variable, only the first of these alternatives can occur.  In addition,
> > -the more strongly ordered systems may rule out some of the above orders.
> > -But in any case, as noted earlier, the smp_mb__after_unlock_lock()
> > -ensures that the store to *A will always be seen as happening before
> > -the store to *B.
> > -
> >  Locks and semaphores may not provide any guarantee of ordering on UP compiled
> >  systems, and so cannot be counted on in such a situation to actually achieve
> >  anything at all - especially with respect to I/O accesses - unless combined
> > @@ -2154,40 +2125,6 @@ But it won't see any of:
> >  	*E, *F or *G following RELEASE Q
> >  
> >  
> > -However, if the following occurs:
> > -
> > -	CPU 1				CPU 2
> > -	===============================	===============================
> > -	WRITE_ONCE(*A, a);
> > -	ACQUIRE M		     [1]
> > -	WRITE_ONCE(*B, b);
> > -	WRITE_ONCE(*C, c);
> > -	RELEASE M	     [1]
> > -	WRITE_ONCE(*D, d);		WRITE_ONCE(*E, e);
> > -					ACQUIRE M		     [2]
> > -					smp_mb__after_unlock_lock();
> > -					WRITE_ONCE(*F, f);
> > -					WRITE_ONCE(*G, g);
> > -					RELEASE M	     [2]
> > -					WRITE_ONCE(*H, h);
> > -
> > -CPU 3 might see:
> > -
> > -	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
> > -		ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
> > -
> > -But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
> > -
> > -	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
> > -	*A, *B or *C following RELEASE M [1]
> > -	*F, *G or *H preceding ACQUIRE M [2]
> > -	*A, *B, *C, *E, *F or *G following RELEASE M [2]
> > -
> > -Note that the smp_mb__after_unlock_lock() is critically important
> > -here: Without it CPU 3 might see some of the above orderings.
> > -Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
> > -to be seen in order unless CPU 3 holds lock M.
> > -
> >  
> >  ACQUIRES VS I/O ACCESSES
> >  ------------------------
> > diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
> > index 4dbe072eecbe..523673d7583c 100644
> > --- a/arch/powerpc/include/asm/spinlock.h
> > +++ b/arch/powerpc/include/asm/spinlock.h
> > @@ -28,8 +28,6 @@
> >  #include <asm/synch.h>
> >  #include <asm/ppc-opcode.h>
> >  
> > -#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
> > -
> >  #ifdef CONFIG_PPC64
> >  /* use 0x800000yy when locked, where yy == CPU number */
> >  #ifdef __BIG_ENDIAN__
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 80d974df0ea0..a9fea7395ba2 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -645,3 +645,15 @@ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
> >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> >  }
> >  #endif /* #ifdef CONFIG_RCU_TRACE */
> > +
> > +/*
> > + * Place this after a lock-acquisition primitive to guarantee that
> > + * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
> > + * if the UNLOCK and LOCK are executed by the same CPU or if the
> > + * UNLOCK and LOCK operate on the same lock variable.
> > + */
> > +#ifdef CONFIG_PPC
> > +#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
> > +#else /* #ifdef CONFIG_PPC */
> > +#define smp_mb__after_unlock_lock()	do { } while (0)
> > +#endif /* #else #ifdef CONFIG_PPC */
> > 
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-15  3:06   ` Michael Ellerman
  2015-07-15 10:44     ` Will Deacon
@ 2015-07-15 14:18     ` Paul E. McKenney
  2015-07-16  1:34       ` Michael Ellerman
  1 sibling, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-15 14:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Benjamin Herrenschmidt, Will Deacon, linux-arch, linux-kernel,
	Peter Zijlstra, Michael Ellerman

On Wed, Jul 15, 2015 at 01:06:18PM +1000, Michael Ellerman wrote:
> On Tue, 2015-07-14 at 08:31 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2015-07-13 at 13:15 +0100, Will Deacon wrote:
> > > smp_mb__after_unlock_lock is used to promote an UNLOCK + LOCK sequence
> > > into a full memory barrier.
> > > 
> > > However:
> > > 
> > >   - This ordering guarantee is already provided without the barrier on
> > >     all architectures apart from PowerPC
> > > 
> > >   - The barrier only applies to UNLOCK + LOCK, not general
> > >     RELEASE + ACQUIRE operations
> > > 
> > >   - Locks are generally assumed to offer SC ordering semantics, so
> > >     having this additional barrier is error-prone and complicates the
> > >     callers of LOCK/UNLOCK primitives
> > > 
> > >   - The barrier is not well used outside of RCU and, because it was
> > >     retrofitted into the kernel, it's not clear whether other areas of
> > >     the kernel are incorrectly relying on UNLOCK + LOCK implying a full
> > >     barrier
> > > 
> > > This patch removes the barrier and instead requires architectures to
> > > provide full barrier semantics for an UNLOCK + LOCK sequence.
> > > 
> > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > > ---
> > > 
> > > This didn't go anywhere last time I posted it, but here it is again.
> > > I'd really appreciate some feedback from the PowerPC guys, especially as
> > > to whether this change requires them to add an additional barrier in
> > > arch_spin_unlock and what the cost of that would be.
> > 
> > We'd have to turn the lwsync in unlock or the isync in lock into a full
> > barrier. As it is, we *almost* have a full barrier semantic, but not
> > quite, as in things can get mixed up inside spin_lock between the LL and
> > the SC (things leaking in past LL and things leaking "out" up before SC
> > and then getting mixed up in there).
> > 
> > Michael, at some point you were experimenting a bit with that and tried
> > to get some perf numbers of the impact that would have, did that
> > solidify ? Otherwise, I'll have a look when I'm back next week.
> 
> I was mainly experimenting with replacing the lwsync in lock with an isync.
> 
> But I think you're talking about making it a full sync in lock.
> 
> That was about +7% on p8, +25% on p7 and +88% on p6.

Just for completeness, what were you running as benchmark?  ;-)

							Thanx, Paul

> We got stuck deciding whether isync was safe to use as a memory barrier,
> because the wording in the arch is a bit vague.
> 
> But if we're talking about a full sync then I think there is no question that's
> OK and we should just do it.
> 
> cheers
> 
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-15 14:18     ` Paul E. McKenney
@ 2015-07-16  1:34       ` Michael Ellerman
  0 siblings, 0 replies; 66+ messages in thread
From: Michael Ellerman @ 2015-07-16  1:34 UTC (permalink / raw)
  To: paulmck
  Cc: Benjamin Herrenschmidt, Will Deacon, linux-arch, linux-kernel,
	Peter Zijlstra

On Wed, 2015-07-15 at 07:18 -0700, Paul E. McKenney wrote:
> On Wed, Jul 15, 2015 at 01:06:18PM +1000, Michael Ellerman wrote:
> > On Tue, 2015-07-14 at 08:31 +1000, Benjamin Herrenschmidt wrote:
> > > Michael, at some point you were experimenting a bit with that and tried
> > > to get some perf numbers of the impact that would have, did that
> > > solidify ? Otherwise, I'll have a look when I'm back next week.
> > 
> > I was mainly experimenting with replacing the lwsync in lock with an isync.
> > 
> > But I think you're talking about making it a full sync in lock.
> > 
> > That was about +7% on p8, +25% on p7 and +88% on p6.
> 
> Just for completeness, what were you running as benchmark?  ;-)

Heh sorry :)

That was my lockcomparison benchmark, based on Anton's original. It just does
100,000,000 lock/unlocks for each type in userspace:

  https://github.com/mpe/lockcomparison/blob/master/lock_comparison.c

cheers



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-15 10:44     ` Will Deacon
@ 2015-07-16  2:00       ` Michael Ellerman
  2015-07-16  5:03         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 66+ messages in thread
From: Michael Ellerman @ 2015-07-16  2:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, linux-arch, linux-kernel, Paul McKenney,
	Peter Zijlstra

On Wed, 2015-07-15 at 11:44 +0100, Will Deacon wrote:
> Hi Michael,
> 
> On Wed, Jul 15, 2015 at 04:06:18AM +0100, Michael Ellerman wrote:
> > On Tue, 2015-07-14 at 08:31 +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2015-07-13 at 13:15 +0100, Will Deacon wrote:
> > > > This didn't go anywhere last time I posted it, but here it is again.
> > > > I'd really appreciate some feedback from the PowerPC guys, especially as
> > > > to whether this change requires them to add an additional barrier in
> > > > arch_spin_unlock and what the cost of that would be.
> > > 
> > > We'd have to turn the lwsync in unlock or the isync in lock into a full
> > > barrier. As it is, we *almost* have a full barrier semantic, but not
> > > quite, as in things can get mixed up inside spin_lock between the LL and
> > > the SC (things leaking in past LL and things leaking "out" up before SC
> > > and then getting mixed up in there).
> > > 
> > > Michael, at some point you were experimenting a bit with that and tried
> > > to get some perf numbers of the impact that would have, did that
> > > solidify ? Otherwise, I'll have a look when I'm back next week.
> > 
> > I was mainly experimenting with replacing the lwsync in lock with an isync.
> > 
> > But I think you're talking about making it a full sync in lock.
> > 
> > That was about +7% on p8, +25% on p7 and +88% on p6.
> 
> Ok, so that's overhead incurred by moving from isync -> lwsync? The numbers
> look quite large...

Sorry no.

Currently we use lwsync in lock. You'll see isync in the code
(PPC_ACQUIRE_BARRIER), but on most modern CPUs that is patched at runtime to
lwsync.

I benchmarked lwsync (current), isync (proposed at the time) and sync (just for
comparison).

The numbers above are going from lwsync -> sync, as I thought that was what Ben
was talking about.

Going from lwsync to isync was actually a small speedup on power8, which was
surprising.

> > We got stuck deciding whether isync was safe to use as a memory barrier,
> > because the wording in the arch is a bit vague.
> > 
> > But if we're talking about a full sync then I think there is no question that's
> > OK and we should just do it.
> 
> Is this because there's a small overhead from lwsync -> sync? Just want to
> make sure I follow your reasoning.

No I mean that sync is clearly a memory barrier. The issue with switching to
isync in lock was that it's not a memory barrier per se, so we were not 100%
confident in it.

> If you went the way of sync in unlock, could you remove the conditional
> SYNC_IO stuff?

Yeah we could, it's just a conditional sync in unlock when mmio has been done.

That would fix the problem with smp_mb__after_unlock_lock(), but not the
original worry we had about loads happening before the SC in lock.

cheers



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-16  2:00       ` Michael Ellerman
@ 2015-07-16  5:03         ` Benjamin Herrenschmidt
  2015-07-16  5:14           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 66+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-16  5:03 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Will Deacon, linux-arch, linux-kernel, Paul McKenney, Peter Zijlstra

On Thu, 2015-07-16 at 12:00 +1000, Michael Ellerman wrote:
> That would fix the problem with smp_mb__after_unlock_lock(), but not
> the original worry we had about loads happening before the SC in lock.

However I think isync fixes *that* :-) The problem with isync is as you
said, it's not a -memory- barrier per-se, it's an execution barrier /
context synchronizing instruction. The combination stwcx. + bne + isync
however prevents the execution of anything past the isync until the
stwcx has completed and the bne has been "decided", which prevents loads
from leaking into the LL/SC loop. It will also prevent a store in the
lock from being issued before the stwcx. has completed. It does *not*
prevent as far as I can tell another unrelated store before the lock
from leaking into the lock, including the one used to unlock a different
lock.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-16  5:03         ` Benjamin Herrenschmidt
@ 2015-07-16  5:14           ` Benjamin Herrenschmidt
  2015-07-16 15:11             ` Paul E. McKenney
  2015-09-01  2:57             ` Paul Mackerras
  0 siblings, 2 replies; 66+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-16  5:14 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Will Deacon, linux-arch, linux-kernel, Paul McKenney, Peter Zijlstra

On Thu, 2015-07-16 at 15:03 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2015-07-16 at 12:00 +1000, Michael Ellerman wrote:
> > That would fix the problem with smp_mb__after_unlock_lock(), but not
> > the original worry we had about loads happening before the SC in lock.
> 
> However I think isync fixes *that* :-) The problem with isync is as you
> said, it's not a -memory- barrier per-se, it's an execution barrier /
> context synchronizing instruction. The combination stwcx. + bne + isync
> however prevents the execution of anything past the isync until the
> stwcx has completed and the bne has been "decided", which prevents loads
> from leaking into the LL/SC loop. It will also prevent a store in the
> lock from being issued before the stwcx. has completed. It does *not*
> prevent as far as I can tell another unrelated store before the lock
> from leaking into the lock, including the one used to unlock a different
> lock.

Except that the architecture says:

<<
Because a Store Conditional instruction may com-
plete before its store has been performed, a condi-
tional Branch instruction that depends on the CR0
value set by a Store Conditional instruction does
not order the Store Conditional's store with respect
to storage accesses caused by instructions that
follow the Branch
>>

So isync in lock in architecturally incorrect, despite being what the
architecture recommends using, yay !

Ben.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-16  5:14           ` Benjamin Herrenschmidt
@ 2015-07-16 15:11             ` Paul E. McKenney
  2015-07-16 22:54               ` Benjamin Herrenschmidt
  2015-09-01  2:57             ` Paul Mackerras
  1 sibling, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-16 15:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Ellerman, Will Deacon, linux-arch, linux-kernel, Peter Zijlstra

On Thu, Jul 16, 2015 at 03:14:55PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2015-07-16 at 15:03 +1000, Benjamin Herrenschmidt wrote:
> > On Thu, 2015-07-16 at 12:00 +1000, Michael Ellerman wrote:
> > > That would fix the problem with smp_mb__after_unlock_lock(), but not
> > > the original worry we had about loads happening before the SC in lock.
> > 
> > However I think isync fixes *that* :-) The problem with isync is as you
> > said, it's not a -memory- barrier per-se, it's an execution barrier /
> > context synchronizing instruction. The combination stwcx. + bne + isync
> > however prevents the execution of anything past the isync until the
> > stwcx has completed and the bne has been "decided", which prevents loads
> > from leaking into the LL/SC loop. It will also prevent a store in the
> > lock from being issued before the stwcx. has completed. It does *not*
> > prevent as far as I can tell another unrelated store before the lock
> > from leaking into the lock, including the one used to unlock a different
> > lock.
> 
> Except that the architecture says:
> 
> <<
> Because a Store Conditional instruction may com-
> plete before its store has been performed, a condi-
> tional Branch instruction that depends on the CR0
> value set by a Store Conditional instruction does
> not order the Store Conditional's store with respect
> to storage accesses caused by instructions that
> follow the Branch
> >>
> 
> So isync in lock in architecturally incorrect, despite being what the
> architecture recommends using, yay !

Well, the architecture isn't expecting that crazies like myself would
want to have an unlock-lock provide ordering to some CPU not holding
the lock.  :-/

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-16 15:11             ` Paul E. McKenney
@ 2015-07-16 22:54               ` Benjamin Herrenschmidt
  2015-07-17  9:32                 ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-16 22:54 UTC (permalink / raw)
  To: paulmck
  Cc: Michael Ellerman, Will Deacon, linux-arch, linux-kernel, Peter Zijlstra

On Thu, 2015-07-16 at 08:11 -0700, Paul E. McKenney wrote:
> > So isync in lock in architecturally incorrect, despite being what
> the
> > architecture recommends using, yay !
> 
> Well, the architecture isn't expecting that crazies like myself would
> want to have an unlock-lock provide ordering to some CPU not holding
> the lock.  :-/

Yes, isync in lock effectively allows any load or store before the lock
to leak into the lock and get re-ordered with things in there.

lwsync leaves us exposed to the re-order inside the LL/SC of a
subsequent load.

So to get the full barrier semantic, the only option is a full sync,
either in lock or unlock. Instinctively I prefer in lock but there's
an argument to have it in unlock so we can get rid of the iosync
business.

Ben.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-16 22:54               ` Benjamin Herrenschmidt
@ 2015-07-17  9:32                 ` Will Deacon
  2015-07-17 10:15                   ` Peter Zijlstra
  2015-07-17 22:14                   ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 66+ messages in thread
From: Will Deacon @ 2015-07-17  9:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: paulmck, Michael Ellerman, linux-arch, linux-kernel, Peter Zijlstra

On Thu, Jul 16, 2015 at 11:54:25PM +0100, Benjamin Herrenschmidt wrote:
> On Thu, 2015-07-16 at 08:11 -0700, Paul E. McKenney wrote:
> > > So isync in lock in architecturally incorrect, despite being what
> > the
> > > architecture recommends using, yay !
> > 
> > Well, the architecture isn't expecting that crazies like myself would
> > want to have an unlock-lock provide ordering to some CPU not holding
> > the lock.  :-/
> 
> Yes, isync in lock effectively allows any load or store before the lock
> to leak into the lock and get re-ordered with things in there.
> 
> lwsync leaves us exposed to the re-order inside the LL/SC of a
> subsequent load.
> 
> So to get the full barrier semantic, the only option is a full sync,
> either in lock or unlock. Instinctively I prefer in lock but there's
> an argument to have it in unlock so we can get rid of the iosync
> business.

Yeah, removing the iosync logic kills mmiowb() as well (totally untested
diff below and I was too scared to touch that paca thing!). It also sticks
the full barrier in one place, since there's no try_unlock to worry about.

Will

--->8

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index a8d2ef30d473..bb34f3bb66dc 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -105,12 +105,6 @@ extern bool isa_io_special;
  *
  */
 
-#ifdef CONFIG_PPC64
-#define IO_SET_SYNC_FLAG()	do { local_paca->io_sync = 1; } while(0)
-#else
-#define IO_SET_SYNC_FLAG()
-#endif
-
 /* gcc 4.0 and older doesn't have 'Z' constraint */
 #if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ == 0)
 #define DEF_MMIO_IN_X(name, size, insn)				\
@@ -127,7 +121,6 @@ static inline void name(volatile u##size __iomem *addr, u##size val)	\
 {									\
 	__asm__ __volatile__("sync;"#insn" %1,0,%2"			\
 		: "=m" (*addr) : "r" (val), "r" (addr) : "memory");	\
-	IO_SET_SYNC_FLAG();						\
 }
 #else /* newer gcc */
 #define DEF_MMIO_IN_X(name, size, insn)				\
@@ -144,7 +137,6 @@ static inline void name(volatile u##size __iomem *addr, u##size val)	\
 {									\
 	__asm__ __volatile__("sync;"#insn" %1,%y0"			\
 		: "=Z" (*addr) : "r" (val) : "memory");			\
-	IO_SET_SYNC_FLAG();						\
 }
 #endif
 
@@ -162,7 +154,6 @@ static inline void name(volatile u##size __iomem *addr, u##size val)	\
 {									\
 	__asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0"			\
 		: "=m" (*addr) : "r" (val) : "memory");			\
-	IO_SET_SYNC_FLAG();						\
 }
 
 DEF_MMIO_IN_D(in_8,     8, lbz);
@@ -626,23 +617,7 @@ static inline void name at					\
 #define writel_relaxed(v, addr)	writel(v, addr)
 #define writeq_relaxed(v, addr)	writeq(v, addr)
 
-#ifdef CONFIG_PPC32
 #define mmiowb()
-#else
-/*
- * Enforce synchronisation of stores vs. spin_unlock
- * (this does it explicitly, though our implementation of spin_unlock
- * does it implicitely too)
- */
-static inline void mmiowb(void)
-{
-	unsigned long tmp;
-
-	__asm__ __volatile__("sync; li %0,0; stb %0,%1(13)"
-	: "=&r" (tmp) : "i" (offsetof(struct paca_struct, io_sync))
-	: "memory");
-}
-#endif /* !CONFIG_PPC32 */
 
 static inline void iosync(void)
 {
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 4dbe072eecbe..9da89ea4ff31 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -28,8 +28,6 @@
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
 
-#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
-
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
 #ifdef __BIG_ENDIAN__
@@ -41,19 +39,6 @@
 #define LOCK_TOKEN	1
 #endif
 
-#if defined(CONFIG_PPC64) && defined(CONFIG_SMP)
-#define CLEAR_IO_SYNC	(get_paca()->io_sync = 0)
-#define SYNC_IO		do {						\
-				if (unlikely(get_paca()->io_sync)) {	\
-					mb();				\
-					get_paca()->io_sync = 0;	\
-				}					\
-			} while (0)
-#else
-#define CLEAR_IO_SYNC
-#define SYNC_IO
-#endif
-
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 	return lock.slock == 0;
@@ -91,7 +76,6 @@ static inline unsigned long __arch_spin_trylock(arch_spinlock_t *lock)
 
 static inline int arch_spin_trylock(arch_spinlock_t *lock)
 {
-	CLEAR_IO_SYNC;
 	return __arch_spin_trylock(lock) == 0;
 }
 
@@ -122,7 +106,6 @@ extern void __rw_yield(arch_rwlock_t *lock);
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
-	CLEAR_IO_SYNC;
 	while (1) {
 		if (likely(__arch_spin_trylock(lock) == 0))
 			break;
@@ -140,7 +123,6 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
 {
 	unsigned long flags_dis;
 
-	CLEAR_IO_SYNC;
 	while (1) {
 		if (likely(__arch_spin_trylock(lock) == 0))
 			break;
@@ -158,9 +140,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-	SYNC_IO;
-	__asm__ __volatile__("# arch_spin_unlock\n\t"
-				PPC_RELEASE_BARRIER: : :"memory");
+	smp_mb();
 	lock->slock = 0;
 }
 

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-17  9:32                 ` Will Deacon
@ 2015-07-17 10:15                   ` Peter Zijlstra
  2015-07-17 12:40                     ` Paul E. McKenney
  2015-07-17 22:14                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2015-07-17 10:15 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, paulmck, Michael Ellerman, linux-arch,
	linux-kernel

On Fri, Jul 17, 2015 at 10:32:21AM +0100, Will Deacon wrote:
> @@ -158,9 +140,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
>  
>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
>  {
> -	SYNC_IO;
> -	__asm__ __volatile__("# arch_spin_unlock\n\t"
> -				PPC_RELEASE_BARRIER: : :"memory");
> +	smp_mb();
>  	lock->slock = 0;
>  }

Should we then also make smp_store_release() use sync instead of lwsync
to keep it consistent?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-17 10:15                   ` Peter Zijlstra
@ 2015-07-17 12:40                     ` Paul E. McKenney
  0 siblings, 0 replies; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-17 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Benjamin Herrenschmidt, Michael Ellerman,
	linux-arch, linux-kernel

On Fri, Jul 17, 2015 at 12:15:35PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2015 at 10:32:21AM +0100, Will Deacon wrote:
> > @@ -158,9 +140,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
> >  
> >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> >  {
> > -	SYNC_IO;
> > -	__asm__ __volatile__("# arch_spin_unlock\n\t"
> > -				PPC_RELEASE_BARRIER: : :"memory");
> > +	smp_mb();
> >  	lock->slock = 0;
> >  }
> 
> Should we then also make smp_store_release() use sync instead of lwsync
> to keep it consistent?

Unless smp_store_release() needs to interact with MMIO accesses, it
should still be able to be lwsync.  This means that unlock-lock is a full
barrier, but relase-acquire is not necessarily, which should be just fine.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-17  9:32                 ` Will Deacon
  2015-07-17 10:15                   ` Peter Zijlstra
@ 2015-07-17 22:14                   ` Benjamin Herrenschmidt
  2015-07-20 13:39                     ` Will Deacon
  1 sibling, 1 reply; 66+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-17 22:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: paulmck, Michael Ellerman, linux-arch, linux-kernel, Peter Zijlstra

On Fri, 2015-07-17 at 10:32 +0100, Will Deacon wrote:
>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
>  {
> -       SYNC_IO;
> -       __asm__ __volatile__("# arch_spin_unlock\n\t"
> -                               PPC_RELEASE_BARRIER: : :"memory");
> +       smp_mb();
>         lock->slock = 0;
>  }

That probably needs to be mb() in case somebody has the expectation that
it does a barrier vs. DMA on UP.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-17 22:14                   ` Benjamin Herrenschmidt
@ 2015-07-20 13:39                     ` Will Deacon
  2015-07-20 13:48                       ` Paul E. McKenney
  2015-07-20 21:18                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 66+ messages in thread
From: Will Deacon @ 2015-07-20 13:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: paulmck, Michael Ellerman, linux-arch, linux-kernel, Peter Zijlstra

On Fri, Jul 17, 2015 at 11:14:14PM +0100, Benjamin Herrenschmidt wrote:
> On Fri, 2015-07-17 at 10:32 +0100, Will Deacon wrote:
> >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> >  {
> > -       SYNC_IO;
> > -       __asm__ __volatile__("# arch_spin_unlock\n\t"
> > -                               PPC_RELEASE_BARRIER: : :"memory");
> > +       smp_mb();
> >         lock->slock = 0;
> >  }
> 
> That probably needs to be mb() in case somebody has the expectation that
> it does a barrier vs. DMA on UP.

Hmm, but on !SMP doesn't arch_spin_unlock effectively expand to barrier()
in the core code?

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-20 13:39                     ` Will Deacon
@ 2015-07-20 13:48                       ` Paul E. McKenney
  2015-07-20 13:56                         ` Will Deacon
  2015-07-20 21:18                       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-20 13:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Michael Ellerman, linux-arch,
	linux-kernel, Peter Zijlstra

On Mon, Jul 20, 2015 at 02:39:06PM +0100, Will Deacon wrote:
> On Fri, Jul 17, 2015 at 11:14:14PM +0100, Benjamin Herrenschmidt wrote:
> > On Fri, 2015-07-17 at 10:32 +0100, Will Deacon wrote:
> > >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> > >  {
> > > -       SYNC_IO;
> > > -       __asm__ __volatile__("# arch_spin_unlock\n\t"
> > > -                               PPC_RELEASE_BARRIER: : :"memory");
> > > +       smp_mb();
> > >         lock->slock = 0;
> > >  }
> > 
> > That probably needs to be mb() in case somebody has the expectation that
> > it does a barrier vs. DMA on UP.
> 
> Hmm, but on !SMP doesn't arch_spin_unlock effectively expand to barrier()
> in the core code?

Yes, to barrier(), but that doesn't generate any code.  In contrast, the
mb() that Ben is asking for puts out a sync instruction.  Without that
sync instruction, MMIO accesses can be reordered with the spin_unlock(),
even on single-CPU systems.  So the bm() is really needed if unlock is
to order against MMIO (and thus DMA) on UP.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-20 13:48                       ` Paul E. McKenney
@ 2015-07-20 13:56                         ` Will Deacon
  0 siblings, 0 replies; 66+ messages in thread
From: Will Deacon @ 2015-07-20 13:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Benjamin Herrenschmidt, Michael Ellerman, linux-arch,
	linux-kernel, Peter Zijlstra

On Mon, Jul 20, 2015 at 02:48:49PM +0100, Paul E. McKenney wrote:
> On Mon, Jul 20, 2015 at 02:39:06PM +0100, Will Deacon wrote:
> > On Fri, Jul 17, 2015 at 11:14:14PM +0100, Benjamin Herrenschmidt wrote:
> > > On Fri, 2015-07-17 at 10:32 +0100, Will Deacon wrote:
> > > >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> > > >  {
> > > > -       SYNC_IO;
> > > > -       __asm__ __volatile__("# arch_spin_unlock\n\t"
> > > > -                               PPC_RELEASE_BARRIER: : :"memory");
> > > > +       smp_mb();
> > > >         lock->slock = 0;
> > > >  }
> > > 
> > > That probably needs to be mb() in case somebody has the expectation that
> > > it does a barrier vs. DMA on UP.
> > 
> > Hmm, but on !SMP doesn't arch_spin_unlock effectively expand to barrier()
> > in the core code?
> 
> Yes, to barrier(), but that doesn't generate any code.  In contrast, the
> mb() that Ben is asking for puts out a sync instruction.  Without that
> sync instruction, MMIO accesses can be reordered with the spin_unlock(),
> even on single-CPU systems.  So the bm() is really needed if unlock is
> to order against MMIO (and thus DMA) on UP.

Understood, but my point was that this needs to be done in the core code
rather than here. Perhaps it's easier to leave mmiowb() alone for PowerPC
for now and instead predicate that on !SMP?

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-20 13:39                     ` Will Deacon
  2015-07-20 13:48                       ` Paul E. McKenney
@ 2015-07-20 21:18                       ` Benjamin Herrenschmidt
  2015-07-22 16:49                         ` Will Deacon
  1 sibling, 1 reply; 66+ messages in thread
From: Benjamin Herrenschmidt @ 2015-07-20 21:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: paulmck, Michael Ellerman, linux-arch, linux-kernel, Peter Zijlstra

On Mon, 2015-07-20 at 14:39 +0100, Will Deacon wrote:
> On Fri, Jul 17, 2015 at 11:14:14PM +0100, Benjamin Herrenschmidt wrote:
> > On Fri, 2015-07-17 at 10:32 +0100, Will Deacon wrote:
> > >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> > >  {
> > > -       SYNC_IO;
> > > -       __asm__ __volatile__("# arch_spin_unlock\n\t"
> > > -                               PPC_RELEASE_BARRIER: : :"memory");
> > > +       smp_mb();
> > >         lock->slock = 0;
> > >  }
> > 
> > That probably needs to be mb() in case somebody has the expectation that
> > it does a barrier vs. DMA on UP.
> 
> Hmm, but on !SMP doesn't arch_spin_unlock effectively expand to barrier()
> in the core code?

include/linux/spinlock_up.h:# define arch_spin_unlock(lock)	do { barrier(); (void)(lock); } while (0)

Indeed...

Ben.


> Will
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-20 21:18                       ` Benjamin Herrenschmidt
@ 2015-07-22 16:49                         ` Will Deacon
  0 siblings, 0 replies; 66+ messages in thread
From: Will Deacon @ 2015-07-22 16:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: paulmck, Michael Ellerman, linux-arch, linux-kernel, Peter Zijlstra

On Mon, Jul 20, 2015 at 10:18:14PM +0100, Benjamin Herrenschmidt wrote:
> On Mon, 2015-07-20 at 14:39 +0100, Will Deacon wrote:
> > On Fri, Jul 17, 2015 at 11:14:14PM +0100, Benjamin Herrenschmidt wrote:
> > > On Fri, 2015-07-17 at 10:32 +0100, Will Deacon wrote:
> > > >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> > > >  {
> > > > -       SYNC_IO;
> > > > -       __asm__ __volatile__("# arch_spin_unlock\n\t"
> > > > -                               PPC_RELEASE_BARRIER: : :"memory");
> > > > +       smp_mb();
> > > >         lock->slock = 0;
> > > >  }
> > > 
> > > That probably needs to be mb() in case somebody has the expectation that
> > > it does a barrier vs. DMA on UP.
> > 
> > Hmm, but on !SMP doesn't arch_spin_unlock effectively expand to barrier()
> > in the core code?
> 
> include/linux/spinlock_up.h:# define arch_spin_unlock(lock)	do {
> barrier(); (void)(lock); } while (0)
> 
> Indeed...

So, leaving mmiowb() alone, we end up with the patch below.

Will

--->8

>From 9373a4226986dd69b6aaf896590ea43abeb1cc8e Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Wed, 22 Jul 2015 17:37:58 +0100
Subject: [PATCH] powerpc: kill smp_mb__after_unlock_lock

PowerPC is the only architecture defining smp_mb__after_unlock_lock,
but we can remove it by adding an unconditional sync (smp_mb()) to the
spin_unlock code.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/powerpc/include/asm/io.h       | 13 +------------
 arch/powerpc/include/asm/spinlock.h | 22 +---------------------
 2 files changed, 2 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index a8d2ef30d473..a3ad51046b04 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -105,12 +105,6 @@ extern bool isa_io_special;
  *
  */
 
-#ifdef CONFIG_PPC64
-#define IO_SET_SYNC_FLAG()	do { local_paca->io_sync = 1; } while(0)
-#else
-#define IO_SET_SYNC_FLAG()
-#endif
-
 /* gcc 4.0 and older doesn't have 'Z' constraint */
 #if __GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ == 0)
 #define DEF_MMIO_IN_X(name, size, insn)				\
@@ -127,7 +121,6 @@ static inline void name(volatile u##size __iomem *addr, u##size val)	\
 {									\
 	__asm__ __volatile__("sync;"#insn" %1,0,%2"			\
 		: "=m" (*addr) : "r" (val), "r" (addr) : "memory");	\
-	IO_SET_SYNC_FLAG();						\
 }
 #else /* newer gcc */
 #define DEF_MMIO_IN_X(name, size, insn)				\
@@ -144,7 +137,6 @@ static inline void name(volatile u##size __iomem *addr, u##size val)	\
 {									\
 	__asm__ __volatile__("sync;"#insn" %1,%y0"			\
 		: "=Z" (*addr) : "r" (val) : "memory");			\
-	IO_SET_SYNC_FLAG();						\
 }
 #endif
 
@@ -162,7 +154,6 @@ static inline void name(volatile u##size __iomem *addr, u##size val)	\
 {									\
 	__asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0"			\
 		: "=m" (*addr) : "r" (val) : "memory");			\
-	IO_SET_SYNC_FLAG();						\
 }
 
 DEF_MMIO_IN_D(in_8,     8, lbz);
@@ -630,9 +621,7 @@ static inline void name at					\
 #define mmiowb()
 #else
 /*
- * Enforce synchronisation of stores vs. spin_unlock
- * (this does it explicitly, though our implementation of spin_unlock
- * does it implicitely too)
+ * Explicitly enforce synchronisation of stores vs. spin_unlock
  */
 static inline void mmiowb(void)
 {
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 4dbe072eecbe..9da89ea4ff31 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -28,8 +28,6 @@
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
 
-#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
-
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
 #ifdef __BIG_ENDIAN__
@@ -41,19 +39,6 @@
 #define LOCK_TOKEN	1
 #endif
 
-#if defined(CONFIG_PPC64) && defined(CONFIG_SMP)
-#define CLEAR_IO_SYNC	(get_paca()->io_sync = 0)
-#define SYNC_IO		do {						\
-				if (unlikely(get_paca()->io_sync)) {	\
-					mb();				\
-					get_paca()->io_sync = 0;	\
-				}					\
-			} while (0)
-#else
-#define CLEAR_IO_SYNC
-#define SYNC_IO
-#endif
-
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 	return lock.slock == 0;
@@ -91,7 +76,6 @@ static inline unsigned long __arch_spin_trylock(arch_spinlock_t *lock)
 
 static inline int arch_spin_trylock(arch_spinlock_t *lock)
 {
-	CLEAR_IO_SYNC;
 	return __arch_spin_trylock(lock) == 0;
 }
 
@@ -122,7 +106,6 @@ extern void __rw_yield(arch_rwlock_t *lock);
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
-	CLEAR_IO_SYNC;
 	while (1) {
 		if (likely(__arch_spin_trylock(lock) == 0))
 			break;
@@ -140,7 +123,6 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
 {
 	unsigned long flags_dis;
 
-	CLEAR_IO_SYNC;
 	while (1) {
 		if (likely(__arch_spin_trylock(lock) == 0))
 			break;
@@ -158,9 +140,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-	SYNC_IO;
-	__asm__ __volatile__("# arch_spin_unlock\n\t"
-				PPC_RELEASE_BARRIER: : :"memory");
+	smp_mb();
 	lock->slock = 0;
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-15 13:12                                 ` Paul E. McKenney
@ 2015-07-24 11:31                                   ` Will Deacon
  2015-07-24 15:30                                     ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-07-24 11:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

Hi Paul,

On Wed, Jul 15, 2015 at 02:12:21PM +0100, Paul E. McKenney wrote:
> On Wed, Jul 15, 2015 at 11:51:35AM +0100, Will Deacon wrote:
> > On Wed, Jul 15, 2015 at 02:38:20AM +0100, Paul E. McKenney wrote:
> > > On Tue, Jul 14, 2015 at 12:31:44PM -0700, Paul E. McKenney wrote:
> > > > On Tue, Jul 14, 2015 at 03:12:16PM +0100, Will Deacon wrote:
> > > > > On Tue, Jul 14, 2015 at 03:00:14PM +0100, Paul E. McKenney wrote:
> > > > > > On Tue, Jul 14, 2015 at 01:51:46PM +0100, Will Deacon wrote:
> > > > > > > On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> > > > > > > > On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > > > > > > > > Given that RCU is currently the only user of this barrier, how would you
> > > > > > > > > feel about making the barrier local to RCU and not part of the general
> > > > > > > > > memory-barrier API?
> > > > > > > > 
> > > > > > > > In theory, no objection.  Your thought is to leave the definitions where
> > > > > > > > they are, mark them as being used only by RCU, and removing mention from
> > > > > > > > memory-barriers.txt?  Or did you have something else in mind?
> > > > > > > 
> > > > > > > Actually, I was thinking of defining them in an RCU header file with an
> > > > > > > #ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
> > > > > > > comment describing the semantics, or put that in an RCU Documentation file
> > > > > > > instead of memory-barriers.txt.
> > > > > > > 
> > > > > > > That *should* then mean we notice anybody else trying to use the barrier,
> > > > > > > because they'd need to send patches to either add something equivalent
> > > > > > > or move the definition out again.
> > > > > > 
> > > > > > My concern with this approach is that someone putting together a new
> > > > > > architecture might miss this.  That said, this approach certainly would
> > > > > > work for the current architectures.
> > > > > 
> > > > > I don't think they're any more likely to miss it than with the current
> > > > > situation where the generic code defines the macro as a NOP unless you
> > > > > explicitly override it.
> > > > 
> > > > Fair enough...
> > > 
> > > Like this?
> > 
> > Precisely! Thanks for cooking the patch -- this lays all my worries to
> > rest, so:
> > 
> >   Acked-by: Will Deacon <will.deacon@arm.com>
> 
> Thank you!

[...]

> > > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Date:   Tue Jul 14 18:35:23 2015 -0700
> > > 
> > >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> > >     
> > >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > >     likely the only thing that ever will use it, so this commit makes this
> > >     macro private to RCU.
> > >     
> > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > >     Cc: Will Deacon <will.deacon@arm.com>
> > >     Cc: Peter Zijlstra <peterz@infradead.org>
> > >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>

Are you planning to queue this somewhere? I think it makes sense regardless
of whether we change PowerPc or not and ideally it would be merged around
the same time as my relaxed atomics series.

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-24 11:31                                   ` Will Deacon
@ 2015-07-24 15:30                                     ` Paul E. McKenney
  2015-08-12 13:44                                       ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-07-24 15:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Fri, Jul 24, 2015 at 12:31:01PM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Wed, Jul 15, 2015 at 02:12:21PM +0100, Paul E. McKenney wrote:
> > On Wed, Jul 15, 2015 at 11:51:35AM +0100, Will Deacon wrote:
> > > On Wed, Jul 15, 2015 at 02:38:20AM +0100, Paul E. McKenney wrote:
> > > > On Tue, Jul 14, 2015 at 12:31:44PM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Jul 14, 2015 at 03:12:16PM +0100, Will Deacon wrote:
> > > > > > On Tue, Jul 14, 2015 at 03:00:14PM +0100, Paul E. McKenney wrote:
> > > > > > > On Tue, Jul 14, 2015 at 01:51:46PM +0100, Will Deacon wrote:
> > > > > > > > On Tue, Jul 14, 2015 at 01:45:40PM +0100, Paul E. McKenney wrote:
> > > > > > > > > On Tue, Jul 14, 2015 at 11:04:29AM +0100, Will Deacon wrote:
> > > > > > > > > > Given that RCU is currently the only user of this barrier, how would you
> > > > > > > > > > feel about making the barrier local to RCU and not part of the general
> > > > > > > > > > memory-barrier API?
> > > > > > > > > 
> > > > > > > > > In theory, no objection.  Your thought is to leave the definitions where
> > > > > > > > > they are, mark them as being used only by RCU, and removing mention from
> > > > > > > > > memory-barriers.txt?  Or did you have something else in mind?
> > > > > > > > 
> > > > > > > > Actually, I was thinking of defining them in an RCU header file with an
> > > > > > > > #ifdef CONFIG_POWERPC for the smb_mb() version. Then you could have a big
> > > > > > > > comment describing the semantics, or put that in an RCU Documentation file
> > > > > > > > instead of memory-barriers.txt.
> > > > > > > > 
> > > > > > > > That *should* then mean we notice anybody else trying to use the barrier,
> > > > > > > > because they'd need to send patches to either add something equivalent
> > > > > > > > or move the definition out again.
> > > > > > > 
> > > > > > > My concern with this approach is that someone putting together a new
> > > > > > > architecture might miss this.  That said, this approach certainly would
> > > > > > > work for the current architectures.
> > > > > > 
> > > > > > I don't think they're any more likely to miss it than with the current
> > > > > > situation where the generic code defines the macro as a NOP unless you
> > > > > > explicitly override it.
> > > > > 
> > > > > Fair enough...
> > > > 
> > > > Like this?
> > > 
> > > Precisely! Thanks for cooking the patch -- this lays all my worries to
> > > rest, so:
> > > 
> > >   Acked-by: Will Deacon <will.deacon@arm.com>
> > 
> > Thank you!
> 
> [...]
> 
> > > > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > Date:   Tue Jul 14 18:35:23 2015 -0700
> > > > 
> > > >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> > > >     
> > > >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > > >     likely the only thing that ever will use it, so this commit makes this
> > > >     macro private to RCU.
> > > >     
> > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > >     Cc: Will Deacon <will.deacon@arm.com>
> > > >     Cc: Peter Zijlstra <peterz@infradead.org>
> > > >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> 
> Are you planning to queue this somewhere? I think it makes sense regardless
> of whether we change PowerPc or not and ideally it would be merged around
> the same time as my relaxed atomics series.

I have is in -rcu.  By default, I will push it to the 4.4 merge window.
Please let me know if you need it sooner.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-24 15:30                                     ` Paul E. McKenney
@ 2015-08-12 13:44                                       ` Will Deacon
  2015-08-12 15:43                                         ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-08-12 13:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

Hello Paul,

On Fri, Jul 24, 2015 at 04:30:46PM +0100, Paul E. McKenney wrote:
> On Fri, Jul 24, 2015 at 12:31:01PM +0100, Will Deacon wrote:
> > On Wed, Jul 15, 2015 at 02:12:21PM +0100, Paul E. McKenney wrote:
> > > > > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > Date:   Tue Jul 14 18:35:23 2015 -0700
> > > > > 
> > > > >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> > > > >     
> > > > >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > > > >     likely the only thing that ever will use it, so this commit makes this
> > > > >     macro private to RCU.
> > > > >     
> > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > >     Cc: Will Deacon <will.deacon@arm.com>
> > > > >     Cc: Peter Zijlstra <peterz@infradead.org>
> > > > >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > > >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> > 
> > Are you planning to queue this somewhere? I think it makes sense regardless
> > of whether we change PowerPc or not and ideally it would be merged around
> > the same time as my relaxed atomics series.
> 
> I have is in -rcu.  By default, I will push it to the 4.4 merge window.
> Please let me know if you need it sooner.

The generic relaxed atomics are now queued in -tip, so it would be really
good to see this Documentation update land in 4.3 if at all possible. I
appreciate it's late in the cycle, but it's always worth asking.

Thanks,

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-12 13:44                                       ` Will Deacon
@ 2015-08-12 15:43                                         ` Paul E. McKenney
  2015-08-12 17:59                                           ` Paul E. McKenney
  2015-08-17  4:06                                           ` Michael Ellerman
  0 siblings, 2 replies; 66+ messages in thread
From: Paul E. McKenney @ 2015-08-12 15:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Wed, Aug 12, 2015 at 02:44:15PM +0100, Will Deacon wrote:
> Hello Paul,
> 
> On Fri, Jul 24, 2015 at 04:30:46PM +0100, Paul E. McKenney wrote:
> > On Fri, Jul 24, 2015 at 12:31:01PM +0100, Will Deacon wrote:
> > > On Wed, Jul 15, 2015 at 02:12:21PM +0100, Paul E. McKenney wrote:
> > > > > > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > Date:   Tue Jul 14 18:35:23 2015 -0700
> > > > > > 
> > > > > >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> > > > > >     
> > > > > >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > > > > >     likely the only thing that ever will use it, so this commit makes this
> > > > > >     macro private to RCU.
> > > > > >     
> > > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > >     Cc: Will Deacon <will.deacon@arm.com>
> > > > > >     Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > > > >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> > > 
> > > Are you planning to queue this somewhere? I think it makes sense regardless
> > > of whether we change PowerPc or not and ideally it would be merged around
> > > the same time as my relaxed atomics series.
> > 
> > I have is in -rcu.  By default, I will push it to the 4.4 merge window.
> > Please let me know if you need it sooner.
> 
> The generic relaxed atomics are now queued in -tip, so it would be really
> good to see this Documentation update land in 4.3 if at all possible. I
> appreciate it's late in the cycle, but it's always worth asking.

Can't hurt to give it a try.  I have set -rcu's rcu/next branch to this
commit, and if it passes a few day's worth of testing, I will see what
Ingo has to say about a pull request.

This commit also privatizes smp_mb__after_unlock_lock() as well as
updating documentation.  Looks like we need to strengthen powerpc's
locking primitives, then get rid of smp_mb__after_unlock_lock() entirely.
Or did that already happen and I just missed it?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-12 15:43                                         ` Paul E. McKenney
@ 2015-08-12 17:59                                           ` Paul E. McKenney
  2015-08-13 10:49                                             ` Will Deacon
  2015-08-17  4:06                                           ` Michael Ellerman
  1 sibling, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-08-12 17:59 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Wed, Aug 12, 2015 at 08:43:46AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 12, 2015 at 02:44:15PM +0100, Will Deacon wrote:
> > Hello Paul,
> > 
> > On Fri, Jul 24, 2015 at 04:30:46PM +0100, Paul E. McKenney wrote:
> > > On Fri, Jul 24, 2015 at 12:31:01PM +0100, Will Deacon wrote:
> > > > On Wed, Jul 15, 2015 at 02:12:21PM +0100, Paul E. McKenney wrote:
> > > > > > > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > > > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > Date:   Tue Jul 14 18:35:23 2015 -0700
> > > > > > > 
> > > > > > >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> > > > > > >     
> > > > > > >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > > > > > >     likely the only thing that ever will use it, so this commit makes this
> > > > > > >     macro private to RCU.
> > > > > > >     
> > > > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > >     Cc: Will Deacon <will.deacon@arm.com>
> > > > > > >     Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > > > > >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> > > > 
> > > > Are you planning to queue this somewhere? I think it makes sense regardless
> > > > of whether we change PowerPc or not and ideally it would be merged around
> > > > the same time as my relaxed atomics series.
> > > 
> > > I have is in -rcu.  By default, I will push it to the 4.4 merge window.
> > > Please let me know if you need it sooner.
> > 
> > The generic relaxed atomics are now queued in -tip, so it would be really
> > good to see this Documentation update land in 4.3 if at all possible. I
> > appreciate it's late in the cycle, but it's always worth asking.
> 
> Can't hurt to give it a try.  I have set -rcu's rcu/next branch to this
> commit, and if it passes a few day's worth of testing, I will see what
> Ingo has to say about a pull request.
> 
> This commit also privatizes smp_mb__after_unlock_lock() as well as
> updating documentation.  Looks like we need to strengthen powerpc's
> locking primitives, then get rid of smp_mb__after_unlock_lock() entirely.
> Or did that already happen and I just missed it?

And just for completeness, here is the current version of that commit.

							Thanx, Paul

------------------------------------------------------------------------

 b/Documentation/memory-barriers.txt   |   71 +---------------------------------
 b/arch/powerpc/include/asm/spinlock.h |    2 
 b/include/linux/spinlock.h            |   10 ----
 b/kernel/rcu/tree.h                   |   12 +++++
 4 files changed, 16 insertions(+), 79 deletions(-)

commit 12d560f4ea87030667438a169912380be00cea4b
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Tue Jul 14 18:35:23 2015 -0700

    rcu,locking: Privatize smp_mb__after_unlock_lock()
    
    RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
    likely the only thing that ever will use it, so this commit makes this
    macro private to RCU.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 318523872db5..eafa6a53f72c 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1854,16 +1854,10 @@ RELEASE are to the same lock variable, but only from the perspective of
 another CPU not holding that lock.  In short, a ACQUIRE followed by an
 RELEASE may -not- be assumed to be a full memory barrier.
 
-Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
-imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
-pair to produce a full barrier, the ACQUIRE can be followed by an
-smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
-(including transitivity) if either (a) the RELEASE and the ACQUIRE are
-executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on
-the same variable.  The smp_mb__after_unlock_lock() primitive is free
-on many architectures.  Without smp_mb__after_unlock_lock(), the CPU's
-execution of the critical sections corresponding to the RELEASE and the
-ACQUIRE can cross, so that:
+Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
+not imply a full memory barrier.  Therefore, the CPU's execution of the
+critical sections corresponding to the RELEASE and the ACQUIRE can cross,
+so that:
 
 	*A = a;
 	RELEASE M
@@ -1901,29 +1895,6 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
 	a sleep-unlock race, but the locking primitive needs to resolve
 	such races properly in any case.
 
-With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
-For example, with the following code, the store to *A will always be
-seen by other CPUs before the store to *B:
-
-	*A = a;
-	RELEASE M
-	ACQUIRE N
-	smp_mb__after_unlock_lock();
-	*B = b;
-
-The operations will always occur in one of the following orders:
-
-	STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
-	STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
-	ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
-
-If the RELEASE and ACQUIRE were instead both operating on the same lock
-variable, only the first of these alternatives can occur.  In addition,
-the more strongly ordered systems may rule out some of the above orders.
-But in any case, as noted earlier, the smp_mb__after_unlock_lock()
-ensures that the store to *A will always be seen as happening before
-the store to *B.
-
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined
@@ -2154,40 +2125,6 @@ But it won't see any of:
 	*E, *F or *G following RELEASE Q
 
 
-However, if the following occurs:
-
-	CPU 1				CPU 2
-	===============================	===============================
-	WRITE_ONCE(*A, a);
-	ACQUIRE M		     [1]
-	WRITE_ONCE(*B, b);
-	WRITE_ONCE(*C, c);
-	RELEASE M	     [1]
-	WRITE_ONCE(*D, d);		WRITE_ONCE(*E, e);
-					ACQUIRE M		     [2]
-					smp_mb__after_unlock_lock();
-					WRITE_ONCE(*F, f);
-					WRITE_ONCE(*G, g);
-					RELEASE M	     [2]
-					WRITE_ONCE(*H, h);
-
-CPU 3 might see:
-
-	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
-		ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
-
-But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
-
-	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
-	*A, *B or *C following RELEASE M [1]
-	*F, *G or *H preceding ACQUIRE M [2]
-	*A, *B, *C, *E, *F or *G following RELEASE M [2]
-
-Note that the smp_mb__after_unlock_lock() is critically important
-here: Without it CPU 3 might see some of the above orderings.
-Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
-to be seen in order unless CPU 3 holds lock M.
-
 
 ACQUIRES VS I/O ACCESSES
 ------------------------
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 4dbe072eecbe..523673d7583c 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -28,8 +28,6 @@
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
 
-#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
-
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
 #ifdef __BIG_ENDIAN__
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 0063b24b4f36..16c5ed5a627c 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -130,16 +130,6 @@ do {								\
 #define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
-/*
- * Place this after a lock-acquisition primitive to guarantee that
- * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
- * if the UNLOCK and LOCK are executed by the same CPU or if the
- * UNLOCK and LOCK operate on the same lock variable.
- */
-#ifndef smp_mb__after_unlock_lock
-#define smp_mb__after_unlock_lock()	do { } while (0)
-#endif
-
 /**
  * raw_spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 0412030ca882..2e991f8361e4 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -653,3 +653,15 @@ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 }
 #endif /* #ifdef CONFIG_RCU_TRACE */
+
+/*
+ * Place this after a lock-acquisition primitive to guarantee that
+ * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
+ * if the UNLOCK and LOCK are executed by the same CPU or if the
+ * UNLOCK and LOCK operate on the same lock variable.
+ */
+#ifdef CONFIG_PPC
+#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
+#else /* #ifdef CONFIG_PPC */
+#define smp_mb__after_unlock_lock()	do { } while (0)
+#endif /* #else #ifdef CONFIG_PPC */


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-12 17:59                                           ` Paul E. McKenney
@ 2015-08-13 10:49                                             ` Will Deacon
  2015-08-13 13:10                                               ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-08-13 10:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Wed, Aug 12, 2015 at 06:59:38PM +0100, Paul E. McKenney wrote:
> On Wed, Aug 12, 2015 at 08:43:46AM -0700, Paul E. McKenney wrote:
> > On Wed, Aug 12, 2015 at 02:44:15PM +0100, Will Deacon wrote:
> > > The generic relaxed atomics are now queued in -tip, so it would be really
> > > good to see this Documentation update land in 4.3 if at all possible. I
> > > appreciate it's late in the cycle, but it's always worth asking.
> > 
> > Can't hurt to give it a try.  I have set -rcu's rcu/next branch to this
> > commit, and if it passes a few day's worth of testing, I will see what
> > Ingo has to say about a pull request.
> > 
> > This commit also privatizes smp_mb__after_unlock_lock() as well as
> > updating documentation.  Looks like we need to strengthen powerpc's
> > locking primitives, then get rid of smp_mb__after_unlock_lock() entirely.
> > Or did that already happen and I just missed it?
> 
> And just for completeness, here is the current version of that commit.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
>  b/Documentation/memory-barriers.txt   |   71 +---------------------------------
>  b/arch/powerpc/include/asm/spinlock.h |    2 
>  b/include/linux/spinlock.h            |   10 ----
>  b/kernel/rcu/tree.h                   |   12 +++++
>  4 files changed, 16 insertions(+), 79 deletions(-)
> 
> commit 12d560f4ea87030667438a169912380be00cea4b
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Tue Jul 14 18:35:23 2015 -0700
> 
>     rcu,locking: Privatize smp_mb__after_unlock_lock()
>     
>     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
>     likely the only thing that ever will use it, so this commit makes this
>     macro private to RCU.
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>     Cc: Will Deacon <will.deacon@arm.com>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>

Acked-by: Will Deacon <will.deacon@arm.com>

I don't think the PowerPC spinlock change is queued anywhere (I sent it
out as a diff for discussion, but that was it). This patch doesn't rely
on that though, right?

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-13 10:49                                             ` Will Deacon
@ 2015-08-13 13:10                                               ` Paul E. McKenney
  0 siblings, 0 replies; 66+ messages in thread
From: Paul E. McKenney @ 2015-08-13 13:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arch, linux-kernel, Benjamin Herrenschmidt

On Thu, Aug 13, 2015 at 11:49:28AM +0100, Will Deacon wrote:
> On Wed, Aug 12, 2015 at 06:59:38PM +0100, Paul E. McKenney wrote:
> > On Wed, Aug 12, 2015 at 08:43:46AM -0700, Paul E. McKenney wrote:
> > > On Wed, Aug 12, 2015 at 02:44:15PM +0100, Will Deacon wrote:
> > > > The generic relaxed atomics are now queued in -tip, so it would be really
> > > > good to see this Documentation update land in 4.3 if at all possible. I
> > > > appreciate it's late in the cycle, but it's always worth asking.
> > > 
> > > Can't hurt to give it a try.  I have set -rcu's rcu/next branch to this
> > > commit, and if it passes a few day's worth of testing, I will see what
> > > Ingo has to say about a pull request.
> > > 
> > > This commit also privatizes smp_mb__after_unlock_lock() as well as
> > > updating documentation.  Looks like we need to strengthen powerpc's
> > > locking primitives, then get rid of smp_mb__after_unlock_lock() entirely.
> > > Or did that already happen and I just missed it?
> > 
> > And just for completeness, here is the current version of that commit.
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> >  b/Documentation/memory-barriers.txt   |   71 +---------------------------------
> >  b/arch/powerpc/include/asm/spinlock.h |    2 
> >  b/include/linux/spinlock.h            |   10 ----
> >  b/kernel/rcu/tree.h                   |   12 +++++
> >  4 files changed, 16 insertions(+), 79 deletions(-)
> > 
> > commit 12d560f4ea87030667438a169912380be00cea4b
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Tue Jul 14 18:35:23 2015 -0700
> > 
> >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> >     
> >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> >     likely the only thing that ever will use it, so this commit makes this
> >     macro private to RCU.
> >     
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >     Cc: Will Deacon <will.deacon@arm.com>
> >     Cc: Peter Zijlstra <peterz@infradead.org>
> >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> 
> Acked-by: Will Deacon <will.deacon@arm.com>
> 
> I don't think the PowerPC spinlock change is queued anywhere (I sent it
> out as a diff for discussion, but that was it). This patch doesn't rely
> on that though, right?

No, this patch just moves the smp_mb__after_unlock_lock() definition,
it does not change the code generated.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-12 15:43                                         ` Paul E. McKenney
  2015-08-12 17:59                                           ` Paul E. McKenney
@ 2015-08-17  4:06                                           ` Michael Ellerman
  2015-08-17  6:15                                             ` Paul E. McKenney
  1 sibling, 1 reply; 66+ messages in thread
From: Michael Ellerman @ 2015-08-17  4:06 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Wed, 2015-08-12 at 08:43 -0700, Paul E. McKenney wrote:
> On Wed, Aug 12, 2015 at 02:44:15PM +0100, Will Deacon wrote:
> > Hello Paul,
> > 
> > On Fri, Jul 24, 2015 at 04:30:46PM +0100, Paul E. McKenney wrote:
> > > On Fri, Jul 24, 2015 at 12:31:01PM +0100, Will Deacon wrote:
> > > > On Wed, Jul 15, 2015 at 02:12:21PM +0100, Paul E. McKenney wrote:
> > > > > > > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > > > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > Date:   Tue Jul 14 18:35:23 2015 -0700
> > > > > > > 
> > > > > > >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> > > > > > >     
> > > > > > >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > > > > > >     likely the only thing that ever will use it, so this commit makes this
> > > > > > >     macro private to RCU.
> > > > > > >     
> > > > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > >     Cc: Will Deacon <will.deacon@arm.com>
> > > > > > >     Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > > > > >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> > > > 
> > > > Are you planning to queue this somewhere? I think it makes sense regardless
> > > > of whether we change PowerPc or not and ideally it would be merged around
> > > > the same time as my relaxed atomics series.
> > > 
> > > I have is in -rcu.  By default, I will push it to the 4.4 merge window.
> > > Please let me know if you need it sooner.
> > 
> > The generic relaxed atomics are now queued in -tip, so it would be really
> > good to see this Documentation update land in 4.3 if at all possible. I
> > appreciate it's late in the cycle, but it's always worth asking.
> 
> Can't hurt to give it a try.  I have set -rcu's rcu/next branch to this
> commit, and if it passes a few day's worth of testing, I will see what
> Ingo has to say about a pull request.
> 
> This commit also privatizes smp_mb__after_unlock_lock() as well as
> updating documentation.  Looks like we need to strengthen powerpc's
> locking primitives, then get rid of smp_mb__after_unlock_lock() entirely.
> Or did that already happen and I just missed it?

No it didn't.

I thought the end result of this thread was that we didn't *need* to change the
powerpc lock semantics? Or did I read it wrong?

ie. the docs now say that RELEASE+ACQUIRE is not a full barrier, which is
consistent with our current implementation.

cheers



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-17  4:06                                           ` Michael Ellerman
@ 2015-08-17  6:15                                             ` Paul E. McKenney
  2015-08-17  8:57                                               ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-08-17  6:15 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Will Deacon, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Mon, Aug 17, 2015 at 02:06:07PM +1000, Michael Ellerman wrote:
> On Wed, 2015-08-12 at 08:43 -0700, Paul E. McKenney wrote:
> > On Wed, Aug 12, 2015 at 02:44:15PM +0100, Will Deacon wrote:
> > > Hello Paul,
> > > 
> > > On Fri, Jul 24, 2015 at 04:30:46PM +0100, Paul E. McKenney wrote:
> > > > On Fri, Jul 24, 2015 at 12:31:01PM +0100, Will Deacon wrote:
> > > > > On Wed, Jul 15, 2015 at 02:12:21PM +0100, Paul E. McKenney wrote:
> > > > > > > > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > > > > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > > Date:   Tue Jul 14 18:35:23 2015 -0700
> > > > > > > > 
> > > > > > > >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> > > > > > > >     
> > > > > > > >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > > > > > > >     likely the only thing that ever will use it, so this commit makes this
> > > > > > > >     macro private to RCU.
> > > > > > > >     
> > > > > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > >     Cc: Will Deacon <will.deacon@arm.com>
> > > > > > > >     Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > > > > > >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> > > > > 
> > > > > Are you planning to queue this somewhere? I think it makes sense regardless
> > > > > of whether we change PowerPc or not and ideally it would be merged around
> > > > > the same time as my relaxed atomics series.
> > > > 
> > > > I have is in -rcu.  By default, I will push it to the 4.4 merge window.
> > > > Please let me know if you need it sooner.
> > > 
> > > The generic relaxed atomics are now queued in -tip, so it would be really
> > > good to see this Documentation update land in 4.3 if at all possible. I
> > > appreciate it's late in the cycle, but it's always worth asking.
> > 
> > Can't hurt to give it a try.  I have set -rcu's rcu/next branch to this
> > commit, and if it passes a few day's worth of testing, I will see what
> > Ingo has to say about a pull request.
> > 
> > This commit also privatizes smp_mb__after_unlock_lock() as well as
> > updating documentation.  Looks like we need to strengthen powerpc's
> > locking primitives, then get rid of smp_mb__after_unlock_lock() entirely.
> > Or did that already happen and I just missed it?
> 
> No it didn't.
> 
> I thought the end result of this thread was that we didn't *need* to change the
> powerpc lock semantics? Or did I read it wrong?
> 
> ie. the docs now say that RELEASE+ACQUIRE is not a full barrier, which is
> consistent with our current implementation.

That change happened about 1.5 years ago, and I thought that the
current discussion was about reversing it, based in part on the
recent powerpc benchmarks of locking primitives with and without the
sync instruction.  But regardless, I clearly cannot remove either the
smp_mb__after_unlock_lock() or the powerpc definition of it to be smp_mb()
if powerpc unlock/lock is not strengthened.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-17  6:15                                             ` Paul E. McKenney
@ 2015-08-17  8:57                                               ` Will Deacon
  2015-08-18  1:50                                                 ` Michael Ellerman
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-08-17  8:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Ellerman, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Mon, Aug 17, 2015 at 07:15:01AM +0100, Paul E. McKenney wrote:
> On Mon, Aug 17, 2015 at 02:06:07PM +1000, Michael Ellerman wrote:
> > On Wed, 2015-08-12 at 08:43 -0700, Paul E. McKenney wrote:
> > > On Wed, Aug 12, 2015 at 02:44:15PM +0100, Will Deacon wrote:
> > > > On Fri, Jul 24, 2015 at 04:30:46PM +0100, Paul E. McKenney wrote:
> > > > > On Fri, Jul 24, 2015 at 12:31:01PM +0100, Will Deacon wrote:
> > > > > > On Wed, Jul 15, 2015 at 02:12:21PM +0100, Paul E. McKenney wrote:
> > > > > > > > > commit 695c05d4b9666c50b40a1c022678b5f6e2e3e771
> > > > > > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > > > Date:   Tue Jul 14 18:35:23 2015 -0700
> > > > > > > > > 
> > > > > > > > >     rcu,locking: Privatize smp_mb__after_unlock_lock()
> > > > > > > > >     
> > > > > > > > >     RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > > > > > > > >     likely the only thing that ever will use it, so this commit makes this
> > > > > > > > >     macro private to RCU.
> > > > > > > > >     
> > > > > > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > > >     Cc: Will Deacon <will.deacon@arm.com>
> > > > > > > > >     Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > > >     Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > > > > > > >     Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
> > > > > > 
> > > > > > Are you planning to queue this somewhere? I think it makes sense regardless
> > > > > > of whether we change PowerPc or not and ideally it would be merged around
> > > > > > the same time as my relaxed atomics series.
> > > > > 
> > > > > I have is in -rcu.  By default, I will push it to the 4.4 merge window.
> > > > > Please let me know if you need it sooner.
> > > > 
> > > > The generic relaxed atomics are now queued in -tip, so it would be really
> > > > good to see this Documentation update land in 4.3 if at all possible. I
> > > > appreciate it's late in the cycle, but it's always worth asking.
> > > 
> > > Can't hurt to give it a try.  I have set -rcu's rcu/next branch to this
> > > commit, and if it passes a few day's worth of testing, I will see what
> > > Ingo has to say about a pull request.
> > > 
> > > This commit also privatizes smp_mb__after_unlock_lock() as well as
> > > updating documentation.  Looks like we need to strengthen powerpc's
> > > locking primitives, then get rid of smp_mb__after_unlock_lock() entirely.
> > > Or did that already happen and I just missed it?
> > 
> > No it didn't.
> > 
> > I thought the end result of this thread was that we didn't *need* to change the
> > powerpc lock semantics? Or did I read it wrong?
> > 
> > ie. the docs now say that RELEASE+ACQUIRE is not a full barrier, which is
> > consistent with our current implementation.
> 
> That change happened about 1.5 years ago, and I thought that the
> current discussion was about reversing it, based in part on the
> recent powerpc benchmarks of locking primitives with and without the
> sync instruction.  But regardless, I clearly cannot remove either the
> smp_mb__after_unlock_lock() or the powerpc definition of it to be smp_mb()
> if powerpc unlock/lock is not strengthened.

Yup. Peter and I would really like to get rid of smp_mb__after_unlock_lock
entirely, which would mean strengthening the ppc spinlocks. Moving the
barrier primitive into RCU is a good step to prevent more widespread usage
of the barrier, but we'd really like to go further if the performance impact
is deemed acceptable (which is what this thread is about).

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-17  8:57                                               ` Will Deacon
@ 2015-08-18  1:50                                                 ` Michael Ellerman
  2015-08-18  8:37                                                   ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Michael Ellerman @ 2015-08-18  1:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Mon, 2015-08-17 at 09:57 +0100, Will Deacon wrote:
> On Mon, Aug 17, 2015 at 07:15:01AM +0100, Paul E. McKenney wrote:
> > On Mon, Aug 17, 2015 at 02:06:07PM +1000, Michael Ellerman wrote:
> > > On Wed, 2015-08-12 at 08:43 -0700, Paul E. McKenney wrote:
> > > I thought the end result of this thread was that we didn't *need* to change the
> > > powerpc lock semantics? Or did I read it wrong?
> > > 
> > > ie. the docs now say that RELEASE+ACQUIRE is not a full barrier, which is
> > > consistent with our current implementation.
> > 
> > That change happened about 1.5 years ago, and I thought that the
> > current discussion was about reversing it, based in part on the
> > recent powerpc benchmarks of locking primitives with and without the
> > sync instruction.  But regardless, I clearly cannot remove either the
> > smp_mb__after_unlock_lock() or the powerpc definition of it to be smp_mb()
> > if powerpc unlock/lock is not strengthened.
> 
> Yup. Peter and I would really like to get rid of smp_mb__after_unlock_lock
> entirely, which would mean strengthening the ppc spinlocks. Moving the
> barrier primitive into RCU is a good step to prevent more widespread usage
> of the barrier, but we'd really like to go further if the performance impact
> is deemed acceptable (which is what this thread is about).

OK, sorry for completely missing the point, too many balls in the air here.

I'll do some benchmarks and see what we come up with.

cheers



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-18  1:50                                                 ` Michael Ellerman
@ 2015-08-18  8:37                                                   ` Will Deacon
  2015-08-20  9:45                                                     ` Michael Ellerman
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-08-18  8:37 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Paul E. McKenney, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Tue, Aug 18, 2015 at 02:50:55AM +0100, Michael Ellerman wrote:
> On Mon, 2015-08-17 at 09:57 +0100, Will Deacon wrote:
> > On Mon, Aug 17, 2015 at 07:15:01AM +0100, Paul E. McKenney wrote:
> > > On Mon, Aug 17, 2015 at 02:06:07PM +1000, Michael Ellerman wrote:
> > > > On Wed, 2015-08-12 at 08:43 -0700, Paul E. McKenney wrote:
> > > > I thought the end result of this thread was that we didn't *need* to change the
> > > > powerpc lock semantics? Or did I read it wrong?
> > > > 
> > > > ie. the docs now say that RELEASE+ACQUIRE is not a full barrier, which is
> > > > consistent with our current implementation.
> > > 
> > > That change happened about 1.5 years ago, and I thought that the
> > > current discussion was about reversing it, based in part on the
> > > recent powerpc benchmarks of locking primitives with and without the
> > > sync instruction.  But regardless, I clearly cannot remove either the
> > > smp_mb__after_unlock_lock() or the powerpc definition of it to be smp_mb()
> > > if powerpc unlock/lock is not strengthened.
> > 
> > Yup. Peter and I would really like to get rid of smp_mb__after_unlock_lock
> > entirely, which would mean strengthening the ppc spinlocks. Moving the
> > barrier primitive into RCU is a good step to prevent more widespread usage
> > of the barrier, but we'd really like to go further if the performance impact
> > is deemed acceptable (which is what this thread is about).
> 
> OK, sorry for completely missing the point, too many balls in the air here.

No problem!

> I'll do some benchmarks and see what we come up with.

Thanks, that sounds great. FWIW, there are multiple ways of implementing
the patch (i.e. whether you strengthen lock or unlock). I had a crack at
something here, but it's not tested:

  http://marc.info/?l=linux-arch&m=143758379023849&w=2

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-18  8:37                                                   ` Will Deacon
@ 2015-08-20  9:45                                                     ` Michael Ellerman
  2015-08-20 15:56                                                       ` Will Deacon
  0 siblings, 1 reply; 66+ messages in thread
From: Michael Ellerman @ 2015-08-20  9:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Tue, 2015-08-18 at 09:37 +0100, Will Deacon wrote:
> On Tue, Aug 18, 2015 at 02:50:55AM +0100, Michael Ellerman wrote:
> > On Mon, 2015-08-17 at 09:57 +0100, Will Deacon wrote:
> > > On Mon, Aug 17, 2015 at 07:15:01AM +0100, Paul E. McKenney wrote:
> > > > On Mon, Aug 17, 2015 at 02:06:07PM +1000, Michael Ellerman wrote:
> > > > > On Wed, 2015-08-12 at 08:43 -0700, Paul E. McKenney wrote:
> > > > > I thought the end result of this thread was that we didn't *need* to change the
> > > > > powerpc lock semantics? Or did I read it wrong?
> > > > > 
> > > > > ie. the docs now say that RELEASE+ACQUIRE is not a full barrier, which is
> > > > > consistent with our current implementation.
> > > > 
> > > > That change happened about 1.5 years ago, and I thought that the
> > > > current discussion was about reversing it, based in part on the
> > > > recent powerpc benchmarks of locking primitives with and without the
> > > > sync instruction.  But regardless, I clearly cannot remove either the
> > > > smp_mb__after_unlock_lock() or the powerpc definition of it to be smp_mb()
> > > > if powerpc unlock/lock is not strengthened.
> > > 
> > > Yup. Peter and I would really like to get rid of smp_mb__after_unlock_lock
> > > entirely, which would mean strengthening the ppc spinlocks. Moving the
> > > barrier primitive into RCU is a good step to prevent more widespread usage
> > > of the barrier, but we'd really like to go further if the performance impact
> > > is deemed acceptable (which is what this thread is about).
> > 
> > OK, sorry for completely missing the point, too many balls in the air here.
> 
> No problem!
> 
> > I'll do some benchmarks and see what we come up with.
> 
> Thanks, that sounds great. FWIW, there are multiple ways of implementing
> the patch (i.e. whether you strengthen lock or unlock). I had a crack at
> something here, but it's not tested:
> 
>   http://marc.info/?l=linux-arch&m=143758379023849&w=2

Thanks.

I notice you are not changing PPC_RELEASE_BARRIER, but only the spin unlock
code. But from my reading of the docs we need to make sure any UNLOCK+LOCK is a
full barrier, not just spin unlock/lock?

So don't we need to worry about some of the other locks as well? At least
rwlock, and mutex fast path?

cheers



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-20  9:45                                                     ` Michael Ellerman
@ 2015-08-20 15:56                                                       ` Will Deacon
  2015-08-26  0:27                                                         ` Paul E. McKenney
  0 siblings, 1 reply; 66+ messages in thread
From: Will Deacon @ 2015-08-20 15:56 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Paul E. McKenney, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Thu, Aug 20, 2015 at 10:45:05AM +0100, Michael Ellerman wrote:
> On Tue, 2015-08-18 at 09:37 +0100, Will Deacon wrote:
> > On Tue, Aug 18, 2015 at 02:50:55AM +0100, Michael Ellerman wrote:
> > > On Mon, 2015-08-17 at 09:57 +0100, Will Deacon wrote:
> > > > On Mon, Aug 17, 2015 at 07:15:01AM +0100, Paul E. McKenney wrote:
> > > > > On Mon, Aug 17, 2015 at 02:06:07PM +1000, Michael Ellerman wrote:
> > > > > > On Wed, 2015-08-12 at 08:43 -0700, Paul E. McKenney wrote:
> > > > > > I thought the end result of this thread was that we didn't *need* to change the
> > > > > > powerpc lock semantics? Or did I read it wrong?
> > > > > > 
> > > > > > ie. the docs now say that RELEASE+ACQUIRE is not a full barrier, which is
> > > > > > consistent with our current implementation.
> > > > > 
> > > > > That change happened about 1.5 years ago, and I thought that the
> > > > > current discussion was about reversing it, based in part on the
> > > > > recent powerpc benchmarks of locking primitives with and without the
> > > > > sync instruction.  But regardless, I clearly cannot remove either the
> > > > > smp_mb__after_unlock_lock() or the powerpc definition of it to be smp_mb()
> > > > > if powerpc unlock/lock is not strengthened.
> > > > 
> > > > Yup. Peter and I would really like to get rid of smp_mb__after_unlock_lock
> > > > entirely, which would mean strengthening the ppc spinlocks. Moving the
> > > > barrier primitive into RCU is a good step to prevent more widespread usage
> > > > of the barrier, but we'd really like to go further if the performance impact
> > > > is deemed acceptable (which is what this thread is about).
> > > 
> > > OK, sorry for completely missing the point, too many balls in the air here.
> > 
> > No problem!
> > 
> > > I'll do some benchmarks and see what we come up with.
> > 
> > Thanks, that sounds great. FWIW, there are multiple ways of implementing
> > the patch (i.e. whether you strengthen lock or unlock). I had a crack at
> > something here, but it's not tested:
> > 
> >   http://marc.info/?l=linux-arch&m=143758379023849&w=2
> 
> Thanks.
> 
> I notice you are not changing PPC_RELEASE_BARRIER, but only the spin unlock
> code. But from my reading of the docs we need to make sure any UNLOCK+LOCK is a
> full barrier, not just spin unlock/lock?
> 
> So don't we need to worry about some of the other locks as well? At least
> rwlock, and mutex fast path?

Hmm, that's a good question. I notice that you don't do any of the SYNC_IO
stuff for any locks other than spinlocks but I don't know whether
smp_mb__after_unlock_lock is similarly limited in scope.

Paul?

Will

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-20 15:56                                                       ` Will Deacon
@ 2015-08-26  0:27                                                         ` Paul E. McKenney
  2015-08-26  4:06                                                           ` Michael Ellerman
  0 siblings, 1 reply; 66+ messages in thread
From: Paul E. McKenney @ 2015-08-26  0:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: Michael Ellerman, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Thu, Aug 20, 2015 at 04:56:04PM +0100, Will Deacon wrote:
> On Thu, Aug 20, 2015 at 10:45:05AM +0100, Michael Ellerman wrote:
> > On Tue, 2015-08-18 at 09:37 +0100, Will Deacon wrote:
> > > On Tue, Aug 18, 2015 at 02:50:55AM +0100, Michael Ellerman wrote:
> > > > On Mon, 2015-08-17 at 09:57 +0100, Will Deacon wrote:
> > > > > On Mon, Aug 17, 2015 at 07:15:01AM +0100, Paul E. McKenney wrote:
> > > > > > On Mon, Aug 17, 2015 at 02:06:07PM +1000, Michael Ellerman wrote:
> > > > > > > On Wed, 2015-08-12 at 08:43 -0700, Paul E. McKenney wrote:
> > > > > > > I thought the end result of this thread was that we didn't *need* to change the
> > > > > > > powerpc lock semantics? Or did I read it wrong?
> > > > > > > 
> > > > > > > ie. the docs now say that RELEASE+ACQUIRE is not a full barrier, which is
> > > > > > > consistent with our current implementation.
> > > > > > 
> > > > > > That change happened about 1.5 years ago, and I thought that the
> > > > > > current discussion was about reversing it, based in part on the
> > > > > > recent powerpc benchmarks of locking primitives with and without the
> > > > > > sync instruction.  But regardless, I clearly cannot remove either the
> > > > > > smp_mb__after_unlock_lock() or the powerpc definition of it to be smp_mb()
> > > > > > if powerpc unlock/lock is not strengthened.
> > > > > 
> > > > > Yup. Peter and I would really like to get rid of smp_mb__after_unlock_lock
> > > > > entirely, which would mean strengthening the ppc spinlocks. Moving the
> > > > > barrier primitive into RCU is a good step to prevent more widespread usage
> > > > > of the barrier, but we'd really like to go further if the performance impact
> > > > > is deemed acceptable (which is what this thread is about).
> > > > 
> > > > OK, sorry for completely missing the point, too many balls in the air here.
> > > 
> > > No problem!
> > > 
> > > > I'll do some benchmarks and see what we come up with.
> > > 
> > > Thanks, that sounds great. FWIW, there are multiple ways of implementing
> > > the patch (i.e. whether you strengthen lock or unlock). I had a crack at
> > > something here, but it's not tested:
> > > 
> > >   http://marc.info/?l=linux-arch&m=143758379023849&w=2
> > 
> > Thanks.
> > 
> > I notice you are not changing PPC_RELEASE_BARRIER, but only the spin unlock
> > code. But from my reading of the docs we need to make sure any UNLOCK+LOCK is a
> > full barrier, not just spin unlock/lock?
> > 
> > So don't we need to worry about some of the other locks as well? At least
> > rwlock, and mutex fast path?
> 
> Hmm, that's a good question. I notice that you don't do any of the SYNC_IO
> stuff for any locks other than spinlocks but I don't know whether
> smp_mb__after_unlock_lock is similarly limited in scope.
> 
> Paul?

I would expect the various locks to have similar ordering characteristics.

Or am I missing something subtle here?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-08-26  0:27                                                         ` Paul E. McKenney
@ 2015-08-26  4:06                                                           ` Michael Ellerman
  0 siblings, 0 replies; 66+ messages in thread
From: Michael Ellerman @ 2015-08-26  4:06 UTC (permalink / raw)
  To: paulmck
  Cc: Will Deacon, Peter Zijlstra, linux-arch, linux-kernel,
	Benjamin Herrenschmidt

On Tue, 2015-08-25 at 17:27 -0700, Paul E. McKenney wrote:
> On Thu, Aug 20, 2015 at 04:56:04PM +0100, Will Deacon wrote:
> > On Thu, Aug 20, 2015 at 10:45:05AM +0100, Michael Ellerman wrote:
> > > On Tue, 2015-08-18 at 09:37 +0100, Will Deacon wrote:
> > > > 
> > > > Thanks, that sounds great. FWIW, there are multiple ways of implementing
> > > > the patch (i.e. whether you strengthen lock or unlock). I had a crack at
> > > > something here, but it's not tested:
> > > > 
> > > >   http://marc.info/?l=linux-arch&m=143758379023849&w=2
> > > 
> > > I notice you are not changing PPC_RELEASE_BARRIER, but only the spin unlock
> > > code. But from my reading of the docs we need to make sure any UNLOCK+LOCK is a
> > > full barrier, not just spin unlock/lock?
> > > 
> > > So don't we need to worry about some of the other locks as well? At least
> > > rwlock, and mutex fast path?
> > 
> > Hmm, that's a good question. I notice that you don't do any of the SYNC_IO
> > stuff for any locks other than spinlocks but I don't know whether
> > smp_mb__after_unlock_lock is similarly limited in scope.
> > 
> > Paul?
> 
> I would expect the various locks to have similar ordering characteristics.
> 
> Or am I missing something subtle here?

I don't think so.

The docs just talk about ACQUIRE/RELEASE, so I think it needs to apply to all
lock types. Or at least the list mentioned in the docs which is:

 (*) spin locks
 (*) R/W spin locks
 (*) mutexes
 (*) semaphores
 (*) R/W semaphores
 (*) RCU

cheers



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock()
  2015-07-16  5:14           ` Benjamin Herrenschmidt
  2015-07-16 15:11             ` Paul E. McKenney
@ 2015-09-01  2:57             ` Paul Mackerras
  1 sibling, 0 replies; 66+ messages in thread
From: Paul Mackerras @ 2015-09-01  2:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Ellerman, Will Deacon, linux-arch, linux-kernel,
	Paul McKenney, Peter Zijlstra

On Thu, Jul 16, 2015 at 03:14:55PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2015-07-16 at 15:03 +1000, Benjamin Herrenschmidt wrote:
> > On Thu, 2015-07-16 at 12:00 +1000, Michael Ellerman wrote:
> > > That would fix the problem with smp_mb__after_unlock_lock(), but not
> > > the original worry we had about loads happening before the SC in lock.
> > 
> > However I think isync fixes *that* :-) The problem with isync is as you
> > said, it's not a -memory- barrier per-se, it's an execution barrier /
> > context synchronizing instruction. The combination stwcx. + bne + isync
> > however prevents the execution of anything past the isync until the
> > stwcx has completed and the bne has been "decided", which prevents loads
> > from leaking into the LL/SC loop. It will also prevent a store in the
> > lock from being issued before the stwcx. has completed. It does *not*
> > prevent as far as I can tell another unrelated store before the lock
> > from leaking into the lock, including the one used to unlock a different
> > lock.
> 
> Except that the architecture says:
> 
> <<
> Because a Store Conditional instruction may com-
> plete before its store has been performed, a condi-
> tional Branch instruction that depends on the CR0
> value set by a Store Conditional instruction does
> not order the Store Conditional's store with respect
> to storage accesses caused by instructions that
> follow the Branch
> >>
> 
> So isync in lock in architecturally incorrect, despite being what the
> architecture recommends using, yay !

That paragraph doesn't say anything about isync.  Doing stcx., bne,
isync to acquire a lock will ensure that the instructions inside the
locked region can't execute until the lock is globally known to be
acquired.  It is true that another CPU that doesn't take the lock
could see the effect of some load or store inside the locked region
and then see the lock variable itself being clear; but any CPU that
tried to acquire the lock would have its stcx. fail (at least until
the first CPU releases the lock).

So isync in the lock sequence is architecturally correct.

Paul.

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2015-09-01  2:57 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-13 12:15 [RFC PATCH v2] memory-barriers: remove smp_mb__after_unlock_lock() Will Deacon
2015-07-13 13:09 ` Peter Hurley
2015-07-13 14:24   ` Will Deacon
2015-07-13 15:56     ` Peter Zijlstra
2015-07-13 13:11 ` Peter Zijlstra
2015-07-13 14:09   ` Will Deacon
2015-07-13 14:21     ` Will Deacon
2015-07-13 15:54       ` Peter Zijlstra
2015-07-13 17:50         ` Will Deacon
2015-07-13 20:20           ` Paul E. McKenney
2015-07-13 22:23             ` Peter Zijlstra
2015-07-13 23:04               ` Paul E. McKenney
2015-07-14 10:04                 ` Will Deacon
2015-07-14 12:45                   ` Paul E. McKenney
2015-07-14 12:51                     ` Will Deacon
2015-07-14 14:00                       ` Paul E. McKenney
2015-07-14 14:12                         ` Will Deacon
2015-07-14 19:31                           ` Paul E. McKenney
2015-07-15  1:38                             ` Paul E. McKenney
2015-07-15 10:51                               ` Will Deacon
2015-07-15 13:12                                 ` Paul E. McKenney
2015-07-24 11:31                                   ` Will Deacon
2015-07-24 15:30                                     ` Paul E. McKenney
2015-08-12 13:44                                       ` Will Deacon
2015-08-12 15:43                                         ` Paul E. McKenney
2015-08-12 17:59                                           ` Paul E. McKenney
2015-08-13 10:49                                             ` Will Deacon
2015-08-13 13:10                                               ` Paul E. McKenney
2015-08-17  4:06                                           ` Michael Ellerman
2015-08-17  6:15                                             ` Paul E. McKenney
2015-08-17  8:57                                               ` Will Deacon
2015-08-18  1:50                                                 ` Michael Ellerman
2015-08-18  8:37                                                   ` Will Deacon
2015-08-20  9:45                                                     ` Michael Ellerman
2015-08-20 15:56                                                       ` Will Deacon
2015-08-26  0:27                                                         ` Paul E. McKenney
2015-08-26  4:06                                                           ` Michael Ellerman
2015-07-13 18:23         ` Paul E. McKenney
2015-07-13 19:41           ` Peter Hurley
2015-07-13 20:16             ` Paul E. McKenney
2015-07-13 22:15               ` Peter Zijlstra
2015-07-13 22:43                 ` Benjamin Herrenschmidt
2015-07-14  8:34                   ` Peter Zijlstra
2015-07-13 22:53                 ` Paul E. McKenney
2015-07-13 22:37         ` Benjamin Herrenschmidt
2015-07-13 22:31 ` Benjamin Herrenschmidt
2015-07-14 10:16   ` Will Deacon
2015-07-15  3:06   ` Michael Ellerman
2015-07-15 10:44     ` Will Deacon
2015-07-16  2:00       ` Michael Ellerman
2015-07-16  5:03         ` Benjamin Herrenschmidt
2015-07-16  5:14           ` Benjamin Herrenschmidt
2015-07-16 15:11             ` Paul E. McKenney
2015-07-16 22:54               ` Benjamin Herrenschmidt
2015-07-17  9:32                 ` Will Deacon
2015-07-17 10:15                   ` Peter Zijlstra
2015-07-17 12:40                     ` Paul E. McKenney
2015-07-17 22:14                   ` Benjamin Herrenschmidt
2015-07-20 13:39                     ` Will Deacon
2015-07-20 13:48                       ` Paul E. McKenney
2015-07-20 13:56                         ` Will Deacon
2015-07-20 21:18                       ` Benjamin Herrenschmidt
2015-07-22 16:49                         ` Will Deacon
2015-09-01  2:57             ` Paul Mackerras
2015-07-15 14:18     ` Paul E. McKenney
2015-07-16  1:34       ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).